TextExtraction matching overlapping elements

david3 · June 10, 2024, 6:04am

Product: PDF SDK

Product Version: Python 3

Please give a brief summary of your issue:
I need to get the underlying text elements corresponding to the TextExtractor output. I’ve already matched elements with lines and words using the bounding boxes. In some cases the elements overlap, e.g. a textual watermark in the background and normal text in the foreground. If I use just bounding boxes, I’ll get possible collisions.

Please describe your issue and provide steps to reproduce it:
Is there a way to use any extra information to know which elements correspond do the foreground or background text (or even associated with a form xobject)? The TextExtractor class is able to pull out the text cleanly.

Please provide a link to a minimal sample where the issue is reproducible:

nicholas.cote · June 25, 2024, 7:58pm

Hello! Thanks for reaching out. For a use case like this, our ElementReader might be better suited for the task. This groups the elements as they are displayed and this returns the actual underlying elements and associated information.

An example of using ElementReader can be found in our Element Reader Sample.

Let me know if this works.