Please give a brief summary of your issue:
The value returned by Element.GetBBox() doesn’t match the TextExtraction output for the Word.GetBBox()
Please describe your issue and provide steps to reproduce it:
I am trying to match the text extraction output to the text elements. It works well for text that is not within a form object, however matching with BBox intersection doesn’t work for elements that are within a form object.
I’m reading from the form as follows:
reader.Begin(obj, page.GetResourceDict())
// iterate through elements, grab text elements, look at element.GetBBox()
reader.End()
The bounding boxes aren’t far off form one another, so I wonder if there’s a missing transform somewhere?
To investigate further could you please provide the following information.
Input file(s)
Generated output
Code and settings used to generate (2) from (1)
Screenshots showing the output, and clearly indicating what you expected to get instead, and also clearly indicating the application/browser being used to view.
Sorry for the delay. Here’s an input file with one line of text that has the exact issue in.pdf (76.4 KB)
The text extraction code looks very similar to the ones in the samples without any special initialization params:
txt = TextExtractor()
txt.Begin(page) # Read the page
After iterating through the lines and words I call
word.GetBBox()
In this case, the word “Subject:” has the following coordinates (x1, y1, x2, y2):
(102.06000000000002, 506.8949599999999, 136.776, 519.4965599999999)
Using code very similar to the elementReader example, I iterate through all text elements, and in this case, the word “Subject:” is a single text element with the following coordinates (x1, y1, x2, y2):
(74.25, 523.66, 113.70, 534.83)
I would expect those to be the same values in default page coordinates as that’s how it appears to work for most text. In this case, the only difference is this text is in a form xobject. I also observed the form xobject gstate has a transform of the following value: (0.8800000000000001 0.0 0.0 0.8800000000000001 36.720000000000006 47.52). Is that related somehow?
If I open the PDF with preview, re-export as PDF, and then try it again, the text extraction BBox lines up with the element BBox. However, I think that’s because Preview removes the form xobject and writes the text directly.