GetOCRJsonFromImage - different returned word coordinates based on how image was extracted from PDF

vanye · September 16, 2023, 3:23pm

Product: PDFNet - Python bindings

Product Version: 10.1.0

Please give a brief summary of your issue:
(Think of this as an email subject)
page->draw.Export() vs Image(el.GetXObject())->ExportAsPng() give different word coordinates

Please describe your issue and provide steps to reproduce it:

I’m trying to get the JSON encoded results from OCR.

If I use the page’s PDFDraw() → Export(“filename.png”) to get a PNG and feed that to GetOCRJsonFromImage() I get good locations (x,y) of the word boundaries in the returned JSON.

If I use the element’s GetXObject() (after checking its an “image”) and then image.ExportAsPng() I get visually the same image as above, but after using OCRModule.GetOCRJsonFromImage() each word’s coordinates are off in the returned JSON.

The image in question is a whole page image. But I’d prefer to use the element API in case the image is not the entire page - I don’t want to double-dip non-image content.

I can apply some scale factors to the x,y coordinates to approximate the correct positions (1.333 or 96/72). I’ve tried playing with DPI settings but it doesn’t seem to affect it.

The origin of the full page image is 0,0.

vanye · September 17, 2023, 10:49am

So I guess the root question is how do I map a coordinate from inside an element (image) to its position on the document page?