Product: PDFNet - Python bindings
Product Version: 10.1.0
Please give a brief summary of your issue:
(Think of this as an email subject)
page->draw.Export() vs Image(el.GetXObject())->ExportAsPng() give different word coordinates
Please describe your issue and provide steps to reproduce it:
I’m trying to get the JSON encoded results from OCR.
If I use the page’s PDFDraw() → Export(“filename.png”) to get a PNG and feed that to GetOCRJsonFromImage() I get good locations (x,y) of the word boundaries in the returned JSON.
If I use the element’s GetXObject() (after checking its an “image”) and then image.ExportAsPng() I get visually the same image as above, but after using OCRModule.GetOCRJsonFromImage() each word’s coordinates are off in the returned JSON.
The image in question is a whole page image. But I’d prefer to use the element API in case the image is not the entire page - I don’t want to double-dip non-image content.
I can apply some scale factors to the x,y coordinates to approximate the correct positions (1.333 or 96/72). I’ve tried playing with DPI settings but it doesn’t seem to affect it.
The origin of the full page image is 0,0.