How do I detect if a PDF was created by OCR?

Ivanho · April 4, 2012, 9:38pm

Q: I am looking for a way to identify if a PDF Document was created by OCR. Are there any functions or properties available in the PDFNet SDK that will aid in this endeavor? I have been unable to find any.

A:

PDFNet doesn’t offer an out-of-box function to detect such files. However, you can implement your own function to to detect OCR-ed files.

Typically, an OCRed PDF has one or several images that represent the rasterized version of the original document. Then the text extracted by an OCR library are added to the document for purposes such as text selection in a third-party viewer. Text can be added above or underneath the image(s) as long as they don’t alter the look of the file. Typically, the PDF file will set the text rendering mode to “neither fill or stoke” to make it invisible.

With this being said, you can use the following rule to detect searchble PDF images files:

A page contains only (monochrome?) image and hidden text elements (with element.GState.GetTextRenderMode() == e_invisible_text).

To implement this check you could use the approach shown in ElementReaderAdv sample:
http://www.pdftron.com/pdfnet/samplecode.html#ElementReaderAdv