Text is not searchable when using OCRModule ApplyOCRJsonToPDF

jlucas · June 22, 2022, 4:41pm

Product: PDFTron.NETCore.Windows.x64 (Nugget), OCRModule

Product Version: 9.2.3.79556 (demo)

Please give a brief summary of your issue:

I am using OCRModule.ApplyOCRJsonToPDF() API to apply OCR results from a third party. I have structured the json results as stated by the OCR documentation (PDFTron Systems Inc. | Documentation). After calling this method, I am able highlight, select, and copy the text. However, I am not able to search the text. Curious to know if it is expected that the text would be searchable or not. If it is not expected, is there a recommend approach to make the text searchable?

My PDF has many scanned images of handwritten text. I am using Adobe Acrobat Reader DC to view the document and to perform the searching.

Reach out if you need more info.

Thanks in advanced
-Jade

system · June 22, 2022, 4:41pm

Hello, I’m Ron, an automated tech support bot

While you wait for one of our customer support representatives to get back to you, please check out some of these documentation pages:

Guides:

APIs:

Forums:

kmirsalehi · June 22, 2022, 5:32pm

Hi Jade,

To investigate further could you please provide the following information:

Input file(s)
Generated output file(s)
Code and settings used to generate (2) from (1)
What is the exact search term, and what is the expected result (screenshot showing page and text)

jlucas · June 23, 2022, 1:48am

Thanks kmirsalehi for your timely response. It is very appreciated.

See attached example program in c#. There is a sample pdf called “example_redacted.pdf” that I am applying external OCR results too. Run the program to generated an output file. I am trying to apply the word “Description” via the OCRModule.ApplyOCRJsonToPDF() method. There is a file called “ocr_results.json” of the external OCR. Using Adobe Acrobat Reader DC, you will notice that the “example_redacted.pdf”, you are not able to select the word “Description”. After you run the program, the output pdf you can select and copy the word “Description”. However, searching for “Description” does not yield any results. The expected result would be that when I search for “Description” in the output pdf file, I should have 1 instance of the word “Description” found. I have two screenshots of what I am experiencing. Let me know if more information is needed and if you are able to replicate on your end.

Kindly,
-Jade

ApplyOcrResultExample.zip (246.2 KB)

kmirsalehi · June 23, 2022, 6:55pm

Hi Jade,

Adobe interprets the reading order differently. If you open the file in Xodo, IE, or Chrome, you will see that the text is searchable. However in Adobe, the text is being displayed as “Descripti2o7n:” (instead of “Description 27:”)

Please note that text extraction/ordering is not defined at all in the ISO PDF standard. In fact, there is no concept of sentence, paragraph, tables, or anything similar, in a typical PDF file. This means each PDF vendor is left to their own design/implementation, and will extract text differently.

Therefore, ordering is not guaranteed to match the order that a typical reader would follow. Mainly this is due to the lack of semantic information in a PDF, and of course is highly dependent on the placement/ordering of text. The reading order of a magazine, newspaper article, and a academic article, are all quite different, and different users may have different expectations of reading order.