TextExtractor misses some glyphs

Product: Server SDK

Product Version: 11.9

Please give a brief summary of your issue:
TextExtractor doesn’t extract some characters for the attached doc.

Please describe your issue and provide steps to reproduce it:
When extracting words from the attached document, the characters ‘ro’ from ‘enroll’, ‘rr’ and ‘re’ from ‘worry-free’ are missed.

Please provide a link to a minimal sample where the issue is reproducible:
problematic_file.pdf (1.1 MB)

Hello, I’m AiRon, an automated tech support bot :robot:
While you wait for one of our customer support representatives, I can help you with your question.


It seems you’re experiencing an issue with the TextExtractor not capturing certain glyphs in your PDF document. This can sometimes occur due to font encoding issues or the presence of special characters that aren’t correctly mapped during extraction.

Some things you can try:

  • Ensure that the PDF document is not using custom or embedded fonts that might not be fully supported.

  • Try using the TextExtractor with different options or settings, such as setting the text extraction mode to e_raw_text to see if it captures more characters.

  • Check if the document has any hidden layers or text that might be affecting extraction.

  • Use the ElementReader class to inspect the content stream of the PDF and see if the text is represented differently.




Here are some resources that might help you:



Hopefully, this helps you solve your problem while you wait for a human to get back to you.

Hello!

We quickly ran the provided PDF through our Text Extractor sample and the text appears to be fine there.

To be able to investigate this further, could you provide a couple things:

  1. A small runnable sample of the code you are using to extract the text
  2. The exact version of 11.9 you are running, you can grab this by logging PDFNet.GetVersionString()

Hello,

We noticed the sample doesn’t set any extractor flags. We use the e_no_ligature_exp flag and when removed, the text is able to be extracted. Shouldn’t ro, rr, and re be extracted regardless of the flag?

We are on version 11.9.0-ee437c0.

Thanks!