TextExtractor misses some glyphs

Product: Server SDK

Product Version: 11.9

Please give a brief summary of your issue:
TextExtractor doesn’t extract some characters for the attached doc.

Please describe your issue and provide steps to reproduce it:
When extracting words from the attached document, the characters ‘ro’ from ‘enroll’, ‘rr’ and ‘re’ from ‘worry-free’ are missed.

Please provide a link to a minimal sample where the issue is reproducible:
problematic_file.pdf (1.1 MB)

Hello, I’m AiRon, an automated tech support bot :robot:
While you wait for one of our customer support representatives, I can help you with your question.


It seems you’re experiencing an issue with the TextExtractor not capturing certain glyphs in your PDF document. This can sometimes occur due to font encoding issues or the presence of special characters that aren’t correctly mapped during extraction.

Some things you can try:

  • Ensure that the PDF document is not using custom or embedded fonts that might not be fully supported.

  • Try using the TextExtractor with different options or settings, such as setting the text extraction mode to e_raw_text to see if it captures more characters.

  • Check if the document has any hidden layers or text that might be affecting extraction.

  • Use the ElementReader class to inspect the content stream of the PDF and see if the text is represented differently.




Here are some resources that might help you:



Hopefully, this helps you solve your problem while you wait for a human to get back to you.

Hello!

We quickly ran the provided PDF through our Text Extractor sample and the text appears to be fine there.

To be able to investigate this further, could you provide a couple things:

  1. A small runnable sample of the code you are using to extract the text
  2. The exact version of 11.9 you are running, you can grab this by logging PDFNet.GetVersionString()

Hello,

We noticed the sample doesn’t set any extractor flags. We use the e_no_ligature_exp flag and when removed, the text is able to be extracted. Shouldn’t ro, rr, and re be extracted regardless of the flag?

We are on version 11.9.0-ee437c0.

Thanks!

Thank you so much for sharing your version, and clarifying the extraction flag. We are still working to reproduce this issue on our end, and to be able to investigate this further could share a couple more things?

  1. We have been testing again our latest 11.9.1 version using .Net, could you try grabbing the latest release from this page, and try it out to see if the issue is still present? https://dev.apryse.com/nightly/stable/latest/11.9
  2. Could you share which OS and language you are experiencing this on?
  3. Are there any further changes to the text extractor samples which may point to this issue? Could you share a small runnable code sample which is still experiencing this?

We appreciate your patience while we look into this.

We tried with version 11.9.1-a48ef1a and we are still seeing the issue. We are on Linux and using cpp. Here is the sample we used where the issue is reproducible:

void extractText(const std::wstring &filePath)
{
	pdftron::PDF::PDFDoc doc(filePath);
	doc.InitSecurityHandler();

	const auto page = doc.GetPage(1);
	pdftron::PDF::TextExtractor txt;
	const auto cropBox = page.GetCropBox();
	txt.Begin(page, &cropBox,
			  pdftron::PDF::TextExtractor::e_remove_hidden_text | pdftron::PDF::TextExtractor::e_extract_using_zorder | pdftron::PDF::TextExtractor::e_no_ligature_exp |
				  pdftron::PDF::TextExtractor::e_no_invisible_text);

	for (auto textLine = txt.GetFirstLine(); textLine.IsValid(); textLine = textLine.GetNextLine())
	{
		for (auto word = textLine.GetFirstWord(); word.IsValid(); word = word.GetNextWord())
		{
			pdftron::UString text;
			text.Assign(word.GetString(), word.GetStringLen());
			const auto wordText = text.ConvertToNativeWString();
			std::wcout << wordText << L" ";
		}
		std::wcout << std::endl;
	}
}

The issue seems to be with the combination of the extraction flags used. When we only set e_no_ligature_exp, it seems to work as expected, but we need to also use those other flags.

Thank you for the code. I was able to confirm that the issue was able to be reproduced with the TextExtractor::e_remove_hidden_text flag and have forwarded this to the development team for further investigation.

I will get back to you with more information as soon as I have it.

1 Like