Poor extraction of Type3 font words and textlines

Product: Server SDK

Product Version: 11.9.0

Please give a brief summary of your issue:
Poor extraction of Type3 font words and textlines with TextExtractor

We have noticed a couple of issues with the Type3 text extraction.

  1. Words getting chopped off, \n characters are extracted
    For example, in problematic-file.pdf, “Prescribing Information” is extracted as “Prescri bi ng I nformation”. “INDICATION” is extracted as “I NDICATION\n”. There are other instances of chopped off word extraction in the file.

  2. Text line gets broken up when font color (?) changes and messes up the reading order.
    In problematic-file.pdf, “*As of [July 2025.]” is extracted as "*As of J u ly 2025.
    [
    ] "
    This issue is also seen in Frame1.pdf where the single text line “The quick brown fox jumped over the cat and hit his head.” is extracted as “The quick brown
    jumped over the
    and hit his head
    fox
    cat
    .”

We are on version 11.9.0-ee437c0. We have tried with 11.9.1, but noticed all the words get reversed. Here is the code snippet where the issue is producible:

void extractText(const std::wstring &filePath)
{
	pdftron::PDF::PDFDoc doc(filePath);
	doc.InitSecurityHandler();

	const auto page = doc.GetPage(1);
	pdftron::PDF::TextExtractor txt;
	const auto cropBox = page.GetCropBox();
	txt.Begin(page, &cropBox,
			  pdftron::PDF::TextExtractor::e_remove_hidden_text | pdftron::PDF::TextExtractor::e_extract_using_zorder | pdftron::PDF::TextExtractor::e_no_ligature_exp);

	for (auto textLine = txt.GetFirstLine(); textLine.IsValid(); textLine = textLine.GetNextLine())
	{
		for (auto word = textLine.GetFirstWord(); word.IsValid(); word = word.GetNextWord())
		{
			pdftron::UString text;
			text.Assign(word.GetString(), word.GetStringLen());
			const auto wordText = text.ConvertToNativeWString();
			std::wcout << wordText << L" ";
		}
		std::wcout << std::endl;
	}
}

Please provide a link to a minimal sample where the issue is reproducible:
Both files are generated using Figma.
problematic-file.pdf (525.2 KB)
problematic-file2.pdf (33.8 KB)

1 Like

Hello, I’m AiRon, an automated tech support bot :robot:
While you wait for one of our customer support representatives, I can help you with your question.


It seems you’re encountering issues with extracting text from PDFs that use Type3 fonts, resulting in incorrectly ordered or chopped text. This can happen due to the unique way Type3 fonts are embedded and rendered in PDF documents.

Some things you can try:

  • Ensure that you are using the latest version of the Apryse SDK, as updates often include fixes for text extraction issues.

  • Try experimenting with different TextExtractor flags to see if they improve results, such as omitting e_extract_using_zorder or e_no_ligature_exp.

  • Check if the problematic PDFs have any specific font encoding issues or anomalies that might affect extraction.

  • Consider using the ElementReader class, which might provide more control over text processing, especially with complex font types like Type3.

  • Review your PDFs for any unusual text transformations or graphic states that might interfere with text extraction.



Here are some resources that might help you:


Hopefully, this helps you solve your problem while you wait for a human to get back to you.

1 Like

Hi Kez,

I’ve reproduced the issues and sent it to the devs to take a closer look. Additionally, if you run text extractor without z-order, it picks up the 2nd file all in 1 sentence rather than breaking it up to multiple lines.

1 Like