Poor extraction of Type3 font words and textlines

kdorji · December 24, 2025, 8:22am

Product: Server SDK

Product Version: 11.9.0

Please give a brief summary of your issue:
Poor extraction of Type3 font words and textlines with TextExtractor

We have noticed a couple of issues with the Type3 text extraction.

Words getting chopped off, \n characters are extracted
For example, in problematic-file.pdf, “Prescribing Information” is extracted as “Prescri bi ng I nformation”. “INDICATION” is extracted as “I NDICATION\n”. There are other instances of chopped off word extraction in the file.
Text line gets broken up when font color (?) changes and messes up the reading order.
In problematic-file.pdf, “*As of [July 2025.]” is extracted as "*As of J u ly 2025.
[
] "
This issue is also seen in Frame1.pdf where the single text line “The quick brown fox jumped over the cat and hit his head.” is extracted as “The quick brown
jumped over the
and hit his head
fox
cat
.”

We are on version 11.9.0-ee437c0. We have tried with 11.9.1, but noticed all the words get reversed. Here is the code snippet where the issue is producible:

void extractText(const std::wstring &filePath)
{
	pdftron::PDF::PDFDoc doc(filePath);
	doc.InitSecurityHandler();

	const auto page = doc.GetPage(1);
	pdftron::PDF::TextExtractor txt;
	const auto cropBox = page.GetCropBox();
	txt.Begin(page, &cropBox,
			  pdftron::PDF::TextExtractor::e_remove_hidden_text | pdftron::PDF::TextExtractor::e_extract_using_zorder | pdftron::PDF::TextExtractor::e_no_ligature_exp);

	for (auto textLine = txt.GetFirstLine(); textLine.IsValid(); textLine = textLine.GetNextLine())
	{
		for (auto word = textLine.GetFirstWord(); word.IsValid(); word = word.GetNextWord())
		{
			pdftron::UString text;
			text.Assign(word.GetString(), word.GetStringLen());
			const auto wordText = text.ConvertToNativeWString();
			std::wcout << wordText << L" ";
		}
		std::wcout << std::endl;
	}
}

Please provide a link to a minimal sample where the issue is reproducible:
Both files are generated using Figma.
problematic-file.pdf (525.2 KB)
problematic-file2.pdf (33.8 KB)

system · December 24, 2025, 8:23am

Hello, I’m AiRon, an automated tech support bot
While you wait for one of our customer support representatives, I can help you with your question.

It seems you’re encountering issues with extracting text from PDFs that use Type3 fonts, resulting in incorrectly ordered or chopped text. This can happen due to the unique way Type3 fonts are embedded and rendered in PDF documents.

Some things you can try:

Ensure that you are using the latest version of the Apryse SDK, as updates often include fixes for text extraction issues.

Try experimenting with different TextExtractor flags to see if they improve results, such as omitting e_extract_using_zorder or e_no_ligature_exp.

Check if the problematic PDFs have any specific font encoding issues or anomalies that might affect extraction.

Consider using the ElementReader class, which might provide more control over text processing, especially with complex font types like Type3.

Review your PDFs for any unusual text transformations or graphic states that might interfere with text extraction.

Here are some resources that might help you:

Extracting text from a PDF on Server/Desktop - This guide covers using the TextExtractor class and flags for better text extraction.

Extracting images from a PDF on Server/Desktop - Useful if you encounter issues with embedded graphics or images.

Extract Text, Read, Parse PDF - TextExtract - Sample code for using Apryse SDK to extract text, with examples in various programming languages.

APIs for basic document operations on Server/Desktop - Overview of utility classes like ElementReader and TextExtractor.

Server/Desktop PDF Content Extraction Library - Detailed guide on extracting various content types from PDFs.

Hopefully, this helps you solve your problem while you wait for a human to get back to you.

kmirsalehi · December 24, 2025, 7:04pm

Hi Kez,

I’ve reproduced the issues and sent it to the devs to take a closer look. Additionally, if you run text extractor without z-order, it picks up the 2nd file all in 1 sentence rather than breaking it up to multiple lines.