Product: Server SDK
Product Version: 11.9.0
Please give a brief summary of your issue:
Poor extraction of Type3 font words and textlines with TextExtractor
We have noticed a couple of issues with the Type3 text extraction.
-
Words getting chopped off, \n characters are extracted
For example, in problematic-file.pdf, “Prescribing Information” is extracted as “Prescri bi ng I nformation”. “INDICATION” is extracted as “I NDICATION\n”. There are other instances of chopped off word extraction in the file. -
Text line gets broken up when font color (?) changes and messes up the reading order.
In problematic-file.pdf, “*As of [July 2025.]” is extracted as "*As of J u ly 2025.
[
] "
This issue is also seen in Frame1.pdf where the single text line “The quick brown fox jumped over the cat and hit his head.” is extracted as “The quick brown
jumped over the
and hit his head
fox
cat
.”
We are on version 11.9.0-ee437c0. We have tried with 11.9.1, but noticed all the words get reversed. Here is the code snippet where the issue is producible:
void extractText(const std::wstring &filePath)
{
pdftron::PDF::PDFDoc doc(filePath);
doc.InitSecurityHandler();
const auto page = doc.GetPage(1);
pdftron::PDF::TextExtractor txt;
const auto cropBox = page.GetCropBox();
txt.Begin(page, &cropBox,
pdftron::PDF::TextExtractor::e_remove_hidden_text | pdftron::PDF::TextExtractor::e_extract_using_zorder | pdftron::PDF::TextExtractor::e_no_ligature_exp);
for (auto textLine = txt.GetFirstLine(); textLine.IsValid(); textLine = textLine.GetNextLine())
{
for (auto word = textLine.GetFirstWord(); word.IsValid(); word = word.GetNextWord())
{
pdftron::UString text;
text.Assign(word.GetString(), word.GetStringLen());
const auto wordText = text.ConvertToNativeWString();
std::wcout << wordText << L" ";
}
std::wcout << std::endl;
}
}
Please provide a link to a minimal sample where the issue is reproducible:
Both files are generated using Figma.
problematic-file.pdf (525.2 KB)
problematic-file2.pdf (33.8 KB)