Space character skipped during extraction

Product: Apryse SDK

Product Version: 11.9.0-ee437c0

Please give a brief summary of your issue:
TextExtractor skips certain space characters and concatenates words

Please describe your issue and provide steps to reproduce it:
In the attached file, certain space characters are skipped and the words, which should be extracted as two separate words are extracted as one.

Actual output:
Lorsque vous pensez que vous avez un saignement
que vous discutiez avec votre médecin et que vous suiviez

Expected output:
Lorsque vous pense z que vous avez un saignement
que vous discutiez avec votre médecin et que vous suivie z

i.e.,
”pensez” should be “pense z”
”suiviez” should be ”suivie z”

Steps to reproduce:
TextExtractor flags used: e_remove_hidden_text, e_extract_using_zorder, e_no_ligature_exp, e_no_invisible_text

for (auto textLine = textExtractor.GetFirstLine(); textLine.IsValid(); textLine = textLine.GetNextLine())
{
for (auto word = textLine.GetFirstWord(); word.IsValid(); word = word.GetNextWord())
{
pdftron::UString text;
text.Assign(word.GetString(), word.GetStringLen());
const auto wordText = text.ConvertToNativeWString();
std::wcout << wordText << L" ";
}
std::wcout << std::endl;
}

Please provide a link to a minimal sample where the issue is reproducible:
problem_file.pdf (2.2 MB)

1 Like

Could you include how you’re passing your input to the TextExtractor?

1 Like

I have tried to share a simplified version of our setup:

void extract(const std::wstring &filePath)
{
    // logic to initialize pdftron 
    
    pdftron::PDF::PDFDoc doc;
    if (!filePath.empty())
    {
	    doc = pdftron::PDF::PDFDoc(filePath);
	    if (!doc.InitSecurityHandler())
	    {
		    throw pdftron::Common::Exception();
	    }
    }

    pdftron::PDF::TextExtractor textExtractor;
    for (auto pageIter = doc.GetPageIterator(); pageIter.HasNext(); pageIter.Next())
    {
        const auto &currentPage = pageIter.Current();
	    textExtractor.Begin(currentPage, currentPage.GetCropBox(), textExtractorFlags()); // flags used: e_remove_hidden_text, e_extract_using_zorder, e_no_ligature_exp, e_no_invisible_text

        for (auto textLine = textExtractor.GetFirstLine(); textLine.IsValid(); textLine = textLine.GetNextLine())
        {
            for (auto word = textLine.GetFirstWord(); word.IsValid(); word = word.GetNextWord())
            {
                pdftron::UString text;
                text.Assign(word.GetString(), word.GetStringLen());
                const auto wordText = text.ConvertToNativeWString();
                std::wcout << wordText << L" ";
            }
            std::wcout << std::endl;
        }
    }
}
1 Like

Thanks for the clarification. I’ve reproduced this and have forwarded it to our development team. I’ll reach out when I have any updates from them.

2 Likes