TextExtractor picks up certain text twice

Product: Apryse SDK

Product Version: 11.9.0-ee437c0

Please give a brief summary of your issue:
Some text gets extracted twice by TextExtractor

Please describe your issue and provide steps to reproduce it:
Text from some rows in the table is getting picked up twice. For example, when running the sample code on problem_file.pdf, the first 4 lines outputted are duplicates of the numbers from the 1st, 5th, and last rows of the table:

  1. 00 is coming from 100 mcg in the first row
  2. 31.153 mg from the 5th row is extracted twice
  3. 1.000 mg from the last row is extracted twice
  4. 0.010 is coming from 0.010 g in the last row

Reviewing the rest of the input shows that everything then gets correctly extracted.

When running the same code on problem_file2.pdf, the & is extracted twice.

When viewing both cases with Adobe, the duplicates are not present. We even tried moving the text elements to see if they were on top of each other, but that’s not the case.

void extract(const std::wstring &filePath)
{
    // logic to initialize pdftron 
    
    pdftron::PDF::PDFDoc doc;
    if (!filePath.empty())
    {
	    doc = pdftron::PDF::PDFDoc(filePath);
	    if (!doc.InitSecurityHandler())
	    {
		    throw pdftron::Common::Exception();
	    }
    }

    pdftron::PDF::TextExtractor textExtractor;
    for (auto pageIter = doc.GetPageIterator(); pageIter.HasNext(); pageIter.Next())
    {
        const auto &currentPage = pageIter.Current();
	    textExtractor.Begin(currentPage, currentPage.GetCropBox(), textExtractorFlags()); // flags used: e_remove_hidden_text, e_extract_using_zorder, e_no_ligature_exp, e_no_invisible_text

        for (auto textLine = textExtractor.GetFirstLine(); textLine.IsValid(); textLine = textLine.GetNextLine())
        {
            for (auto word = textLine.GetFirstWord(); word.IsValid(); word = word.GetNextWord())
            {
                pdftron::UString text;
                text.Assign(word.GetString(), word.GetStringLen());
                const auto wordText = text.ConvertToNativeWString();
                std::wcout << wordText << L" ";
            }
            std::wcout << std::endl;
        }
    }
}

Please provide a link to a minimal sample where the issue is reproducible:

problem_file2.pdf (2.1 MB)

problem_file.pdf (2.1 MB)

1 Like

Hi kdorji,

My name is Christopher and I am a Support Engineer with the Apryse toolkit.
Thank you for reaching out regarding the issue you are experiencing with some text being extracted twice. Thank you as well as for providing sample files and a code snippet.
We are currently investigating this issue and will reach out soon with an update.
In the meantime, it looks like you are working with an older version of the toolkit, v11.9.
Can you please try updating your application to use the latest version that is available, v11.12, which can be found on our downloads page. It is also important to note that tomorrow is the release date for v11.13 which may have a fix for this already included if it’s not in v11.12.
Please let me know the results of your testing with these updated versions.

Thanks,
-Christopher Thompson
Support Engineer

1 Like

Hi kdorji,
Were you able to test with the latest version of the Apryse toolkit, v11.13, to see if that resolved your issue?

Thanks
-Christopher Thompson
Support Engineer

1 Like