Product: Apryse SDK
Product Version: 11.9.0-ee437c0
Please give a brief summary of your issue:
Some text gets extracted twice by TextExtractor
Please describe your issue and provide steps to reproduce it:
Text from some rows in the table is getting picked up twice. For example, when running the sample code on problem_file.pdf, the first 4 lines outputted are duplicates of the numbers from the 1st, 5th, and last rows of the table:
00is coming from 100 mcg in the first row31.153 mgfrom the 5th row is extracted twice1.000 mgfrom the last row is extracted twice0.010is coming from 0.010 g in the last row
Reviewing the rest of the input shows that everything then gets correctly extracted.
When running the same code on problem_file2.pdf, the & is extracted twice.
When viewing both cases with Adobe, the duplicates are not present. We even tried moving the text elements to see if they were on top of each other, but that’s not the case.
void extract(const std::wstring &filePath)
{
// logic to initialize pdftron
pdftron::PDF::PDFDoc doc;
if (!filePath.empty())
{
doc = pdftron::PDF::PDFDoc(filePath);
if (!doc.InitSecurityHandler())
{
throw pdftron::Common::Exception();
}
}
pdftron::PDF::TextExtractor textExtractor;
for (auto pageIter = doc.GetPageIterator(); pageIter.HasNext(); pageIter.Next())
{
const auto ¤tPage = pageIter.Current();
textExtractor.Begin(currentPage, currentPage.GetCropBox(), textExtractorFlags()); // flags used: e_remove_hidden_text, e_extract_using_zorder, e_no_ligature_exp, e_no_invisible_text
for (auto textLine = textExtractor.GetFirstLine(); textLine.IsValid(); textLine = textLine.GetNextLine())
{
for (auto word = textLine.GetFirstWord(); word.IsValid(); word = word.GetNextWord())
{
pdftron::UString text;
text.Assign(word.GetString(), word.GetStringLen());
const auto wordText = text.ConvertToNativeWString();
std::wcout << wordText << L" ";
}
std::wcout << std::endl;
}
}
}
Please provide a link to a minimal sample where the issue is reproducible:
problem_file2.pdf (2.1 MB)
problem_file.pdf (2.1 MB)