Incorrect Parsing of Hyphen Character

ggao · December 17, 2024, 9:14pm

Product: Server SDK Version

Product Version: 11.1.0

Please give a brief summary of your issue: Incorrect Parsing of Hyphen Character to Soft Hyphen

Please describe your issue and provide steps to reproduce it:
Attached is a file containing one word. There should be one hyphen in the word, but instead, the hyphen is recognized as soft hyphen when using the latest text extractor. Noted that I am using the text extractor sample code provided by Apryse PDFNetC64\Samples\TextExtractTest\CPP\TextExtractTest.cpp for testing.

Our usage requires accurate extraction of the hyphen for correct text rendering. Therefore, this result will affect our application’s performance. Your help will be appreciated.

Please provide a link to a minimal sample where the issue is reproducible:
Incorrect Soft Hyphen.pdf (251.9 KB)

btompkinson1 · December 18, 2024, 9:01pm

Thank you for the detailed report. The hard hyphen character is being replaced with the soft hypen when extracted, \U00AD → (soft hyphen). We have reproduced and are investigating. We will keep you up to date, in the meantime, thank you for your patience.

btompkinson1 · December 19, 2024, 4:34pm

The reason you are seeing odd result is that your input file has an invalid ToUnicode map which has conflicting data.

<0001> <005f> <0020>
<0001> <0001> <00a0>
<0060> <006b> <00a1>
<000e> <000e> <00ad>
…
<000e> <000e> <2010>
<000e> <000e> <2011>

In other words <000e> can be a dash-minus (U+002D), a soft hyphen (U+00AD), a hyphen (U+2010), or a non-breaking hyphen (U+2011) and is why you are getting the strange result. Character maps are supposed to be non-ambiguous, where each character maps to exactly one Unicode, no more and no less.

You should check with the author of the document and fix the ToUnicode map.