Please give a brief summary of your issue: Incorrect Parsing of Hyphen Character to Soft Hyphen
Please describe your issue and provide steps to reproduce it:
Attached is a file containing one word. There should be one hyphen in the word, but instead, the hyphen is recognized as soft hyphen when using the latest text extractor. Noted that I am using the text extractor sample code provided by Apryse PDFNetC64\Samples\TextExtractTest\CPP\TextExtractTest.cpp for testing.
Our usage requires accurate extraction of the hyphen for correct text rendering. Therefore, this result will affect our application’s performance. Your help will be appreciated.
Please provide a link to a minimal sample where the issue is reproducible: Incorrect Soft Hyphen.pdf (251.9 KB)
Thank you for the detailed report. The hard hyphen character is being replaced with the soft hypen when extracted, \U00AD → (soft hyphen). We have reproduced and are investigating. We will keep you up to date, in the meantime, thank you for your patience.
In other words <000e> can be a dash-minus (U+002D), a soft hyphen (U+00AD), a hyphen (U+2010), or a non-breaking hyphen (U+2011) and is why you are getting the strange result. Character maps are supposed to be non-ambiguous, where each character maps to exactly one Unicode, no more and no less.
You should check with the author of the document and fix the ToUnicode map.