Couple of questions related to low-level text extraction from PDF.

Aaron_Gravesdale · August 13, 2008, 7:23pm

Q: We have a couple of questions related to extraction the text from
PDF.

When we are extracting the text from particular “Rectangle region” the
PDF’ using the “ElementReaderAdvTestVB_2005” using the CharIterator
which included in the sample project. We get attached file
<<”First_Art_7065002-1.txt”>> as result. In file if you look
characters like “ü” are coming wrong. Can you please let us know why
this is happening?

Same way when we use the Text Extraction class using the PDFView
class. We get different result for same rectangle region. There is
change in the “Case” of the characters for eg : “rückgang”. Attached
is the file showing the results <<Copy.txt>> Can you please let us
know why is this? Where in the “First_Art_7065002-1.txt” file the “R”
has come Capital but the “ü” is completely different character.
Other example word “ZürICH” is Mixed case letter word. Where in
<<First_Art_7065002-1.txt >> file we get the words in Extract case
they appear in PDF. But the Character code’s are different.
-----
A: CharIterator does not return Unicode code points, instead it
returns 'char-codes' as they are stored in PDF content stream. As a
result you see garbage characters.

The 'charcodes' could be mapped to Unicode values using
pdftron.PDF.Font.MapToUnicode(charcode, ...). Actually if you use
text_element.GetTextString() utility function it will automatically
map all charcodes in a text-run to Unicode (using the above mentioned
MapToUnicode()).

CharIterator is fairly low-level so unless you have significant
expertise with PDF, we recommend using higher-level functions instead
(e.g. element.GetTextString(), element.GetBBox(), TextExtractor, etc).

Please keep in mind that TextExtractor could also return very
different results from low-level text enumeration of PDF document
using ElementReader. TextExtractor includes a complex AI engine to
analyze document layout and to reconstruct words and reading order
which is not available in ElementReader. You should be able to relate
words to underlying text-run elements through their bounding boxes
(word.GetBBox() and element.GetBBox()),

Regarding the last issue it is caused by a corruption in PDF generator
(i.e. bad ToUnicode mapping) - the semantic/meaning of character codes
does not correspond to actual glyphs. If you try to copy & paste text
from Acrobat or any other PDF consumer your will get the same results.
Using PDFNet library you could recognize and correct these malformed
PDFs, however it is not a trivial task.