Converting PDF text to Unicode

Aaron_Gravesdale · April 20, 2007, 9:50pm

Q:

when I extract text how do I tell which encoding the text is in? What
I need to do is to extract it out as Unicode, or at least know what
encoding the text is in so that I can convert it to Unicode myself.
----

A:

You can use 'element.GetTextData()' to extract raw text content. This
text is represented using encoding information present in the font
dictionary (i.e. font.GetSDFObj()) and possibly in the embedded font
itself.

The simplest approach to extract text as Unicode is using
'element.GetTextString()' method (see TextExtract sample project for a
concrete code snippet). You can also map individual char-codes to
Unicode as shown in ElementReaderAdv sample project (in ProcessText
function). For example:

CharIterator end = element->CharEnd();
for (CharIterator itr = element->CharBegin(); itr != end; ++itr) {
  Unicode uni;
  int uni_sz = 0;
  if (MapToUnicode(itr->char_code, &uni, 1, uni_sz) && uni_sz>0) {
  // mapping ok...
   }
}