What data is in Element.GetTextData?

Q:

We are trying to manipulate text data for Arabic text. What exactly is contained in Element.GetTextData and what goes into Element.SetTextData?

A:

The TextData in a Text Element (Element.GetType() == e_text) is the actual character codes used in the PDF. These character codes only have meaning with regards to the currently active font (Element.GetGState().GetFont)

If the font is simple, then each character code is a single byte (0-255), otherwise, the data is UTF16-BE multi-byte.

byte[] text_data = element.GetTextData(); if(element.GetGState().GetFont().IsSimple()) { // each byte is a 'character' } else { // multibyte data, treat as UTF16-BE }

This means that switching fonts can completely break the appearance/unicode values, since the character encoding between the two fonts might not be the same.