Question
When I use the following code, the word is split into separate words. When I copy the word from Acrobat, it is a single word.
using (PDFDoc doc = new PDFDoc(@"C:\temp\Source.pdf")) { Page page = doc.GetPage(1); using (TextExtractor txt = new TextExtractor()) { txt.Begin(page); String text = txt.GetAsXML(TextExtractor.XMLOutputFlags.e_words_as_elements); } }
Output from the program above:
<Word>h</Word> <Word>e</Word> <Word>l</Word> <Word>l</Word> <Word>o</Word> <Word>!</Word>
In the page content it looks like this:
BT 1 G 1 g 0.66667 0 0 1 60.024 337.01 Tm /F0 0.96 Tf [(h)-61(e)-228(l)-215(l)-96(o)-249(!)] TJ ET
Answer:
This is the correct output when using TextExtractor.XMLOutputFlags.e_words_as_elements flag.
Given
[(h)-61(e)-228(l)-215(l)-96(o)-249(!)] TJ
Each character is its own “element” in this case. e_words_as_elements refers to our lower level Element class, returned by ElementReader. TextExtractor uses ElementReader to parse the raw PDF content stream, and then generates a higher level human reading order.
If you run this document through the ElementReader sample test, you will see what I mean.