TextExtractor is returning each letter as its own Word

Ryan · June 7, 2017, 6:54pm

Question

When I use the following code, the word is split into separate words. When I copy the word from Acrobat, it is a single word.

using (PDFDoc doc = new PDFDoc(@"C:\temp\Source.pdf")) { Page page = doc.GetPage(1); using (TextExtractor txt = new TextExtractor()) { txt.Begin(page); String text = txt.GetAsXML(TextExtractor.XMLOutputFlags.e_words_as_elements); } }

Output from the program above:

<Word>h</Word> <Word>e</Word> <Word>l</Word> <Word>l</Word> <Word>o</Word> <Word>!</Word>

In the page content it looks like this:

BT 1 G 1 g 0.66667 0 0 1 60.024 337.01 Tm /F0 0.96 Tf [(h)-61(e)-228(l)-215(l)-96(o)-249(!)] TJ ET

Answer:

This is the correct output when using TextExtractor.XMLOutputFlags.e_words_as_elements flag.

Given

[(h)-61(e)-228(l)-215(l)-96(o)-249(!)] TJ

Each character is its own “element” in this case. e_words_as_elements refers to our lower level Element class, returned by ElementReader. TextExtractor uses ElementReader to parse the raw PDF content stream, and then generates a higher level human reading order.

If you run this document through the ElementReader sample test, you will see what I mean.