PDF Text extraction

Aaron_Gravesdale · December 15, 2010, 7:52pm

Q: What is the best method of extracting text with PdfNet SDK (in as
much natural reading order as possible, remove hidden/duplicate text
etc) and for each word extracted also capturing its bounding box
(x1,y1,x2,y2) coordinates?
---------------------------
A: The simples approach would be to use 'pdftron.PDF.TextExtractor' as
shown in TextExtract sample: http://www.pdftron.com/pdfnet/samplecode.html#TextExtract

To extract text from a specific box you can pass a rectangle as a
second parameter in TextExtract.Begin() method. You can also tweak
text extraction using the following bit-flags in the third parameter:

--------------------------------------------------
// Disables expanding of ligatures using a predefined mapping.
// Default ligatures are: fi, ff, fl, ffi, ffl, ch, cl, ct, ll, ss,
fs, st, oe, OE.
e_no_ligature_exp = 1,

// Disables removing duplicated text that is frequently used to
// achieve visual effects of drop shadow and fake bold.
e_no_dup_remove = 2,

// Treat punctuation (e.g. full stop, comma, semicolon, etc.) as
// word break characters.
e_punct_break = 4,

// Enables removal of text that is obscured by images or
// rectangles. Since this option has small performance penalty
// on performance of text extraction, by default it is not
// enabled.
e_remove_hidden_text = 8,

// Enables removing text that uses rendering mode 3 (i.e. invisible
text).
// Invisible text is usually used in 'PDF Searchable Images' (i.e.
scanned
// pages with a corresponding OCR text). As a result, invisible text
// will be extracted by default.
e_no_invisible_text = 16
--------------------------------------------------

In case you want to implement you own text extraction engine to
assemble words, remove hidden/duplicate text, etc. you could use
'pdftron.PDF.ElementReader' (e.g. along the lines of ElementReaderAdv
sample - http://www.pdftron.com/pdfnet/samplecode.html#ElementReaderAdv).
In this case, please keep in mind that writing a text extraction
engine (such as TextExtractor) is far from trivial.