Q: We are working on a tool based upon the sample PDF viewer. We
identified a rectangle on the page from which we want to extract the
text on every page of the document. We use TextExtractor and
This works fairly well for our purposes, except...
We sometimes get a bit more than we want. If a word intersects with
extraction rectangle then then entire word is returned. Is there a
to return only those characters that intersect with the extraction
Our application is to extract account numbers from a static location
a generic tool, and for some documents there are extra digits at the
we want to exclude. We have an extraction rectangle defined that
intersects only with the digits we want.
A: How do you extract text? In order to extract text from a specific
region you can use a version of TextExtractor.Begin() that takes a
rectangle as a parameter.
To extract text from a specific rectangle pass the selection rectangle
as the second parameter, and
TextExtractor.ProcessingFlags.e_remove_hidden_text (in case you would
like to remove invisible text obscured by other rectangles, clipping
paths, etc.) as the third parameter in the call to
Because TextExtactor provides you with a bounding box (actually a
quadrilateral) for each glyph/character on the page you can
alternatively decide what text is visible on your own by iterating
through all lines, words, and glyphs and testing for intersection with
the given region.