Extracting text within a rectangle

Aaron_Gravesdale · June 15, 2010, 4:30pm

Q: We are working on a tool based upon the sample PDF viewer. We
have
identified a rectangle on the page from which we want to extract the
text on every page of the document. We use TextExtractor and
GetAsText.
This works fairly well for our purposes, except...

We sometimes get a bit more than we want. If a word intersects with
the
extraction rectangle then then entire word is returned. Is there a
way
to return only those characters that intersect with the extraction
rectangle?

Our application is to extract account numbers from a static location
in
a generic tool, and for some documents there are extra digits at the
end
we want to exclude. We have an extraction rectangle defined that
intersects only with the digits we want.
-------------------
A: How do you extract text? In order to extract text from a specific
region you can use a version of TextExtractor.Begin() that takes a
rectangle as a parameter.

To extract text from a specific rectangle pass the selection rectangle
as the second parameter, and
TextExtractor.ProcessingFlags.e_remove_hidden_text (in case you would
like to remove invisible text obscured by other rectangles, clipping
paths, etc.) as the third parameter in the call to
text_extractor.Begin(page, select,
TextExtractor.ProcessingFlags.e_remove_hidden_text).

Because TextExtactor provides you with a bounding box (actually a
quadrilateral) for each glyph/character on the page you can
alternatively decide what text is visible on your own by iterating
through all lines, words, and glyphs and testing for intersection with
the given region.