Q: I use the following code to extract a text (number) from a PDF,
that is placed on the same place on every document. This is not very
efficient, because the whole document is read line by line. Is there a
way to read only within the specified WordBox?
// For each line on the page...
for (line = txt.GetFirstLine(); line.IsValid();
line = line.GetNextLine())
// For each word in the line...
for (word = line.GetFirstWord(); word.IsValid
(); word = word.GetNextWord())
// Get the bounding box for the word.
bbox = word.GetBBox();
// Look whether the word is positioned on
the AHV Nr Position
// if so, read the word
if ((bbox.x1 > 70 && bbox.x1 < 160) &&
(bbox.y1 > 760 && bbox.y1 < 772))
currentAHVNumber = word.GetString();
A code sample that shows how to read text from a certain aerea of a
page would be great.
A: You can either pass an optional clipping rectangle as the second
parameter in text_extractor.Begin(page, box) method or you can iterate
through all words on the page and test for intersection between word's
bounding box (word.GetBBox()) and the selection rectangle.
So your code is on the right track, however the rectangle intersection
test is not correct. You may want to alter the code as follows:
if ((bbox.x1 > 70 && bbox.x2 < 160) && (bbox.y1 > 760 && bbox.y2 <
or use pdftron.PDF.Rect.IntersectRect(r1, r2) utility method.