How do I relate characters in 'words' from TextExtractor to characters in 'text-run' elements?

Aaron_Gravesdale · December 1, 2008, 11:45pm

Q: I wish to modify an existing PDF by swapping certain strings of
text with alternative values based on user input. The location of the
text is arbitrary and will change from one document to another.

I have read the recommendations about replacing rather than editing
text and understand this is going to be simpler and more reliable.

I have also followed the details on using the text extraction features
to find the text that needs to be swapped, the location, fonts used
etc.

What I cannot work out is how to identify the PDF elements in the
bounding box (bbox) values in the XML returned by the extractor
class.

What I would like to be able to do is specify a bounding box, get a
list of all text elements that encompass that box, delete them, then
replace with the new text. The information returned by the text
extractor class is useful and reliable with all documents tested, but
I cannot work out how to reconcile it back to the elements.

Is this approach feasible? If so, how do I do it?

ps I'm very impressed with PDFNet, it's the best PDF manipulation tool
I have come across having checked literally hundreds of alternatives.
------
A: One way to relate text found using text extractor to text runs
extracted using ElementReader is to compare their bounding boxes (i.e.
element.GetBBox(rect) and word.GetBBox()). If there is a sufficient
vertical overlap (rect.IntersectRect(bbox1, bbox2)) it is likely that
you could delete a given element. The only tricky part is when a text
run partially intersects a given rectangle. For example, a text run
may consist of an entire sentence and only the middle part of the
sentence may interact the replacement rectangle. In this case you
would need to break the text run into 2 or 3 elements. For an example
of how this can be implemented you may want to take a look at
RecolorPDFText.zip as well as PDFViewSimple_CS.zip samples located in
http://groups.google.com/group/pdfnet-sdk/files.

PDFNet also offers some lower-level methods of relating characters in
'Words' to characters in 'text-run' elements (based on character id-
s), however bbox comparison is simpler to understand and should work
in most cases. If this method does not work for you, please let us
know and we will provide more info on the char-id apprach.