Implementing search and replace in PDF.

Aaron_Gravesdale · December 20, 2008, 12:20am

Q: I am trying to take a PDF file, scan through it for a character
string and replace that string with another string. I want the strings
to have the same font and location, but the strings may not be the
same length.

I have the program working, based on the EditTextTest sample. The
program however has introduced some tabs and I believe it has covered
up some text, although I guess it is possible some extra text is being
erased.

Couple of things I do not understand.

1) It seems that most of the elements are very small, 2-4 characters
long. The original document was a word file that someone else
prepared, however the fields I entered as one continuous word, such as
[[DATE_LONG]]. Even that text is broken across multiple elements. I
know that word introduces a lot of tags for use with undo’s, however I
thought that along as you type it without breaks, backspaces, etc. it
would be one block for word. I converted it myself using the save as
PDF that Adobe Acrobat 9 puts in. Why are they all broken down into
these little elements?

2) My test document is doing 4 substitutions. 3 are isolated like
address blocks the 4th is inline in a paragraph. I was hoping that it
would replace the characters and redo the paragraph spacing, if not at
least the line. But it seems to have introduced a tab that wasn’t
there before and covered up the adjacent word. There was also a tab
introduced in the other 3 locations. I think it is a tab because it is
a single character when you use your left/right arrow key on it.
a. Why was the tab introduced?
b. Why was the adjacent text covered?
-----
A: The simplest way to implement "find/replace" on text within an
existing PDF document using PDFNet is as follows:

1) Search for all occurrences of the string on the PDF page. There are
several ways to implement this, but probably the simplest one is using
pdftron.PDF.TextExtractor as shown in TextExtract sample project
(http://www.pdftron.com/net/samplecode.html#TextExtract). The result
of this step is that you would have the positioning information for
each placeholder on the page (i.e. word(s) bounding boxes).

2) Edit the existing page (e.g. as illustrated in ElementEdit sample -
www.pdftron.com/net/samplecode.html#ElementEdit). You could use
bounding boxes of strings identified in 1) to detect if a given run
should be deleted (i.e. skipped). This steps would essentially remove
specific text runs from the page.

3) Finally you can add new content at the place of old placeholders
(e.g. see www.pdftron.com/net/faq.html#how_watermark). For this step
you would also use the positioning information identified in 1).

You may want to search PDFNet Knowledge Base (http://groups.google.com/
group/pdfnet-sdk) for more info on this topic (e.g. "text replace",
etc).

1) Why are they all broken down into these little elements?

Text in PDF may be broken into many little elements (text runs)
because MS Word may be applying kerning (fine spacing adjustments)
between adjacent elements. Or text runs may be using different fonts,
font sizes, or other properties. Unfortunately PDF format usually does
not preserve the semantic structure of text as in HTML or Word. For
purposes of text extraction it is better to use TextExtractor class
than ElementReader. Unlike ElementReader, TextExtractor can recognize
words, lines, and paragraphs within PDF pages and can provide precise
positioning information for each word.

a. Why was the tab introduced?

The placement of a text is completely determined with [and
element.GetCTM(), Current Transformation Matrix], text matrix
[element.GetTextMatrix()] and a number of properties in the graphics
state (e.g. character, word spacing etc). Also depending on which
element you select for replacement, the content may occur at different
locations on the page. For better understanding of what is happening
under the hood you may want to read Section 9 'Text' in PDF Reference
and use a tool such as CosEdit (http://www.pdftron.com/cosedit) to
inspect the low-level content of a PDF file.

Am I using the wrong tool for the job? Is a
PDF document just not suited for the task?

It really depends on your requirements. Text search and replace can be
implemented with PDF, however it is undoubtedly more complicated then
editing HTML or plain text files.

Please let me know if this helps and if you have any questions.

1) Why are they all broken down into these little elements?

Text in PDF may be broken into many little elements (text runs)
because MS Word may be applying kerning (fine spacing adjustments)
between adjacent elements. Or text runs may be using different fonts,
font sizes, or other properties. Unfortunately PDF format usually does
not preserve the semantic structure of text as in HTML or Word. For
purposes of text extraction it is better to use TextExtractor class
than ElementReader. Unlike ElementReader, TextExtractor can recognize
words, lines, and paragraphs within PDF pages and can provide precise
positioning information for each word.

a. Why was the tab introduced?

The placement of a text is completely determined with [and
element.GetCTM(), Current Transformation Matrix], text matrix
[element.GetTextMatrix()] and a number of properties in the graphics
state (e.g. character, word spacing etc). Also depending on which
element you select for replacement, the content may occur at different
locations on the page. For better understanding of what is happening
under the hood you may want to read Section 9 'Text' in PDF Reference
and use a tool such as CosEdit (http://www.pdftron.com/cosedit) to
inspect the low-level content of a PDF file.

Am I using the wrong tool for the job? Is a
PDF document just not suited for the task?

It really depends on your requirements. Text search and replace can be
implemented with PDF, however it is undoubtedly more complicated then
editing HTML or plain text files.