Selecting text in PDFNet based on search results from dtSearch

Q: We are trying to visualize search hits in PDF documents by marking them on a picture of the found page. The search engine we use is dtSearch – maybe you have already some experience with it.

The search engines result is the word number of the hit. Our next step is to correlate the word number with the result of PDFNet TextExtractor to get the actual coordinates of the word.

Our problem is the algorithm how TextExtractor determines a word. Is there a way to parameterize it beside using the ProcessingFlags, like setting which characters should be handled as whitespace?

A: So your task is to correspond a word extracted from dtSearch with one from PDFNet TextExtractor, and this way, you can get the word’s spatial information. Is this correct?

In theory, text on a PDF page lacks structural information and characters can be specified in random order, as long as their positions make sense so that a human being can read. Spaces between characters can be a whitespace, or it can be just because the starting position of the second character is translated by a certain amount (in which case, there is not a space character at all). The text extracting algorithm in PDFNet has to take care of these difficulties and try to recover the structural information, such as words, lines, and paragraphs. With this being said, it is possible that a word seen by dtSearch cannot been seen by PDFNet and vice versa. So the bad news is that your task cannot be achieved with 100% accuracy.

The good news is that most PDFs have structural text information relatively easy to parse. I believe, most of the time, the words extracted by dtSearch should be extracted by PDFNet as well. To correspond a word, can you use the TextSearch class from PDFNet? For example, given a key word, does deSearch output all its instances on a page in order? If so, you can search for that key word using the TextSearch class. Then you can simply correspond these instances following their order.

Not sure if the third party solution offers the support for character offsets (and Adobe’s XML highlight format) In any case you can also use PDFNet to select text based on character offsets (and XML highlight format).

This is a more vendor independent way to select text in PDF. Unfortunately the feature for some reason disappeared in Acrobat 10.

To select text using XML highlight format simply call

doc.AddHighlights(higlight) where ‘highlight’ argument can be either a path to a locally stored file or a string buffer containing actual ‘XML’ data.

PDFNet.Initialize();

PDFDoc doc = …

doc.AddHighlights(@“c:/hilite.xml”);

doc.Save(…);

doc.Close();

To generate PNG/JPEG thumbnail images with text highlights use PDFDraw class (as shown in PDFDraw sample project - http://www.pdftron.com/pdfnet/samplecode.html#PDFDraw)

to render pages after calling AddHighlights().

Other relevant classes are pdftron.PDF.Highlights and PDFVIewCtrl.Select(Highlights)