Q: We are trying to visualize search hits in PDF documents by marking them on a picture of the found page. The search engine we use is dtSearch – maybe you have already some experience with it.
The search engines result is the word number of the hit. Our next step is to correlate the word number with the result of PDFNet TextExtractor to get the actual coordinates of the word.
Our problem is the algorithm how TextExtractor determines a word. Is there a way to parameterize it beside using the ProcessingFlags, like setting which characters should be handled as whitespace?
A: So your task is to correspond a word extracted from dtSearch with one from PDFNet TextExtractor, and this way, you can get the word’s spatial information. Is this correct?
In theory, text on a PDF page lacks structural information and characters can be specified in random order, as long as their positions make sense so that a human being can read. Spaces between characters can be a whitespace, or it can be just because the starting position of the second character is translated by a certain amount (in which case, there is not a space character at all). The text extracting algorithm in PDFNet has to take care of these difficulties and try to recover the structural information, such as words, lines, and paragraphs. With this being said, it is possible that a word seen by dtSearch cannot been seen by PDFNet and vice versa. So the bad news is that your task cannot be achieved with 100% accuracy.
The good news is that most PDFs have structural text information relatively easy to parse. I believe, most of the time, the words extracted by dtSearch should be extracted by PDFNet as well. To correspond a word, can you use the TextSearch class from PDFNet? For example, given a key word, does deSearch output all its instances on a page in order? If so, you can search for that key word using the TextSearch class. Then you can simply correspond these instances following their order.