[PDFNet] Full Text Search

James1 · November 2, 2012, 1:19am

PDFNet includes a sophisticated text extraction engine which could be used to create an index of the text found in a set of PDFs. For detailed information please see the documentation for the TextExtractor class, available online here: http://www.pdftron.com/pdfnet/documentation.html, and the sample project TextExtract, found online here: http://www.pdftron.com/pdfnet/samplecode.html#TextExtract

Ivanho · November 2, 2012, 1:29am

There is also
http://www.pdftron.com/pdfnet/samplecode.html#TextSearch

class / sample.

With TextExtractor you can pass extracted text to Lucene for indexing.

If you need to highlight text, you can index text based on a page (say with help of Lucene).
Then run a quick page specific with help of TextSearch, this will give you bbox positioning for each match and you can also save hit results using XML highlight format (pdftron.PDF.Highlights.Save(…)). PDFViewCtrl can load the selection from the file etc.