What is the fastest way to search text in PDF ?

Aaron_Gravesdale · October 25, 2008, 12:16am

Q: If I have a large PDF, 700 pages for example, what is the fastest
way to search these pages for a string of text and find out what page
it occurs on?

After checking the APIs, my current solution is to loop through each
page and use the text extractor. This takes about 1.5 seconds to loop
through all pages. Is there some other way that can search the entire
PDF document at once and return the page that the text is found on?

My sample code (C#, .NET 3.5)

// Use PDF Tron to search a large doc for a string of text.
// Add page text to a list that I can search after the loop completes.
List<string> pageText = new List<string> ( );
PDFNet.Initialize ( );
Page page = null;
TextExtractor txt = new TextExtractor ( );
PDFDoc doc = new PDFDoc ( “700PagePDF.pdf” );
for ( int i = 1; i <= doc.GetPageCount(); i++ )
{
       page = doc.GetPage ( i );
       txt.Begin ( page );
       pageText.Add ( txt.GetAsText ( ) );
}

txt.Dispose();

...Search the list for my wanted text.
...
--------
A: A faster algorithm to find text would be to stop the search as soon
as a string match is found. This is faster than accumulating text from
all pages (which may also take lots of memory) and then doing the
search on all page buffers.

You may also want to improve your string search procedue (e.g. usign
fast sub-linear algorithms), but I am not sure if this will be worth
the effort in case of PDF format etc.