How do I search PDF using a regular expression?

Aaron_Gravesdale · February 26, 2016, 1:24am

Q:

This regular expression “[0-9silo]{3}[\s-.~,][0-9silo]{2}[\s-.~,][0-9silo]{4}” gives the error below. I am trying to find the differences with regular expressions using the TextSearch since this expression works with the .net regular expression engine.

Invalid regular expression encountered:

pdftron.PDF.TextSearch.SetPattern

My code is similar to your TextSearch sample:

http://www.pdftron.com/pdfnet/samplecode.html#TextSearch

A:

PDFNet uses boost regular expression search engine and the details can be found at:

http://www.boost.org/doc/libs/release/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

I found the problem emerges from the set expression “[\s-.~,]”. It looks to me you are trying to match a set of characters using this expression. The search engine will understand it in the following way:

match single character ‘’, since you used “\”
match a range from ‘s’ to ‘.’, since you used “s-.”. Note that ‘-’ is a special character if used in a set. For example, if you search for [s-v], if will match characters s, t, u, v. Now, here is the problem about expression

“s-.”: the search engine will get confused about the range since ‘.’ is matched for any single character; so the range is undefined.

With this being said, you might want to search for:

“[0-9silo]{3}[\s.~,][0-9silo]{2}[\s.~,][0-9silo]{4}”

Poulami_Maity · May 6, 2015, 11:10am

Hi, I have been using PDFNet to Search for text and highlight as needed.
The text that I search for may contain newline or carriage return characters.
Some of the PDF documents are not able to highlight properly in case of line breaks.
Here is the code:

internal static List<nENTITIES.Coordinate> GetPDFCoordinate(string pdfLocation, nENTITIES.SnippetCollection snptColl)
 {
 List<nENTITIES.Coordinate> lstOrdinates = new List<nENTITIES.Coordinate>();
 WebClient downloadClient = new WebClient();
 byte[] byteContent = downloadClient.DownloadData(pdfLocation);
 PDFDoc doc = new PDFDoc(byteContent, byteContent.Length);
 doc.InitSecurityHandler();
 try
 {
 using (doc)
 {
 Int32 page_num = 0;
 String result_str = "", ambient_string = "";
 Highlights hlts = new Highlights();

                    TextSearch txt_search = new TextSearch();
                    Int32 mode = (Int32)(TextSearch.SearchMode.e_reg_expression | TextSearch.SearchMode.e_page_stop | TextSearch.SearchMode.e_highlight | TextSearch.SearchMode.e_whole_word);
                    foreach (var snippet in snptColl.SnippetData.Single().Match)
                    {
                        if (!string.IsNullOrEmpty(snippet.Text))
                        {
                            snippet.TextLeft = snippet.TextLeft.Replace(@"\", @"\\");
                            snippet.TextLeft = snippet.TextLeft.Replace("?", "");
                            snippet.TextLeft = snippet.TextLeft.Replace("(", "\$");
                            snippet.TextLeft = snippet.TextLeft.Replace(")", "\$");
                            snippet.TextLeft = snippet.TextLeft.Replace("+", "\\+");
                            snippet.TextLeft = snippet.TextLeft.Replace("*", "\\*");
                            snippet.TextLeft = snippet.TextLeft.Replace("^", "\\^");
                            snippet.TextLeft = snippet.TextLeft.Replace("$", "\\$");
                            snippet.TextLeft = snippet.TextLeft.Replace("|", "\\|");
                            snippet.TextLeft = snippet.TextLeft.Replace("[", "\\[");
                            snippet.TextLeft = snippet.TextLeft.Replace("{", "\\{");
                            snippet.TextLeft = snippet.TextLeft.Replace("}", "\\}");

                            snippet.TextRight = snippet.TextRight.Replace(@"\", @"\\");
                            snippet.TextRight = snippet.TextRight.Replace("?", "");
                            snippet.TextRight = snippet.TextRight.Replace("(", "\$");
                            snippet.TextRight = snippet.TextRight.Replace(")", "\$");
                            snippet.TextRight = snippet.TextRight.Replace("+", "\\+");
                            snippet.TextRight = snippet.TextRight.Replace("*", "\\*");
                            snippet.TextRight = snippet.TextRight.Replace("^", "\\^");
                            snippet.TextRight = snippet.TextRight.Replace("$", "\\$");
                            snippet.TextRight = snippet.TextRight.Replace("|", "\\|");
                            snippet.TextRight = snippet.TextRight.Replace("[", "\\[");
                            snippet.TextRight = snippet.TextRight.Replace("{", "\\{");
                            snippet.TextRight = snippet.TextRight.Replace("}", "\\}");

string keyword = snippet.HighLight;
 string pattern = string.Empty;
 int flag = 0;
 //pattern = "(?<=" + snippet.Text + ")" + keyword;

if (string.IsNullOrEmpty(snippet.TextRight))
 {
 flag = 1;
 pattern = "(?<=" + snippet.TextLeft + ")" + keyword;
 }
 else if (string.IsNullOrEmpty(snippet.TextLeft))
 {
 flag = 2;
 pattern = keyword + "(?=" + snippet.TextRight + ")";
 }
 else
 {
 pattern = "(?<=" + snippet.TextLeft + ")" + keyword + "(?=" + snippet.TextRight + ")";
 }
 ////call Begin() method to initialize the text search.
 txt_search.Begin(doc, pattern, mode, -1, -1);
 bool done = false;
 while (!done)
 {
 TextSearch.ResultCode code = txt_search.Run(ref page_num, ref result_str, ref ambient_string, hlts);
 switch (code)
 {
 case TextSearch.ResultCode.e_found:
 hlts.Begin(doc);
 while (hlts.HasNext())
 {
 Page cur_page = doc.GetPage(hlts.GetCurrentPageNumber());

double[] quads = hlts.GetCurrentQuads();
 int quad_count = quads.Length / 8;
 for (int i = 0; i < quad_count; ++i)
 {
 //assume each quad is an axis-aligned rectangle
 int offset = 8 * i;
 double x1 = Math.Min(Math.Min(Math.Min(quads[offset + 0], quads[offset + 2]), quads[offset + 4]), quads[offset + 6]);
 double x2 = Math.Max(Math.Max(Math.Max(quads[offset + 0], quads[offset + 2]), quads[offset + 4]), quads[offset + 6]);
 double y1 = Math.Min(Math.Min(Math.Min(quads[offset + 1], quads[offset + 3]), quads[offset + 5]), quads[offset + 7]);
 double y2 = Math.Max(Math.Max(Math.Max(quads[offset + 1], quads[offset + 3]), quads[offset + 5]), quads[offset + 7]);
 if (flag == 1)
 {
 lstOrdinates.Add(new nENTITIES.Coordinate { PageNumber = hlts.GetCurrentPageNumber(), X1 = x1, X2 = x2, Y1 = y1, Y2 = y2, Snippet = snippet.TextLeft + "" + keyword.ToLower() + "" });
 }
 else if (flag == 2)
 {
 lstOrdinates.Add(new nENTITIES.Coordinate { PageNumber = hlts.GetCurrentPageNumber(), X1 = x1, X2 = x2, Y1 = y1, Y2 = y2, Snippet = snippet.TextRight + "" + keyword.ToLower() + "" });
 }
 else
 {
 lstOrdinates.Add(new nENTITIES.Coordinate { PageNumber = hlts.GetCurrentPageNumber(), X1 = x1, X2 = x2, Y1 = y1, Y2 = y2, Snippet = snippet.TextLeft + "" + keyword.ToLower() + "" + snippet.TextRight });
 }
 }
 hlts.Next();
 }
 break;
 case TextSearch.ResultCode.e_done:
 done = true;
 break;
 case TextSearch.ResultCode.e_page:
 break;
 default:
 break;
 }
 }
 }
 }
 }
 }
 catch (PDFNetException ex)
 {

                _mhcLogger.Fatal(ex.Message, ex, _currentMethodName);
                throw new ApplicationException(nCOMMON.ErrorCode.c_UNEXPECTED_SYSTEM_ERROR.ToString(), ex);
            }
            catch (Exception ex)
            {

                _mhcLogger.Fatal(ex.Message, ex, _currentMethodName);

            }

Now when I have the search text as (?<=wood. H. Trim Members For Replacement Windows: 1. Trim members for )vinyl clad window where in the PDF there is a line break after "wood. " and after “Windows:”

Ryan · May 7, 2015, 11:44pm

The actual PDF file in question almost certainly is required, to diagnose this. Please modify the TextSearch sample in the SDK to reproduce the issue, and send the modified code, the input pdf file, and your expected output/behavior to support at pdftron.