Regex Search for Japanese Text

Product: Webviewer

Product Version: 8.12.0

Please give a brief summary of your issue:
Regex Search for Japanese Text

Please describe your issue and provide steps to reproduce it:
Hi, I’m trying to add search patterns that will work for Japanese characters and other writing systems. I’ve read online about Unicode categories (Regex Tutorial - Unicode Characters and Properties) so I wrote these patterns that should match any string of characters in any writing system:

\p{Letter}+
[\p{Letter}\p{Mark}]+
[A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+

Those work on other platforms but not on webviewer. I tried searching for all 3 in the online demo (JS Search & Highlight Text in a PDF Demo | Apryse WebViewer). I used the document attached below. The first one gives me this error:


And the second match English text but not Japanese. I get the same results when I add them to our custom webviewer with this code:

instance.UI.addRedactionSearchPattern({label: “Letter”, type: “Letter”, regex: /\p{Letter}+/iu})
instance.UI.addRedactionSearchPattern({label: “unicode”, type: “unicode”, regex: /[A-Za-z\u00C0-\u1FFF\u2800-\uFFFD]+ /iu})

Can anyone explain why that error is happening and if there is any way around it? Is it because these are not valid patterns in the Core library?

japanese language clinical document 4 pages.pdf (444.0 KB)

1 Like

Thank you for contacting Apryse support.

I’m reviewing your request and will get back to you shortly.

1 Like

I am sending this over to the WebViewer Team. They should be in contact shortly.

2 Likes

Thanks @jmccarthy1 I appreciate it. Can they provide me with some technical information about what kind of regexes are allowed in the webviewer-core? I’ve been trying to figure out which Regular Expression grammar is used ( Regular Expressions (C++) | Microsoft Learn)

Also, is there a way I can look at the code in the core library? If I could see what’s in TextSearch.cpp that would help.

1 Like

Hello kanderson,

We have examples of what Regex format we support here:
https://docs.apryse.com/api/web/UI.html#.addRedactionSearchPattern

Unfortunately, we cannot provide the source code to our customers.

Best regards,
Tyler

1 Like

@tgordon Thanks but that still doesn’t explain why I get an error when I search with this regex: \p{Letter}+

I think it has something to do with the library that the core is using for regexes and if that supports unicode. See here


https://www.regular-expressions.info/unicode.html#category

So can you tell me what library the core is using for regexes?

1 Like

Hello kanderson,

We use the regular JavaScript RegExp: RegExp - JavaScript | MDN

So the / would be invalid.

Best regards,
Tyler

1 Like

Ok, I’m gonna ask about this in a different channel.

1 Like