I have found that Apryse SDK does not support the Hebrew alphabet: if a PDF with content in Hebrew is sent to the SDK, the resulting docx file has random numbers and latin letters in it.
I have looked through the documentation, but could not find a list of supported scripts, is it possible to obtain a list of supported scripts more quickly than by trial and error?
Hello, I’m AiRon, an automated tech support bot While you wait for one of our customer support representatives, I can help you with your question.
It seems you are facing issues with text conversion for Hebrew script using the Apryse SDK. While there isn’t a readily available list of supported alphabets or scripts in the documentation, the SDK generally supports a wide range of languages and scripts. However, the accuracy of conversion can depend on the specifics of the PDF content and the fonts used.
Some things you can try:
Ensure that the PDF uses embedded fonts for Hebrew text, as missing fonts can lead to improper text extraction.
Check if the latest version of the SDK has improved support for Hebrew or other complex scripts.
Consider using OCR (Optical Character Recognition) features if the text is not being recognized correctly.
Reach out to Apryse support for specific inquiries about language support and potential workarounds.
Here are some resources that might help you:
Text Extraction Guide: Learn how to extract text from PDFs and troubleshoot common issues.
OCR Guide: Explore using OCR capabilities for recognizing text in images or non-standard fonts.
PDF/UA Accessibility Guide: Understand how to work with accessible PDFs, which may help with language support.
Apryse Community Forum: Engage with other users and developers for insights and shared experiences.
Hopefully, this helps you solve your problem while you wait for a human to get back to you.
My name is Christopher, and I am a Support Engineer with the Apryse toolkit. Would it be possible for you to provide me with the file containing Hebrew, that you are having issues with?
In the meantime, I have carried out an experiment by myself and am convinced that the issue is due to the file’s font not being embedded.
I am attaching two files: an excerpt of the file I was trying to convert (I have deleted everything that could be considered confidential information), named “NOK”, and a file that I created by saving a Word file with some sample Hebrew text in it, this file converts correctly, so I have named it “OK”. NOK.pdf (137.2 KB) OK.pdf (32.8 KB)
I was able to see what you were referring to when doing a straight conversion from .pdf to .docx. Where certain numbers showed up instead of lettering. That being said, after trying the following code snippet I was able to get better results with the implementation of the SetCustomOCRLanguage() call. Please try the following code snippet and see if it works for you on your end.
PDFNet.Initialize(PDFTronLicense.Key);
PDFNet.AddResourceSearchPath(@"C:\Users\ChristopherThompson\OneDrive - Apryse\Desktop\StructuredOutputWindows\Lib");
bool err = false;
try
{
// Convert PDF document to Word
Console.WriteLine("Converting PDF to Word");
string outputFile = outputPath;
pdftron.PDF.Convert.WordOutputOptions options = new pdftron.PDF.Convert.WordOutputOptions();
options.SetCustomOCRLanguage("heb");
pdftron.PDF.Convert.ToWord(inputPath, outputFile, options);
Console.WriteLine("Result saved in " + outputFile);
}
catch (PDFNetException e)
{
Console.WriteLine("Unable to convert PDF document to Word, error: " + e.Message);
err = true;
}
catch (Exception e)
{
Console.WriteLine("Unknown Exception, error: ");
Console.WriteLine(e);
err = true;
}
Thank you for responding!
Are you by any chance able to send the result of the conversion using that code snippet?
Also, am I correct in assuming that the code snippet you shared forces the SDK to use OCR, rather than the actual character by character conversion (sorry if I misnamed it, I don’t really know the correct name for it)?
Yes, the code snippet that I shared does use OCR (Optical Character Recognition) to convert the file. Below I have added the output .docx files for each of the .pdf files provided.