List of supported aplhabets

I have found that Apryse SDK does not support the Hebrew alphabet: if a PDF with content in Hebrew is sent to the SDK, the resulting docx file has random numbers and latin letters in it.

I have looked through the documentation, but could not find a list of supported scripts, is it possible to obtain a list of supported scripts more quickly than by trial and error?

Thanks!

1 Like

Hello, I’m AiRon, an automated tech support bot :robot:
While you wait for one of our customer support representatives, I can help you with your question.


It seems you are facing issues with text conversion for Hebrew script using the Apryse SDK. While there isn’t a readily available list of supported alphabets or scripts in the documentation, the SDK generally supports a wide range of languages and scripts. However, the accuracy of conversion can depend on the specifics of the PDF content and the fonts used.

Some things you can try:

  • Ensure that the PDF uses embedded fonts for Hebrew text, as missing fonts can lead to improper text extraction.

  • Check if the latest version of the SDK has improved support for Hebrew or other complex scripts.

  • Consider using OCR (Optical Character Recognition) features if the text is not being recognized correctly.

  • Reach out to Apryse support for specific inquiries about language support and potential workarounds.



Here are some resources that might help you:

  • Text Extraction Guide: Learn how to extract text from PDFs and troubleshoot common issues.

  • OCR Guide: Explore using OCR capabilities for recognizing text in images or non-standard fonts.

  • PDF/UA Accessibility Guide: Understand how to work with accessible PDFs, which may help with language support.

  • Apryse Community Forum: Engage with other users and developers for insights and shared experiences.



Hopefully, this helps you solve your problem while you wait for a human to get back to you.

1 Like

Hi Leonardo,

My name is Christopher, and I am a Support Engineer with the Apryse toolkit. Would it be possible for you to provide me with the file containing Hebrew, that you are having issues with?

Thanks,

2 Likes

Hello Christopher,

Thank you for your response!

In the meantime, I have carried out an experiment by myself and am convinced that the issue is due to the file’s font not being embedded.

I am attaching two files: an excerpt of the file I was trying to convert (I have deleted everything that could be considered confidential information), named “NOK”, and a file that I created by saving a Word file with some sample Hebrew text in it, this file converts correctly, so I have named it “OK”.
NOK.pdf (137.2 KB)
OK.pdf (32.8 KB)

Please let me know if you need anything else.

Thanks!

Leo

1 Like

Hi Leonardo,

I was able to see what you were referring to when doing a straight conversion from .pdf to .docx. Where certain numbers showed up instead of lettering. That being said, after trying the following code snippet I was able to get better results with the implementation of the SetCustomOCRLanguage() call. Please try the following code snippet and see if it works for you on your end.

PDFNet.Initialize(PDFTronLicense.Key);

PDFNet.AddResourceSearchPath(@"C:\Users\ChristopherThompson\OneDrive - Apryse\Desktop\StructuredOutputWindows\Lib");

bool err = false;

try
{
	// Convert PDF document to Word
	Console.WriteLine("Converting PDF to Word");

	string outputFile = outputPath;

	pdftron.PDF.Convert.WordOutputOptions options = new pdftron.PDF.Convert.WordOutputOptions();
	
	options.SetCustomOCRLanguage("heb");

	pdftron.PDF.Convert.ToWord(inputPath, outputFile, options);

	Console.WriteLine("Result saved in " + outputFile);
}
catch (PDFNetException e)
{
	Console.WriteLine("Unable to convert PDF document to Word, error: " + e.Message);
	err = true;
}
catch (Exception e)
{
	Console.WriteLine("Unknown Exception, error: ");
	Console.WriteLine(e);
	err = true;
}
1 Like

Hello Christopher,

Thank you for responding!
Are you by any chance able to send the result of the conversion using that code snippet?

Also, am I correct in assuming that the code snippet you shared forces the SDK to use OCR, rather than the actual character by character conversion (sorry if I misnamed it, I don’t really know the correct name for it)?

Thanks!
Leo

1 Like

Hi Leonardo,

Yes, the code snippet that I shared does use OCR (Optical Character Recognition) to convert the file. Below I have added the output .docx files for each of the .pdf files provided.

test_NOK.docx (15.6 KB)
test_OK.docx (6.7 KB)

Thanks,
Christopher

1 Like

Hi Christopher,

Thank you so much!

Leo

1 Like