How to use OCR when converting PDF to Office?

Product:

Product Version:11.3.8.5075

Please give a brief summary of your issue:
How to use OCR when converting PDF to Office?

Please describe your issue and provide steps to reproduce it:
What is the tessdata folder in the structured output module for? Does the OCR using the structured output module default to using the engine in the tessdata folder? Why is there no difference before and after executing SetPreferredOCREngine(OutputOptionsOCR.PreferredOCREngine.e_engine_tesseract)? Is it because there is no engine file and the default engine is used? If there is no engine, do I need to add the engine file myself? Where should the new engine be placed when changing the engine?

1 Like

Hello, I’m AiRon, an automated tech support bot :robot:
While you wait for one of our customer support representatives, I can help you with your question.

To use OCR when converting a PDF to an Office document using Apryse’s PDF SDK, you need to ensure that the OCR feature is correctly set up and configured. The tessdata folder is essential for the Tesseract OCR engine, as it contains the language data files required for text recognition. If you’re not seeing any difference after setting the OCR engine, it might be due to the absence of necessary OCR data files or incorrect configuration.

Some things you can try:


  • Ensure that the tessdata folder contains the necessary language data files for Tesseract.

  • Verify that the OCR engine is correctly set up by checking the configuration and file paths.

  • If using Tesseract, you may need to download and place the language data files in the tessdata directory.

  • Specify the correct path to the tessdata folder in your configuration.

  • Check if the OCR engine is being correctly selected by your configuration settings.




Here are some resources that might help you:



Hopefully, this helps you solve your problem while you wait for a human to get back to you.

1 Like

Our online documentation on the OCR modules will answer most of these questions.

What is the tessdata folder in the structured output module for?

  • These are language data files for the Tesseract engine.

Does the OCR using the structured output module default to using the engine in the tessdata folder?

Why is there no difference before and after executing

  • Can you explain in more detail? Do you mean before and after executing with a different preferred engine on the same document? It’s possible that both engines will create the same output.
1 Like