How to use OCR when converting PDF to Office?

nj.lrq12138 · March 12, 2025, 6:36pm

Product:

Product Version:11.3.8.5075

Please give a brief summary of your issue:
How to use OCR when converting PDF to Office?

Please describe your issue and provide steps to reproduce it:
What is the tessdata folder in the structured output module for? Does the OCR using the structured output module default to using the engine in the tessdata folder? Why is there no difference before and after executing SetPreferredOCREngine(OutputOptionsOCR.PreferredOCREngine.e_engine_tesseract)? Is it because there is no engine file and the default engine is used? If there is no engine, do I need to add the engine file myself? Where should the new engine be placed when changing the engine?

system · March 12, 2025, 6:36pm

Hello, I’m AiRon, an automated tech support bot
While you wait for one of our customer support representatives, I can help you with your question.

To use OCR when converting a PDF to an Office document using Apryse’s PDF SDK, you need to ensure that the OCR feature is correctly set up and configured. The tessdata folder is essential for the Tesseract OCR engine, as it contains the language data files required for text recognition. If you’re not seeing any difference after setting the OCR engine, it might be due to the absence of necessary OCR data files or incorrect configuration.

Some things you can try:

Ensure that the tessdata folder contains the necessary language data files for Tesseract.

Verify that the OCR engine is correctly set up by checking the configuration and file paths.

If using Tesseract, you may need to download and place the language data files in the tessdata directory.

Specify the correct path to the tessdata folder in your configuration.

Check if the OCR engine is being correctly selected by your configuration settings.

Here are some resources that might help you:

Apryse OCR Guide: Provides an overview of OCR capabilities and setup instructions.

Converting PDF to Office: Details on how to convert PDFs to Office formats with potential OCR integration.

Tesseract Language Data: Repository for downloading language data files for Tesseract.

Structured Output Module: Information on how to use the structured output module effectively.

Hopefully, this helps you solve your problem while you wait for a human to get back to you.

btompkinson1 · March 13, 2025, 2:47pm

Our online documentation on the OCR modules will answer most of these questions.

What is the tessdata folder in the structured output module for?

These are language data files for the Tesseract engine.

Does the OCR using the structured output module default to using the engine in the tessdata folder?

The default engine is the Solid Documents engine, the Tesseract engine is the alternative. Please see the online documentation for the PreferredOCR Engine settings. Enum Convert.OutputOptionsOCR.PreferredOCREngine

Why is there no difference before and after executing

Can you explain in more detail? Do you mean before and after executing with a different preferred engine on the same document? It’s possible that both engines will create the same output.