How do I support additional OCR languages when converting PDF to Office?

shakthi124 · October 24, 2024, 10:26pm

Question:
The Structured Output module supports OCR languages outlined in the documentation here. How do I support additional OCR languages when converting PDF to Office?

Answer.
You may set the PreferredOcrEngine() to Tesseract and set the custom languages you’re looking for with SetCustomOCRLanguage(). As the document states, you may use 3-letter ISO 639-2 language codes, separated by spaces. Example: “eng deu spa fra”. The default is English.

For an example, please refer to the following code:

pdftron.PDF.Convert.WordOutputOptions wordOutputOptions = new pdftron.PDF.Convert.WordOutputOptions();
wordOutputOptions.SetPreferredOCREngine(pdftron.PDF.Convert.OutputOptionsOCR.PreferredOCREngine.e_engine_tesseract); 
// "chi_tra" "jpn" "kor" or "ara" can be used here 
wordOutputOptions.SetCustomOCRLanguage("chi_sim"); 
pdftron.PDF.Convert.ToWord(inputPath + "input.pdf", outputFile, wordOutputOptions);