TextExtractor cannot extract correct quad information for type3 text.
Please describe your issue and provide steps to reproduce it:
When using TextExtractor on a file with type3 font text, Line.GetQuad(), Word.GetQuad(), Glyph.GetQuad() functions return invalid bounding box with no area. Is there a way to get this information without having to go through ElementReader and .GetType3GlyphStream()?
This also causes the order of the characters in the word extracted to be off (maybe its unrelated).
Steps to reproduce:
Use TextExtractor on attached file with no processing flags set.
Try calling Line.GetQuad(), Word.GetQuad(), Glyph.GetQuad() and notice the values create a bounding box with no width or height.
Please provide a link to a minimal sample where the issue is reproducible: Sample file.pdf (1.1 MB)
Hello, I’m AiRon, an automated tech support bot While you wait for one of our customer support representatives, I can help you with your question.
It seems you’re encountering an issue with extracting quad information from Type3 font text using the TextExtractor in Apryse’s Server SDK. Type3 fonts can be complex because they allow custom glyph shapes, which may not be handled as expected by the TextExtractor’s default methods for bounding box extraction.
Some things you can try:
Verify if the TextExtractor is set up correctly for handling Type3 fonts. Try using different processing flags or options that might affect how the text is extracted.
Consider using the ElementReader class to access the raw Type3 glyph streams directly. This can provide more control and potentially more accurate bounding box calculations.
Check if there are updates or patches for the SDK that might address issues with Type3 font handling.
Ensure your document is correctly formatted and doesn’t have any corruption that might affect text extraction.
Here are some resources that might help you:
Text Extraction Guide: Offers a comprehensive overview of text extraction capabilities and options.
Working with Text: Details on handling text with different fonts, including potential issues and solutions.
ElementReader Documentation: Provides information on using ElementReader to access and manipulate raw PDF content.
Advanced Text Extraction: Discusses advanced techniques for extracting text, which might be useful for Type3 fonts.
Hopefully, this helps you solve your problem while you wait for a human to get back to you.
There seems to be some issues with the PDF file itself where it contains Type3 fonts and does not describe any font bounding boxes. Unfortunately, it does not seem to render these bounding boxes correctly in Adobe as well. I would suggest using a different file that is not generated by Figma and seeing if there are improvements to the Text Extractor.
Hey Luke, wondering if there’s any updates on this? Would be great to see TextExtractor working properly for these Figma exported PDFs without having to go through ElementReader.