Q:
I am working with your PDFNet component, a very impressive piece of
work indeed. I have one question - if I have a task of extracting
text from a PDF to another format (Rich Text, ASCII, etc), can your
component preserve the structuring of the tables in the document, and
not just extract delimited text?
---
A:
Thank you for your compliment. PDFNet SDK (www.pdftron.com/net) can be
used to extract any information present in the document. If the PDF
document contains structure information (i.e. if it is 'tagged'),
PDFNet can also be used to extract the logical structure.
Unfortunately, PDF documents generated using most third party tools are
missing logical structure, and the only approach is to reconstruct the
logical structure using some document analysis technique (see
www.pdftron.com/net/faq.html#struct_01,
www.pdftron.com/net/faq.html#text_00).
Also, we are in the beta stage testing of a new add-on module for
PDFNet for document analysis, and will offer it as part of the SDK in
the near future. Because no document analysis approach is 'perfect',
PDFNet users will still be able to use their own implementations.