Extracting XML data from PDF.

Aaron_Gravesdale · February 3, 2009, 8:31pm

Q: We want the data in pdf to be extracted by paragraphs sections so i
can build an xml with the data. For example:

<page>
<chapter1>
<section>
<paragraph>
<body>Hello World!!!</body>
</paragraph>
</section>
</chapter1>
</page>

How can I use PDFNet to implement this functionality?
----
A: You can use 'pdftron.PDF.TextExtractor', as shown in TextExtract
sample project (http://www.pdftron.com/net/
samplecode.html#TextExtract), to extract words, lines, and blocks of
text. The sample also includes a snippet showing how to serialize this
information as XML.

Because there is infinite number of document grammars, TextExtractor
does not try to reconstruct higher-level logical structures (such as
chapters, sections, footers, headings, etc ), however you can use the
information from TextExtractor (i.e. positioning info, font and
graphics styles, content, etc) and ElementReader to parse a specific
logical structure. In essence you would be using TextExtractor as a
low-level recognizer and would recognize desired higher-level
structures from the provided layout, content, and style information.