Can I convert PDF to UTF-8 compliant xml files using PDFNet SDK?

Aaron_Gravesdale · June 3, 2009, 10:11pm

Q: I'm looking to know if your PDF converter will take PDFs and
convert them to UTF-8 compliant xml files.
Specifically, I'm looking to create an xml source file (type xmlpipe2)
from PDF to be indexed by the Sphinx search engine.
------
A: You could use either PDFNet SDK (http://www.pdftron.com/pdfnet) or
PDF2Text (http://www.pdftron.com/pdf2text) for conversion of PDF to
UTF8 encoded text or XML files.

If you are looking for a more programmatic solution, you may want to
take a look at TextExtract sample project (http://www.pdftron.com/
pdfnet/samplecode.html#TextExtract) in PDFNet. In this sample
pdftron.PDF.TextExtractor is used to extract Unicode strings which can
be encoded using any desired encoding (e.g. UTF8, UTF16BE/LE, MBC,
etc).

If you are looking for a command-line utility or SDK with a very
simple interface you may want to take a look at PDF2Text.