Extract section titles from a PDF (that has no TOC or bookmarks)?


Are there simple text extraction APIs that’ll allow me to extract all the section titles from a PDF (that has no TOC or bookmarks, and tags)?


Did you take a look at ‘pdftron.PDF.TextExtractor’? http://www.pdftron.com/pdfnet/samplecode.html#TextExtract

This class can return positioning info for each line, word, character, on the page along with font and style info (that may be important for logical structure extraction).

At the moment PDFNet does not include APIs that will return high-level structure (e.g. section titles) given on the PDF layout information. In general this would not always work and the results could vary depending on how the page is formatted. Having said this, you could use PDFNet APIs to implement such functions.