Extracting tables from PDF

Ivanho · August 22, 2013, 1:06am

Q:

We are using PDFTron Text Extractor to extract data (especially tabular info) from a PDF page.

Some pages may contain tables and in these cases we may wrong order of lines. For example:

John Doe

Albert Square

00150 England

Gets the following order in C# when printing out the extracted text information :

00150

John Doe

Albert Square

England

Do you guys have any solution for this?

A:

Could you please send us a sample document and we will take a look into it. TextExtractor does not have built-in capability to recognize things such as tables, figures, header/footers etc. Unfortunately this type of structure information is usually not explicitly stored in PDF, and we need to rely on potentially error prone techniques (similar to OCR) in order to reconstruct the info.

For example, you could use text positioning and styling information provided by TextExtractor to figure out what text belongs to a table etc.

We have implemented a prototype solution (based on TextExtractor) that tries to recognize text and dumps reflow-able HTML that contains tables.

The following is a sample C# that extracts PDF and reflow-able HTML and also recognizes tables:

using System;

using System.IO;

using pdftron;

using pdftron.Common;

using pdftron.PDF;

namespace pdftron

{

class test

{

static void Main(string[] args)

{

PDFNet.Initialize();

try

{

using (PDFDoc doc = new PDFDoc(input_file))

{

doc.InitSecurityHandler();

pdftron.PDF.Convert.HtmlOutputOptions options = new pdftron.PDF.Convert.HtmlOutputOptions();

options.SetReflow(true);

// Creates a file with original filename in the given folder

pdftron.PDF.Convert.ToHtml(doc, output_path, options);

}

catch (PDFNetException e) {

Console.WriteLine(e.Message);

}

To test drive this functionality you can use one of the following links:

(.Net 4, 64-bit) : https://pdftron.com/ID-zJWLuhTffd3c/22jdk340d/PDFNet64DotNet4.zip

(.Net 1.1-3.5, 32-bit) : https://pdftron.com/ID-zJWLuhTffd3c/22jdk340d/PDFNet.zip

The other PDFNet variants will be available in the near future.