Please help me!
I would like to extract data from pdf file.
I plan to use PDFNet to convert the pdf file to html file and then extract the data from the html file.
Because I found the html file (converted from pdf file using PDFNet) has very similar pdf file.
However, I am having some troubles :
1. Html code does not reflect the table data structure
- Can not distinguish table data area from other data areas.
- Can not distinguish data of rows.
=> How to configure it to fix this?
2. Data in a cell is contained in many html tags
=> Is there a way for data in a cell to contain only in one html tag?
3. The html tag does not have the id and the name attribute
=> How to configure it to do this?
The PDF to HTML conversion you are using is aimed at graphical accuracy, and reflects exactly what is in the PDF file page content streams.
Furthermore, there is no concept in a PDF content stream of sentences, paragraphs, or tables. Complicating things is that often “words” in a PDF are broken up into their individual characters for exact letter placement, so a single word can be made up of multiple spans in the HTML output.
It sounds like you would be a lot more interested in our PDFGenie solution, which does high level analysis, including table extraction.
You can download the Windows version of PDFGenie from https://www.pdftron.com/downloads/pdfgenie.zip and try it out on your own documents.
We are still porting over content from our old site to the new, so you can see more about PDFGenie here.
To summarize, PDF files do not contain tables, rows, cells, paragraphs, so you need to go to a higher level of analysis to determine this.
There may be a better solution to what you are trying to do. If you could elaborate on why you would like to “extract data from pdf”, then I could advise further. If you like to move the conversation into private conversation, please go here.