Convert PDF to HTML / SVG using PDFNet

Aaron_Gravesdale · April 1, 2011, 7:20pm

Q: I have spent two days going through the .Net version of the pdf
control and am looking at using it but I have a couple of questions.

Basically we are looking at the quickest way to export from PDF to
HTML and or SVG (pdftron.PDF.Convert.ToSvg()).

There is a GPL app that exports to html (pdftohtml) though that does a
better job with the html than the sample you provide though because of
the following issues.

- Is there any way to try get it to add a div or span around each
line rather than each word.
- If not, Is there a better sample to just extract the images and
shapes (For the background image) and just extract the text which I
can the run through pdftohtml. That removes the fonts though. Well use
css3 to register the exact same fonts.
- How do we remove the clipping etc from pdf's sent to print. It has
markers but I can't find a way to remove these. I have attached a
sample. Is there a way to apply a consistent clip. Keep in mind though
that not all pages have these. No idea why. So I basically want to
clip them all to get a standard size.

- Also is there a way to downgrade a pdf version if need be. I.e. 1.6
to 1.5.

- Finally do you offer consulting for stuff we can't work out. These
publishers do some strange stuff that doesn't export to html well.
--------------

A: The main purpose of PDF to HTML sample (http://www.pdftron.com/
pdfnet/samplecode.html#Html2Pdf) is to illustrate the use of PDFNet
content extraction API and less as a production quality PDF to HTML
converter.

Is there any way to try get it to add a div or span around
each line rather than each word.

Yes, definitely. We provide all the source code so you can tweak it as
required. Adding a div/span around each line shouldn't be a problem.

just extract the images and shapes (For the background image) and just extract
the text which I can the run through pdftohtml. That removes the fonts though.

Yes, this is also possible. You can simply comment-out all lines that
output text in PDF to HTML.

How do we remove the clipping etc from pdf's sent to print. It has markers
but I can't find a way to remove these.

Probably the simple option is to adjust the size of the crop-box based
on the trim box (i.e. page.SetCropBox(page.GetBox(Page.Box.e_trim))).

In case there is no reliable trim box you could physically remove
content from the page (if you can somehow recognize the markers) as
shown in ElementEdti sample (Document Processing Technology for Developers | Apryse, Formerly Known as PDFTron
samplecode.html#ElementEdit).

Also is there a way to downgrade a pdf version if need be. I.e. 1.6 to 1.5.

You could use PDFNet to implement this type of functionality (e.g.
pdftron.PDF.PDFA.PDFAComplinace can downgrade to PDF 1.4) however
there is no generic function to do this.

Finally do you offer consulting for stuff we can't work out. These
publishers do some strange stuff that doesn't export to html well.

Definitely, PDFTron offers consulting services and we have a large
portfolio of very happy clients (http://www.pdftron.com/whypdftron/
testimonials.html).