Converting PDF to SVG (XML)

Aaron_Gravesdale · April 7, 2009, 10:01pm

Q: I found that you have a tool for SVG extraction (http://
www.pdftron.com/pdf2svg) . I ran the demo program and put the file
through a couple of different SVG editors. I am really impressed by
what you guys have achieved with this. I would like to ask about
optimising the structure of the embedded text in the SVG file, so that
it runs without breaking in the middle of a word. For example, if you
look at the attached image, you can see how a very short title is
divided into odd text runs. Do you know of any way to consolidate this
type of thing without changing the final appearance of the rendered
SVG document?
------
A: PDF2SVG preserves the same text structure that is present in the
input PDF. Text lines are broken into short text runs in order to
achieve precise text positioning (this differs from HTML where text
layout is usually left to the browser).

Do you know of any way to consolidate this type of thing without
changing the final appearance of the rendered SVG document?

Short text runs are used only to specify precise positioning
information. Unfortunately there is no other way to specify accurate
positioning information in SVG 1.1 (in SVG 1.2, which is still a draft
standard, there is a textFlow element that can be used to wrap text.
Unfortunately, this tag is not suitable for accurate text
reproduction).

If you would like to extract PDF text in a different form, you may
want to take a look at TextExtract sample (http://www.pdftron.com/
pdfnet/samplecode.html#TextExtract). 'pdftron.PDF.TxtExtractor' will
reconstruct PDF text into words, lines, and paragraphs with full
access to positioning information. The sample also illustrates how to
extract text from PDF in a custom XML format (the third code sample).