How can I extract all HTTP links from any PDF document?

Aaron_Gravesdale · November 29, 2007, 7:21pm

Q: I am trying to extract http links from a PDF file. I used the
sample code provided on you website, but I encountered the following
problem.
While for some PDFs links are being extracted, for others I don't get
any result. I can see the links in the document text when I open it in
Acrobat reader but I cannot find them using code similar to your
example(http://www.pdftron.com/net/samplecode/AnnotationTest.cs).
----
A:
Could you please provide us with a sample file that exhibits this
problem? Thank you. It is possible that the link is stored as part of
the JavaScript action or similar. Also we noticed that Acrobat
sometimes reads the text of the page and recognizes the hyperlinks
even though there are no explicit annotation objects (you could
implement similar functionality using pdftron.PDF.TextExtractor class
- please see TextExtract sample). In any case it is hard to recommend
a solution without taking a peak at the file.

Aaron_Gravesdale · December 1, 2007, 12:02am

Q: I suspect that, as you say, the acrobat reader parses streams and
shows URL formatted text as links. But, since I am not familiar with
the pdf format, I wanted to make sure I am not doing something wrong.
-----
A:
You are right, the file does not contain any hyperlink annotations. In
case you would like to mimic Acrobat behavior, you can use
pdftron.PDF.TextExtractor class (please see www.pdftron.com/net/samplecode.html#TextExtract)
to extract words and their positioning information. Using a regular
expression it should be straightforward to identify HTTP links.