Does PDFNet support the "#pdfloc" notation?

agravesdale · June 6, 2015, 1:08am

Q:

I’m using a Adobe’s SDK, and it stores text selection with this format:

#pdfloc(caf5,0,690,9,0,0,1,1)

I’d like to use PDFNet, but I’m not sure how to convert this notation over.

A:

The comments on this API do indicate that the #pdfloc notation refers to the order of characters in extracted text.

[ Note that there is no standard for extracting text from PDF documents, which is unfortunate — extracting text from a PDF document is more of an art than a science. The order in which text appears in a rendered PDF page can be totally different from the order in which text is encoded in the PDF. (This is unlike a text file, word processing file, or an HTML file, where the order in which words are encoded in the source file determine the order in which they appear when rendered.) Text in a PDF is sometimes encoded in the order it appears on the page, but it might instead be listed in the reverse order, or in a totally random order. If you wanted to, you could create a PDF page which first specifies the location of all the “A” characters, then specifies the location of all the “B” characters, and so on. Thus, to extract text in a useful way (that imitates how that text appears on the rendered page) is not at all straightforward. And since it’s not a standardized procedure, every PDF toolkit extracts text slightly differently. ]

Since the order of characters in Acrobat’s extracted text is not something we can find out, that makes #pdfloc location something we cannot reverse-engineer.

If you are able to merge these #pdfloc annotations back into the original PDF (even temporarily) using Adobe’s SDK, you could then use the PDFNet SDK to extract the annotations as XFDF (an XML-based format which is in the process of becoming an ISO standard for PDF annotations, and which can be processed by PDFNet). Thus you would still be able to store annotations outside the original PDF, but be able to merge annotations back in using the PDFNet SDK.