Low level PDF Parser

Ryan · April 23, 2021, 3:53pm

Yes, the Xref is all the indirect objects, and the code referenced in the other post shows how to then iterate all the direct objects (including Arrays, Streams and Dictionaries) recursively. So in the end you have iterated every object, both indirect and direct.

even in case of incremental updates and multiple cross-reference tables.

Yes, our SDK handles incremental updates, but you see the final versions of the objects. So if an object was modified or deleted, you would not see the original object. If that is not what you are looking for then please elaborate.

Is there any enumeration callbacks and object parser class for SDFDoc which can enumerate the complete doc for us.

Yes, the code in the other forum post does all the enumerating for you, using our APIs.

Extract all the Javascript. - this is very much possible with PDFTron.

Yes, the other forum post referenced deletes all the javascript, but you could instead extract it. See this post.

Extract all /ObjStm objects.

Yes, our SDK parses all objects, including those in a compressed object stream.

Extract the decoded embedded file streams and other object streams. - It does have APIs like GetDecodedStream and GetRawStream.

Yes, exactly, there are API’s to access the stream as it is in the PDF, and also PDFTron can decode the streams for you so you can get the actual data.