PDF throws exceptions accessing DocInfo

Question:

We found a few documents out there throwing exception when accessing doc.GetDocInfo(). Here are the 4 top exceptions we get:

PDFNetException - Code:0
File: ObjParser.cpp
Func: trn::SDF::ObjParser::GetObj
Line: 340
Expr: m_operand_stack.size() >= 1
Message: Operator endobj expects a single argument

PDFNetException - Code:0
File: ObjParser.cpp
Func: trn::SDF::ObjStmParser::ObjStmParser
Line: 40
Expr: GetObj()
Message: Compressed object is corrupt

PDFNetException - Code:0
File: Parser.cpp
Func: trn::SDF::Parser::LexDict
Line: 349
Expr: num_elements>=0 && (num_elements % 2 == 0)
Message: the number of key-value elements should be even

PDFNetException - Code:0
File: Parser.cpp
Func: trn::SDF::Parser::LexDict
Line: 358
Expr: !m_operand_stack[i-1]->IsIndirect() && m_operand_stack[i-1]->IsName()
Message: Bad key

In these cases where the document info is corrupt, should we assume that the entire document is corrupt? Should we avoid reading/rendering/writing to the file?

Answer:

There are essentially two types of “corruption” in a PDF. Bad XRef table, and malformed content.

The first thing that happens when opening a PDF is reading the XRef table, which provides the exact byte offsets of objects. If this turns out be incorrect, the table is “repaired”. To see an example XRef see the red section here: https://www.pdftron.com/pdfnet/intro.html#pdf_intro

To detect this case, see this post.
https://groups.google.com/d/msg/pdfnet-sdk/uPLT9156YYY/c3fU7Y0NAwAJ

Ideally, in this case the file gets saved with e_remove_unused flag.

Either way, from this point on, even though the XRef was repaired, nothing was actually accessed. When you start doing operations on the file, such as viewing, then any object can throw an exception, such as the ones you see above, which are the second case of “corruption”. PDFNet has over 10 years of dealing with malformed PDF files, and does its best to complete actions, but if this is not possible, then exceptions are thrown.

In these cases where the document info is corrupt, should we assume that the entire document is corrupt? Should we avoid reading/rendering/writing to the file?

Generally, you can keep using these files and interacting with them. Though it is possible that writing to them might make things worse, but reading and viewing would be fine.

For example, if you are trying to populate a UI element with the document info (such as Author), then just catch the PDFNet exceptions, and leave the fields blank. Typically PDF viewers would not report errors in this case.