Does PDFTron provide a way to compare two PDF documents to programmatically determine if they are identical? I have read the documentation on the semantic compare feature that generates PDFs that visually highlight differences - PDFTron - but I am just looking for a simple way to load two PDFs in a test and determine if their content is identical.
One approach I have tried is simply loading two files using fs.readFileSync, and then using Buffer.compare. If I take a PDF and make a copy of it with a different name, and use this approach Buffer.compare returns 0 indicating the two files are identical. However, this is not the case if I use PDFTron to convert the same Word doc to a PDF twice, and then compare the buffers using the following technique. Why is that? Shouldn’t the two files be identical?
const docx = fs.readFileSync(inputPath);
const pdf = await PDFNet.Convert.office2PDFBuffer(docx);
const pdfBuffer = Buffer.from(pdf);
const pdf2 = await PDFNet.Convert.office2PDFBuffer(docx);
const pdfBuffer2 = Buffer.from(pdf2);
console.log(Buffer.compare(pdfBuffer, pdfBuffer2)); // never 0 as expected
Just discovered that the highlightTextDiff function returns a count of differences, so am giving that a try. Unfortunately it doesn’t seem to provide accurate results. When I compare two completely different documents I only get a diff count of 1. I also tried comparing two documents that have their page order reversed, and I get 0 differences.
@kenneth.cruz
I am working in Nodejs, but ultimately what I found worked the best was to convert the docs to images using PDFTron functions, and then do a simple Buffer.compare. We were comparing multipage documents, so found that the tiff format worked the best for comparison. Here’s a helper function we use in our tests that is written in Typescript, but should give you a decent idea what you could do in Python.
/**
* Converts PDFs to TIFFs, and does a binary comparison.
*/
export async function compareAsTiff(actual: Buffer, expected: Buffer): Promise<boolean> {
const actualDoc = await PDFNet.PDFDoc.createFromBuffer(actual);
const expectedDoc = await PDFNet.PDFDoc.createFromBuffer(expected);
const actualPageCount = await actualDoc.getPageCount();
const expectedPageCount = await expectedDoc.getPageCount();
expect(actualPageCount).toBe(expectedPageCount);
// may need to experiment with DPI - if it's too low, may not catch minor changes (e.g. font differences)
// PDF Tron support recommended 150 as a starting point
const options = new PDFNet.Convert.TiffOutputOptions().setDPI(150);
const actualTiff = await PDFNet.Convert.toTiffBuffer(actualDoc, options);
const expectedTiff = await PDFNet.Convert.toTiffBuffer(expectedDoc, options);
expect(Buffer.compare(actualTiff, expectedTiff)).toBe(0);
return true;
}
Thanks @dan.obrien . I was also looking into that, I used PDFDraw in my case. The tradeoff is slow speed though, when rendering each page into an image. Anyway, I just thought I asked in case there is another way.