Compare two PDFs

dan.obrien · February 26, 2022, 1:51pm

Product:
pdfnet-node
Product Version:
9.2

Does PDFTron provide a way to compare two PDF documents to programmatically determine if they are identical? I have read the documentation on the semantic compare feature that generates PDFs that visually highlight differences - PDFTron - but I am just looking for a simple way to load two PDFs in a test and determine if their content is identical.

One approach I have tried is simply loading two files using fs.readFileSync, and then using Buffer.compare. If I take a PDF and make a copy of it with a different name, and use this approach Buffer.compare returns 0 indicating the two files are identical. However, this is not the case if I use PDFTron to convert the same Word doc to a PDF twice, and then compare the buffers using the following technique. Why is that? Shouldn’t the two files be identical?

    const docx = fs.readFileSync(inputPath);
    const pdf = await PDFNet.Convert.office2PDFBuffer(docx);
    const pdfBuffer = Buffer.from(pdf);
    
    const pdf2 = await PDFNet.Convert.office2PDFBuffer(docx);
    const pdfBuffer2 = Buffer.from(pdf2);
    
    console.log(Buffer.compare(pdfBuffer, pdfBuffer2)); // never 0 as expected

Thanks, Dan

anugu14divya · February 28, 2022, 1:33pm

Can anyone help me to integrate with angularJS compare PDF.Please provide steps

dan.obrien · March 1, 2022, 5:08pm

Just discovered that the highlightTextDiff function returns a count of differences, so am giving that a try. Unfortunately it doesn’t seem to provide accurate results. When I compare two completely different documents I only get a diff count of 1. I also tried comparing two documents that have their page order reversed, and I get 0 differences.

kenneth.cruz · July 14, 2022, 5:21am

I have a somewhat similar problem. For a simple example (Python):

doc1 = PDFDoc()
doc2 = PDFDoc()
assert doc1.Save(0) == doc2.Save(0)

Is there a way to make sure they are always the same? How do I compare 2 PDFs programatically? If not on PDF file level, how about on page level?

dan.obrien · July 14, 2022, 12:51pm

@kenneth.cruz
I am working in Nodejs, but ultimately what I found worked the best was to convert the docs to images using PDFTron functions, and then do a simple Buffer.compare. We were comparing multipage documents, so found that the tiff format worked the best for comparison. Here’s a helper function we use in our tests that is written in Typescript, but should give you a decent idea what you could do in Python.

/**
 * Converts PDFs to TIFFs, and does a binary comparison.
 */
export async function compareAsTiff(actual: Buffer, expected: Buffer): Promise<boolean> {
  const actualDoc = await PDFNet.PDFDoc.createFromBuffer(actual);
  const expectedDoc = await PDFNet.PDFDoc.createFromBuffer(expected);
  const actualPageCount = await actualDoc.getPageCount();
  const expectedPageCount = await expectedDoc.getPageCount();
  expect(actualPageCount).toBe(expectedPageCount);

  // may need to experiment with DPI - if it's too low, may not catch minor changes (e.g. font differences)
  // PDF Tron support recommended 150 as a starting point
  const options = new PDFNet.Convert.TiffOutputOptions().setDPI(150);
  const actualTiff = await PDFNet.Convert.toTiffBuffer(actualDoc, options);
  const expectedTiff = await PDFNet.Convert.toTiffBuffer(expectedDoc, options);
  expect(Buffer.compare(actualTiff, expectedTiff)).toBe(0);

  return true;
}

kenneth.cruz · July 15, 2022, 7:02am

Thanks @dan.obrien . I was also looking into that, I used PDFDraw in my case. The tradeoff is slow speed though, when rendering each page into an image. Anyway, I just thought I asked in case there is another way.