Running out of memory when merging 8000 pdf files into one

Ryan · March 29, 2022, 11:40pm

Question:

We need to merge 8000 PDF files into a single PDF file, but we find that memory is exceeding what is available on the system.

What can we do to reduce the memory consumption?

Answer:

First you would want to make sure Disk caching is enabled, this way PDFNet writes any streams (such as fonts, images and page content) to disk temporarily, rather than storing in memory.

By default PDFNet writes the temporary changes to disk to minimize memory usage. Default for below API is True.
https://www.pdftron.com/api/PDFTronSDK/dotnet/pdftron.PDFNet.html#pdftron_PDFNet_SetDefaultDiskCachingEnabled_System_Boolean_

The above is the default PDFNet behavior on Server/Desktop.

Next, you need to periodically Save, Close and re-open the file. This will clear out all the in memory objects (e.g. all the ones pointing to the streams written to disk, see above).

Since saving to disk is a slow operation comparatively we take advantage of the Incremental option afforded by the PDF standard so all the changes are appended bytes, rather than re-writing the whole file over and over again. The on the last save we do Linearized (AKA Fast Web View) which will clean up and minimize the size of the file and optimize for fast first page viewing.

The below C# code provides one implementation.

static void Merge(string[] filesToMerge, string outputFileName)
{
	// There is no one way to determine at what point to save. Below are some options.
	// every nth PDF file merged
	// every nth PDF Page merged
	// accumulate the sizes of the merged PDF files and save after nth bytes loaded from disk
	// track memory of the process (can be very hard depending on your environment)
	// track elapsed time

	// For demo purposes we will just do Page count.
	int pageCountSaveThreshold = 500; // after merging 500+ pages we will make an intermediate save to clear memory
	int pagesMerged = 0;
	PDFDoc mergedPdfDoc = new PDFDoc();
	foreach(string fileToMerge in filesToMerge)
	{
		try
		{
			using (PDFDoc sourcePdfDoc = new PDFDoc(fileToMerge))
			{
				if (!sourcePdfDoc.InitSecurityHandler())
				{
					// PDF has a read password, cannot effectively open and parse, do not merge.
					continue;
				}
				int currentPageCount = mergedPdfDoc.GetPageCount();
				int pagesToMerge = sourcePdfDoc.GetPageCount();
				// append pages to end of mergedPdfDoc
				mergedPdfDoc.InsertPages(currentPageCount + 1, sourcePdfDoc, 1, sourcePdfDoc.GetPageCount(), PDFDoc.InsertFlag.e_none);
				pagesMerged += pagesToMerge;
			}
		}
		catch(Exception e)
		{
		Console.WriteLine($"{fileToMerge}\n{e}");
		}
		if(pagesMerged >= pageCountSaveThreshold)
		{
		Console.WriteLine("temp saving");
		mergedPdfDoc.Save(outputFileName, SDFDoc.SaveOptions.e_incremental); // incremental is fastest way to save
		mergedPdfDoc.Close(); // release file handles and allocated memory
		mergedPdfDoc = new PDFDoc(outputFileName); // re-open
		pagesMerged = 0;
		}
	}
	mergedPdfDoc.Save(outputFileName, SDFDoc.SaveOptions.e_linearized); // Save as Fast Web View which will re-write and optimize the file
	mergedPdfDoc.Close(); // final release of memory and file handles.
}