Find whitespace in PDF

Ryan · June 27, 2021, 9:24pm

Are you looking for a general solution, that would work with any PDF file? Or just a solution that works with your specific PDF files?

Note, by default, if nothing is drawn to a PDF page, then the entire page could be considered transparent. PDF viewers though, by default, make the page white, and some viewers allow changing the page color.

This makes sense if you consider PDF viewing as print pre-viewing, and that that graphics commands are commands to a physical printer. Therefore, a blank PDF page, when printed, would just be the color of the actual physical paper, which the PDF file and PDF viewer have on idea will be.

The issue then is that some PDF files actually do define the background color, and this color could be white.

So, no PDFTron does not have an automatic way to detect “whitespace”.

The simplest way is to use our ElementReader sample, and track all the bounding boxes of any element. Compare that the Page’s CropBox and you can find areas that definitely have no graphics drawn in the area.

But if you want to also exclude white color, such as a white images, or path/rectangle filled white, than that is more complicated. Note that PDF supports many color spaces, such as CMYK and Spot colors, so it can be unclear what is even “white”.

Therefore the best general solution would be to rasterize the page to an image, and do an image analysis. See this forum post on how to translate between PDF and image coordinates.