Using PDFNet SDK to determine whether a PDF page is 'black and white', 'color', or 'composite'.

Aaron_Gravesdale · March 12, 2007, 7:08pm

Q:

We need to determine whether each page within the PDF is "Black and
White", "Composite Black" or "Colour".

- If the page contains a colour graphic, chart, logo or coloured text
then we would count that page as "Colour".

- If the page contains Black and White text, but the text is made up
of colours to create the Black, we would count that as "Composite"

- If the page just contained Black and White normal text that the page
would be B&W.
----

A:
You can use PDFNet SDK (www.pdftron.com/net) to determine whether the
PDF page is 'black & white' or color. There are couple approaches to
this problem:

a) The simplest approach is to render the page using PDFDraw class
(with anti-aliased rendering turned off). You can then scan through
RGB pixels to check which colors are present in the rendered page.
This approach is simplest because all colors will be normalized to
RGB. The main drawbacks is that you don't have access to original
color spaces, colorants etc.

b) The second approach it to use PDF.ElementReader class to iterate
through all graphical elements of the page. Using Element interfaces
it is possible to obtain associated information about color spaces and
colorants used to paint graphical elements (see
http://www.pdftron.com/net/samplecode.html#ElementReader and
http://www.pdftron.com/net/index.html#ElementReaderAdv). The advantage
of this approach is that you have full access to complete graphics
state of every element used on the page (including obscured or
invisible elements). As a result, you would be able to further
discriminate between different classes (e.g. based on color spaces,
bits per pixel, etc). This solution is probably also a bit faster that
rasterizing the entire page. The main disadvantage is that the
implementation is more tricky and you would need to write more code.