Extracting EXIF meta-data from JPEG images embedded in PDFs

Aaron_Gravesdale · February 3, 2010, 11:09pm

Q: We're using PDFNet as our PDF toolkit for PDF handling. Recently we
have the need to extract and read the EXIF meta data in JPEG images
embedded in the PDFs. I've spend some time researching the PDFnet API
but I haven't been able to find anything to that effect. I'd
appreciate any pointer.
----------------
A: You can use PDFNet SDK to implement your requirement. As a starting
point you may want to take a look at ImageExtract sample project:

http://www.pdftron.com/pdfnet/samplecode.html#ImageExtract

Using pdftron.PDF.Image.Export(fname) you can export embedded
DCTDecode images without any transcoding. You can then use one of
numerous free exif libraries to extract EXIF meta data.

Please keep in mind that PDFNet11 is a fairly old version of PDFNet.
We are about to release v5 of PDFNet so you may want to upgrade your
license.

Aaron_Gravesdale · February 17, 2010, 11:43pm

Q: Thanks to your email, I think I now have a solution to extract the
EXIF info. But it involves first writting a jpeg file and extract the
EXIF info from the file. Ideally I'd like the jpeg file be written
into certain memory stream and the extraction process is performed in
memory. Wondering if this is also possible with PDFNet?

On a different front, I have a PDF that has several PNG images
embedded as the background of another image. When I use the following
code to extract and write the image, the output image is
monolithically black. I try attaching the PDF to this email but fail.
I'll send the PDF to you via gmail. When you open the PDF, Page 10 of
the PDF contains two images sitting on top of a drop shadow. The drop
shadows are actually PNG images and when extracted they are black. I'd
very much appreciate if you can give me a pointer on this.

           if (element.GetType() == Element.Type.e_image)
           {
                        Random rand = new Random();
                        byte[] bytes = new byte[4];
                        int iRandomNumber = rand.Next(1000000);
                        string fname = "image_extract_" +
iRandomNumber.ToString();
                        pdftron.PDF.Image image = new
pdftron.PDF.Image(element.GetXObject());
                        image.Export(fname);
           }
----------------
A: The problem is that some of the images are associated with a Soft
mask. In PDF a soft (or image) mask is used to compute the alpha value
of a bitmap. For example, some applications achieve a drop shadow
effect using a base image which is a solid black rectangle associated
with a soft mask (e.g. a gradient).

You can ckeck if an image has a soft mask using image.GetSoftMask()
(or image.GetImageMask() to check for image mask). For example:

If (image.GetSoftMask() != null) {
Image soft_mask = new Image(image.GetSoftMask());
soft_mask.Export(...);
}

memory stream and the extraction process is performed in memory

Yes, you can extract JPEG image directly in memory using
image.GetSDFObj().GetDecodedStream(). For example:

Obj stm = image.GetSDFObj().GetDecodedStream();
FilterReader reader = new FilterReader(stm);
byte[] buff = new byte[image.GetSDFObj().GetRawStreamLength()];
reader.Read(buff);