Low-level editing of PDF page content bytes

Q:

We would like to directly edit the decoded bytes of a page content stream and would like to know the most efficient and safe way to do so with PDFNet. By “safe”, we mean in terms of making the fewest changes possible to the PDF as a whole, ideally changing only the stream bytes (and of course updating the stream’s “Length” entry accordingly) - we understand that it’s our responsibility to ensure that our modifications to the stream’s contents are still valid according to the PDF spec. Currently our approach is as follows:

  1. Read decoded stream data to a byte array:

private static byte[] GetStreamBytes(Obj stream)

{

byte[] bytes;

using (MemoryStream memoryStream = new MemoryStream())

{

Filter filter = stream.GetDecodedStream();

FilterReader filterReader = new FilterReader(filter);

byte[] buffer = new byte[filter.Size()];

int length = 0;

while ((length = filterReader.Read(buffer)) > 0) {

memoryStream.Write(buffer, 0, length);

}

bytes = memoryStream.ToArray();

memoryStream.Close();

}

return bytes;

}

  1. Modify stream bytes as desired.

  2. Write the modified stream bytes back to the original stream object. We have tried three approaches for this:

a)

stream.SetStreamData(byteArray);

This approach does not appear to preserve the encoding type of the original stream data (i.e. just writes uncompressed data).

b)

stream.SetStreamData(byteArray, new FlateEncode(null));

This approach always uses Flate encoding.

c)

Obj contentsNew = pdf.CreateIndirectStream(memoryStream.ToArray(), new FlateEncode(null));

pdf.GetSDFDoc().Swap(contents.GetObjNum(), contentsNew.GetObjNum());

// TODO: Copy additional entries from old stream dictionary to new one

This approach always uses Flate encoding and also requires some manual copying of entries from one dictionary to another.

Can you please comment on what the best way is to accomplish our overall goal - whether it is one of the approaches taken above, some variation thereof, or something completely different?

A

Without knowing what concrete objective you’re trying to accomplish with this code, the second approach (SetStreamData with FlateEncoding) seems least problematic.

This should preserve as much of the original PDF structure as possible.

For some outdated PDFs, obsolete compression schemes (such as LZW) may be replaced with Flate (or no compression, if you prefer that). Preserving the original stream filter type would be fairly difficult (keep in mind that some filters may have different compression parameters, many of which are not explicitly defined in PDF) and is not supported out of the box.

However, keep in mind that such low-level operations might not be the best approach. Could you explain more about why you’re writing to the content stream directly?