How to extract XML from XFA PDF from?

Ivanho · January 8, 2013, 7:41pm

Q:
I am trying to extract XML data from a XFA form, (PDFNet low-level APIs to extract/edit XFA data from PDF files). Can you please show me some concrete example so I can use this feature.

I have looked at the following example https://groups.google.com/forum/#!msg/pdfnet-sdk/zgoiXfCf_bU/ktXNzkpTspkJ

I couldn’t understand which part of the code is getting the xml string? How do I get the xml string? And once I get the xml string how do I put it back after the udpate?

A:

I would recommend you try running the test implementation provided on the forum, as it may make things clearer. It might also be helpful to inspect the internal structure of XFA forms using our CosEdit utility:

http://www.pdftron.com/pdfcosedit/index.html

In any case, I will do my best to explain what the code is doing. From what I understand, the XML data you want to extract is held inside the XFA Array, within the AcroForm dictionary. In order to extract all of the XFA data, you will need to iterate through this Array, and extract all of the content streams. The following example shows how to extract the XML data at one specific index in the Array. I’ve simplified the source code and added additional comments, so hopefully this will be more clear:

//Example code for extracting an xml string from the XFA form,
// and putting it back after an update.

//Create the PDFDoc
PDFDoc doc = new PDFDoc(“some_file.pdf”);
doc.InitSecurityHandler();

//get the acroform dictionary
Obj acro_form = doc.GetAcroForm();
if (acro_form != null)
{
// This PDF document contains forms…
if (acro_form.FindObj(“XFA”) != null)
{
// This PDF document contains XFA forms…
Obj obj = acro_form.FindObj(“XFA”);

//We will store the XML string in this byte array
byte[] buff = new byte[4000];
byte byteRawPre, byteDecodePre, byteRawPost, byteDecodePost;

pdftron.Filters.Filter filter;
pdftron.Filters.FilterReader fr;

//The XFA entry in the PDF is an Array, so in this case,
// we want to read the xml string stored at the 5th index of the Array
filter = obj.GetAt(5).GetDecodedStream();
fr = new pdftron.Filters.FilterReader(filter);
fr.Read(buff);
//at this point, the xml string should be stored inside buff,
// and you can make whatever modifications you want

//Modify XML String HERE

//We create an indirect stream object, which will contain our
// newly modified XML string
Obj new_xmp_stm = doc.CreateIndirectStream(buff);
//The swap method allows us to switch all indirect references to the old stream,
// to point to our newly created stream.
doc.GetSDFDoc().Swap(new_xmp_stm.GetObjNum(), acro_form.Get("XFA
").Value().GetAt(5).GetObjNum());

doc.Save(“output_filename.pdf”,
SDFDoc.SaveOptions.e_linearized);
doc.Close();
}
}