Structuring documents for easy slicing

Ken · November 16, 2010, 4:01pm

Hello,

How can I create a stream containing the content between pageStart,
lineStart and pageEnd, lineEnd?
How can assemble a pdf from streams created this way?

Here is the context for what I am trying to do:

I am working for a company that sells pdf research documents online.
The customer can buy the whole report or sections out of the report.

When the pdf comes to us from the publisher, it is not necessary
bookmarked in a way that corresponds directly to the sections of the
report that will be available for sale.

We need a way to tag/mark the pdf with the start and end points of
sections in a way that will allow us to quickly extract the section(s)
of a report the customer would like to buy.

We currently work with a third party that is doing some of this for
us. We would like to be able to do it ourselves. The current
solution seems to be, that the original pdf is restructured so that
report sections are on streams. Also, an xml file functioning as an
index is created and stored as xmp metadata. The index maps section
names to a list of page number, stream number pairs. This then allows
looking up and grabbing the content.
( pageContents.GetAt(streamIndex)

Here is a small piece of a typical xml index that gets stored with the
document
<INDEX creation_date="12/30/2009 10:47:05 PM">
<SECTION st="S1">
      <CONTENT page="4" stream="1"/>
</SECTION>
<SECTION st="S2">
      <CONTENT page="4" stream="2"/>
</SECTION>
<SECTION st="S3">
      <CONTENT page="5" stream="1"/>
      <CONTENT page="6" stream="1"/>
</SECTION>

I can do the splitting once the streams and index file are in place in
the pdf. I need help doing the restructuring of the document.

Aaron_Gravesdale · November 18, 2010, 2:35am

It seems that your PDF workflow is very specific and most of this
would not relate to other PDFNet users (since you expect your files to
be structured in a very specific way).

You should be able to use 'pdftron.SDF' API to splice and merge low-
level page content streams. You can access low-level PDF page content
stream as follows:

Obj c2 = page2.GetContents();

Obj c = page.GetContents();
if (c.IsStream()) { ...
}
else if (c.IsArray()) {
    for (int i=0; i<c.Size(); ++i) {
        Obj content_stm = c.GetAt(i);
        if (c.IsArray()) {
            c2.PushBack(content_stm);
       }
    }
}

Now, this is not a full story because if the pages do not have the
same Resource dictionary (page.GetResourceDict()) you may also need to
copy/merge these dictionaries etc.