Using PDFNet to split and merge PDF in a web service.

Aaron_Gravesdale · January 5, 2011, 11:19pm

Q: We need to develop a Web Service to obtain a statement from within
a large PDF file containing as many as 300,000 statements. The Web
Service will obtain the PDF filename (including path) and pass this
information to the PDF Splitter Service. Our requirements are:

1. Server based API where the PDF Split Service will be provided with
a file name (including path), the output path/filename. The output
file and notification of file available is returned.

2. The ability for the PDF Splitting to be multi-thread. It is
possible that a PDF file containing 250,000 documents will be accessed
(for splitting) as many as 250,000 times concurrently. Although it is
more likely that in this case it would be around 20,000 times
concurrently.

3. The process time is to be under 2 seconds (1 second and under is
preferred)

4. The source PDF file containing the documents could be as much as
2GB in size (and possible larger) and is not to be written to disk.
(i.e. a copy is not to be taken of the source file)

5. The average extracted document size is 4 pages.

6. The source PDF file is not to be locked while splitting.
-------------------------

A: With PDFNet SDK you can split and process PDF documents with any
number of pages. In case you also need to process PDF documents of
arbitrary size you should use the 64-bit version (http://
www.pdftron.com/pdfnet/downloads.html). With PDFNet v.5.3+ (64-bit) it
is possible to process PDF files of arbitrary size (e.g. multi
terrabytes).

PDFNet is designed to run in server and multi-threaded environment and
is fairly efficient. Please keep in mind that a large number of
concurrent requests could bring down any server. So if you need to
handle a large number or requests you may need more than one machine.
The number of servers would be proportional to the number of
concurrent users and would most likely need to be determined
empirically.

Attached (http://groups.google.com/group/pdfnet-sdk/web/
PDFSplitMergeTest.zip.doc) is a sample showing how to extract, split,
and merge PDF files using PDFNet. To run the sample, rename file to
zip and extract the archive in 'PDFNet/Samples' folder. Another
relevant sample is PDF page sample (Document Processing Technology for Developers | Apryse, Formerly Known as PDFTron
samplecode.html#PDFPage), however it does not show how to use
pdfdoc.InsertPages() method.

// The following C# code snippet shows how to extracts first 'cnt'
pages to a blank new PDF.
PDFNet.Initialize();
using (PDFDoc doc = new PDFDoc(fileName)) {
  doc.InitSecurityHandler();
  using (PDFDoc new_doc = new PDFDoc()) {
    int from = 1, to = doc.GetPageCount() > cnt ? cnt :
doc.GetPageCount();
    new_doc.InsertPages(0, doc, from, to, PDFDoc.InsertFlag.e_none);
    new_doc.Save(output_path+Path.GetFileName(fileName),
SDFDoc.SaveOptions.e_remove_unused);
  }
}

3. The process time is to be under 2 seconds (1 second and under is preferred)

This is generally true, but would really depend on the type of
processing operation you need to perform. For example, do you simple
need to extract a small number pages from a large document, or you
need to split a whole document into many (possibly thousands) little
pieces? The latter operation would be slower due to variable number of
pages and higher I/O. With PDFNet there are many tricks that you can
use to decrease the processing time and decrease the size of resulting
files.

6. The source PDF file is not to be locked while splitting.

Because PDFNet is incrementally loading file (for efficiency reasons)
the file will be locked. If you would like to prevent file locking you
would need to load the file in a memory buffer (e.g. as shown in
PDFDocMemory sample - http://www.pdftron.com/pdfnet/samplecode.html#PDFDocMemory)
before processing the file.