We are looking into optimizing our PDF’s and therefore testing the PDFNet library. We will be mostly working with magazines and newspapers. Some of these contains very large vector images, which take up alot of space (data). The only solution we found to reduce the filesize of these pdf’s, is to rasterize the pdf. However, we want the text to remain the same, so it is still readable.
Is there a way to do this?
I tried extracting all the text from the pdf, rasterizing the remainder and then putting the two files back together. This works pretty good, but in some cases it just fails because it seems to be difficult to accurately separate text from other page graphics. It seems that the PDFTron Optimizer doesnt make a big difference for my pdfs.
I hope that there is a way to accomplish what I’m trying to do. If you need more information, feel free to ask.
Based on your project description, it seems that you are looking to implement something along the lines of PDF to HTML sample project:
Because many PDFs use soft masks, transparency, weird blend modes, and other exotic features creating an accurate (i.e. bullet-proof) is very difficult. The good news is that this separator is already available as part of PDFNet SDK WebPublisher Add-on (i.e. pdftron.PDF.Convert.ToXod()).
As a starting point to get familiar with the WebViewer publishing platform please take a look at some online samples:
There are multiple ways to convert PDF and other document formats to XOD (including hosting PDFNet SDK on your own servers), but probably the simplest starting point is to use Cloud API (http://www.pdftron.com/pdfnet/cloud/started.html). Basically you can start without any programming and then add more customizations up to the point where you host everything on your own servers.
You can also test your own files without creating any account via bookstore sample (http://s84786.gridserver.com/website/demo/bookstore/bookstore.php, http://www.pdftron.com/pdfnet/cloud/samples.html).
The WebViewer itself does not need to rasterize non-vector graphics (since it is using HTML5 Canvas to render paths, etc), but on mobile platforms ‘flattening’ content could result in better performance. To flatten content you can use ‘flattenContent’ conversion option (http://www.pdftron.com/pdfnet/cloud/advanced.html). If you are using PDFNet you can flatten content with the following snippet:
pdftron.PDF.Convert.XODOutputOptions xodOptions = new pdftron.PDF.Convert.XODOutputOptions();
xodOptions.SetThumbnailSize(600); // The width and height of a square in which all thumbnails will be contained.
xodOptions.SetMaximumImagePixels(2000000); // Specifies the maximum image size in pixels.
xodOptions.SetFlattenContent(true); // The amount of elements and image data is taken into account when determining whether to flatten or not. If the numbers pass a certain threshold on a page then the non-text content on that page will be flattened. The threshold is quite low. So, the page would have to be mostly text based to not be flattened.
// xodOptions.SetPreferJPG(true); // Where possible output JPG files rather than PNG. This will apply to both thumbnails and document images.
for a full sample see WebViewerConvert and WebViewerStreaming samples that are included as part of PDFNet SDK ().