How do I extract text from a given PDF layer using PDFNet SDK?

Ivanho · January 26, 2013, 12:21am

Q:

Our scenario is this:

· Input file is a layered PDF (normally one page, but could be more)

· We need to check that a particular layer has live (not outlined) text on it

· We know the layer name we are looking for will contain the word ‘artwork’

· Therefore, we want to attempt to extract text only on this particular layer (if it is found)

· If the extracted text is empty, we will fail the process, otherwise we continue

Is there a recommended approach to this? My developers have been struggling a little with this as there doesn’t appear to be a way to extract text from only one layer?

A:

Yes, this is a somewhat tricky. One thing that pops to mind is that you can extract the required text layer into a temp page then use ‘pdftron.PDF.TextExtractor’ to get text from the page.

To extract the layer you can use the approach shown in ElementEdit sample: http://www.pdftron.com/pdfnet/samplecode.html#ElementEdit

To copy elements you would initialize ElementReader with OCG Context similar to the way PDFDraw in PDFLayers sample (http://www.pdftron.com/pdfnet/samplecode.html#PDFLayers):

Config init_cfg = doc.GetOCGConfig();

Context ctx = new Context(init_cfg);

ctx.ResetStates(false);

ctx.SetState(ocg, true);

…

reader.Begin(page, ctx);

…

if (element.IsOCVisible()) {

    writer.ElementWrite(element);