Q: We are evaluating the PDF2Text command line product (http://
www.pdftron.com/pdf2text) and have come across a weird situation.
We are trying to read the text from a PDF document.
Is there any way to determine if the text in the PDF has been striked
out? The XML output I am seeing for this document provides no
indication the works are striked out.
Any help would be appreciated... I might need to switch to the SDK?
A: You can't use PDF2Text to determine whether the text crossed out,
however you could use PDFNet SDK (http://www.pdftron.com/pdfnet). You
can extract all text as show in TextExtract sample project (http://
www.pdftron.com/pdfnet/samplecode.html#TextExtract). Depending on how
text is crossed out you could check is there are any strikeout
annotations on the page (see Annotation sample project). In the worst
case you would need to use ElementReader (see ElementReaderAdv sample
project) to traverse all paths on the page and see if there is any
horizontal line that interests text bounding box.