Hi All,
I'm running into a strange issue with the getAsXML method. If I call the method in the exact same manner twice on the same PDF, I get slightly different results. Has anyone seen this before or can anyone guess why I might be seeing this behavior?
I'm using the Java PDFNet implementation, here's the relevant code:
import java.io.*;
import pdftron.Common.PDFNetException;
import pdftron.PDF.*;
public class VSMExtractText {
public static void main(String args) {
PDFNet.initialize();
String input_path = args[0];
String output_pre = args[1];
String output_post = ".xml";
try {
PDFDoc doc = new PDFDoc(input_path);
doc.initSecurityHandler();
int page_num = doc.getPageCount();
for (int i=1; i<=page_num; ++i) {
try {
File file = new File(output_pre + String.format("%04d", i) + output_post);
Writer output = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "UTF8"));
if (!file.exists()) {
file.createNewFile();
}
Page page = doc.getPage(i);
TextExtractor txt = new TextExtractor();
txt.begin(page, null, 0);
String text = txt.getAsXML(7);
String utf8text = text.replace("utf-16","utf-8");
output.write(utf8text);
output.flush();
output.close();
txt.destroy();
}
catch (IOException e) {
System.out.println(e);
}
}
doc.close();
}
catch (PDFNetException e) {
System.out.println(e);
}
PDFNet.terminate();
}
}
I'm running this via the command line on a Mac with (for example) this call:
java -Djava.library.path=bin/libs -classpath .:bin/libs/PDFNet.jar:bin VSMExtractText pdfnet-test.pdf pdfnettest1/pdfnet-test_
And here is an example diff between two output files, running the same method on the same PDF twice in a row. Note that in one case it picked up the text as bold, and in the other it did not:
diff pdfnettest1/pdfnet-test_0010.xml pdfnettest2/pdfnet-test_0010.xml
63c63
< <Word box="268.84, 663.025, 8.10467, 9.43">(a</Word>
---
<Word box="268.84, 663.025, 8.10467, 9.43" style="font-family:HelveticaNeueLTStd-Bd; font-size:10.25; color: #231F20;">(a</Word>
65c65
< <Word box="291.621, 663.025, 8.50853, 9.43">b)</Word>
---
<Word box="291.621, 663.025, 8.50853, 9.43" style="font-family:HelveticaNeueLTStd-Bd; font-size:10.25; color: #231F20;">b)</Word>
Thanks for any suggestions!
Nick