Removing dashes from the output of TextExtract sample project

Aaron_Gravesdale · May 20, 2008, 8:33pm

Q: When I use the TextExtractTest.java to extract text from pdf’s, I
always get rows of dashes that print out to the console after the pdf
text prints out. They look like this:

I have been experimenting with this, trying to figure out where they
are coming from and why they are printing out, but haven’t been able
to come up with any logical reason. Can you tell me why this is
happening?

other: Here is what I’m getting when I extract the text from a simple
test pdf file that I made. Notice the 2 lines of dashes at the end:

Word Count: 32

GetAsText --------------------------
Credit card number: 1234123412341234
Credit card number, rotated 90 degrees: 1234123412341234
Credit card number, mirrored and rotated 90 degrees: 1234123412341234

GetAsXML --------------------------
<Page num="1 crop_box=“0, 0, 595, 842” media_box=“0, 0, 595, 842”
rotate=“0”>

…

A: Dashes are part of the sample code and are not part of extracted
PDF content. The main reason why dashes are inserted is to delimit
output generated from different code snippets.

To get rid of dashes simple comment out all
‘System.out.println’ (JAVA) or ‘Console.WriteLine()’ (in C# or VB.NET)
statements in the sample project.