Using 'pdftron.PDF.TextExtractor' in Ruby to extract text from PDF

Do you have documentation (i.e. Ruby Doc) for the ruby api? I’m trying to use the TextExtractor and pass the e_no_ligature_exp flag to the Begin method, but it’s unclear how to pass in a Rect from ruby. When I pass in, no content is extracted.

When I use the GetAsText method the string returned is encoded as ANSII8BIT. How can I tell the library to return the string as encoded as UTF-8 or UTF-16?


The API for PDFNet’s other language bindings (Ruby/PHP/Python/etc.) is same as for C/C++. As such, you can use the C/C++ documentation found here: All method and class names should be identical with each other. Additionally, we provide several examples here: to get you more comfortable with our PDFNet Ruby API.

As for the TextExtractor questions, one reason why no content was extracted when you pass is that it created a default Rectangle which has 0 width and 0 height. This will not intersect any texts. What you may want to do is to specify the values of x1, y1, x2, & y2 in, y1, x2, y2). Please see this sample for more information:

Finally, GetAsText indeed returns the encoding of type as ASCII8BIT. When you take a closer look at the character codes, you will notice that they are encoded as UTF-8. As a temporary fix, you can safely change the encoding of this string to UTF-8 by invoking string.encoding method. All our Ruby APIs return strings in UTF-8 character encoding. It will be up to the user to map them correctly to desired encoding (by using string.encoding).