Why is TextExtractor's bounding box x value different from the x value from GetTextMatrix?

agravesdale · July 18, 2014, 1:17am

Q:

We found that there is coordinates difference between element.GetTextMatrix() & TextExtractor. I hope theoretically there should not be any difference in the coordinate values.

Output generated using element.GetTextMatrix():
X: 133.731

Output generated using TextExtractor:
x1: 134.044824

A:

The bounding box returned by the text extractor is as tight a bounding box as PDFNet can calculate. The first character is draw slightly after the X coordinate of the text matrix, and thus the bounding box from text extractor begins there. Thus its X coordinate is slightly larger than the X coordinate from the text matrix.

GetTextMatrix() returns so called Text Matrix (as documented in PDF Specification – Text space details: http://xodo.com/view/#/c0c11968-ee14-478e-9b09-6dc5635c0915).

TextExtractor bbox (or element.GetBBox()) is concatenation of text matrix (element.GetTextMatrix()) , Current Transformation Matrix (element.GetCTM()), and number of other test state parameters in the graphics state.