Pdf text/positioning extraction issue. Why is GetNumGlyphs != GetStringLen

Q: I am experiencing a strange issue when extracting text and
positioning information from a PDF when using the PdfTron library.

When extracting from a given PDF using the code below the third word I
come across is “Confidential”, Calling GetNumGlyphs() on the word
object returns “11”. This is a bit puzzling to me as “confidential”
has 12 letters. Strangely enough, if I call the GetGlyphQuad(int)
method with i=11 (something I assume should throw an error if the
index starts from 0 and the number of glyphs is 11) it returns the
glyph quad.

I can understand why the number of glyphs could be larger than the
amount of letters in a word, but I can’t see why/how it would be
smaller? Could you please let me know if this is normal behaviour?

Additionally, could you let me know what the difference between
GetNumGlyphs and GetStringLen and GetString().Length is in this
context?

Page page = _pdf.GetPage(pageNumber);

using (TextExtractor txt = new TextExtractor())
{
txt.Begin(page, null,
TextExtractor.ProcessingFlags.e_remove_hidden_text |
TextExtractor.ProcessingFlags.e_no_invisible_text);

new object().GetType();

TextExtractor.Line line;
TextExtractor.Word word;

for (line = txt.GetFirstLine(); line.IsValid(); line =
line.GetNextLine())
{
for (word = line.GetFirstWord(); word.IsValid(); word =
word.GetNextWord())
{
int startTextOffset = _textBuilder.Length;

for (int i = 0; i < word.GetNumGlyphs(); i++)
{
double[] quad = word.GetGlyphQuad(i);

int offset = startTextOffset++;

_mappingDictionary.Add(offset,
GetCoordinateSet(pageNumber, quad));
}

_textBuilder.Append(word.GetString());
_textBuilder.Append(" ");
}
}
}

private static Coordinates GetCoordinateSet(int pageNumber, double[]
quad)
{
return new Coordinates()
{
Page = pageNumber,
StartX = Math.Min(quad[0], quad[2]),
StartY = Math.Min(quad[1], quad[3]),
EndX = Math.Max(quad[4], quad[6]),
EndY = Math.Max(quad[5], quad[7])
};
}


A: Note that “fi” from “Confidential” is identified to be in the same
glyph quad. In this case “fi” is a single glyph and which is called
ligature (http://en.wikipedia.org/wiki/Typographic_ligature). If you
zoom in enough in a PDF viewer, you will see it is a single glyph
instead of two (or three if some cases; ‘ffi’ ‘ffl’ etc) and you won’t
be able to select them individually. With this being said,
Word.GetStringLen() won’t always return the same value with
Word.GetNumGlyphs(). The first is always equal or greater (if ligature
is present) than the latter.

You can use TextExtracto::e_no_ligature_exp to disables expanding of
ligatures for character mapping. Ligature expansion is on by default
(when flag is 0), but it only applies to character (i.e. Unicode)
mapping.

Q: Thank you for your explanation.

What I am interested in is to be able to detect ligature through
pdftron::PDF::TextExtractor::Word.
But it appears there are no APIs for that (perhaps there is a way?).
I was thinking something along the lines: Ligature
Word.GetLigature(int glyph_idx). Through returned ligature object (or
null if none) you can then query characters it represents.

Can you please help with this?


A: You can identify ligatures as follows:

  • Disable ligature expansion (using TextExtractor::e_no_ligature_exp)
  • Iterate through Unicode characters in a word.
  • If you encounter char codes for fi, ff, fl, ffi, ffl, ch, cl, ct,
    ll, ss, fs, st, oe, OE (e.g. http://unicode.org/charts/PDF/UFB00.pdf)
    you have identified a ligature.