Text extractor for non pdf documents

I am trying to extract the text under an annotation (highlight, underline etc). For pdfs, I am using textextractor for PDFNet library. Is there such a capability existing for html, and MS office documents as well?

