I'm trying to remove all fully embedded and embedded subsets fonts for a pdf file. Basically removing every font stream.
So far I found the FAQ-Entry "How do I remove embedded fonts?" from http://www.pdftron.com/pdfnet/faq.html which works quite good for fully embedded fonts.
The problem I'm having is that removing a embedded subset font results in a broken pdf file.
Is there a way to remove those fonts as well?
Or maybe the font type can be changed from "Type 1 (CID)" to "TrueType".
Thanks for your help in advance.
by broken you mean the file will not open at all? I’m not able to reproduce that.
I’ve attached two python files, one to list the embedded and subsetted fonts in all PDFs in a folder, and another that removes all the embedded fonts. I wasn’t able to break any PDF files using the latter python file.
If you still have an issue, please send the input and output PDF files, and what version of PDFNet you are using to PDFTron support.
Note, if you want to use the python files, you can place them in the root folder of a PDFNetC folder, and the python files will pick up the libraries in the PDFNetC/Lib folder. Note, if you use 64bit python, you need PDFNetC64.
list_fonts.py (2.15 KB)
remove_embedded_fonts.py (1.72 KB)
Unfortunately as many things with PDF, removing fonts from PDF is not as simple as it seems.
If you are indiscriminately removing all embedded font streams you will visually ‘break’ the file only if the substituted font is missing specific glyphs or if the font is not using standard/predefined encoding. The latter case is more likely and the end result will be gibberish text.
Perhaps you can identify these cases (e.g. custom encoding) so that you can skip font removal?
If you need to remove all fonts … period, you could still use PDFNet, but you will need to rewrite all content streams normalizing all text to Unicode.
The idea is similar to ElementEdit sample except that you create a new Font (non-embedded) and associate all text with the new font. Also you would need to map text to Unicode (element.GetTextString () → uni, then element.SetData(uni.GetBuffer()m, uni.GetLength()*2)).
This would work for most but not all files. The problem is that some/many PDFs don’t have proper Unicode maps, which means that you could still end up with gibberish text. As a next step you could try to use OCR to verify and fix these mappings …