Fully embedded font is extracted only partially if it occupies more than one objects (original) (raw)

Description

I have a PDF where the PDF reader was telling me that some fonts were fully embedded, so I decided to test my freshly installed PyMuPDF 1.21.0-rc2 on it. It told me that it "saved 16 fonts to" my working directory, but looking there revealed only 11 of them. Looking at the list of fonts from pdffonts, it became clear why: some fonts that were "fully" embedded were using more than one objects to store partial pieces of them. A check with mupdf confirmed that PyMuPDF extracted only the last object onto font-name.cff, probably because the output file (font-name.cff) was the same for all pieces/objects of font with name font-name.

How To Reproduce

You must have one of those PDFs that embed a font fully by storing pieces of it in multiple objects. Let's say x.pdf is one of them. pdffonts lists the fonts of x.pdf as follows:

pdffonts x.pdf

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
Times-Roman                          Type 1C           MacRoman         yes no  no     567  0
CAJCNM+intiri                        Type 1C           Custom           yes yes yes    593  0
CAJCMM+intirr                        Type 1C           Custom           yes yes yes    595  0
CAJCON+intirsc                       Type 1C           Custom           yes yes yes    594  0
intirsc                              Type 1C           WinAnsi          yes no  no     596  0
DFMPPD+Springnew-Regular             Type 1C           Custom           yes yes yes    576  0
Times-Roman                          Type 1C           Custom           yes no  no     569  0
Times-Italic                         Type 1C           Custom           yes no  no     568  0
Times-Bold                           Type 1C           WinAnsi          yes no  no     321  0
Times-Roman                          Type 1C           Custom           yes no  no     306  0
Times-Italic                         Type 1C           Custom           yes no  no     315  0
Times-BoldItalic                     Type 1C           WinAnsi          yes no  no     350  0
MT2SYS                               Type 1C           Custom           yes no  yes    379  0
MT2MIT                               Type 1C           MacRoman         yes no  no     392  0
MT2MIS                               Type 1C           WinAnsi          yes no  no     332  0
Times-Bold                           Type 1C           WinAnsi          yes no  no     311  0

You can see that, for example, Times-Roman has 'no' in the "sub" (subsetted) column, meaning it is NOT subsetted - therefore it is fully embedded. You can also see that it occupied objects with numbers 567, 569 and 306.

mutool shows practically the same with its 'info' command:

mutool info -F x.pdf

Fonts (16):
        2       (820 0 R):      Type1 'Times-Roman' MacRomanEncoding (567 0 R)
        3       (819 0 R):      Type1 'CAJCNM+intiri' (593 0 R)
        3       (819 0 R):      Type1 'CAJCMM+intirr' (595 0 R)
        3       (819 0 R):      Type1 'CAJCON+intirsc' (594 0 R)
        3       (819 0 R):      Type1 'intirsc' WinAnsiEncoding (596 0 R)
        4       (301 0 R):      Type1 'DFMPPD+Springnew-Regular' (576 0 R)
        5       (298 0 R):      Type1 'Times-Roman' WinAnsiEncoding (569 0 R)
        5       (298 0 R):      Type1 'Times-Italic' WinAnsiEncoding (568 0 R)
        6       (295 0 R):      Type1 'Times-Bold' WinAnsiEncoding (321 0 R)
        6       (295 0 R):      Type1 'Times-Roman' WinAnsiEncoding (306 0 R)
        6       (295 0 R):      Type1 'Times-Italic' WinAnsiEncoding (315 0 R)
        10      (292 0 R):      Type1 'Times-BoldItalic' WinAnsiEncoding (350 0 R)
        22      (47 0 R):       Type1 'MT2SYS' (379 0 R)
        30      (798 0 R):      Type1 'MT2MIT' MacRomanEncoding (392 0 R)
        207     (620 0 R):      Type1 'MT2MIS' WinAnsiEncoding (332 0 R)
        220     (134 0 R):      Type1 'Times-Bold' WinAnsiEncoding (311 0 R)

We could use mutool extract x.pdf, but that is not user-friendly, as it a) extracts both all fonts and all images and b) it extracts fonts as font-XXXX.cff (or, possibly, font-XXXX.ttf), where XXXX has no relation to the font object number (contrary to what its documentation claims), so you practically don't know which file is which font, unless you open each one of them, or at least read its metadata somehow.

Enter PyMuPDF which promises to a) extract all fonts and b) give the extracted files sensible names.Alas, trying it on our x.pdf results on just 11 fonts - contrary to the claimed 16:

python -m fitz extract -fonts x.pdf

saved 16 fonts to ...

ls -l | awk -e '{print <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>5</mn><mo separator="true">,</mo></mrow><annotation encoding="application/x-tex">5,</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8389em;vertical-align:-0.1944em;"></span><span class="mord">5</span><span class="mpunct">,</span></span></span></span>9}'

24759 CAJCMM+intirr.cff
25860 CAJCNM+intiri.cff
23898 CAJCON+intirsc.cff
1295 DFMPPD+Springnew-Regular.cff
217 MT2MIS.cff
506 MT2MIT.cff
286 MT2SYS.cff
17076 Times-Bold.cff
18332 Times-BoldItalic.cff
18302 Times-Italic.cff
24847 Times-Roman.cff

(first column is file size)

What has happened? Comparing this to the output of mutool extract

mutool extract x.pdf*
...
ls -l font-* | awk -e '{print <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mn>5</mn><mo separator="true">,</mo></mrow><annotation encoding="application/x-tex">5, </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8389em;vertical-align:-0.1944em;"></span><span class="mord">5</span><span class="mpunct">,</span></span></span></span>9}'

217 font-0330.cff
506 font-0522.cff
286 font-0534.cff
18332 font-0549.cff
18302 font-0557.cff
17076 font-0558.cff
25078 font-0561.cff
26087 font-0564.cff
1295 font-0574.cff
23496 font-0579.cff
24759 font-0583.cff
23898 font-0587.cff
25860 font-0591.cff
24847 font-0599.cff

and looking carefully at the file sizes (first column), we see that Times-Roman.cff (the Times-Roman font as extracted by PyMuPDF) is exactly font-0599.cff (a font extracted by mutool, whose object number is NOT 599 (there is no font object with such a number in x.pdf)) - but this is only one of the three pieces (objects) that store parts of Times-Roman!

Your configuration (mandatory)

Operating system: Gentoo
Python version: 3.10
PyMuPDF version: 1.21.0-rc2, installation method: generated from source, using installed mupdf 1.21.0.

More precisely:

python -c 'import sys; import fitz; print(sys.version, "\n", sys.platform, "\n", fitz.__doc__)'

3.10.0 (default, Feb 11 2022, 00:50:04) [GCC 11.2.0] 
 linux 
 
PyMuPDF 1.21.0rc2: Python bindings for the MuPDF 1.21.0 library.
Version date: 2022-11-07 00:00:01.
Built for Python 3.10 on linux (64-bit).