Decode error when trying to get drawings (original) (raw)

Describe the bug (mandatory)

Starting with version 1.22.0, I'm seeing the following exception when calling page.get_drawings() on one of our PDF files.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 0: invalid start byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<...>/pdf_test.py", line 67, in <module>
    main()
  File "<...>/pdf_test.py", line 60, in main
    page.get_cdrawings()
  File "<...>/lib/python3.9/site-packages/fitz/fitz.py", line 6612, in get_cdrawings
    val = _fitz.Page_get_cdrawings(self, extended, callback, method)
SystemError: <built-in function Page_get_cdrawings> returned a result with an error set

But I do not get any error with previous versions like 1.21.1.

To Reproduce (mandatory)

I'm a bit stuck here as unfortunately I cannot share the PDF in question because it's sensitive, and I've been struggling to create a new PDF that would mimic the issue.

Is there any chance you could provide some guidance on how to isolate the drawing issue?

So far I tried to copy the failing drawing content stream to a new PDF using version 1.21.1, and so that I can potentially post it here, but the newly created PDF has no issue with 1.22.0+....

Here is my script for copying the stream

doc = fitz.open(fp) page = doc[0] xref_content = page.get_contents()

>> in this case = [4]

stream = doc.xref_stream(xref_content[0])

>> returning bytes: b' BT /F2 11.000 Tf ET\n1.000 g\n0.000 G\n/GS1 gs\n0.567 w\n<...>'

the problem is with b'\xac' which can't be decoded with utf-8

page.get_cdrawings() print(stream)

new_doc = fitz.open() new_page = new_doc.new_page(width=page.rect.width, height=page.rect.height)

create a dummy drawing to overwrite with the failing one

shape = new_page.new_shape() shape.draw_line((10, 10), (15, 15)) shape.finish() shape.commit()

overwrite the dummy drawing with the failing one

new_xref = new_page.get_contents()[0] new_doc.update_stream(new_xref, stream, compress=True) new_doc.save("new_doc.pdf")

Expected behavior (optional)

Since getting the drawings would pass for versions prior to 1.22.0, I would expect it to pass for newer versions as well.

Screenshots (optional)

Not sure if that can help, but here is a cropped screenshot of the drawing stream bytes:

image

Your configuration (mandatory)

For example, the output of print(sys.version, "\n", sys.platform, "\n", fitz.__doc__) would be sufficient (for the first two bullets).

3.9.13 (main, Sep  8 2022, 09:21:48)
[GCC 9.4.0]
 linux

PyMuPDF 1.22.0: Python bindings for the MuPDF 1.22.0 library.
Version date: 2023-04-14 00:00:01.
Built for Python 3.9 on linux (64-bit).

Installed via pip install pymupdf==1.22.0