LookupError: unknown encoding: utf16-le · Issue #6054 · pypa/pip (original) (raw)
Environment
- pip version: 18.1
- Python version: 3.7.1
- OS: Fedora 30 s390x
This is a bug that manifests itself on a Big Endian architecture, when the tests are run.
However it can be examined on Little Endian as well.
Description
This is the test failure on s390x:
=================================== FAILURES ===================================
____________________ TestEncoding.test_auto_decode_utf16_le ____________________
self = <tests.unit.test_utils.TestEncoding object at 0x3ff9cb5b5c0>
def test_auto_decode_utf16_le(self):
data = (
b'\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
b'=\x001\x00.\x004\x00.\x002\x00'
)
> assert auto_decode(data) == "Django==1.4.2"
tests/unit/test_utils.py:459:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
data = '\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00=\x001\x00.\x004\x00.\x002\x00'
def auto_decode(data):
"""Check a bytes string for a BOM to correctly detect the encoding
Fallback to locale.getpreferredencoding(False) like open() on Python3"""
for bom, encoding in BOMS:
if data.startswith(bom):
> return data[len(bom):].decode(encoding)
E LookupError: unknown encoding: utf16-le
src/pip/_internal/utils/encoding.py:25: LookupError
Expected behavior
The tests should pass on all architectures alike.
How to Reproduce
- Get a big endian machine (virtualize maybe?)
- Run the tests.
More info
I've checked and pip has:
BOMS = [ |
---|
(codecs.BOM_UTF8, 'utf8'), |
(codecs.BOM_UTF16, 'utf16'), |
(codecs.BOM_UTF16_BE, 'utf16-be'), |
(codecs.BOM_UTF16_LE, 'utf16-le'), |
(codecs.BOM_UTF32, 'utf32'), |
(codecs.BOM_UTF32_BE, 'utf32-be'), |
(codecs.BOM_UTF32_LE, 'utf32-le'), |
] |
And:
for bom, encoding in BOMS: |
---|
if data.startswith(bom): |
return data[len(bom):].decode(encoding) |
So this has 2 problems:
- why does this fail on a big endian architecture and not on all?
- pip tries to use nonexsiting encodings
I have a small reproducer here (run on my machine, x86_64):
from pip._internal.utils.encoding import BOMS for bom, encoding in BOMS: ... print(bom, encoding, end=': ') ... try: ... _ = ''.encode(encoding) ... print('ok') ... except Exception as e: ... print(type(e), e) ... b'\xef\xbb\xbf' utf8: ok b'\xff\xfe' utf16: ok b'\xfe\xff' utf16-be: <class 'LookupError'> unknown encoding: utf16-be b'\xff\xfe' utf16-le: <class 'LookupError'> unknown encoding: utf16-le b'\xff\xfe\x00\x00' utf32: ok b'\x00\x00\xfe\xff' utf32-be: <class 'LookupError'> unknown encoding: utf32-be b'\xff\xfe\x00\x00' utf32-le: <class 'LookupError'> unknown encoding: utf32-le
This is the output on s390x:
b'\xef\xbb\xbf' utf8: ok b'\xfe\xff' utf16: ok b'\xfe\xff' utf16-be: <class 'LookupError'> unknown encoding: utf16-be b'\xff\xfe' utf16-le: <class 'LookupError'> unknown encoding: utf16-le b'\x00\x00\xfe\xff' utf32: ok b'\x00\x00\xfe\xff' utf32-be: <class 'LookupError'> unknown encoding: utf32-be b'\xff\xfe\x00\x00' utf32-le: <class 'LookupError'> unknown encoding: utf32-le
Clearly we see that utf16-be
, utf16-le
, utf32-be
and utf32-le
encoding are not even possible to use.
Is that expected? The code should not reach those anyway?
The testing bytestring is:
b'\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00=\x001\x00.\x004\x00.\x002\x00'
It starts with \xff\xfe
and hence should be decoded by first encoding that has this bom. On little endian, that is utf16
: Everything works, we haven't reached the nonexisiting encodings.
However on big endian system, the utf16
bom is big endian and hence the first item with the \xff\xfe
bom is utf16-le
- it blows up.
To reproduce this problem on little endian architectures, add a test_auto_decode_utf16_be
tests with:
def test_auto_decode_utf16_le(self):
data = (
b'\xfe\xffD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
b'=\x001\x00.\x004\x00.\x002\x00'
)
assert auto_decode(data) == "Django==1.4.2"
data = ( ... b'\xfe\xffD\x00j\x00a\x00n\x00g\x00o\x00=\x00' ... b'=\x001\x00.\x004\x00.\x002\x00' ... ) from pip._internal.utils.encoding import auto_decode auto_decode(data) Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.7/site-packages/pip/_internal/utils/encoding.py", line 25, in auto_decode return data[len(bom):].decode(encoding) LookupError: unknown encoding: utf16-be