LookupError: unknown encoding: utf16-le · Issue #6054 · pypa/pip (original) (raw)

Environment

pip version: 18.1
Python version: 3.7.1
OS: Fedora 30 s390x

This is a bug that manifests itself on a Big Endian architecture, when the tests are run.
However it can be examined on Little Endian as well.

Description

This is the test failure on s390x:

=================================== FAILURES ===================================
____________________ TestEncoding.test_auto_decode_utf16_le ____________________
self = <tests.unit.test_utils.TestEncoding object at 0x3ff9cb5b5c0>
    def test_auto_decode_utf16_le(self):
        data = (
            b'\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
            b'=\x001\x00.\x004\x00.\x002\x00'
        )
>       assert auto_decode(data) == "Django==1.4.2"
tests/unit/test_utils.py:459: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
data = '\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00=\x001\x00.\x004\x00.\x002\x00'
    def auto_decode(data):
        """Check a bytes string for a BOM to correctly detect the encoding
    
        Fallback to locale.getpreferredencoding(False) like open() on Python3"""
        for bom, encoding in BOMS:
            if data.startswith(bom):
>               return data[len(bom):].decode(encoding)
E               LookupError: unknown encoding: utf16-le
src/pip/_internal/utils/encoding.py:25: LookupError

Expected behavior

The tests should pass on all architectures alike.

How to Reproduce

Get a big endian machine (virtualize maybe?)
Run the tests.

More info

I've checked and pip has:

BOMS = [
(codecs.BOM_UTF8, 'utf8'),
(codecs.BOM_UTF16, 'utf16'),
(codecs.BOM_UTF16_BE, 'utf16-be'),
(codecs.BOM_UTF16_LE, 'utf16-le'),
(codecs.BOM_UTF32, 'utf32'),
(codecs.BOM_UTF32_BE, 'utf32-be'),
(codecs.BOM_UTF32_LE, 'utf32-le'),
]

And:

for bom, encoding in BOMS:
if data.startswith(bom):
return data[len(bom):].decode(encoding)

So this has 2 problems:

why does this fail on a big endian architecture and not on all?
pip tries to use nonexsiting encodings

I have a small reproducer here (run on my machine, x86_64):

from pip._internal.utils.encoding import BOMS for bom, encoding in BOMS: ... print(bom, encoding, end=': ') ... try: ... _ = ''.encode(encoding) ... print('ok') ... except Exception as e: ... print(type(e), e) ... b'\xef\xbb\xbf' utf8: ok b'\xff\xfe' utf16: ok b'\xfe\xff' utf16-be: <class 'LookupError'> unknown encoding: utf16-be b'\xff\xfe' utf16-le: <class 'LookupError'> unknown encoding: utf16-le b'\xff\xfe\x00\x00' utf32: ok b'\x00\x00\xfe\xff' utf32-be: <class 'LookupError'> unknown encoding: utf32-be b'\xff\xfe\x00\x00' utf32-le: <class 'LookupError'> unknown encoding: utf32-le

This is the output on s390x:

b'\xef\xbb\xbf' utf8: ok b'\xfe\xff' utf16: ok b'\xfe\xff' utf16-be: <class 'LookupError'> unknown encoding: utf16-be b'\xff\xfe' utf16-le: <class 'LookupError'> unknown encoding: utf16-le b'\x00\x00\xfe\xff' utf32: ok b'\x00\x00\xfe\xff' utf32-be: <class 'LookupError'> unknown encoding: utf32-be b'\xff\xfe\x00\x00' utf32-le: <class 'LookupError'> unknown encoding: utf32-le

Clearly we see that utf16-be, utf16-le, utf32-be and utf32-le encoding are not even possible to use.
Is that expected? The code should not reach those anyway?

The testing bytestring is:

b'\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00=\x001\x00.\x004\x00.\x002\x00'

It starts with \xff\xfe and hence should be decoded by first encoding that has this bom. On little endian, that is utf16: Everything works, we haven't reached the nonexisiting encodings.

However on big endian system, the utf16 bom is big endian and hence the first item with the \xff\xfe bom is utf16-le - it blows up.

To reproduce this problem on little endian architectures, add a test_auto_decode_utf16_be tests with:

def test_auto_decode_utf16_le(self):
    data = (
        b'\xfe\xffD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
        b'=\x001\x00.\x004\x00.\x002\x00'
    )
    assert auto_decode(data) == "Django==1.4.2"

data = ( ... b'\xfe\xffD\x00j\x00a\x00n\x00g\x00o\x00=\x00' ... b'=\x001\x00.\x004\x00.\x002\x00' ... ) from pip._internal.utils.encoding import auto_decode auto_decode(data) Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.7/site-packages/pip/_internal/utils/encoding.py", line 25, in auto_decode return data[len(bom):].decode(encoding) LookupError: unknown encoding: utf16-be