Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LookupError: unknown encoding: utf16-le #6054

Closed
hroncok opened this issue Nov 30, 2018 · 8 comments · Fixed by #6311
Closed

LookupError: unknown encoding: utf16-le #6054

hroncok opened this issue Nov 30, 2018 · 8 comments · Fixed by #6311
Labels
auto-locked Outdated issues that have been locked by automation C: encoding Related to text encoding and likely, UnicodeErrors type: bug A confirmed bug or unintended behavior

Comments

@hroncok
Copy link
Contributor

hroncok commented Nov 30, 2018

Environment

  • pip version: 18.1
  • Python version: 3.7.1
  • OS: Fedora 30 s390x

This is a bug that manifests itself on a Big Endian architecture, when the tests are run.
However it can be examined on Little Endian as well.

Description

This is the test failure on s390x:

=================================== FAILURES ===================================
____________________ TestEncoding.test_auto_decode_utf16_le ____________________
self = <tests.unit.test_utils.TestEncoding object at 0x3ff9cb5b5c0>
    def test_auto_decode_utf16_le(self):
        data = (
            b'\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
            b'=\x001\x00.\x004\x00.\x002\x00'
        )
>       assert auto_decode(data) == "Django==1.4.2"
tests/unit/test_utils.py:459: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
data = '\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00=\x001\x00.\x004\x00.\x002\x00'
    def auto_decode(data):
        """Check a bytes string for a BOM to correctly detect the encoding
    
        Fallback to locale.getpreferredencoding(False) like open() on Python3"""
        for bom, encoding in BOMS:
            if data.startswith(bom):
>               return data[len(bom):].decode(encoding)
E               LookupError: unknown encoding: utf16-le
src/pip/_internal/utils/encoding.py:25: LookupError

Expected behavior

The tests should pass on all architectures alike.

How to Reproduce

  1. Get a big endian machine (virtualize maybe?)
  2. Run the tests.

More info

I've checked and pip has:

BOMS = [
(codecs.BOM_UTF8, 'utf8'),
(codecs.BOM_UTF16, 'utf16'),
(codecs.BOM_UTF16_BE, 'utf16-be'),
(codecs.BOM_UTF16_LE, 'utf16-le'),
(codecs.BOM_UTF32, 'utf32'),
(codecs.BOM_UTF32_BE, 'utf32-be'),
(codecs.BOM_UTF32_LE, 'utf32-le'),
]

And:

for bom, encoding in BOMS:
if data.startswith(bom):
return data[len(bom):].decode(encoding)

So this has 2 problems:

  • why does this fail on a big endian architecture and not on all?
  • pip tries to use nonexsiting encodings

I have a small reproducer here (run on my machine, x86_64):

>>> from pip._internal.utils.encoding import BOMS
>>> for bom, encoding in BOMS:
...     print(bom, encoding, end=': ')
...     try:
...         _ = ''.encode(encoding)
...         print('ok')
...     except Exception as e:
...         print(type(e), e)
... 
b'\xef\xbb\xbf' utf8: ok
b'\xff\xfe' utf16: ok
b'\xfe\xff' utf16-be: <class 'LookupError'> unknown encoding: utf16-be
b'\xff\xfe' utf16-le: <class 'LookupError'> unknown encoding: utf16-le
b'\xff\xfe\x00\x00' utf32: ok
b'\x00\x00\xfe\xff' utf32-be: <class 'LookupError'> unknown encoding: utf32-be
b'\xff\xfe\x00\x00' utf32-le: <class 'LookupError'> unknown encoding: utf32-le

This is the output on s390x:

b'\xef\xbb\xbf' utf8: ok
b'\xfe\xff' utf16: ok
b'\xfe\xff' utf16-be: <class 'LookupError'> unknown encoding: utf16-be
b'\xff\xfe' utf16-le: <class 'LookupError'> unknown encoding: utf16-le
b'\x00\x00\xfe\xff' utf32: ok
b'\x00\x00\xfe\xff' utf32-be: <class 'LookupError'> unknown encoding: utf32-be
b'\xff\xfe\x00\x00' utf32-le: <class 'LookupError'> unknown encoding: utf32-le

Clearly we see that utf16-be, utf16-le, utf32-be and utf32-le encoding are not even possible to use.
Is that expected? The code should not reach those anyway?

The testing bytestring is:

b'\xff\xfeD\x00j\x00a\x00n\x00g\x00o\x00=\x00=\x001\x00.\x004\x00.\x002\x00'

It starts with \xff\xfe and hence should be decoded by first encoding that has this bom. On little endian, that is utf16: Everything works, we haven't reached the nonexisiting encodings.

However on big endian system, the utf16 bom is big endian and hence the first item with the \xff\xfe bom is utf16-le - it blows up.

To reproduce this problem on little endian architectures, add a test_auto_decode_utf16_be tests with:

    def test_auto_decode_utf16_le(self):
        data = (
            b'\xfe\xffD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
            b'=\x001\x00.\x004\x00.\x002\x00'
        )
        assert auto_decode(data) == "Django==1.4.2"
>>> data = (
...     b'\xfe\xffD\x00j\x00a\x00n\x00g\x00o\x00=\x00'
...     b'=\x001\x00.\x004\x00.\x002\x00'
... )
>>> from pip._internal.utils.encoding import auto_decode
>>> auto_decode(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.7/site-packages/pip/_internal/utils/encoding.py", line 25, in auto_decode
    return data[len(bom):].decode(encoding)
LookupError: unknown encoding: utf16-be
@pradyunsg pradyunsg added S: needs triage Issues/PRs that need to be triaged type: bug A confirmed bug or unintended behavior C: encoding Related to text encoding and likely, UnicodeErrors labels Dec 14, 2018
@hroncok
Copy link
Contributor Author

hroncok commented Feb 28, 2019

Still happens on 19.x.

@cjerdonek
Copy link
Member

It looks like that code was added in PR #3485. @xavfernandez, can you take a look?

@cjerdonek
Copy link
Member

It looks like the fix might be as simple as changing utf16-be to utf-16-be and similarly for the others.

There should be a regression test to iterate over the BOMS list and check that its entries are valid.

@hroncok
Copy link
Contributor Author

hroncok commented Mar 1, 2019

Indeed, utf-16-be seems to exist.

@hroncok
Copy link
Contributor Author

hroncok commented Mar 1, 2019

I'll submit a PR with the fix and regression test.

@hroncok
Copy link
Contributor Author

hroncok commented Mar 1, 2019

#6311

@pfmoore
Copy link
Member

pfmoore commented Mar 1, 2019

The table of aliases here would seem to confirm that utf16-be isn't a valid alias (although utf-16be is...)

@cjerdonek cjerdonek removed the S: needs triage Issues/PRs that need to be triaged label Mar 1, 2019
hroncok added a commit to hroncok/pip that referenced this issue Mar 1, 2019
utils.encoding.auto_decode() was broken when decoding Big Endian BOM
byte-strings on Little Endian or vice versa.

The TestEncoding.test_auto_decode_utf_16_le test was failing on Big Endian
systems, such as Fedora's s390x builders. A similar test, but with BE BOM
test_auto_decode_utf_16_be was added in order to reproduce this on a Little
Endian system (which is much easier to come by).

A regression test was added to check that all listed encodings in
utils.encoding.BOMS are valid.

Fixes pypa#6054
@lock
Copy link

lock bot commented May 28, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label May 28, 2019
@lock lock bot locked as resolved and limited conversation to collaborators May 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation C: encoding Related to text encoding and likely, UnicodeErrors type: bug A confirmed bug or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants