Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode errors when locale.preferredencoding() is ascii #123

Closed
kalikaneko opened this issue Feb 10, 2013 · 9 comments
Closed

unicode errors when locale.preferredencoding() is ascii #123

kalikaneko opened this issue Feb 10, 2013 · 9 comments

Comments

@kalikaneko
Copy link

When the locales have encoding set to ascii (example: inside a freshly created debian chroot), test suite raises uncatched UnicodeDecodeErrors.

In [1]: from locale import getpreferredencoding
In [2]: getpreferredencoding()                                                                                                                                                                                                    
Out[2]: 'ANSI_X3.4-1968'
..............................................E............................E...
======================================================================
ERROR: test_non_ascii_error (test.Basic)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/sh-1.08/test.py", line 1276, in test_non_ascii_error
    self.assertRaises(ErrorReturnCode, ls, test)
  File "/usr/lib/python2.6/unittest.py", line 336, in failUnlessRaises
    callableObj(*args, **kwargs)
  File "/tmp/sh-1.08/sh.py", line 730, in __call__
    return RunningCommand(cmd, call_args, stdin, stdout, stderr)
  File "/tmp/sh-1.08/sh.py", line 291, in __init__
    self.wait()
  File "/tmp/sh-1.08/sh.py", line 295, in wait
    self._handle_exit_code(self.process.wait())
  File "/tmp/sh-1.08/sh.py", line 309, in _handle_exit_code
    self.process.stderr
  File "/tmp/sh-1.08/sh.py", line 121, in __init__
    (full_cmd, tstdout.decode(DEFAULT_ENCODING), tstderr.decode(DEFAULT_ENCODING))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 24: ordinal not in range(128)

======================================================================
ERROR: test_unicode_arg (test.Basic)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/sh-1.08/test.py", line 60, in test_unicode_arg
    p = echo(test).strip()
  File "/tmp/sh-1.08/sh.py", line 389, in __getattr__
    return getattr(unicode(self), p)
  File "/tmp/sh-1.08/sh.py", line 375, in __unicode__
    self.call_args["decode_errors"])
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)

----------------------------------------------------------------------
Ran 79 tests in 24.988s

FAILED (errors=2)
@kalikaneko
Copy link
Author

here's a rough patch for this issue (against 0.8 release):

--- a/sh.py
+++ b/sh.py
@@ -117,8 +117,14 @@
             if err_delta: 
                 tstderr += ("... (%d more, please see e.stderr)" % err_delta).encode()

-        msg = "\n\n  RAN: %r\n\n  STDOUT:\n%s\n\n  STDERR:\n%s" %\
-            (full_cmd, tstdout.decode(DEFAULT_ENCODING), tstderr.decode(DEFAULT_ENCODING))
+        try:
+            msg = "\n\n  ran: %r\n\n  stdout:\n%s\n\n  stderr:\n%s" %\
+                (full_cmd, tstdout.decode(DEFAULT_ENCODING),
+                    tstderr.decode(DEFAULT_ENCODING))
+        except UnicodeDecodeError:
+            msg = "\n\n  ran: %r\n\n  stdout:\n%s\n\n  stderr:\n%s" %\
+                (full_cmd, tstdout.decode('utf-8'), tstderr.decode('utf-8'))
+
         super(ErrorReturnCode, self).__init__(msg)


@@ -371,8 +377,12 @@

     def __unicode__(self):
         if self.process and self.stdout:
-            return self.stdout.decode(self.call_args["encoding"],
-                self.call_args["decode_errors"])
+            try:
+                return self.stdout.decode(self.call_args["encoding"],
+                    self.call_args["decode_errors"])
+            except UnicodeDecodeError:
+                return self.stdout.decode('utf-8',
+                    self.call_args["decode_errors"])
         return ""

     def __eq__(self, other):
@@ -561,7 +571,11 @@
             # if the argument is already unicode, or a number or whatever,
             # this first call will fail.  
             try: arg = unicode(arg, DEFAULT_ENCODING).encode(DEFAULT_ENCODING)
-            except TypeError: arg = unicode(arg).encode(DEFAULT_ENCODING)
+            except TypeError:
+                try:
+                    arg = unicode(arg).encode(DEFAULT_ENCODING)
+                except UnicodeEncodeError:
+                    arg = unicode(arg).encode('utf-8')
         return arg


@@ -633,7 +647,11 @@

     def __str__(self):
         if IS_PY3: return self.__unicode__()
-        else: return unicode(self).encode(DEFAULT_ENCODING)
+        else:
+            try:
+                return unicode(self).encode(DEFAULT_ENCODING)
+            except UnicodeEncodeError:
+                return unicode(self).encode('utf-8')

     def __eq__(self, other):
         try: return str(self) == str(other)
--- a/test.py
+++ b/test.py
@@ -1338,9 +1338,9 @@
 import sys
 sys.stdout.write("te漢字st")
 """)
-        fn = partial(python, py.name, _encoding="ascii")
-        def s(fn): str(fn())
-        self.assertRaises(UnicodeDecodeError, s, fn)
+        #fn = partial(python, py.name, _encoding="ascii")
+        #def s(fn): str(fn())
+        #self.assertRaises(UnicodeDecodeError, s, fn)

         p = python(py.name, _encoding="ascii", _decode_errors="ignore")
         self.assertEqual(p, "test")

@amoffat
Copy link
Owner

amoffat commented Feb 20, 2013

@kalikaneko could you go ahead and test the code on the dev branch? I had a hard time getting the locale to be respected as ascii on my machine.

@abadger
Copy link
Contributor

abadger commented Apr 13, 2013

If you're running on any Linux machine, the easiest way to get an ASCii locale is:

$ export LC_ALL=C
$ run_tests

@abadger
Copy link
Contributor

abadger commented Apr 13, 2013

Still fails with master. Trying dev branch now.

@abadger
Copy link
Contributor

abadger commented Apr 13, 2013

heh. dev branch fails unittests even with LC_ALL=en_US.utf8. With LC_ALL=C I get two tracebacks and the unittests hang... might be more failures after those two.

@abadger
Copy link
Contributor

abadger commented Apr 13, 2013

Tracebacks from the two dev branch runs (LC_ALL=C was first): http://paste.fedoraproject.org/7380/65870359/

@amoffat
Copy link
Owner

amoffat commented Apr 17, 2013

I'm wondering if the source of the problem is really line 51:

DEFAULT_ENCODING = getpreferredencoding() or "utf-8"

Does it make sense to use the user's default system encoding for a script they (or someone else) may have written with utf-8? Should we always just assume utf-8?

@abadger
Copy link
Contributor

abadger commented Apr 17, 2013

It depends on what DEFAULT_ENCODING is being used for. When interpreting arguments, filenames, and things that are going to be passed on to subprocess, it probably makes sense to just use bytes on python2 (I'm not sure of python3 -- I'll have to do some experimenting to see what all gets handled as bytes and what gets handled as string). For then displaying that output to the user, it might make sense to use repr(byte_string) or something else that would display in the user's locale's encoding but not traceback if the user's locale isn't able to translate that byte sequence.

The problem is, of course, that sh deals with a lot of things that come from outside of python; from the C world. In that world, strings are sequences of bytes and many of those do not have encoding values associated with them. because of that there is often the need to use several different strategies depending on whatthe code is attempting to achieve at the time.

@amoffat
Copy link
Owner

amoffat commented Sep 8, 2013

by default, the python sh.py test suite runs for all python versions, with both C locale and en_US.UTF-8 locale, with all tests passing on my test machines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants