Fix UnicodeEncodeError on file encoding detection #274

imankulov · 2015-10-13T11:26:11Z

If the first line of a python file is not a valid latin-1 string,
parse_encoding dies with "UnicodeDecodeError". These strings nonetheless can be
valid in some scenarios (for example, Mako extractor uses
babel.messages.extract.extract_python), and it makes more sense to ignore this
exception and return None.

codecov-io · 2015-10-13T11:37:59Z

Current coverage is `84.21%`

Merging #274 into master will not affect coverage as of a3950c0

@@            master    #274   diff @@
======================================
  Files           22      22       
  Stmts         3770    3770       
  Branches         0       0       
  Methods          0       0       
======================================
  Hit           3175    3175       
  Partial          0       0       
  Missed         595     595

Review entire Coverage Diff as of a3950c0

Powered by Codecov. Updated on successful CI builds.

etanol · 2015-10-15T09:00:56Z

Do you have a working example where this happens? Because any byte sequence can be decoded as Latin1. It doesn't mean that the decoded result should make sense, but decoding Latin1 to unicode usually never fails.

imankulov · 2015-10-15T11:55:13Z

Oh, to be frank, I misunderstood the source of the problem initially. Of course, when you have a byte sequence (Python 2.x string) and you decode it, you cannot end up with UnicodeEncodeError.

The exception is thrown later on, when we try to parse the resulting object with parser.suite. The thing is, in Python 2.x parser.suite expects a byte sequence (py2.x str), and in Python 3.x it expects a unicode object (py3.x str).

Python 2.x example

$ python
>>> import parser
>>> parser.suite('print("я")')
<parser.st object at 0x7f3150d18090>
>>> parser.suite(u'print("я")')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u044f' in position 7: ordinal not in range(128)

Python 3.x example

$ python3
>>> import parser
>>> parser.suite('print("я")')
<parser.st object at 0x7f1f222e3f90>
>>> parser.suite('print("я")'.encode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: suite() argument 1 must be str, not bytes

As it turned out, to support both 2.x and 3.x properly we have to switch between py3.x str and py2.x str, depending on the interpreter version. Current solution is to "die quietly" if first string contains non-ascii symbols. As I feel it, it is "good enough" to cover almost everything. Non-ascii first string theoretically can contain encoding information, like # coding: utf-8 😈, but do we really have to properly process this case?

If mako templates contain something like "_('Köln')", babel extractor converts it to pure ASCII so that resulting .po file would contain "K\xf6ln". Not all translation tools and translations are ready for such kind of escape sequences. Babel allows message ids to be non-ascii, the plugin just has to return Unicode objects instead of ASCII strings (and that's exactly how Babel built-in Python and JavaScript extractors work). This fix ensures mako extractor doesn't excape non-ascii symbols, works well both for Unicode and non-unicode input (there is a test for cp1251 encoding), and also provides a workaround for babel charset detector python-babel/babel#274.

akx · 2016-01-03T21:36:36Z

@imankulov Can you rebase this, please?

If the first line of a python file is not a valid latin-1 string, parse_encoding dies with "UnicodeDecodeError". These strings nonetheless can be valid in some scenarios (for example, Mako extractor uses babel.messages.extract.extract_python), and it makes more sense to ignore this exception and return None.

imankulov · 2016-01-04T09:40:31Z

@akx Done.

akx · 2016-01-04T11:12:08Z

Thank you, this looks good!

Fix UnicodeEncodeError on file encoding detection

imankulov · 2016-01-04T11:14:30Z

@akx Awesome! Thanks a lot for merging it.

sils added the (2) in progress label Oct 13, 2015

sils added (3) pending review and removed (2) in progress labels Oct 13, 2015

etanol added (4) approved for master and removed (3) pending review labels Oct 15, 2015

sils assigned etanol Oct 16, 2015

sils added needs rebase (2) in progress and removed (3) pending review labels Jan 3, 2016

sils unassigned etanol Jan 3, 2016

imankulov force-pushed the fix-encoding-detection branch from 258cb37 to f9b04a5 Compare January 4, 2016 09:38

gitmate-bot added the size/S label Jan 4, 2016

sils added (3) pending review and removed (2) in progress needs rebase labels Jan 4, 2016

akx added a commit that referenced this pull request Jan 4, 2016

Merge pull request #274 from imankulov/fix-encoding-detection

07aa84f

Fix UnicodeEncodeError on file encoding detection

akx merged commit 07aa84f into python-babel:master Jan 4, 2016

sils removed the (3) pending review label Jan 4, 2016

akx added this to the Babel 2.3 milestone Jan 4, 2016

pyup-bot mentioned this pull request Jan 31, 2017

Update babel to 2.3.4 mkurek/ralph#28

Open

pyup-bot mentioned this pull request Apr 11, 2017

Initial Update do3cc/blog#3

Closed

pyup-bot mentioned this pull request Apr 27, 2017

Pin babel to latest version 2.4.0 osteele/assignment-dashboard#24

Open

pyup-bot mentioned this pull request May 12, 2017

Initial Update tantale/intranet#43

Closed

pyup-bot mentioned this pull request Jul 4, 2017

Initial Update pelson/pyggybank#2

Merged

This was referenced Aug 17, 2017

Pin babel to latest version 2.4.0 springermac/pyinstaller#67

Merged

Pin babel to latest version 2.5.0 osteele/assignment-dashboard#51

Open

This was referenced Aug 26, 2017

Pin babel to latest version 2.5.0 nicfit/nicfit.py#129

Merged

Initial Update oddbird/oddsite#127

Closed

pyup-bot mentioned this pull request Sep 4, 2017

Scheduled weekly dependency update for week 36 oddbird/oddsite#128

Merged

pyup-bot mentioned this pull request Sep 14, 2017

Pin babel to latest version 2.5.1 osteele/assignment-dashboard#56

Open

pyup-bot mentioned this pull request Oct 17, 2017

Initial Update tracim/tracim#373

Closed

pyup-bot mentioned this pull request Nov 3, 2017

Update babel to 2.5.1 tracim/tracim#424

Closed

This was referenced Jan 15, 2018

Pin babel to latest version 2.5.2 osteele/assignment-dashboard#76

Open

Pin babel to latest version 2.5.3 osteele/assignment-dashboard#78

Open

vorpal-buildbot mentioned this pull request Jun 13, 2020

Initial Update PennyDreadfulMTG/Penny-Dreadful-Tools#7430

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix UnicodeEncodeError on file encoding detection #274

Fix UnicodeEncodeError on file encoding detection #274

imankulov commented Oct 13, 2015

codecov-io commented Oct 13, 2015

etanol commented Oct 15, 2015

imankulov commented Oct 15, 2015

akx commented Jan 3, 2016

imankulov commented Jan 4, 2016

akx commented Jan 4, 2016

imankulov commented Jan 4, 2016

Fix UnicodeEncodeError on file encoding detection #274

Fix UnicodeEncodeError on file encoding detection #274

Conversation

imankulov commented Oct 13, 2015

codecov-io commented Oct 13, 2015

Current coverage is 84.21%

etanol commented Oct 15, 2015

imankulov commented Oct 15, 2015

akx commented Jan 3, 2016

imankulov commented Jan 4, 2016

akx commented Jan 4, 2016

imankulov commented Jan 4, 2016

Current coverage is `84.21%`