Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UnicodeEncodeError on file encoding detection #274

Merged
merged 1 commit into from
Jan 4, 2016

Conversation

imankulov
Copy link
Contributor

If the first line of a python file is not a valid latin-1 string,
parse_encoding dies with "UnicodeDecodeError". These strings nonetheless can be
valid in some scenarios (for example, Mako extractor uses
babel.messages.extract.extract_python), and it makes more sense to ignore this
exception and return None.

@codecov-io
Copy link

Current coverage is 84.21%

Merging #274 into master will not affect coverage as of a3950c0

@@            master    #274   diff @@
======================================
  Files           22      22       
  Stmts         3770    3770       
  Branches         0       0       
  Methods          0       0       
======================================
  Hit           3175    3175       
  Partial          0       0       
  Missed         595     595       

Review entire Coverage Diff as of a3950c0

Powered by Codecov. Updated on successful CI builds.

@etanol
Copy link
Contributor

etanol commented Oct 15, 2015

Do you have a working example where this happens? Because any byte sequence can be decoded as Latin1. It doesn't mean that the decoded result should make sense, but decoding Latin1 to unicode usually never fails.

@imankulov
Copy link
Contributor Author

Oh, to be frank, I misunderstood the source of the problem initially. Of course, when you have a byte sequence (Python 2.x string) and you decode it, you cannot end up with UnicodeEncodeError.

The exception is thrown later on, when we try to parse the resulting object with parser.suite. The thing is, in Python 2.x parser.suite expects a byte sequence (py2.x str), and in Python 3.x it expects a unicode object (py3.x str).

Python 2.x example

$ python
>>> import parser
>>> parser.suite('print("я")')
<parser.st object at 0x7f3150d18090>
>>> parser.suite(u'print("я")')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u044f' in position 7: ordinal not in range(128)

Python 3.x example

$ python3
>>> import parser
>>> parser.suite('print("я")')
<parser.st object at 0x7f1f222e3f90>
>>> parser.suite('print("я")'.encode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: suite() argument 1 must be str, not bytes

As it turned out, to support both 2.x and 3.x properly we have to switch between py3.x str and py2.x str, depending on the interpreter version. Current solution is to "die quietly" if first string contains non-ascii symbols. As I feel it, it is "good enough" to cover almost everything. Non-ascii first string theoretically can contain encoding information, like # coding: utf-8 😈, but do we really have to properly process this case?

imankulov added a commit to imankulov/mako that referenced this pull request Oct 23, 2015
If mako templates contain something like "_('Köln')", babel extractor converts
it to pure ASCII so that resulting .po file would contain "K\xf6ln". Not all
translation tools and translations are ready for such kind of escape sequences.

Babel allows message ids to be non-ascii, the plugin just has to return Unicode
objects instead of ASCII strings (and that's exactly how Babel built-in Python
and JavaScript extractors work).

This fix ensures mako extractor doesn't excape non-ascii symbols, works well
both for Unicode and non-unicode input (there is a test for cp1251 encoding),
and also provides a workaround for babel charset detector python-babel/babel#274.
@akx
Copy link
Member

akx commented Jan 3, 2016

@imankulov Can you rebase this, please?

If the first line of a python file is not a valid latin-1 string,
parse_encoding dies with "UnicodeDecodeError". These strings nonetheless can be
valid in some scenarios (for example, Mako extractor uses
babel.messages.extract.extract_python), and it makes more sense to ignore this
exception and return None.
@imankulov
Copy link
Contributor Author

@akx Done.

@akx
Copy link
Member

akx commented Jan 4, 2016

Thank you, this looks good!

akx added a commit that referenced this pull request Jan 4, 2016
Fix UnicodeEncodeError on file encoding detection
@akx akx merged commit 07aa84f into python-babel:master Jan 4, 2016
@imankulov
Copy link
Contributor Author

@akx Awesome! Thanks a lot for merging it.

@akx akx added this to the Babel 2.3 milestone Jan 4, 2016
@pyup-bot pyup-bot mentioned this pull request Apr 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants