-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
preprocess.py does't handle unicode character sequences correctly #29
Comments
Print statements to print function calls. Use `items` instead of `iteritems`. Tested on Python 2.7 and Python 3.5 Running preprocess.py under Python 3.3+ fixes jcjohnson#29
It is not strictly related but when I try to train texts with emojis, the lua throw me an error. I've tried the solution in #52 also, but I keep getting Any idea? I've made the preprocess with py2.7 |
Rewriting preprocessor script to address this. |
I'm using the latest version. The preprocessing works fine, it's when I try to start training that I get an error..
I don't have many emoji in my text, so I was able to just remove them. |
Great project!
I just ran into one small problem with text containing emojis. These are currently not encoded correctly by
preprocess.py
:Outputs the following json:
As you can see, the emoji has been broken into two characters:
\ud83d
and\ude00
. cjson throws an error when it attempts to decode this since\ud83d
is not a valid unicode character.I prototyped a fix in Python3.3+ based on this SO question that I can submit a pull request for, but that requires updating
print
and unrelated code for Python 3 as well. I'm not sure what the proper fix is for Python 2.x.The text was updated successfully, but these errors were encountered: