Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preprocess.py does't handle unicode character sequences correctly #29

Closed
mattbierner opened this issue Mar 11, 2016 · 3 comments
Closed

Comments

@mattbierner
Copy link

Great project!

I just ran into one small problem with text containing emojis. These are currently not encoded correctly by preprocess.py:

Test 😀!

Outputs the following json:

{"idx_to_token": {"1": "T", "2": "e", "3": "s", "4": "t", "5": " ", "6": "\ud83d", "7": "\ude00", "8": "!", "9": "\n"}, "token_to_idx": {"!": 8, " ": 5, "e": 2, "\ude00": 7, "\n": 9, "s": 3, "T": 1, "\ud83d": 6, "t": 4}}

As you can see, the emoji has been broken into two characters: \ud83d and \ude00. cjson throws an error when it attempts to decode this since \ud83d is not a valid unicode character.

I prototyped a fix in Python3.3+ based on this SO question that I can submit a pull request for, but that requires updating print and unrelated code for Python 3 as well. I'm not sure what the proper fix is for Python 2.x.

mattbierner added a commit to mattbierner/torch-rnn that referenced this issue Mar 11, 2016
Print statements to print function calls.
Use `items` instead of `iteritems`.

Tested on Python 2.7 and Python 3.5

Running preprocess.py under Python 3.3+ fixes jcjohnson#29
@manuchis
Copy link

It is not strictly related but when I try to train texts with emojis, the lua throw me an error. I've tried the solution in #52 also, but I keep getting Expected value but found invalid unicode escape code at character

Any idea? I've made the preprocess with py2.7

@dgcrouse
Copy link

Rewriting preprocessor script to address this.

@Benimation
Copy link

I'm using the latest version. The preprocessing works fine, it's when I try to start training that I get an error..

train.lua:77: Expected value but found invalid unicode escape code at character 1080

I don't have many emoji in my text, so I was able to just remove them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants