preprocess.py does't handle unicode character sequences correctly #29

mattbierner · 2016-03-11T08:08:02Z

Great project!

I just ran into one small problem with text containing emojis. These are currently not encoded correctly by preprocess.py:

Test 😀!

Outputs the following json:

{"idx_to_token": {"1": "T", "2": "e", "3": "s", "4": "t", "5": " ", "6": "\ud83d", "7": "\ude00", "8": "!", "9": "\n"}, "token_to_idx": {"!": 8, " ": 5, "e": 2, "\ude00": 7, "\n": 9, "s": 3, "T": 1, "\ud83d": 6, "t": 4}}

As you can see, the emoji has been broken into two characters: \ud83d and \ude00. cjson throws an error when it attempts to decode this since \ud83d is not a valid unicode character.

I prototyped a fix in Python3.3+ based on this SO question that I can submit a pull request for, but that requires updating print and unrelated code for Python 3 as well. I'm not sure what the proper fix is for Python 2.x.

The text was updated successfully, but these errors were encountered:

Print statements to print function calls. Use `items` instead of `iteritems`. Tested on Python 2.7 and Python 3.5 Running preprocess.py under Python 3.3+ fixes jcjohnson#29

manuchis · 2016-10-19T19:01:10Z

It is not strictly related but when I try to train texts with emojis, the lua throw me an error. I've tried the solution in #52 also, but I keep getting Expected value but found invalid unicode escape code at character

Any idea? I've made the preprocess with py2.7

dgcrouse · 2017-04-27T04:46:36Z

Rewriting preprocessor script to address this.

Benimation · 2017-10-06T17:19:55Z

I'm using the latest version. The preprocessing works fine, it's when I try to start training that I get an error..

train.lua:77: Expected value but found invalid unicode escape code at character 1080

I don't have many emoji in my text, so I was able to just remove them.

mattbierner mentioned this issue Mar 11, 2016

Update preprocess.py to support Python 3 #30

Closed

dgcrouse closed this as completed Apr 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preprocess.py does't handle unicode character sequences correctly #29

preprocess.py does't handle unicode character sequences correctly #29

mattbierner commented Mar 11, 2016

manuchis commented Oct 19, 2016

dgcrouse commented Apr 27, 2017

Benimation commented Oct 6, 2017

preprocess.py does't handle unicode character sequences correctly #29

preprocess.py does't handle unicode character sequences correctly #29

Comments

mattbierner commented Mar 11, 2016

manuchis commented Oct 19, 2016

dgcrouse commented Apr 27, 2017

Benimation commented Oct 6, 2017