Properly handle unicode_errors arg parameter when loading a vocab file #2570

wmtzk · 2019-08-07T06:13:03Z

Simply, added "unicode_errors" option on loading vocab file process.

I've tested and all of them passed with changing numpy version from 1.14.5 to 1.17.0 in tox.ini.

tox -e flake8
tox -e py37-linux

The reason I've changed the version of numpy is to avoid the follwoing error, which also happens the HEAD revision of f97d0e7

tox -e py37-linux

py37-linux recreate: /home/ubuntu/gensim/.tox/py37-linux
py37-linux installdeps: pip>=19.1.1, numpy==1.14.5, scipy==1.1.0, .[test]
py37-linux installed: atomicwrites==1.3.0,attrs==19.1.0,boto==2.49.0,boto3==1.9.202,botocore==1.12.202,certifi==2019.6.16,chardet==3.0.4,Cython==0.29.13,docutils==0.14,gensim==3.8.0,idna==2.8,importlib-metadata==0.19,jmespath==0.9.4,joblib==0.13.2,mock==3.0.5,more-itertools==7.2.0,Morfessor==2.0.2a4,numpy==1.14.5,packaging==19.1,Pillow==6.1.0,pluggy==0.12.0,py==1.8.0,pyparsing==2.4.2,pytest==5.0.1,pytest-rerunfailures==7.0,python-dateutil==2.8.0,python-Levenshtein==0.12.0,pyzmq==18.0.2,requests==2.22.0,s3transfer==0.2.1,scikit-learn==0.21.3,scipy==1.1.0,six==1.12.0,smart-open==1.8.4,testfixtures==6.10.0,torchfile==0.1.0,tornado==6.0.3,urllib3==1.25.3,visdom==0.1.8.8,wcwidth==0.1.7,websocket-client==0.56.0,zipp==0.5.2
py37-linux run-test-pre: PYTHONHASHSEED='1'
py37-linux run-test: commands[0] | python --version
Python 3.7.3
py37-linux run-test: commands[1] | pip --version
pip 19.2.1 from /home/ubuntu/gensim/.tox/py37-linux/lib/python3.7/site-packages/pip (python 3.7)
py37-linux run-test: commands[2] | python -c 'from gensim.models.word2vec import FAST_VERSION; print(FAST_VERSION)'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ubuntu/gensim/gensim/__init__.py", line 5, in <module>
    from gensim import parsing, corpora, matutils, interfaces, models, similarities, summarization, utils  # noqa:F401
  File "/home/ubuntu/gensim/gensim/corpora/__init__.py", line 6, in <module>
    from .indexedcorpus import IndexedCorpus  # noqa:F401 must appear before the other classes
  File "/home/ubuntu/gensim/gensim/corpora/indexedcorpus.py", line 15, in <module>
    from gensim import interfaces, utils
  File "/home/ubuntu/gensim/gensim/interfaces.py", line 21, in <module>
    from gensim import utils, matutils
  File "/home/ubuntu/gensim/gensim/matutils.py", line 1104, in <module>
    from gensim._matutils import logsumexp, mean_absolute_difference, dirichlet_expectation
  File "__init__.pxd", line 918, in init gensim._matutils
ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject
ERROR: InvocationError for command /home/ubuntu/gensim/.tox/py37-linux/bin/python -c 'from gensim.models.word2vec import FAST_VERSION; print(FAST_VERSION)' (exited with code 1)
______________________________________________________ summary ______________________________________________________
ERROR:   py37-linux: commands failed

Versions

Ubuntu 18.04

>>> import platform; print(platform.platform())
Linux-4.15.0-1044-aws-x86_64-with-debian-buster-sid
>>> import sys; print("Python", sys.version)
Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.17.0
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.3.0
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.8.0
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 1

Thank you for checking that and your consideration!

mpenkov · 2019-08-07T06:17:16Z

Looks good, can you please add some tests? The tests should fail without your changes, and pass when your changes are active.

wmtzk · 2019-08-07T06:44:40Z

Thank you for checking that. I'll do that. Please give me a moment.

wmtzk · 2019-08-22T05:57:20Z

I've changed the way to take care of fvocab, which is same as model file is doing.
Also I made a few tests for that.
I'd appreciate it if you could check that. Thanks!

gensim/models/utils_any2vec.py

wmtzk

I've reverted except "add unicode_errors arg" and confirmed the test codes work.

mpenkov · 2019-08-23T08:11:36Z

Your tests fail under Py2.7, can you please take a look?

wmtzk · 2019-08-23T14:32:22Z

Mmm, It's embarrassing to be mentioned such a thing. ^^; I've confirmed the tests passed on Python2.7.15 and 3.7.3 env.

mpenkov · 2019-08-26T09:22:50Z

OK, looks good to merge. Congrats on your first contribution, @wmtzk ! 🥇

Good work!

wmtzk · 2019-08-26T10:15:59Z

Thanks!

Added unicode_errors arg in loading a vocab file.

5508bbb

wmtzk added 2 commits August 22, 2019 13:43

add a vocab file decoding error handling as same as a model file

c0df618

add test and test data of loading a vocab file which has decoding error

447a15e

add 'rstrip()' to load vocab file to make it same as model file

680da0c

piskvorky reviewed Aug 22, 2019

View reviewed changes

gensim/models/utils_any2vec.py Outdated Show resolved Hide resolved

reverted except adding unicode_errors arg.

6fc8705

wmtzk commented Aug 22, 2019

View reviewed changes

handle key values as unicode for Python2.

fc19f4a

mpenkov merged commit 4ab2167 into piskvorky:develop Aug 26, 2019

mpenkov changed the title ~~Added unicode_errors arg in loading a vocab file.~~ Properly handle unicode_errors arg parameter when loading a vocab file Aug 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly handle unicode_errors arg parameter when loading a vocab file #2570

Properly handle unicode_errors arg parameter when loading a vocab file #2570

wmtzk commented Aug 7, 2019 •

edited

Loading

mpenkov commented Aug 7, 2019

wmtzk commented Aug 7, 2019

wmtzk commented Aug 22, 2019

wmtzk left a comment

mpenkov commented Aug 23, 2019

wmtzk commented Aug 23, 2019 •

edited

Loading

mpenkov commented Aug 26, 2019

wmtzk commented Aug 26, 2019

Properly handle unicode_errors arg parameter when loading a vocab file #2570

Properly handle unicode_errors arg parameter when loading a vocab file #2570

Conversation

wmtzk commented Aug 7, 2019 • edited Loading

Versions

mpenkov commented Aug 7, 2019

wmtzk commented Aug 7, 2019

wmtzk commented Aug 22, 2019

wmtzk left a comment

Choose a reason for hiding this comment

mpenkov commented Aug 23, 2019

wmtzk commented Aug 23, 2019 • edited Loading

mpenkov commented Aug 26, 2019

wmtzk commented Aug 26, 2019

wmtzk commented Aug 7, 2019 •

edited

Loading

wmtzk commented Aug 23, 2019 •

edited

Loading