Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly handle unicode_errors arg parameter when loading a vocab file #2570

Merged
merged 6 commits into from
Aug 26, 2019

Conversation

wmtzk
Copy link
Contributor

@wmtzk wmtzk commented Aug 7, 2019

Fix #2569

Simply, added "unicode_errors" option on loading vocab file process.

I've tested and all of them passed with changing numpy version from 1.14.5 to 1.17.0 in tox.ini.

tox -e flake8
tox -e py37-linux

The reason I've changed the version of numpy is to avoid the follwoing error, which also happens the HEAD revision of f97d0e7

tox -e py37-linux
py37-linux recreate: /home/ubuntu/gensim/.tox/py37-linux
py37-linux installdeps: pip>=19.1.1, numpy==1.14.5, scipy==1.1.0, .[test]
py37-linux installed: atomicwrites==1.3.0,attrs==19.1.0,boto==2.49.0,boto3==1.9.202,botocore==1.12.202,certifi==2019.6.16,chardet==3.0.4,Cython==0.29.13,docutils==0.14,gensim==3.8.0,idna==2.8,importlib-metadata==0.19,jmespath==0.9.4,joblib==0.13.2,mock==3.0.5,more-itertools==7.2.0,Morfessor==2.0.2a4,numpy==1.14.5,packaging==19.1,Pillow==6.1.0,pluggy==0.12.0,py==1.8.0,pyparsing==2.4.2,pytest==5.0.1,pytest-rerunfailures==7.0,python-dateutil==2.8.0,python-Levenshtein==0.12.0,pyzmq==18.0.2,requests==2.22.0,s3transfer==0.2.1,scikit-learn==0.21.3,scipy==1.1.0,six==1.12.0,smart-open==1.8.4,testfixtures==6.10.0,torchfile==0.1.0,tornado==6.0.3,urllib3==1.25.3,visdom==0.1.8.8,wcwidth==0.1.7,websocket-client==0.56.0,zipp==0.5.2
py37-linux run-test-pre: PYTHONHASHSEED='1'
py37-linux run-test: commands[0] | python --version
Python 3.7.3
py37-linux run-test: commands[1] | pip --version
pip 19.2.1 from /home/ubuntu/gensim/.tox/py37-linux/lib/python3.7/site-packages/pip (python 3.7)
py37-linux run-test: commands[2] | python -c 'from gensim.models.word2vec import FAST_VERSION; print(FAST_VERSION)'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ubuntu/gensim/gensim/__init__.py", line 5, in <module>
    from gensim import parsing, corpora, matutils, interfaces, models, similarities, summarization, utils  # noqa:F401
  File "/home/ubuntu/gensim/gensim/corpora/__init__.py", line 6, in <module>
    from .indexedcorpus import IndexedCorpus  # noqa:F401 must appear before the other classes
  File "/home/ubuntu/gensim/gensim/corpora/indexedcorpus.py", line 15, in <module>
    from gensim import interfaces, utils
  File "/home/ubuntu/gensim/gensim/interfaces.py", line 21, in <module>
    from gensim import utils, matutils
  File "/home/ubuntu/gensim/gensim/matutils.py", line 1104, in <module>
    from gensim._matutils import logsumexp, mean_absolute_difference, dirichlet_expectation
  File "__init__.pxd", line 918, in init gensim._matutils
ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject
ERROR: InvocationError for command /home/ubuntu/gensim/.tox/py37-linux/bin/python -c 'from gensim.models.word2vec import FAST_VERSION; print(FAST_VERSION)' (exited with code 1)
______________________________________________________ summary ______________________________________________________
ERROR:   py37-linux: commands failed

Versions

Ubuntu 18.04

>>> import platform; print(platform.platform())
Linux-4.15.0-1044-aws-x86_64-with-debian-buster-sid
>>> import sys; print("Python", sys.version)
Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.17.0
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 1.3.0
>>> import gensim; print("gensim", gensim.__version__)
gensim 3.8.0
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 1

Thank you for checking that and your consideration!

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 7, 2019

Looks good, can you please add some tests? The tests should fail without your changes, and pass when your changes are active.

@wmtzk
Copy link
Contributor Author

wmtzk commented Aug 7, 2019

Thank you for checking that. I'll do that. Please give me a moment.

@wmtzk
Copy link
Contributor Author

wmtzk commented Aug 22, 2019

I've changed the way to take care of fvocab, which is same as model file is doing.
Also I made a few tests for that.
I'd appreciate it if you could check that. Thanks!

Copy link
Contributor Author

@wmtzk wmtzk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted except "add unicode_errors arg" and confirmed the test codes work.

@mpenkov
Copy link
Collaborator

mpenkov commented Aug 23, 2019

Your tests fail under Py2.7, can you please take a look?

@wmtzk
Copy link
Contributor Author

wmtzk commented Aug 23, 2019

Mmm, It's embarrassing to be mentioned such a thing. ^^; I've confirmed the tests passed on Python2.7.15 and 3.7.3 env.

@mpenkov mpenkov merged commit 4ab2167 into piskvorky:develop Aug 26, 2019
@mpenkov mpenkov changed the title Added unicode_errors arg in loading a vocab file. Properly handle unicode_errors arg parameter when loading a vocab file Aug 26, 2019
@mpenkov
Copy link
Collaborator

mpenkov commented Aug 26, 2019

OK, looks good to merge. Congrats on your first contribution, @wmtzk ! 🥇

Good work!

@wmtzk
Copy link
Contributor Author

wmtzk commented Aug 26, 2019

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

loading a vocab file doesn't seem to support unicode_errors
3 participants