Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datatype parameter in load_word2vec_format doesn't work as expected #1682

Closed
jayantj opened this issue Nov 1, 2017 · 0 comments
Closed

datatype parameter in load_word2vec_format doesn't work as expected #1682

jayantj opened this issue Nov 1, 2017 · 0 comments
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple)

Comments

@jayantj
Copy link
Contributor

jayantj commented Nov 1, 2017

Description

Using datatype=np.float64 in a KeyedVectors.load_word2vec_call doesn't work as expected, the loaded floats seem to lose precision. The datatype for syn0 is still float64 though, so it seems that they are cast to float32 first while loading, then cast to float64 when creating the array.

Steps/Code/Corpus to Reproduce

Using this file -
test.kv.txt

from gensim.models.keyedvectors import KeyedVectors
import numpy as np

kv = KeyedVectors.load_word2vec_format('test.kv.txt', datatype=np.float64)
print(kv['horse.n.01'][0] == -0.0008546282343595379)
# False
print(kv['horse.n.01'].dtype)
# float64

Expected Results

print(kv['horse.n.01'][0] == -0.0008546282343595379)
# True

Actual Results

print(kv['horse.n.01'][0] == -0.0008546282343595379)
# False

Looking at the code and making a quick hack here, changing..

word, weights = parts[0], [REAL(x) for x in parts[1:]]

to..

word, weights = parts[0], [datatype(x) for x in parts[1:]]

..leads to the correct result. However, I imagine there are other cases to be covered as well.

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple) labels Nov 1, 2017
horpto added a commit to horpto/gensim that referenced this issue Nov 5, 2017
sj29-innovate pushed a commit to sj29-innovate/gensim that referenced this issue Feb 21, 2018
…iskvorky#1682 (piskvorky#1819)

* load vector with high precision

* Test changes

* Fix flake8 error

* Fix path error

* Reformat code

* Fix precision loss issue for binary word2vec

* Fix precision loss during saving model in text format

* Fix binary file loading issue

* Test other datatypes as well.

* Test type conversion

* Fix build error

* Use better names

* Test type after conversion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple)
Projects
None yet
Development

No branches or pull requests

2 participants