Fix datatype parameter for `KeyedVectors.load_word2vec_format`. Fix #1682 #1819

pushpankar · 2017-12-27T14:57:19Z

No description provided.

menshikh-iv · 2017-12-28T03:43:44Z

Hi @pushpankar, thanks for your PR, what's about "other cases" from #1682?
CC @jayantj, can you elaborate on additional cases?

menshikh-iv · 2018-01-15T13:25:51Z

Ping @pushpankar, how is going? You also need to write additional tests to check this behavior and prevent regression + please merge develop from upstream.

pushpankar · 2018-01-15T14:19:48Z

Sorry. I thought my work is done. Should I write a test script in gensim/test and check with different types of inputs like float16, float32, ints, etc?

menshikh-iv · 2018-01-15T14:26:49Z

@pushpankar for two types already enough, also, use example from original issue

menshikh-iv · 2018-01-16T06:56:12Z

gensim/test/test.kv.txt

@@ -0,0 +1,3 @@
+2 2


should be in gensim/test/test_data

menshikh-iv · 2018-01-16T06:57:13Z

gensim/test/test_datatype.py

+    def test_datatype(self):
+        path = os.path.join(os.path.dirname(__file__), 'test.kv.txt')
+        kv = KeyedVectors.load_word2vec_format(path, datatype=np.float64)
+        self.assertEqual(kv['horse.n.01'][0], -0.0008546282343595379)


You should use assertAlmostEqual instead of assertEqual for float comparison

You also need to check datatype

menshikh-iv · 2018-01-16T06:57:57Z

gensim/test/test_datatype.py

+
+class TestDataType(unittest.TestCase):
+    def test_datatype(self):
+        path = os.path.join(os.path.dirname(__file__), 'test.kv.txt')


When you move test.kv.txt construction should look like

from gensim.test.utils import datapath path = datapath('test.kv.txt')

jayantj · 2018-01-16T07:03:03Z

Hi @pushpankar thanks for the PR! Apart from the suggestions @menshikh-iv mentioned, it'd also be good to have a test for word2vec files in binary format.

jayantj · 2018-01-16T07:05:07Z

gensim/test/test_datatype.py

@@ -0,0 +1,18 @@
+import logging


Please add a file header, like in the other test files.

pushpankar · 2018-01-16T12:11:45Z

Saving model(loaded with datatype=np.float64) in binary and then reloading it is raising unicode error. I looking for solution. Everything else is done.

pushpankar · 2018-01-16T13:56:35Z

Saving model also causes loss in precision in both text and binary format.

menshikh-iv · 2018-01-18T05:47:04Z

I see many failed tests, @pushpankar please have a look into.

pushpankar · 2018-01-18T06:19:38Z

@menshikh-iv I am working on it. Changing the way binary files are loaded is causing it.

pushpankar

@menshikh-iv @jayantj I need some help.

pushpankar · 2018-01-18T10:37:13Z

gensim/models/keyedvectors.py

@@ -220,7 +220,7 @@ def add_word(word, weights):
                result.index2word.append(word)

            if binary:
-                binary_len = dtype(REAL).itemsize * vector_size
+                binary_len = dtype(datatype).itemsize * vector_size


I am having problem detecting binary_len of vector saved with custom datatype. The only clue is that the next vector starts after a " " but before the space comes a string(also converted in python bytes) which can be of any length. @menshikh-iv any suggestion?

I tried adding \n at the end of each vector during saving in binary but that broke many other tests.

@pushpankar how it works if datatype=REAL here?

Since earlier float were only saved with only 32 bit precision, knowing the size of each vector in binary format was easy. Casting to lower precision is only done after loading vectors.
Please also note that in develop branch too, casting vector to lower precision and saving it in binary and then loading leads to some errors. This is because while loading float32 is being assumed but during saving it was saved with lower precision like float16. I am adding some code to make it more clear.

from gensim.models.keyedvectors import KeyedVectors model = KeyedVectors.load_word2vec_format('./test_data/test.kv.txt', datatype=np.float16) print(model['horse.n.01'][0]) model.save_word2vec_format('./test_data/test.kv.bin', binary=True) model2 = KeyedVectors.load_word2vec_format('./test_data/test.kv.bin', datatype=np.float32, binary=True) print(model2['horse.n.01'][0])

Gives

Traceback (most recent call last): File "convert2binary.py", line 7, in <module> print(model2['horse.n.01'][0]) File "/home/pushpankar/gensim/gensim/models/keyedvectors.py", line 326, in __getitem__ return self.word_vec(words) File "/home/pushpankar/gensim/gensim/models/keyedvectors.py", line 453, in word_vec raise KeyError("word '%s' not in vocabulary" % word) KeyError: "word 'horse.n.01' not in vocabulary"

This is because float32 was assumed while reading binary vector but originally it was saved with float16. Thus more than necessary bytes was read for every vector.
Let me know if I am not clear enough.

so, probably the easiest solution for this case is read/write with REAL type & cast it before the end of "load" process, wdyt @jayantj?

…into fix_load_wv

pushpankar · 2018-01-20T10:19:47Z

This fixes:

Precision loss issue if model is saved in text format.
Crash issue during loading binary model with precision other than float32 but precision loss issue still persists if precision is higher than float32 during both loading and saving binary model.

menshikh-iv · 2018-01-22T06:40:19Z

gensim/test/test_datatype.py

+    def test_text(self):
+        path = datapath('test.kv.txt')
+        kv = KeyedVectors.load_word2vec_format(path, binary=False,
+                                               datatype=np.float64)


What's about different datatypes?

Will np.float16, np.float32, and np.float64 be enough?

menshikh-iv · 2018-01-22T06:40:36Z

gensim/models/keyedvectors.py

@@ -147,9 +147,10 @@ def save_word2vec_format(self, fname, fvocab=None, binary=False, total_vec=None)
            for word, vocab in sorted(iteritems(self.vocab), key=lambda item: -item[1].count):
                row = self.syn0[vocab.index]
                if binary:
+                    row = row.astype(REAL)


why this needed?

Because

from gensim.models.keyedvectors import KeyedVectors model = KeyedVectors.load_word2vec_format('./test_data/test.kv.txt', datatype=np.float16) print(model['horse.n.01'][0]) model.save_word2vec_format('./test_data/test.kv.bin', binary=True) model2 = KeyedVectors.load_word2vec_format('./test_data/test.kv.bin', datatype=np.float32, binary=True)

this causes crash.
This is another bug that exists in develop branch.
Steps to reproduce error:

load model with low precision.

save that model in binary

then finally try to load the binary model.

Ah, thanks for investigating that and reporting it. Do you have any ideas why that might be occurring? The fix seems a little hack-ish and might mask other genuine problems.

…into fix_load_wv

menshikh-iv · 2018-01-23T05:52:48Z

#1777 should be merged first

menshikh-iv · 2018-02-05T08:32:01Z

Hello @pushpankar, can you resolve merge-conflict?

…into fix_load_wv

menshikh-iv · 2018-02-16T06:06:58Z

Good work @pushpankar, LGTM, wdyt @jayantj?

jayantj · 2018-02-16T08:33:22Z

gensim/test/test_datatype.py

+
+class TestDataType(unittest.TestCase):
+    def load_model(self, datatype):
+        path = datapath('test.kv.txt')


A slightly more descriptive name would be helpful, there's already a lot of test data and it can easily get confusing.

jayantj · 2018-02-16T08:52:33Z

gensim/test/test_datatype.py

+        model1.save_word2vec_format(binary_path, binary=True)
+        model2 = KeyedVectors.load_word2vec_format(binary_path, datatype=np.float64, binary=True)
+        self.assertAlmostEqual(model1["horse.n.01"][0], np.float16(model2["horse.n.01"][0]))
+


Another test to verify that the type of a value in the loaded array matches the datatype passed to load_word2vec_format would explicitly confirm the new behaviour.

jayantj · 2018-02-16T08:58:20Z

LGTM to me too (left some minor comments which should be taken care of, IMO), thanks for the fix @pushpankar

…into fix_load_wv

menshikh-iv · 2018-02-16T10:01:02Z

@pushpankar good work, congratz with the first contribution! 👍

…iskvorky#1682 (piskvorky#1819) * load vector with high precision * Test changes * Fix flake8 error * Fix path error * Reformat code * Fix precision loss issue for binary word2vec * Fix precision loss during saving model in text format * Fix binary file loading issue * Test other datatypes as well. * Test type conversion * Fix build error * Use better names * Test type after conversion

piskvorky

Minor code style request.

piskvorky · 2018-04-05T15:27:40Z

gensim/test/test_datatype.py

+class TestDataType(unittest.TestCase):
+    def load_model(self, datatype):
+        path = datapath('high_precision.kv.txt')
+        kv = KeyedVectors.load_word2vec_format(path, binary=False,


Hanging indent please (not vertical).

I guess you are talking about line 22-23. I have merged them.

load vector with high precision

b923043

pushpankar added 5 commits January 15, 2018 20:53

Merge branch 'develop' into fix_load_wv

35a8f8a

Test changes

aaa7c2a

Fix flake8 error

a8f44c5

Fix path error

37b39f4

Merge branch 'develop' into fix_load_wv

8e095d7

menshikh-iv suggested changes Jan 16, 2018

View reviewed changes

menshikh-iv changed the title ~~Fixes #1682~~ Fix datatype parameter for KeyedVectors.load_word2vec_format. Fix #1682 Jan 16, 2018

jayantj reviewed Jan 16, 2018

View reviewed changes

gensim/test/test_datatype.py

@@ -0,0 +1,18 @@

import logging

Copy link

Contributor

jayantj Jan 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a file header, like in the other test files.

Reformat code

310690d

Fix precision loss issue for binary word2vec

de98f2e

pushpankar commented Jan 18, 2018

View reviewed changes

pushpankar added 3 commits January 18, 2018 16:26

Fix precision loss during saving model in text format

805daf6

Fix binary file loading issue

466f37f

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

a76aec6

…into fix_load_wv

menshikh-iv suggested changes Jan 22, 2018

View reviewed changes

pushpankar added 2 commits January 23, 2018 00:27

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

049fb91

…into fix_load_wv

Test other datatypes as well.

c157d79

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

164cf63

…into fix_load_wv

pushpankar force-pushed the fix_load_wv branch from 69a4b2e to 4d0661f Compare February 5, 2018 15:12

Test type conversion

991bcb6

pushpankar force-pushed the fix_load_wv branch from 4d0661f to 991bcb6 Compare February 5, 2018 15:30

Fix build error

0904460

jayantj reviewed Feb 16, 2018

View reviewed changes

pushpankar added 3 commits February 16, 2018 14:32

Use better names

96d8aa5

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

17f6b39

…into fix_load_wv

Test type after conversion

6f53175

menshikh-iv merged commit 205fd79 into piskvorky:develop Feb 16, 2018

piskvorky reviewed Apr 5, 2018

View reviewed changes

menshikh-iv added the style checking label Apr 5, 2018

Fix datatype parameter for KeyedVectors.load_word2vec_format. Fix #1682 #1819

Fix datatype parameter for KeyedVectors.load_word2vec_format. Fix #1682 #1819

Conversation

pushpankar commented Dec 27, 2017

menshikh-iv commented Dec 28, 2017

menshikh-iv commented Jan 15, 2018 • edited Loading

pushpankar commented Jan 15, 2018

menshikh-iv commented Jan 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj commented Jan 16, 2018

Choose a reason for hiding this comment

pushpankar commented Jan 16, 2018

pushpankar commented Jan 16, 2018 • edited Loading

menshikh-iv commented Jan 18, 2018

pushpankar commented Jan 18, 2018

pushpankar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pushpankar Jan 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pushpankar Jan 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pushpankar commented Jan 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj Feb 16, 2018 • edited Loading

Choose a reason for hiding this comment

menshikh-iv commented Jan 23, 2018

menshikh-iv commented Feb 5, 2018

menshikh-iv commented Feb 16, 2018

jayantj Feb 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj commented Feb 16, 2018 • edited Loading

menshikh-iv commented Feb 16, 2018

piskvorky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fix datatype parameter for `KeyedVectors.load_word2vec_format`. Fix #1682 #1819

Fix datatype parameter for `KeyedVectors.load_word2vec_format`. Fix #1682 #1819

menshikh-iv commented Jan 15, 2018 •

edited

Loading

pushpankar commented Jan 16, 2018 •

edited

Loading

pushpankar Jan 18, 2018 •

edited

Loading

pushpankar Jan 19, 2018 •

edited

Loading

pushpankar commented Jan 20, 2018 •

edited

Loading

jayantj Feb 16, 2018 •

edited

Loading

jayantj Feb 16, 2018 •

edited

Loading

jayantj commented Feb 16, 2018 •

edited

Loading