Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix datatype parameter for KeyedVectors.load_word2vec_format. Fix #1682 #1819

Merged
merged 19 commits into from
Feb 16, 2018
Merged
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions gensim/models/keyedvectors.py
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@ def add_word(word, weights):
result.index2word.append(word)

if binary:
binary_len = dtype(REAL).itemsize * vector_size
binary_len = dtype(datatype).itemsize * vector_size
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having problem detecting binary_len of vector saved with custom datatype. The only clue is that the next vector starts after a " " but before the space comes a string(also converted in python bytes) which can be of any length. @menshikh-iv any suggestion?

Copy link
Contributor Author

@pushpankar pushpankar Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding \n at the end of each vector during saving in binary but that broke many other tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pushpankar how it works if datatype=REAL here?

Copy link
Contributor Author

@pushpankar pushpankar Jan 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since earlier float were only saved with only 32 bit precision, knowing the size of each vector in binary format was easy. Casting to lower precision is only done after loading vectors.
Please also note that in develop branch too, casting vector to lower precision and saving it in binary and then loading leads to some errors. This is because while loading float32 is being assumed but during saving it was saved with lower precision like float16. I am adding some code to make it more clear.

from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('./test_data/test.kv.txt', datatype=np.float16)
print(model['horse.n.01'][0])
model.save_word2vec_format('./test_data/test.kv.bin', binary=True)
model2 = KeyedVectors.load_word2vec_format('./test_data/test.kv.bin', datatype=np.float32, binary=True)
print(model2['horse.n.01'][0])

Gives

Traceback (most recent call last):
  File "convert2binary.py", line 7, in <module>
    print(model2['horse.n.01'][0])
  File "/home/pushpankar/gensim/gensim/models/keyedvectors.py", line 326, in __getitem__
    return self.word_vec(words)
  File "/home/pushpankar/gensim/gensim/models/keyedvectors.py", line 453, in word_vec
    raise KeyError("word '%s' not in vocabulary" % word)
KeyError: "word 'horse.n.01' not in vocabulary"

This is because float32 was assumed while reading binary vector but originally it was saved with float16. Thus more than necessary bytes was read for every vector.
Let me know if I am not clear enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, probably the easiest solution for this case is read/write with REAL type & cast it before the end of "load" process, wdyt @jayantj?

for _ in xrange(vocab_size):
# mixed text and binary: read text first, then binary
word = []
Expand All @@ -233,7 +233,7 @@ def add_word(word, weights):
if ch != b'\n': # ignore newlines in front of words (some binary files have)
word.append(ch)
word = utils.to_unicode(b''.join(word), encoding=encoding, errors=unicode_errors)
weights = fromstring(fin.read(binary_len), dtype=REAL)
weights = fromstring(fin.read(binary_len), dtype=datatype)
add_word(word, weights)
else:
for line_no in xrange(vocab_size):
Expand All @@ -243,7 +243,7 @@ def add_word(word, weights):
parts = utils.to_unicode(line.rstrip(), encoding=encoding, errors=unicode_errors).split(" ")
if len(parts) != vector_size + 1:
raise ValueError("invalid vector on line %s (is this really the text format?)" % line_no)
word, weights = parts[0], [REAL(x) for x in parts[1:]]
word, weights = parts[0], np.array(parts[1:], dtype=datatype)
add_word(word, weights)
if result.syn0.shape[0] != len(result.vocab):
logger.info(
Expand Down
2 changes: 2 additions & 0 deletions gensim/test/test_data/test.kv.bin
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
2 2
kangaroo.n.01 8��&�%H��.���horse.n.01 \O�($L���k�P6I?
3 changes: 3 additions & 0 deletions gensim/test/test_data/test.kv.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
2 2
kangaroo.n.01 -0.0007369244245224787 -8.269973595356034e-05
horse.n.01 -0.0008546282343595379 0.0007694142576316829
37 changes: 37 additions & 0 deletions gensim/test/test_datatype.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""
Automated tests for checking various matutils functions.
"""

import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a file header, like in the other test files.

import unittest

import numpy as np

from gensim.test.utils import datapath
from gensim.models.keyedvectors import KeyedVectors


class TestDataType(unittest.TestCase):
def test_binary(self):
path = datapath('test.kv.bin')
kv = KeyedVectors.load_word2vec_format(path, binary=True,
datatype=np.float64)
self.assertAlmostEqual(kv['horse.n.01'][0], -0.0008546282343595379)
self.assertEqual(kv['horse.n.01'][0].dtype, np.float64)

def test_text(self):
path = datapath('test.kv.txt')
kv = KeyedVectors.load_word2vec_format(path, binary=False,
datatype=np.float64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's about different datatypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will np.float16, np.float32, and np.float64 be enough?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

self.assertAlmostEqual(kv['horse.n.01'][0], -0.0008546282343595379)
self.assertEqual(kv['horse.n.01'][0].dtype, np.float64)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another test to verify that the type of a value in the loaded array matches the datatype passed to load_word2vec_format would explicitly confirm the new behaviour.


if __name__ == '__main__':
logging.root.setLevel(logging.WARNING)
unittest.main()