Loading fastText models using only bin file #1341

prakhar2b · 2017-05-22T11:41:57Z

solve #1236. Mismatch in bin and vec files while loading wiki.fr pretrained model is resolved in this PR

prakhar2b · 2017-05-22T12:34:13Z

@tmylk error is about whitespace in the vec file that I have uploaded.
@jayantj As of now, this code reads additional word (vocab_size is one more than nwords), but skips loading any information into dictionary as it seems undesirable. Should we consider loading those information?

jayantj · 2017-05-25T07:24:27Z

Thanks for the PR @prakhar2b
As discussed, I believe the ideal fix for this would be loading FastText models from the .bin file only. It wouldn't require us to handle special cases in code, as is the case with the current fix.

Related to issue #1261

prakhar2b · 2017-05-25T07:33:57Z

@jayantj yes, it was supposed to be only a workaround for now. Though, as for issue 1261, please look into this comment also

…into french

jayantj · 2017-06-05T10:57:09Z

gensim/models/wrappers/fasttext.py

 from gensim.models.word2vec import Word2Vec

 from six import string_types

+from numpy import exp, log, dot, zeros, outer, random, dtype, float32 as REAL,\
+    double, uint32, seterr, array, uint8, vstack, fromstring, sqrt, newaxis,\


Do we need all these imports?

jayantj · 2017-06-05T10:57:36Z

gensim/models/wrappers/fasttext.py

@@ -224,7 +228,7 @@ def load_word2vec_format(cls, *args, **kwargs):
        return FastTextKeyedVectors.load_word2vec_format(*args, **kwargs)

    @classmethod
-    def load_fasttext_format(cls, model_file, encoding='utf8'):
+    def load_fasttext_format(cls, model_file, bin_only = False, encoding='utf8'):


Please add bin_only to the docstring

jayantj · 2017-06-05T10:58:39Z

gensim/models/wrappers/fasttext.py

+        if not bin_only:
+            assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
+            if len(self.wv.vocab) != vocab_size:
+                logger.warnings("If you are loading any model other than pretrained vector wiki.fr, ")


logger.warning?

jayantj · 2017-06-05T11:01:30Z

gensim/models/wrappers/fasttext.py

+            self.wv.syn0 = zeros((vocab_size, self.vector_size), dtype=REAL)
+            logger.info("here?")
+            # TO-DO : how to update this
+        ntokens= self.struct_unpack(file_handle, '@1q')


Please take care of code style: space before =

jayantj · 2017-06-05T11:02:14Z

gensim/models/wrappers/fasttext.py

+                logger.warnings("If you are loading any model other than pretrained vector wiki.fr, ")
+                logger.warnings("Please report to gensim or fastText.")
+        else:
+            self.wv.syn0 = zeros((vocab_size, self.vector_size), dtype=REAL)


This doesn't belong here, we need to be loading the vectors for the in-vocab words after loading the ngram weight matrix (syn0_all)

jayantj · 2017-06-05T11:02:56Z

gensim/models/wrappers/fasttext.py

-        assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
-        self.struct_unpack(file_handle, '@1q')  # number of tokens
+        if not bin_only:
+            assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'


Let's also log vocab_size/nwords/len(self.wv.vocab) in case of mismatches.

jayantj · 2017-06-05T11:03:49Z

gensim/models/wrappers/fasttext.py

@@ -298,8 +312,24 @@ def load_dict(self, file_handle, encoding='utf8'):
                char_byte = file_handle.read(1)
            word = word_bytes.decode(encoding)
            count, _ = self.struct_unpack(file_handle, '@qb')
-            assert self.wv.vocab[word].index == i, 'mismatch between gensim word index and fastText word index'
-            self.wv.vocab[word].count = count
+            if bin_only and word not in self.wv.vocab:


This seems unnecessarily convoluted - can the word be in self.wv.vocab if we're loading from bin file only?

jayantj · 2017-06-05T11:04:34Z

gensim/models/wrappers/fasttext.py

+        if bin_only:
+            if self.wv.syn0.shape[0] != len(self.wv.vocab):
+                logger.info(
+                    "duplicate words detected, shrinking matrix size from %i to %i",


Not sure I understand the need for this. Can duplicate words exist?

jayantj · 2017-06-05T11:05:34Z

gensim/test/test_fasttext_wrapper.py

+        """ Test model succesfully loaded from fastText (new format) .bin files only """
+        new_model = fasttext.FastText.load_fasttext_format(self.test_new_model_file, bin_only = True)
+        vocab_size, model_size = 1763, 10
+        self.assertEqual(self.test_new_model.wv.syn0.shape, (vocab_size, model_size))


Shouldn't we be testing new_model and not self.test_new_model?

jayantj · 2017-06-06T08:25:20Z

gensim/models/wrappers/fasttext.py

+                word_vec = np.zeros(self.wv.syn0.shape[1])
+
+                num_word_ngram_vectors = len(word_ngrams)
+                for word_ngram in word_ngrams:


This seems redundant and possibly error prone.
Instead you could simply use self.word_vec(word) to obtain the syn0 weights for the in-vocab words, after the syn0_all trimming takes place. Could you please try that, and see if it initializes the correct values for self.syn0?

Okay, so from looking at the FastText code, it seems this is more complicated than what we originally thought.
The final in-vocab word vectors (aka the syn0 vectors, the vectors in the .vec file) are obtained by summing up -

all vectors for its component ngrams

the vector for the original word itself

The large weight matrix in the .bin file contains -

vectors for the original words in the first len(vocab) rows

in the remaining rows, weights for component ngrams

This will require a change in logic then.

Possibly useful reference - https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc#L71

…into french

…rench

…into french

jayantj · 2017-06-16T11:04:28Z

gensim/models/wrappers/fasttext.py

@@ -284,12 +285,12 @@ def load_model_params(self, file_handle):
    def load_dict(self, file_handle, encoding='utf8'):
        vocab_size, nwords, _ = self.struct_unpack(file_handle, '@3i')
        # Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc)
-        assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
-        assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
+        logger.info("loading projection weights")


This message doesn't seem right, we're loading the vocabulary here, not the weights.

ok, I tried to replicate the logging that we were getting earlier.
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L204

Yes, but here we aren't loading exactly the same thing as in the case linked above - in that case, we're loading the vocab words along with their vectors, here we're simply building a dictionary.

jayantj · 2017-06-16T11:05:33Z

gensim/models/wrappers/fasttext.py

-            assert self.wv.vocab[word].index == i, 'mismatch between gensim word index and fastText word index'
-            self.wv.vocab[word].count = count
+
+            if word != "__label__":


Please add a note briefly clarifying this, since it's a hack-ish fix and likely to confuse any future developers.

jayantj · 2017-06-16T11:08:02Z

gensim/models/wrappers/fasttext.py

-            self.wv.vocab[word].count = count
+
+            if word != "__label__":
+                self.wv.vocab[word] = Vocab(index=i, count=count)


Is this correct? The word "__label__" doesn't have a corresponding vector in the weight matrix, if I understand our previous discussion correctly. As a result, any of the words after the "__label__" word will have indices shifted.

Actually, you were right about it, this is the last word, but we can't skip reading it otherwise there will be error in further bytes reading.

Yes, I understand we have to ready the bytes, agreed.
In that case, I'd prefer something more explicit and clear, like -

if i == nwords and i < vocab_size: assert word == "__label__" continue # don't add word to vocab

jayantj · 2017-06-16T11:12:34Z

gensim/models/wrappers/fasttext.py

+                self.wv.syn0[self.wv.vocab[w].index] += np.array(ngram_weights[self.wv.ngrams[word_ngram]])
+
+            self.wv.syn0[self.wv.vocab[w].index] /= (len(word_ngrams) + 1)
+        logger.info("loaded %s matrix from %s" % (self.wv.syn0.shape))


This message seems inconsistent with self.wv.syn0.shape

Again, tried to replicate older logging
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L266

Okay, but you're passing in a tuple, so the message is going to look something like loaded 30000 matrix from 100. Please update the message.

Don't use % to format the string at all; instead, pass the arguments as function arguments.

jayantj · 2017-06-16T11:13:07Z

gensim/models/wrappers/fasttext.py

        for w, v in self.wv.vocab.items():
            all_ngrams += self.compute_ngrams(w, self.wv.min_n, self.wv.max_n)
+            self.wv.syn0[self.wv.vocab[w].index] += np.array(self.wv.syn0_all[self.wv.vocab[w].index])


v instead of self.wv.vocab[w]? Please rename the variable to something more verbose when making this change.

ok, vocab should be fine ?

prakhar2b · 2017-06-17T18:05:24Z

@jayantj Also, we don't need this anymore load_word2vec_format I guess - https://github.com/prakhar2b/gensim/blob/0fc11591f6477ebd67eb783ce242aaa39f078f71/gensim/models/wrappers/fasttext.py#L223

jayantj · 2017-06-18T01:56:59Z

Good point, let's remove this and any other unused code.
Please also remove the .vec files from the tests, as we only need the .bin files now.

jayantj · 2017-06-18T02:58:42Z

gensim/models/wrappers/fasttext.py

+
+        assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
+        if len(self.wv.vocab) != vocab_size:
+            logger.warning("If you are loading any model other than pretrained vector wiki.fr, ")


Any particular reason for two separate warning statements? Why not a single one?

piskvorky · 2017-06-20T01:41:37Z

gensim/models/wrappers/fasttext.py

@@ -363,13 +377,15 @@ def init_ngrams(self):

        ngram_weights = self.wv.syn0_all

-        for w, v in self.wv.vocab.items():
+        logger.info("loading vocabulary weights")


Try to make the INFO messages more informative: loading how many weights, loading from where etc.

When there's a problem, it really helps when the log contains concrete numbers and messages. The problem often becomes apparent even without debugging.

gojomo · 2017-06-21T17:44:35Z

gensim/models/wrappers/fasttext.py

@@ -220,6 +220,10 @@ def save(self, *args, **kwargs):
        super(FastText, self).save(*args, **kwargs)

    @classmethod
+    def load_word2vec_format(cls, *args, **kwargs):
+        return FastTextKeyedVectors.load_word2vec_format(*args, **kwargs)


I believe that a load using this method only learns the full-word vectors as in the .vec file. If so, isn't it true that the resulting object doesn't have any other capabilities beyond a plain KeyedVectors? In that case, using a specialized class like FastTextKeyedVectors – that maybe is trying to do more, such as ngram-tracking, but inherently is not because that info was lost in the sequence-of-steps used to load it – seems potentially misleading. So unless I'm misunderstanding, I think this load-technique should use a plain KeyedVectors.

yes, this method is not used now for loading using bin only. I removed this unused code, but got a strange flake8 error for python 3+, therefore re-added this for this PR. I'll try removing these unused codes later maybe in a different PR. @gojomo

That is an odd error! I suspect it's not really the presence/absence of that method that triggered it, but something else either random or hidden in the whitespace.

@gojomo ok, test passed this time after removing this code 😄

For reference, this was a bug in the flake8 script, fixed in cebb3fc

jayantj · 2017-06-22T06:23:30Z

gensim/models/wrappers/fasttext.py


        """
        model = cls()
-        model.wv = cls.load_word2vec_format('%s.vec' % model_file, encoding=encoding)
+        model.file_name = model_file
        model.load_binary_data('%s.bin' % model_file, encoding=encoding)


I wonder if it would be a good idea to allow taking the whole model filename (including the .bin) as valid input - with the latest changes, we're loading from the bin file only, so it makes intuitive sense for the entire filename to be valid input.
The only reason IMO we're still allowing the filename without extension as valid input is backward compatibility.

jayantj · 2017-06-22T06:23:54Z

gensim/models/wrappers/fasttext.py

@@ -284,12 +285,12 @@ def load_model_params(self, file_handle):
    def load_dict(self, file_handle, encoding='utf8'):
        vocab_size, nwords, _ = self.struct_unpack(file_handle, '@3i')
        # Vocab stored by [Dictionary::save](https://github.com/facebookresearch/fastText/blob/master/src/dictionary.cc)
-        assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
-        assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
+        logger.info("loading vocabulary words for fastText model from %s.bin", self.file_name)


Adding the number of words would be helpful in this logging statement.

jayantj · 2017-06-22T06:24:45Z

gensim/models/wrappers/fasttext.py

+
+            if i == nwords and i < vocab_size:
+                """
+                To handle the error in pretrained vector wiki.fr (French).


This format -

""" Some comments """

is generally reserved for docstrings. Regular comments would be preferable.

jayantj · 2017-06-22T06:25:43Z

gensim/models/wrappers/fasttext.py

+        if len(self.wv.vocab) != vocab_size:
+            logger.warning("mismatch between vocab sizes")
+            logger.warning("If you are loading any model other than pretrained vector wiki.fr, ")
+            logger.warning("Please report to Gensim.")


Please change the multiple warning statements to a single concatenated statement.

Also, "Please report to Gensim" is vague, and probably unnecessary.

If people encounter bugs or get exceptions, they'll let us know, don't worry about that.

I'd prefer if the logging message was more concrete instead: what mismatch, what are the mismatched "vocab sizes"?
We want the logs as concrete and useful as possible, for our own sanity (remote debugging via mailing list etc).

@jayantj I remember a few weeks ago, I received a review comment from @piskvorky that concatenated statement would contain white space therefore better to split into multiple statements. Correct me if I didn't understand that comment

I believe @piskvorky meant to split the string itself across multiple lines, like this -

logger.warning( "mismatch between vocab sizes " "If you are loading any model other than pretrained vector wiki.fr, " "Please report to Gensim.")

He left a comment later in the PR clarifying it too.

ohh, thanks for clarifying 😄

jayantj · 2017-06-22T06:27:06Z

gensim/models/wrappers/fasttext.py

@@ -348,6 +370,18 @@ def init_ngrams(self):
            self.wv.ngrams[ngram] = i
        self.wv.syn0_all = self.wv.syn0_all.take(ngram_indices, axis=0)

+        ngram_weights = self.wv.syn0_all
+
+        logger.info("loading weights for %s vocabulary words for fastText models from %s.bin", len(self.wv.vocab), self.file_name)


Typo: fastText model (not models)

jayantj · 2017-06-22T06:27:51Z

gensim/test/test_fasttext_wrapper.py

@@ -115,7 +114,7 @@ def testNormalizedVectorsNotSaved(self):
        self.assertTrue(loaded_kv.syn0_all_norm is None)

    def testLoadFastTextFormat(self):
-        """Test model successfully loaded from fastText .vec and .bin files"""
+        """Test model successfully loaded from fastText .bin files"""


Typo here and below: .bin file (not files)

jayantj · 2017-06-28T08:45:33Z

Pushed some changes to model filename handling and logging.
Also, a minor fix to the flake8 diff script that was causing the weird utf8 error in python3+ discussed above.
IMO, this PR can be merged now.

…y#1341) * french wiki issue resolved * bin and vec mismatch handled * added test from bin only loading * [WIP] loading bin only * word vec from its ngrams * [WIP] word vec from ngrams * [WIP] getting syn0 from all n-grams * [TDD] test comparing word vector from bin_only and default loading * cleaned up test code * added docstring for bin_only * resolved wiki.fr issue * pep8 fixes * default bin file loading only * logging info modified plus changes a/c review * removed unused code in fasttext.py * removed unused codes and vec files from test * added lee_fasttext vec files again * re-added removed files and unused codes * added file name in logging info * removing unused load_word2vec_format code * updated logging info and comments * input file name with or without .bin both accepted * resolved typo mistake * test for file name * minor change to input filename handling in ft wrapper * changes to logging and assert messages, pep8 fixes * removes redundant .vec files * fixes utf8 bug in flake8_diff.sh script

Prakhar Pratyush and others added 3 commits May 22, 2017 17:07

french wiki issue resolved

7759a95

Merge branch 'develop' into french

c12b4fa

bin and vec mismatch handled

8025710

updating with lastest codes and resolving conflicts

7ee83d9

prakhar2b changed the title ~~Loading french pretrained vector issue resolved~~ Loading fastText models using only bin file Jun 1, 2017

Prakhar Pratyush added 3 commits June 3, 2017 02:13

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

041a6e9

…into french

added test from bin only loading

22c6710

[WIP] loading bin only

61be613

jayantj suggested changes Jun 5, 2017

View reviewed changes

Prakhar Pratyush added 2 commits June 6, 2017 12:52

word vec from its ngrams

e11ac44

[WIP] word vec from ngrams

a63a3bc

jayantj reviewed Jun 6, 2017

View reviewed changes

Prakhar Pratyush added 5 commits June 7, 2017 11:10

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

f80410f

…into french

[WIP] getting syn0 from all n-grams

454d74e

[TDD] test comparing word vector from bin_only and default loading

e6b0d8b

cleaned up test code

9b03ea3

added docstring for bin_only

c496be9

prakhar2b mentioned this pull request Jun 12, 2017

[MRG] FastText out-of-vocab fix #1409

Merged

Prakhar Pratyush added 5 commits June 12, 2017 18:25

Merge branch 'ft_oov_fix' of https://github.com/jayantj/gensim into f…

2c4a8dd

…rench

resolved wiki.fr issue

d2ab903

pep8 fixes

82507d1

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

c44b958

…into french

default bin file loading only

0fc1159

jayantj suggested changes Jun 16, 2017

View reviewed changes

jayantj reviewed Jun 18, 2017

View reviewed changes

piskvorky reviewed Jun 20, 2017

View reviewed changes

Prakhar Pratyush added 3 commits June 20, 2017 11:22

added lee_fasttext vec files again

5f7fe02

re-added removed files and unused codes

8bd56cf

added file name in logging info

b916187

gojomo reviewed Jun 21, 2017

View reviewed changes

jayantj reviewed Jun 22, 2017

View reviewed changes

Prakhar Pratyush and others added 8 commits June 22, 2017 13:59

removing unused load_word2vec_format code

1a0bfc0

updated logging info and comments

98e0287

input file name with or without .bin both accepted

f3d2032

resolved typo mistake

bd7e7f6

test for file name

800cd01

minor change to input filename handling in ft wrapper

a15233a

changes to logging and assert messages, pep8 fixes

431aebf

removes redundant .vec files

e52fee4

jayantj force-pushed the french branch from 87ddccf to e52fee4 Compare June 23, 2017 11:31

gojomo mentioned this pull request Jun 23, 2017

cPickle.UnpicklingError: unpickling stack underflow #1447

Open

fixes utf8 bug in flake8_diff.sh script

cebb3fc

jayantj force-pushed the french branch from b572ad9 to cebb3fc Compare June 28, 2017 08:45

jayantj approved these changes Jun 28, 2017

View reviewed changes

menshikh-iv merged commit e1d89c2 into piskvorky:develop Jun 28, 2017

menshikh-iv mentioned this pull request Jun 28, 2017

Gensim error when loading French FastText #1236

Closed

prakhar2b mentioned this pull request Jul 3, 2017

[WIP] Labeled w2v #1153

Closed

menshikh-iv mentioned this pull request Mar 7, 2019

Filename overwrite in FastText.load_fasttext_format #2407

Closed

Loading fastText models using only bin file #1341

Loading fastText models using only bin file #1341

Conversation

prakhar2b commented May 22, 2017

prakhar2b commented May 22, 2017

jayantj commented May 25, 2017

prakhar2b commented May 25, 2017 • edited Loading

Choose a reason for hiding this comment

jayantj Jun 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj Jun 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj Jun 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj Jun 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj Jun 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jun 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prakhar2b commented Jun 17, 2017

jayantj commented Jun 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj Jun 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj Jun 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piskvorky Jun 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj Jun 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj commented Jun 28, 2017 • edited Loading

prakhar2b commented May 25, 2017 •

edited

Loading

jayantj Jun 5, 2017 •

edited

Loading

jayantj Jun 5, 2017 •

edited

Loading

jayantj Jun 5, 2017 •

edited

Loading

jayantj Jun 6, 2017 •

edited

Loading

jayantj Jun 18, 2017 •

edited

Loading

piskvorky Jun 18, 2017 •

edited

Loading

jayantj commented Jun 18, 2017 •

edited

Loading

jayantj Jun 28, 2017 •

edited

Loading

jayantj Jun 22, 2017 •

edited

Loading

piskvorky Jun 22, 2017 •

edited

Loading

jayantj Jun 22, 2017 •

edited

Loading

jayantj commented Jun 28, 2017 •

edited

Loading