Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve speed of FastTextKeyedVectors __contains__ #1499

Merged
merged 3 commits into from
Sep 11, 2017

Conversation

ELind77
Copy link
Contributor

@ELind77 ELind77 commented Jul 23, 2017

The current implementation of __contains__ in FastTextKeyedVectors is O(n*m) where n is the number of character ngrams in the query word and m is the size of the vocabulary. This is very slow for large
corpora. The new implementation is O(n).

I was working with the Wikipedia-trained vectors provided by Facebook, which have a vocabulary of a few million, and it was extremely slow to create averaged vectors from sentences. I looked into the code and found that __contains__ was implemented inefficiently and I replaced it with an implementation that is faster algorithmically and will short-circuit to give good average complexity.

Testing

  • No new tests were required to test this implementation change
  • All tests in test_fasttext_wrapper.py pass when run from Pycharm

Eric Lind added 2 commits July 22, 2017 15:49
The current implementation of __contains__ in FastTextKeyedVectors is
`O(n*m)` where `n` is the number of character ngrams in the query word
and `m` is the size of the vocabulary.  This is very slow for large
corpora.  The new implementation is O(n).
else:
return False
word_ngrams = FastText.compute_ngrams(word, self.min_n, self.max_n)
return any(ng in self.ngrams for ng in word_ngrams)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The original was a bizarre construct indeed.

How often can we expect compute_ngrams to contain duplicates? If it's a non-trivial number, iterating over set(word_ngrams) better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult to say. As the character ngrams get smaller there is a greater chance of repeats and inputs may not always be english. Do character ngrams follow a Zipfian distribution? Even if they do, is the std library loop in the Set constructor really that much faster than iterating in pure Python? Since the list of character ngrams is bound to be relatively small and we're already looping over all of the ngrams, it doesn't seem like making them into a set would get us anything.

When I played around with this before submitting the PR I actually memoized both __contains__ and computer_ngrams in order to get more speed out of it but I didn't think that would be in line with gensim's goal of memory independence.

Also, I think word_ngrams should become character_ngrams.

Copy link
Owner

@piskvorky piskvorky Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Since dict.__contains__ is fast (relative to allocating and filling in a new set object), and the expected duplication rate low (I think), the current approach is probably for the best.

IMO the set optimization should come from compute_ngrams() itself, to avoid creating the intermediate list. An iterator over (unique?) ngrams unique_ngrams() would be ideal -- no full explicit list or set needed (since we short-circuit anyway). An optimized inlining of compute_ngrams into FastText.__contains__ seems the simplest & most efficient solution (starting the iteration from the smallest/most frequent ngrams, since these are the most likely to exist => early out). CC @jayantj

Copy link
Contributor

@jayantj jayantj Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute_ngrams previously used to return a set of unique ngrams. The reason this was changed was that the original FastText implementation sums up vectors for all character ngrams to obtain a word vector (so ngrams occurring more than once are actually added up multiple times), and so we wanted to replicate FastText behaviour as closely as possible.

An iterator would probably be ideal, and the char ngrams are already returned in increasing order of length, so the any(ng in self.ngrams for ng in word_ngrams) should be fairly efficient. Returning them in decreasing order of frequency would be non-trivial though (partly because we don't even store ngram counts anyway), and probably unnecessary. It is possible to create a new method to iterate over unique ngrams only, but I don't think it would result in a significant gain in speed.

I'm not sure why the previous construct was the way it was - I can't think/remember of any good reason for it. There is already a very similar snippet in the word_vec method -

ngrams = FastText.compute_ngrams(word, self.min_n, self.max_n)
ngrams = [ng for ng in ngrams if ng in self.ngrams]

So I'm not sure why I didn't already stick to that. Thanks a lot for the fix.

@ELind77
Copy link
Contributor Author

ELind77 commented Jul 29, 2017

I changed a variable name and the docstring to improve clarity but didn't change anything else.

This is what a hyper-optimized inlined ngrams function would look like:

    def __contains__(self, word):
        """
        Check if `word` or any character ngrams in `word` are present in the vocabulary.
        A vector for the word is guaranteed to exist if `__contains__` returns True.
        """
        if word in self.vocab:
            return True
        else:
            # Inline creation of character ngrams for speed
            extended_word = '<{}>'.format(word)
            n_chars = len(extended_word)
            min_n = self.min_n
            max_len = min(n_chars, self.max_n) + 1
            char_ngrams = {extended_word[i:i+ngl] for ngl in xrange(min_n, max_len)
                           for i in xrange(0, n_chars - ngl + 1)}

            return any(ng in self.ngrams for ng in char_ngrams)

Is this something we are comfortable having in codebase? I could benchmark it over the wikipedia fasttext vocabulary, but before I do I'd like to get some feedback on whether this code is at a sufficiently high level of clarity for the project.

Also, I used xrange here out of habit, but in doing so I noticed that there don't seem to be any imports from six in this module and there are some things, like FileNotFoundError that I don't think are compatible with Py3. Are there compatibility tests for this module? Should I start a different PR for that?

@ELind77
Copy link
Contributor Author

ELind77 commented Jul 29, 2017

This could be even faster if "lazy set generation" was a thing in Python but I think that might be overdoing it. At that point I think a LFU cache would actually be better. We could be sure that would see use by virtue of Zipf's law.

@menshikh-iv
Copy link
Contributor

Thank you @ELind77

@menshikh-iv menshikh-iv merged commit 224566c into piskvorky:develop Sep 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants