Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve speed of FastTextKeyedVectors __contains__ #1499

Merged
merged 3 commits into from
Sep 11, 2017
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 2 additions & 5 deletions gensim/models/wrappers/fasttext.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,11 +123,8 @@ def __contains__(self, word):
if word in self.vocab:
return True
else:
word_ngrams = set(FastText.compute_ngrams(word, self.min_n, self.max_n))
if len(word_ngrams & set(self.ngrams.keys())):
return True
else:
return False
word_ngrams = FastText.compute_ngrams(word, self.min_n, self.max_n)
return any(ng in self.ngrams for ng in word_ngrams)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The original was a bizarre construct indeed.

How often can we expect compute_ngrams to contain duplicates? If it's a non-trivial number, iterating over set(word_ngrams) better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult to say. As the character ngrams get smaller there is a greater chance of repeats and inputs may not always be english. Do character ngrams follow a Zipfian distribution? Even if they do, is the std library loop in the Set constructor really that much faster than iterating in pure Python? Since the list of character ngrams is bound to be relatively small and we're already looping over all of the ngrams, it doesn't seem like making them into a set would get us anything.

When I played around with this before submitting the PR I actually memoized both __contains__ and computer_ngrams in order to get more speed out of it but I didn't think that would be in line with gensim's goal of memory independence.

Also, I think word_ngrams should become character_ngrams.

Copy link
Owner

@piskvorky piskvorky Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Since dict.__contains__ is fast (relative to allocating and filling in a new set object), and the expected duplication rate low (I think), the current approach is probably for the best.

IMO the set optimization should come from compute_ngrams() itself, to avoid creating the intermediate list. An iterator over (unique?) ngrams unique_ngrams() would be ideal -- no full explicit list or set needed (since we short-circuit anyway). An optimized inlining of compute_ngrams into FastText.__contains__ seems the simplest & most efficient solution (starting the iteration from the smallest/most frequent ngrams, since these are the most likely to exist => early out). CC @jayantj

Copy link
Contributor

@jayantj jayantj Jul 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute_ngrams previously used to return a set of unique ngrams. The reason this was changed was that the original FastText implementation sums up vectors for all character ngrams to obtain a word vector (so ngrams occurring more than once are actually added up multiple times), and so we wanted to replicate FastText behaviour as closely as possible.

An iterator would probably be ideal, and the char ngrams are already returned in increasing order of length, so the any(ng in self.ngrams for ng in word_ngrams) should be fairly efficient. Returning them in decreasing order of frequency would be non-trivial though (partly because we don't even store ngram counts anyway), and probably unnecessary. It is possible to create a new method to iterate over unique ngrams only, but I don't think it would result in a significant gain in speed.

I'm not sure why the previous construct was the way it was - I can't think/remember of any good reason for it. There is already a very similar snippet in the word_vec method -

ngrams = FastText.compute_ngrams(word, self.min_n, self.max_n)
ngrams = [ng for ng in ngrams if ng in self.ngrams]

So I'm not sure why I didn't already stick to that. Thanks a lot for the fix.



class FastText(Word2Vec):
Expand Down