Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize FastText.load_fasttext_model #2340

Merged
merged 35 commits into from
Jan 24, 2019
Merged

Optimize FastText.load_fasttext_model #2340

merged 35 commits into from
Jan 24, 2019

Conversation

mpenkov
Copy link
Collaborator

@mpenkov mpenkov commented Jan 19, 2019

Should fix #1261

@menshikh-iv menshikh-iv changed the title [WIP] Fb improv Optimize FastText.load_fasttext_model Jan 19, 2019
@menshikh-iv menshikh-iv changed the title Optimize FastText.load_fasttext_model [WIP] Optimize FastText.load_fasttext_model Jan 19, 2019
gensim/models/doc2vec_inner.c Outdated Show resolved Hide resolved
gensim/models/_utils_any2vec.pyx Show resolved Hide resolved
gensim/test/test_fasttext.py Show resolved Hide resolved
gensim/test/test_utils.py Outdated Show resolved Hide resolved
gensim/models/utils_any2vec.py Outdated Show resolved Hide resolved
Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mpenkov, what's still missing

gensim/models/_utils_any2vec.pyx Show resolved Hide resolved
gensim/models/_utils_any2vec.pyx Outdated Show resolved Hide resolved
gensim/models/fasttext.py Outdated Show resolved Hide resolved
gensim/models/utils_any2vec.py Outdated Show resolved Hide resolved
gensim/test/test_fasttext.py Show resolved Hide resolved
@@ -704,6 +708,14 @@ def train(self, sentences=None, corpus_file=None, total_examples=None, total_wor
>>> model.train(sentences, total_examples=model.corpus_count, epochs=model.epochs)

"""
cant_train = hasattr(self.trainables, 'syn1neg') and self.trainables.syn1neg is None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stupid question: what if self.trainables does not have syn1neg attr at all, so can model train ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I don't see any other code that sets syn1neg to None. So, the new code uses that value to mean "cannot continue training".

If trainables does not have syn1neg at all, it is possible to start training.

gensim/test/test_utils.py Outdated Show resolved Hide resolved
gensim/test/test_utils.py Show resolved Hide resolved
@mpenkov
Copy link
Collaborator Author

mpenkov commented Jan 24, 2019

I benchmarked the model loading in this PR against 3.7.0 using:

from gensim.models import FastText
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(filename)s:%(lineno)s - %(message)s')
m = FastText.load_fasttext_format("cc.ru.300.bin")

Before: 13 min
After: 2min 20s

fb-improv

370.prof.gz

improv.prof.gz

@menshikh-iv menshikh-iv changed the title [WIP] Optimize FastText.load_fasttext_model Optimize FastText.load_fasttext_model Jan 24, 2019
@menshikh-iv
Copy link
Contributor

Awesome, thanks @mpenkov and @horpto 👍

@menshikh-iv menshikh-iv merged commit 411f546 into piskvorky:develop Jan 24, 2019
@mpenkov
Copy link
Collaborator Author

mpenkov commented Jan 24, 2019

We're still considerably slower than the FB app:

(improv.venv) mpenkov@hetrad2:~$ cat words.txt 
команда
маленьких
друзей
возит
грузы
всех
быстрей
(improv.venv) mpenkov@hetrad2:~$ cat bench.py 
from gensim.models import FastText
import sys

m = FastText.load_fasttext_format("cc.ru.300.bin", full_model=False)
for line in sys.stdin:
    word = line.rstrip()
    print(word, m.wv[word])
(improv.venv) mpenkov@hetrad2:~$ time cat words.txt | python bench.py > /dev/null

real    2m31.777s
user    1m50.864s
sys     0m41.336s
(improv.venv) mpenkov@hetrad2:~$ time cat words.txt | fasttext print-word-vectors cc.ru.300.bin > /dev/null                                                                                                   

real    0m14.301s
user    0m2.812s
sys     0m11.480s

@piskvorky
Copy link
Owner

piskvorky commented Jan 24, 2019

@mpenkov how much of that time is loading vs access? (in gensim and in fb)

EDIT: n/m, I see there are just a few words accessed, so this must be all loading.

@menshikh-iv
Copy link
Contributor

@piskvorky you are right, all of it is loading, retrieve a vector by word works fast.

Note: I guess reason in "retrieve vectors for vocab" (adjust_vector mostly) on loading, FB doesn't do that (construct all vectors on-the-fly, don't precompute vocab), but we do.

@mpenkov mpenkov deleted the fb-improv branch June 26, 2020 01:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve FastText loading times
4 participants