Fix method `estimate_memory` from `gensim.models.FastText` & huge performance improvement. Fix #1824 #1916

jbaiter · 2018-02-19T17:09:46Z

This PR is an attempt to optimize the memory usage of the FastText model and to provide a more accurate FastText.estimate_memory method.

Specifically, it implements the following improvements:

Cythonize the ft_hash function
Cythonize the compute_ngrams function
Do not pre-compute and store the ngrams with the model, calculate them on the fly when needed
Do not store ngrams with the model, completely rely on the ngram hashes

In its current state this PR does not merge and does not pass the test suite, due to these issues:

~~The improvements were done before the refactoring of the word embedding models, likely some code will have to be moved around~~
~~There are some Python2/3 issues with the cythonized compute_ngrams function~~
The tests currently check for OOV ngrams. In the old code, this relied on the ngrams attribute of the model, but in the optimized model I use the ngram hash. Since these hashes are bucketed, an OOV is most often no longer possible (i.e. if all buckets are occupied), i.e. these tests would have to be removed. Is this okay?

I think I can do 1) and 2) on my own when I find the time, but for 3) I'd need some help, since I'm not that familiar with the intentions behind the old code.

menshikh-iv · 2018-02-20T05:49:51Z

@jbaiter Wow! Please resolve merge-conflicts (this is critical right now). Probably, create new branch (based on fresh develop) and apply your changes will be simpler for you than resolve conflicts here.

When you resolve conflicts - please ping me for review / any help.

jbaiter · 2018-02-22T11:54:47Z

@menshikh-iv So I managed to rebase my changes on the latest develop branch.
The Python 2 issue with _compute_ngrams has also been fixed.
The test suit now also passes all of the tests on Python 3.6.
On Python 2.7 and 3.5 however, the OOV vector in testPersistenceForOldVersions does not match the fixture. I'm currently at a loss at what could cause this, it's working fine in Python 3.6.

jbaiter · 2018-02-22T12:00:55Z

gensim/test/test_fasttext_wrapper.py

-        out_expected_vec = numpy.array([-1.34948218, -0.8686831, -1.51483142, -1.0164026, 0.56272298,
-            0.66228276, 1.06477463, 1.1355902, -0.80972326, -0.39845538])
+        out_expected_vec = numpy.array([-0.33959097, -0.21121596, -0.37212455, -0.25057459, 0.11222091,
+            0.17517674, 0.26949012, 0.29352987, -0.1930912, -0.09438948])


This is probably not correct, the vector seems to differ between different Python versions (the above was with 3.6)

…tPersistenceForOldVersions

jbaiter · 2018-02-22T13:58:25Z

I now reverted all changes to the deprecated fasttext_wrapper and the test suite is now passing in all Linux environments. I don't think it's too bad if it doesn't get the optimizations for now, given that it's deprecated.

On Windows there seems to be a memory-related issue, the allocation for the ngram vectors in test_estimate_memory causes a MemoryError.

menshikh-iv · 2018-02-22T18:44:13Z

@jbaiter thanks, for memory - this is known issue for appveyour :( I'll look into later, thanks for your patience.

menshikh-iv

CC: @manneshiva can you review this one, please?

menshikh-iv · 2018-02-23T13:11:18Z

gensim/models/fasttext_inner.pyx

@@ -317,7 +317,8 @@ def train_batch_sg(model, sentences, alpha, _work, _l1):
                continue
            indexes[effective_words] = word.index

-            subwords = [model.wv.ngrams[subword_i] for subword_i in model.wv.ngrams_word[model.wv.index2word[word.index]]]
+            subwords = [model.wv.hash2index[ft_hash(subword) % model.bucket]


Please use only hanging indents (no vertical).

That is, remove the newline and keep the comprehension on a single line?

single line (if line <= 120 characters) or something like

subwords = [ .... .... ]

menshikh-iv · 2018-02-23T13:12:41Z

gensim/models/utils_any2vec_fast.pyx

@@ -0,0 +1,21 @@
+#!/usr/bin/env cython


This related only with n-grams model, I think it's better to move it to fasttext_inner.pyx

I don't think this would work, since both functions are used by both models.fasttext and models.keyedvectors. Moving the functions into models.fasttext would break, since this would cause a circular import.

Aha, so, in this case, better to name it as _utils_any2vec.pyx (similar with _mmreader.pyx and _matutils.pyx)

menshikh-iv · 2018-02-23T13:12:54Z

gensim/models/utils_any2vec_fast.pyx

+# cython: cdivision=True
+# coding: utf-8
+
+def ft_hash(unicode string):


why not cdef with nogil?

Both functions need to be called from both Python and Cython, so cdef won't work, but cpdef should. Will be fixed.

nogil doesn't work, since the function returns a Python object.

menshikh-iv · 2018-02-23T13:13:04Z

gensim/models/utils_any2vec_fast.pyx

+    return h
+
+
+cpdef compute_ngrams(word, unsigned int min_n, unsigned int max_n):


why not nogil, you can fix type for word.

Fixing the type for word doesn't really work, since the token might be a str on Python 2.7. nogil won't work either, since the function returns a Python object.

menshikh-iv · 2018-02-23T13:13:51Z

gensim/models/utils_any2vec_fast.pyx

+
+
+cpdef compute_ngrams(word, unsigned int min_n, unsigned int max_n):
+    cdef unicode extended_word = f'<{word}>'


f-strings supports only in 3.6 (and we maintain 2.7, 3.5, 3.6), please use simple concatenation (or any alternative) here.

This is in Cython, which, as far as I understood, since 0.24 automatically generates cross-compatible C-code for it. It works fine under 2.7.

wow, really? I didn't know about it, thanks for the information!

You can read about it here: http://cython.readthedocs.io/en/latest/src/tutorial/strings.html#string-literals

menshikh-iv · 2018-02-23T13:14:30Z

gensim/test/test_fasttext.py

        model_neg.build_vocab(new_sentences, update=True)  # update vocab
        model_neg.train(new_sentences, total_examples=model_neg.corpus_count, epochs=model_neg.iter)
        self.assertEqual(len(model_neg.wv.vocab), 14)
-        self.assertTrue(len(model_neg.wv.ngrams), 271)


Why you removed part of tests (I mean all removed lines in tests)?

As I mentioned in the PR, one optimization was to remove the storage of ngrams on the model and solely rely on the hashes. This is why any tests that assert the number of ngrams in the model are no longer neccesary.

With the lack of ngrams, the __contains__ check now also only uses the hashed and bucketed ngrams, which is why a 'real' OOV is a lot rarer (i.e. it will only happen if not all buckets are occupied and none of the ngrams in the token match any occupied bucket).

menshikh-iv · 2018-02-23T13:17:01Z

@jbaiter don't forget to resolve merge-conflict too

…ation

manneshiva

@jbaiter I went through your PR and it looks good to me. Great job!
Just one more small deletion required in /gensim/models/deprecated/fasttext.py. Please delete,

new_model.wv.ngrams_word = old_model.wv.ngrams_word
new_model.wv.ngrams = old_model.wv.ngrams

and the corresponding asserts in test_fasttext.py, here.

I also feel we should also evaluate the effect of the changes in this PR on the quality of vectors learnt. Maybe compare the old and new code by training a FastText model on text8 and looking at the accuracies (using accuracy) of learnt vectors on question-answers.txt. It would also be interesting to see the memory consumption in both cases.

cc: @menshikh-iv

This removes the expensive calls to `compute_ngrams` and `ft_hash` during training and uses a simple lookup in an int -> int[] mapping instead, resulting in a dramatic increase in training performance.

jbaiter · 2018-02-26T18:01:20Z

I ran some benchmarks with my optimized version and the current gensim implementation of FastText.

Initially the performance was about 10x slower, but I implemented an optimization that pre-generates the ngram buckets for each word to avoid calling compute_ngrams and ft_hash in the training loop. This actually improved performance to 2x that of the original implementation.

I trained on the text8 corpus with the default settings on an Xeon E5-1620 with 8 cores.
The test script can be found at https://gist.github.com/3d781a311e536b471b24fb4a46c952a4
All measurements were done with GNU time (/usr/bin/time).

Metric	original	optimized
Training time	585.5	299.3s
Training words/sec	106,804 words/s	208,911 words/s
Training peak memory	1,409.92 MiB	1,181.35 MiB
Evaluation time	60.66s	59.25s
Evaluation peak memory	980.46 MiB	768.42 MiB

As you can see, the goal of reducing memory consumption was achieved and we additionally almost doubled the training speed, while maintaining evaluation speed.

The quality of the vectors seems to suffer a bit, however I think that this might be due to the different random initializations of the two models.

Benchmark	original	optimized
capital-common-countries	8.5% (43/506)	7.1% (36/506)
capital-world	3.6% (52/1452)	3.6% (52/1452)
currency	0.0% (0/268)	0.0% (0/268)
city-in-state	4.3% (68/1571)	4.1% (64/1571)
family	40.8% (125/306)	40.8% (125/306)
gram1-adjective-to-adverb	96.0% (726/756)	95.1% (719/756)
gram2-opposite	91.8% (281/306)	91.2% (279/306)
gram3-comparative	81.4% (1026/1260)	82.2% (1036/1260)
gram4-superlative	85.2% (431/506)	84.6% (428/506)
gram5-present-participle	76.3% (757/992)	75.9% (753/992)
gram6-nationality-adjective	71.3% (978/1371)	68.6% (940/1371)
gram7-past-tense	29.1% (387/1332)	28.4% (378/1332)
gram8-plural	80.9% (803/992)	80.5% (799/992)
gram9-plural-verbs	82.8% (538/650)	82.9% (539/650)
total	50.7% (6215/12268)	50.1% (6148/12268)

menshikh-iv · 2018-02-27T04:18:58Z

@jbaiter first table looks awesome: almost x2 faster + reduce memory usage, fantastic 🔥!

About second table & random init: can you train/evaluate several times & average results (to exclude random effect) please?

Also, please have a look at Appveyor, I see MemoryError in test_estimate_memory.
Besides, please make backward-compatibility test (train FastText with old code, and try to load it with current code).

manneshiva · 2018-02-27T12:58:05Z

@jbaiter The speedup looks great! Thanks for this contribution.
Just a few comments:

You would also want to include the size of wv.buckets_word to estimate_memory() method.
The initial setup time before training (calculating hashes and storing ngrams) should definitely be faster now that you have Cythonized compute_ngrams and ft_hash. But I am not sure what is causing the speedup in training (in terms of words/sec). As far as I know, the initial code (Gensim 3.3.0) did not call compute_ngrams or ft_hash during training (from fasttext_inner.pyx), what do you think is the reason for 2x increase in the number of words processed per second (I am assuming you have pasted these numbers (words/sec) from the logs)? The table values compare your current implementation (optimized) with the implementation from Gensim 3.3.0 (original), am I right?

menshikh-iv · 2018-02-27T14:35:01Z

@manneshiva @jbaiter maybe try to compare on bigger corpus (something ~1GB, not text8, this is more "fair" performance comparison)

jbaiter · 2018-02-27T15:52:40Z

@menshikh-iv

About second table & random init: can you train/evaluate several times & average results (to exclude random effect) please?
Besides, please make backward-compatibility test (train FastText with old code, and try to load it with current code).

Will do, I'll also do a run each with a fixed seed.

Also, please have a look at Appveyor, I see MemoryError in test_estimate_memory.

Do you have any idea what could be causing this? estimate_memory is called in the training tests as well, but doesn't cause an error there. You mentioned that this tends to happen on AppVeyor with other tests as well?

@manneshiva @jbaiter maybe try to compare on bigger corpus (something ~1GB, not text8, this is more "fair" performance comparison)

Can you recommend a suitable dataset that works with the tests in questions-answers.txt? The corpora I currently have on hand are all historical German and not really suitable for these tests.

@manneshiva

You would also want to include the size of wv.buckets_word to estimate_memory() method.

Will do!

The initial setup time before training (calculating hashes and storing ngrams) should definitely be faster now that you have Cythonized compute_ngrams and ft_hash. But I am not sure what is causing the speedup in training (in terms of words/sec). As far as I know, the initial code (Gensim 3.3.0) did not call compute_ngrams or ft_hash during training (from fasttext_inner.pyx), what do you think is the reason for 2x increase in the number of words processed per second (I am assuming you have pasted these numbers (words/sec) from the logs)? The table values compare your current implementation (optimized) with the implementation from Gensim 3.3.0 (original), am I right?

Yes, the values for original are from the latest commit on the develop branch, optimized is the code of this PR.
I think the reason might be that the new code does not use a list comprehension but instead directly looks up the tuple of ngram buckets:

subwords = [model.wv.ngrams[subword_i] for subword_i in model.wv.ngrams_word[model.wv.index2word[word.index]]]

vs

subwords = model.wv.buckets_word[word.index]

The current version does num_ngrams + 2 lookups, while the new code always just uses a single lookup. It could be that the previous code was causing a lot of cache evictions/misses and the new one doesn't? I could try to look at cache usage patterns with perf if I find the time.

menshikh-iv · 2018-02-27T17:04:20Z

Do you have any idea what could be causing this? estimate_memory is called in the training tests as well but doesn't cause an error there. You mentioned that this tends to happen on AppVeyor with other tests as well?

This happens sometimes with appveyour (by memory limit reasons), so, you can try to use the smaller model in this case.

Can you recommend a suitable dataset that works with the tests in questions-answers.txt? The corpora I currently have on hand are all historical German and not really suitable for these tests.

Sample from https://github.com/RaRe-Technologies/gensim-data/releases/tag/wiki-english-20171001 should be a good idea (you should pick first 1M articles).

menshikh-iv · 2018-02-28T08:04:49Z

I'll make benchmark myself too (~~will be updated soon~~, finished)

Code: https://gist.github.com/menshikh-iv/ba8cba26744c668e73b59d5972dabbf8
Evaluation dataset: https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt
Input: 0.5M first articles from wiki ("title" + "sect_title" + "sect_content" + ...)
Preprocessing: very simple (without stemming)

from gensim.parsing.preprocessing import (
    preprocess_string, strip_punctuation, strip_multiple_whitespaces, remove_stopwords, strip_short
)

prc = partial(
    preprocess_string, 
    filters=[strip_punctuation, strip_multiple_whitespaces, remove_stopwords, strip_short]
)

Model parameters: almost default, only iter=1 (because corpus large enough),

Metric	original	optimized	improvement (x)
Training time (1 epoch)	4823.4s (80.38 minutes)	1873.6s (31.22 minutes)	2.57
Training time (full)	1h 26min 13s	36min 43s	2.35
Training words/sec	72781	187366	2.57
Training peak memory	5,173 MB	3,671 MB	1.4

Benchmark	original	optimized
capital-common-countries	60.3% (305/506)	63.4% (321/506)
capital-world	35.9% (655/1826)	36.6% (669/1826)
currency	0.8% (1/128)	0.8% (1/128)
city-in-state	24.5% (539/2203)	23.4% (515/2203)
family	69.9% (214/306)	72.2% (221/306)
gram1-adjective-to-adverb	85.7% (514/600)	85.7% (514/600)
gram2-opposite	70.0% (147/210)	72.4% (152/210)
gram3-comparative	84.0% (1000/1190)	84.4% (1004/1190)
gram4-superlative	80.1% (442/552)	78.8% (435/552)
gram5-present-participle	71.5% (465/650)	71.7% (466/650)
gram6-nationality-adjective	90.5% (1175/1299)	91.3% (1186/1299)
gram7-past-tense	49.8% (664/1332)	49.7% (662/1332)
gram8-plural	88.0% (873/992)	87.1% (864/992)
gram9-plural-verbs	83.8% (387/462)	86.1% (398/462)
total	60.2% (7381/12256)	60.4% (7408/12256)

jbaiter · 2018-02-28T09:16:54Z

So I ran the text8 tests 100 times with both the current implementation and the optimization, the total score is now at 50.38% (original) and 50.29%, so a pretty small difference.
The averaged tests with the first 1M documents from the wiki corpus are currently running, will be able to report on the outcome tomorrow.

As for the causes of the speedup, it seems that my changes somehow result in a significant increase in parallelism, as can be gathered from these performance counters:

Perfomance counters for training + evaluation on current

    1404012.973981      task-clock (msec)         #    2.145 CPUs utilized
         1,272,950      context-switches          #    0.907 K/sec
            27,026      cpu-migrations            #    0.019 K/sec
           667,005      page-faults               #    0.475 K/sec
 4,603,142,310,724      cycles                    #    3.279 GHz                      (48.93%)
 5,168,917,813,028      instructions              #    1.12  insn per cycle           (61.42%)
 1,032,270,301,259      branches                  #  735.228 M/sec                    (61.21%)
     8,170,082,299      branch-misses             #    0.79% of all branches          (60.97%)
 1,732,560,787,151      L1-dcache-loads           # 1234.006 M/sec                    (45.44%)
   164,838,925,283      L1-dcache-load-misses     #    9.51% of all L1-dcache hits    (28.92%)
    36,178,103,172      LLC-loads                 #   25.768 M/sec                    (25.83%)
    10,871,658,490      LLC-load-misses           #   30.05% of all LL-cache hits     (36.92%)

     654.542443124 seconds time elapsed

Performance counters for training + evaluation on optimized

    1032653.771199      task-clock (msec)         #    3.002 CPUs utilized
         1,721,311      context-switches          #    0.002 M/sec
             9,352      cpu-migrations            #    0.009 K/sec
           623,004      page-faults               #    0.603 K/sec
 3,411,762,671,386      cycles                    #    3.304 GHz                      (48.48%)
 4,492,762,293,396      instructions              #    1.32  insn per cycle           (60.97%)
   877,973,650,752      branches                  #  850.211 M/sec                    (60.70%)
     6,791,892,588      branch-misses             #    0.77% of all branches          (60.63%)
 1,521,323,964,974      L1-dcache-loads           # 1473.218 M/sec                    (43.68%)
   126,194,704,440      L1-dcache-load-misses     #    8.30% of all L1-dcache hits    (29.18%)
    21,929,347,031      LLC-loads                 #   21.236 M/sec                    (26.23%)
     5,608,723,856      LLC-load-misses           #   25.58% of all LL-cache hits     (36.70%)

     343.979779366 seconds time elapsed

Additionally to using a full core more, IPC seems to be higher (1.32 instructions/cycle vs 1.12 before) and the overall number of instructions is lower (could be because of the lower number of lookups?)

menshikh-iv · 2018-02-28T10:30:34Z

I checked current PR with 500,000 articles from wiki, my results #1916 (comment), this looks exciting 🌟 🔥 impressive work @jbaiter!

I also checked, the model from 3.3.0 loads fine, need to check 3.2.0 only now.

menshikh-iv · 2018-02-28T10:38:22Z

@jbaiter last things that missed here

Check that old FastText loads correctly with new code
Fix Appveyour memory issue (reduce model size)

menshikh-iv · 2018-02-28T11:17:01Z

This is files from profiler (current relese version + PR version)

Script:

import gensim.downloader as api
from gensim.models import FastText

data = api.load("text8")
model = FastText(data)

master.txt
optimized.txt

@manneshiva this is for you

jayantj

Hi @jbaiter thanks a lot for the PR! This looks really great, and that's some serious speedup. Would really appreciate if you could address the comments in my review.

jayantj · 2018-02-28T14:04:19Z

gensim/models/fasttext.py

-                    wv.hash2index[ngram_hash] = new_hash_count
-                    wv.ngrams[ngram] = wv.hash2index[ngram_hash]
-                    new_hash_count = new_hash_count + 1
+            wv.num_ngram_vectors = 0


We could probably reduce some variables here - there seems to be some redundancy, if I understand correctly. wv.num_ngram_vectors, new_hash_count and len(ngram_indices) serve effectively the same purpose.
Maybe we could use len(ngram_indices) within the loop and set wv.num_ngram_vectors at the end of the loop?

jayantj · 2018-02-28T14:04:42Z

gensim/models/fasttext.py

-            new_ngrams = list(set(new_ngrams))
-            wv.num_ngram_vectors += len(new_ngrams)
-            logger.info("Number of new ngrams is %d", len(new_ngrams))
+            if not wv.buckets_word:


What is the purpose of this?

jayantj · 2018-02-28T14:06:20Z

gensim/models/fasttext.py

-                    new_hash_count = new_hash_count + 1
-                else:
-                    wv.ngrams[ngram] = wv.hash2index[ngram_hash]
+            num_new_ngrams = 0


There seems to be some redundancy again with new_hash_count, num_new_ngrams.

jayantj · 2018-02-28T14:12:22Z

gensim/models/fasttext.py

+                    continue
+                ngram_indices.append(len(wv.vocab) + ngram_hash)
+                wv.hash2index[ngram_hash] = wv.num_ngram_vectors
+                wv.num_ngram_vectors += 1


Can be set to len(ngram_indices) at the end instead (sorry for nitpicking, but we already have very long code for some of these methods)

jayantj · 2018-02-28T14:16:05Z

gensim/models/keyedvectors.py

-                word_vec += ngram_weights[self.ngrams[ngram]]
+                ngram_hash = _ft_hash(ngram) % self.bucket
+                if ngram_hash in self.hash2index:
+                    word_vec += ngram_weights[self.hash2index[ngram_hash]]
            if word_vec.any():
                return word_vec / len(ngrams)


This probably needs to be updated to only take into account the ngrams for which hashes were present in self.hash2index

jayantj · 2018-02-28T14:19:25Z

gensim/test/test_fasttext.py

-            0.58818,
-            0.57828,
-            0.75801
+            -0.21929,


What is the reason for this change? Could it because of the len(ngrams) issue mentioned in a comment above?

That's really a crucial question, but it's not because of the len(ngrams) issue.
The change was honestly simply made because the numbers were pretty similar and I thought that the vectors just changed a bit since the new code is far more lenient in assigning a vector to unknown ngrams (i.e. once all buckets are occupied, any ngram will result in a vector, even if it was not in the original corpus).

But it looks like there might a bug in the old code that has something to do with this: There are a lot more ngram vectors in the loaded model (17004) than there are in the model on disk (2762). This is probably because wv.vectors_ngram = wv.vectors_ngrams.take(ngram_indices, axis=0) in init_ngrams_post_load will result in a (num_ngrams_total, ngram_vec_len) matrix. Shouldn't vectors_ngram have a shape of (num_buckets, ngram_vec_len)? At least that's the case in the new code, and it follows from my (not necessarily correct) understanding of how the bucketing in this implementation works.

This sound similar to what was reported in #1779

...since the new code is far more lenient in assigning a vector to unknown ngrams (i.e. once all buckets are occupied, any ngram will result in a vector, even if it was not in the original corpus).

Ahh right, that makes sense, thanks for explaining.

Re: the number of ngram vectors being greater than num_buckets (or the number of vectors on disk) - I see why that might have been happening. With a ngram vocab larger than the number of buckets, a lot of ngrams will be mapped to the same indices. And when .take is passed a list that contains multiple occurrences of the same index, the vector at that index is "taken" multiple times.
For example -

all_vectors = np.array([[0.1, 0.3], [0.3, 0.1]]) taken_vectors = all_vectors.take([0, 1, 0], axis=0) taken_vectors.shape >>> (3, 2)

So it wouldn't result in incorrect results, but yeah, it'd result in unexpectedly high memory usage (and kind of blowing out the whole idea of keeping memory usage constant even with increasing ngram vocabs, out of the water). Thanks for investigating this and explaining it!

@menshikh-iv we're good to merge from my side

jayantj · 2018-02-28T14:19:50Z

gensim/test/test_fasttext.py

-            0.18025,
-            -0.14128,
-            0.22508
+            -0.49111,


Ditto:
What is the reason for this change? Could it because of the len(ngrams) issue mentioned in a comment above?

menshikh-iv · 2018-02-28T16:46:18Z

Some missed stuff

Add .c and .pyx files to https://github.com/RaRe-Technologies/gensim/blob/develop/MANIFEST.in
Add .rst for _utils_any2vec.py & _utils_any2vec.pyx to gensim/docs/src/models
Update apiref.rst according to previous changes
Re-compile cython part with doc-directives

jbaiter mentioned this pull request Feb 19, 2018

FastText memory usage greatly exceeds value returned by estimate_memory #1824

Closed

jbaiter added 2 commits February 20, 2018 22:35

Cythonize fasttext.ft_hash for 100x performance improvement

3db9c63

Cythonize fasttext.compute_ngrams for 2x performance improvement

9f3428a

jbaiter force-pushed the fasttext-optimization branch from 7c6afb2 to 08c464a Compare February 21, 2018 23:46

Reduce fasttext memory usage by computing ngrams on the fly

51a1a6e

jbaiter force-pushed the fasttext-optimization branch from 08c464a to 51a1a6e Compare February 22, 2018 00:18

jbaiter added 2 commits February 22, 2018 01:47

Fix compute_ngrams for Python 2

f467ab9

Merge branch 'develop' into fasttext-optimization

783114a

piskvorky requested a review from manneshiva February 22, 2018 07:10

jbaiter commented Feb 22, 2018

View reviewed changes

jbaiter added 2 commits February 22, 2018 13:05

Store OOV vec in variable for more informative assertion error in tes…

5c576ad

…tPersistenceForOldVersions

Revert all changes to fasttext_wrapper

0a5912c

menshikh-iv suggested changes Feb 23, 2018

View reviewed changes

jbaiter added 3 commits February 23, 2018 15:06

Fix indentation for multi-line expressions

9a36b08

Rename utils_any2vec_fast to _utils_any2vec

764071b

Merge remote-tracking branch 'upstream/develop' into fasttext-optimiz…

722cdda

…ation

manneshiva approved these changes Feb 24, 2018

View reviewed changes

jbaiter added 2 commits February 26, 2018 18:51

fasttext: Cache ngram buckets for words during training

c6f347e

This removes the expensive calls to `compute_ngrams` and `ft_hash` during training and uses a simple lookup in an int -> int[] mapping instead, resulting in a dramatic increase in training performance.

Remove last occurences of wv.ngrams_word and wv.ngrams

1d86111

jbaiter added 5 commits February 28, 2018 11:55

fasttext: use buckets_word cache also for non-Cython training

6aaab0a

fasttext: Add buckets_ngram size to memory estimate

85679ed

fasttext: Don't store buckets_word with the model

e574e90

fasttext: Use smaller model for test_estimate_memory

2a090c6

fasttext: Fix pure python training code

33968dc

jbaiter added 2 commits February 28, 2018 12:54

fasttext: Fix asserts for test_estimate_memory

76a0675

fasttext: Fix typo and style errors

0a2ae3c

menshikh-iv changed the title ~~Optimizations for FastText~~ Fix method estimate_memory from gensim.models.FastText & huge performance improvement. Fix #1824 Feb 28, 2018

jayantj suggested changes Feb 28, 2018

View reviewed changes

fasttext: Simplify code as per @jayantj's review

0fe0f80

Update MANIFEST.in and documentation with utils_any2vec implementations

7cb46e3

jayantj approved these changes Mar 1, 2018

View reviewed changes

menshikh-iv approved these changes Mar 1, 2018

View reviewed changes

last fixes (add option for cython compiler, fix descriptions, etc)

dcc0857

menshikh-iv merged commit 9021ea8 into piskvorky:develop Mar 1, 2018

manneshiva mentioned this pull request Mar 1, 2018

FastText native VS original, different outputs #1940

Closed

quajak mentioned this pull request May 11, 2018

Importing fasttext models still not working #2045

Closed

gojomo referenced this pull request Mar 9, 2019

Update CHANGELOG.md

e69e112

		return h


		cpdef compute_ngrams(word, unsigned int min_n, unsigned int max_n):



		cpdef compute_ngrams(word, unsigned int min_n, unsigned int max_n):
		cdef unicode extended_word = f'<{word}>'

Fix method estimate_memory from gensim.models.FastText & huge performance improvement. Fix #1824 #1916

Fix method estimate_memory from gensim.models.FastText & huge performance improvement. Fix #1824 #1916

Conversation

jbaiter commented Feb 19, 2018 • edited Loading

menshikh-iv commented Feb 20, 2018

jbaiter commented Feb 22, 2018

Choose a reason for hiding this comment

jbaiter commented Feb 22, 2018 • edited Loading

menshikh-iv commented Feb 22, 2018 • edited Loading

menshikh-iv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbaiter Feb 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Feb 23, 2018

manneshiva left a comment

Choose a reason for hiding this comment

jbaiter commented Feb 26, 2018 • edited Loading

menshikh-iv commented Feb 27, 2018 • edited Loading

manneshiva commented Feb 27, 2018 • edited Loading

menshikh-iv commented Feb 27, 2018

jbaiter commented Feb 27, 2018 • edited Loading

menshikh-iv commented Feb 27, 2018

menshikh-iv commented Feb 28, 2018 • edited Loading

jbaiter commented Feb 28, 2018 • edited Loading

menshikh-iv commented Feb 28, 2018

menshikh-iv commented Feb 28, 2018 • edited Loading

menshikh-iv commented Feb 28, 2018

jayantj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbaiter Feb 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Feb 28, 2018 • edited Loading

Fix method `estimate_memory` from `gensim.models.FastText` & huge performance improvement. Fix #1824 #1916

Fix method `estimate_memory` from `gensim.models.FastText` & huge performance improvement. Fix #1824 #1916

jbaiter commented Feb 19, 2018 •

edited

Loading

jbaiter commented Feb 22, 2018 •

edited

Loading

menshikh-iv commented Feb 22, 2018 •

edited

Loading

jbaiter Feb 23, 2018 •

edited

Loading

jbaiter commented Feb 26, 2018 •

edited

Loading

menshikh-iv commented Feb 27, 2018 •

edited

Loading

manneshiva commented Feb 27, 2018 •

edited

Loading

jbaiter commented Feb 27, 2018 •

edited

Loading

menshikh-iv commented Feb 28, 2018 •

edited

Loading

jbaiter commented Feb 28, 2018 •

edited

Loading

menshikh-iv commented Feb 28, 2018 •

edited

Loading

jbaiter Feb 28, 2018 •

edited

Loading

menshikh-iv commented Feb 28, 2018 •

edited

Loading