Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bucket Argument in fasttext not working as expected ? #1765

Closed
saroufimc1 opened this issue Dec 6, 2017 · 3 comments
Closed

Bucket Argument in fasttext not working as expected ? #1765

saroufimc1 opened this issue Dec 6, 2017 · 3 comments

Comments

@saroufimc1
Copy link

saroufimc1 commented Dec 6, 2017

Hi, For the fasttext native from gensim:

My understanding is that according to the hashing trick, if bucket is < total # of subwords, there will be collisions and some subwords will be mapped to the same integers. Am I wrong?
However, it is not what I see on a toy example:

import gensim
from gensim.models.fasttext import FastText

sent = [['lol', 'dds', 'sdsf'], ['anticonsti']]
model = FastText(min_count = 1, bucket = 20)
model.build_vocab(sentences=sent)
model.train(sentences = sent, epochs = 1, report_delay = 1.0)

model.wv.ngrams

Expected Results

Dictionary with ngrams and their mappings to integers between 0 and 19 ( buckets = 20)

Actual Results

Dictionary with ngrams and their mappings to integers between 0 and 55 ( number of ngrams is 56 here)

Versions

import platform; print(platform.platform())
Windows-10-10.0.14393-SP0
import sys; print("Python", sys.version)
Python 3.5.2 |Continuum Analytics, Inc.| (default, Jul 5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
import numpy; print("NumPy", numpy.version)
NumPy 1.13.3
import scipy; print("SciPy", scipy.version)
SciPy 1.0.0
import gensim; print("gensim", gensim.version)
gensim 3.1.0
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
FAST_VERSION 0

@saroufimc1 saroufimc1 changed the title Bucket Argument in fasttext not working asexpected Bucket Argument in fasttext not working as expected ? Dec 6, 2017
@menshikh-iv
Copy link
Contributor

Thanks for report @saroufimc1, I went through code with debugger and found that "suspicious resize" of model.wv.syn0_ngrams happen in https://github.com/manneshiva/gensim/blob/2560e1d6eaeb1b6e18fb29bd1a6e460a141c5f70/gensim/models/fasttext.py#L332

@manneshiva please look more closely (and fix if this a bug)

@manneshiva
Copy link
Contributor

manneshiva commented Dec 7, 2017

@menshikh-iv My code hasn't been merged yet, so I doubt @saroufimc1 used my code (unless he installed Gensim from my branch). The exact same problem appears to be here:
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/fasttext.py#L150
I did not touch/change this part of the code. Will have a look at it soon and give my comments.

@manneshiva
Copy link
Contributor

Confirmed that this is a bug caused by incorrect setting of ngram_hash here.. Thanks @saroufimc1 for raising this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants