-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix min_count handling in phrases detection using npmi #2072
Conversation
This seems to be also the issue here |
Thanks @lopusz ! Sorry it's taking longer to process, we were busy with the documentation updates. This indeed looks like a bug to me. @michaelwsherman thoughts? |
gensim/models/phrases.py
Outdated
else: | ||
# Return the value below minimal npmi, to make sure that phrases | ||
# will be created only out of bigrams more frequent than min_count | ||
return -1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reasoning behind -1.1? Looks too magical.
Deserves a code comment at the very least, but maybe a principled value would be even better? -inf
? None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That value will be compared against the threshold parameter that was supplied by the user, which even though is not restricted, should fall within -1 and 1.
Also. I would use a >= instead of > in if bigram_count > min_count:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch.
I'd be in favor of returning None
(for all scoring functions) rather than a number for any min_count
violation since that would mean all scorers return the same value that means "min_count not met". But I also don't know if None
may break things. What tests fail with None
? From a quick look this is probably safe.
I don't think returning -1.1 is a great idea either, but I agree about it being commented at the very least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems you can't compare int/float with NoneType (at least in python 3.6.2), so line 168 of phrases.py would have to be something like if score is not None and score > threshold:
. This would also require to change the default scorer to return None
explicitly when min_count
is not met, for consistency across scorers.
Other solution, instead of returning None
in scorers, could be return -float('Inf')
which would result in 'False' when comparing score > threshold
.
I personally like solution 1 more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry for a slow reply. The pull-request woke up in a pretty busy time for me.
Thank you for remarks. Indeed -inf sounds a bit less fragile-magical. Code updated.
Personally, I would avoid returning None from function expected to return a number and cluttering the rest of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to -Inf, I learned something :). LGTM.
Code updated along the review lines. @piskvorky, Others do you find it mergeable? |
@lopusz thanks! Let's wait for @menshikh-iv 's verdict. |
@menshikh-iv what do you think? |
gensim/models/phrases.py
Outdated
pb = wordb_count / corpus_word_count | ||
pab = bigram_count / corpus_word_count | ||
return log(pab / (pa * pb)) / -log(pab) | ||
if bigram_count > min_count: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sharp inequality goes against the established pattern in the other classes in this module, and in word2vec etc. Should be >=
IMO.
@piskvorky Code updated. However, I want to bikeshed a bit on that (If the stuff below looks like nonsense, just ignore it...) I arrived at > sign because IMO gensim api for the default scorer seems to indirectly enforces > min_count in the default scorer. It would be great if default and npmi were compatible. How is that enforced, you say? Two things:
Since, I am not allowed to Perhaps the check could enforce only that How I arrived at this problem and why it might be important? I wanted to generate all the phrases above certain What do you think about allowing If I am the only bothered about that, just ignore. |
No problem :) Good points. As a user, without reading the docs, I'd expect The principle of least surprise. If I'm in favour of b). But let's be consistent in the naming and the semantics, across all supported scorers. |
@lopusz I'm +1 for #2072 (comment), i.e. should be Current code looks good, but please resolve merge conflict. |
Great. Corrected doc string and comment to reflect the |
@menshikh-iv Do you think it is now mergeable? |
@lopusz looks good, thanks! Congratz with first contribution 🥇 |
@menshikh-iv :) Yes, my mergeytocin level has never been so high :) I hope to be contributing more in the future. |
@lopusz 🌟 great, you are welcome :) |
Dear Gensim Developers,
First of all, gensim is an amazing tool. Thank you for your work!
Recently, I played a bit with phrases (collocation) detection via gensim.
It seems that the npmi version ignores
min_count
parameter.I believe
min_count
should allow me to filter rarely occurring bigrams no matter how strong the collocation score might be. It works fine with the default (original) scorer, but npmi ignoresmin_count
parameter.The attached example illustrates the point (sorry for txt ext, but github complains about py). It displays detected phrases, calls are with
min_count
=3. Before modification the NPMI version does not drop cases with 3 or fewer occurrences.After the correction, the low count bigrams are correctly pruned:
If my way of reasoning is right, but you want me to add some love to the attached pull
request do not hesitate to let me know (perhaps stating explicitly in the docs that
scorer should handle the
min_count
would be a nice addition, in case somebody wants to write her/his own...)Best regards,
Michał
mwe.txt