Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: easy speedup of WMD using POT.emd2 #3312

Closed
TLouf opened this issue Mar 23, 2022 · 2 comments · Fixed by #3327
Closed

PERF: easy speedup of WMD using POT.emd2 #3312

TLouf opened this issue Mar 23, 2022 · 2 comments · Fixed by #3327

Comments

@TLouf
Copy link
Contributor

TLouf commented Mar 23, 2022

Problem description

The WMD calculation can be sped up by using POT's emd2 function.

Steps/code/corpus to reproduce

Using for instance the pretrained glove-twitter-50 model, and lenghtened sentences from the tutorial:

import gensim.downloader as api
model = api.load('glove-twitter-50')

sentence_obama = "Obama speaks to the media in Illinois, what a great guy he's so nice."
sentence_president = "The president greets the press in Chicago, the greateness of his personality is without equal, I love him."
distance = model.wmdistance(sentence_obama, sentence_president)

running it with the current implementation wrapped in a timeit I get
1.09 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
vs
749 µs ± 6.71 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
using POT, a performance increase that's of course even more noticeable for larger documents, and especially if the distance matrix and bag-of-words are pre-computed:
543 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
vs
223 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Doing so is not possible with the current implementation in gensim but that's an option that could easily be given to the user (although that's for another issue).

So the way I see it, the wmdistance could first try to import POT, and if unavailable default to pyemd, adding to the current warning in the docstring that performance is better if POT is installed. I can open a PR if that sounds good.

Versions

With the most recent versions of the involved packages, gensim 4.1.2, pyemd 0.5.1 and POT 0.8.1.0

@gojomo
Copy link
Collaborator

gojomo commented Mar 25, 2022

Good find!

If POT works better, there'd be no need to dynamically fall back to pyemd – unless there are some other limitations/risks with POT. We can just make POT the requirement for the feature.

A PR would be welcome.

As you've deduced, there's quite a bit of potential for further WMD optimization in Gensim - especially via avoiding recalc of distance matrixes over large batches of WMD calculations, or using some reduced form in other research that's quicker at rejecting deducing the top candidates for a "nearest-N" ranking.

Separately, curious if you think any of the other 'optimal transport' algorithms in POT could be interesting if applied to bags-of-word-vectors, in the same way as EMD? (Should EMD/WMD be just one drop-in option for a more general "treat text differences as transport problems" facility?)

@TLouf
Copy link
Contributor Author

TLouf commented Apr 16, 2022

First I have to say I'm not an expert at all in this area, I had used both pyemd and POT for EMD calculations in some past research unrelated to WMD, and simply found out the latter was faster than the former. So I may not be the best person to answer your questions, but I read up a little to try and provide you the beginning of an answer.

or using some reduced form in other research that's quicker at rejecting deducing the top candidates for a "nearest-N" ranking.

Are you referring to, for instance, what's described in Kusner's paper in the context of finding the k nearest neighbors of a given document among a batch of other documents? That would indeed greatly speed up WmdSimilarity, which would need quite a rework.

Separately, curious if you think any of the other 'optimal transport' algorithms in POT could be interesting if applied to bags-of-word-vectors, in the same way as EMD? (Should EMD/WMD be just one drop-in option for a more general "treat text differences as transport problems" facility?)

I guess it wouldn't be too hard to give the possibility to use any of the other algorithms, the only thing I'm not sure of is if it makes sense for all of them to be used in the context of WMD. However I found a few have already been tested. I found Sinkhorn's has in this paper, or some unbalanced OT algorithms, like what they call "lazy EMD" in this other paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants