-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: easy speedup of WMD using POT.emd2 #3312
Comments
Good find! If A PR would be welcome. As you've deduced, there's quite a bit of potential for further WMD optimization in Gensim - especially via avoiding recalc of distance matrixes over large batches of WMD calculations, or using some reduced form in other research that's quicker at rejecting deducing the top candidates for a "nearest-N" ranking. Separately, curious if you think any of the other 'optimal transport' algorithms in |
First I have to say I'm not an expert at all in this area, I had used both
Are you referring to, for instance, what's described in Kusner's paper in the context of finding the k nearest neighbors of a given document among a batch of other documents? That would indeed greatly speed up
I guess it wouldn't be too hard to give the possibility to use any of the other algorithms, the only thing I'm not sure of is if it makes sense for all of them to be used in the context of WMD. However I found a few have already been tested. I found Sinkhorn's has in this paper, or some unbalanced OT algorithms, like what they call "lazy EMD" in this other paper. |
Problem description
The WMD calculation can be sped up by using
POT
'semd2
function.Steps/code/corpus to reproduce
Using for instance the pretrained
glove-twitter-50
model, and lenghtened sentences from the tutorial:running it with the current implementation wrapped in a
timeit
I get1.09 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
vs
749 µs ± 6.71 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
using POT, a performance increase that's of course even more noticeable for larger documents, and especially if the distance matrix and bag-of-words are pre-computed:
543 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
vs
223 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Doing so is not possible with the current implementation in
gensim
but that's an option that could easily be given to the user (although that's for another issue).So the way I see it, the
wmdistance
could first try to importPOT
, and if unavailable default topyemd
, adding to the current warning in the docstring that performance is better ifPOT
is installed. I can open a PR if that sounds good.Versions
With the most recent versions of the involved packages,
gensim
4.1.2,pyemd
0.5.1 andPOT
0.8.1.0The text was updated successfully, but these errors were encountered: