PERF: easy speedup of WMD using POT.emd2 #3312

TLouf · 2022-03-23T16:22:45Z

Problem description

The WMD calculation can be sped up by using POT's emd2 function.

Steps/code/corpus to reproduce

Using for instance the pretrained glove-twitter-50 model, and lenghtened sentences from the tutorial:

import gensim.downloader as api
model = api.load('glove-twitter-50')

sentence_obama = "Obama speaks to the media in Illinois, what a great guy he's so nice."
sentence_president = "The president greets the press in Chicago, the greateness of his personality is without equal, I love him."
distance = model.wmdistance(sentence_obama, sentence_president)

running it with the current implementation wrapped in a timeit I get
1.09 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
vs
749 µs ± 6.71 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
using POT, a performance increase that's of course even more noticeable for larger documents, and especially if the distance matrix and bag-of-words are pre-computed:
543 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
vs
223 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Doing so is not possible with the current implementation in gensim but that's an option that could easily be given to the user (although that's for another issue).

So the way I see it, the wmdistance could first try to import POT, and if unavailable default to pyemd, adding to the current warning in the docstring that performance is better if POT is installed. I can open a PR if that sounds good.

Versions

With the most recent versions of the involved packages, gensim 4.1.2, pyemd 0.5.1 and POT 0.8.1.0

The text was updated successfully, but these errors were encountered:

gojomo · 2022-03-25T20:44:33Z

Good find!

If POT works better, there'd be no need to dynamically fall back to pyemd – unless there are some other limitations/risks with POT. We can just make POT the requirement for the feature.

A PR would be welcome.

As you've deduced, there's quite a bit of potential for further WMD optimization in Gensim - especially via avoiding recalc of distance matrixes over large batches of WMD calculations, or using some reduced form in other research that's quicker at rejecting deducing the top candidates for a "nearest-N" ranking.

Separately, curious if you think any of the other 'optimal transport' algorithms in POT could be interesting if applied to bags-of-word-vectors, in the same way as EMD? (Should EMD/WMD be just one drop-in option for a more general "treat text differences as transport problems" facility?)

TLouf · 2022-04-16T11:20:39Z

First I have to say I'm not an expert at all in this area, I had used both pyemd and POT for EMD calculations in some past research unrelated to WMD, and simply found out the latter was faster than the former. So I may not be the best person to answer your questions, but I read up a little to try and provide you the beginning of an answer.

or using some reduced form in other research that's quicker at rejecting deducing the top candidates for a "nearest-N" ranking.

Are you referring to, for instance, what's described in Kusner's paper in the context of finding the k nearest neighbors of a given document among a batch of other documents? That would indeed greatly speed up WmdSimilarity, which would need quite a rework.

Separately, curious if you think any of the other 'optimal transport' algorithms in POT could be interesting if applied to bags-of-word-vectors, in the same way as EMD? (Should EMD/WMD be just one drop-in option for a more general "treat text differences as transport problems" facility?)

I guess it wouldn't be too hard to give the possibility to use any of the other algorithms, the only thing I'm not sure of is if it makes sense for all of them to be used in the context of WMD. However I found a few have already been tested. I found Sinkhorn's has in this paper, or some unbalanced OT algorithms, like what they call "lazy EMD" in this other paper.

TLouf mentioned this issue Apr 16, 2022

PERF: pyemd to POT for EMD computation in wmdistance #3327

Merged

mpenkov closed this as completed in #3327 Nov 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: easy speedup of WMD using POT.emd2 #3312

PERF: easy speedup of WMD using POT.emd2 #3312

TLouf commented Mar 23, 2022

gojomo commented Mar 25, 2022

TLouf commented Apr 16, 2022 •

edited

Loading

PERF: easy speedup of WMD using POT.emd2 #3312

PERF: easy speedup of WMD using POT.emd2 #3312

Comments

TLouf commented Mar 23, 2022

Problem description

Steps/code/corpus to reproduce

Versions

gojomo commented Mar 25, 2022

TLouf commented Apr 16, 2022 • edited Loading

TLouf commented Apr 16, 2022 •

edited

Loading