Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speeding up similarity calculations #126

Closed
vsraptor opened this issue Jun 18, 2021 · 1 comment
Closed

Speeding up similarity calculations #126

vsraptor opened this issue Jun 18, 2021 · 1 comment

Comments

@vsraptor
Copy link

vsraptor commented Jun 18, 2021

Is there a systematic way to loop through all the SYNSETS i.e. synset iterator ?

Is it wn.synsets()

Any idea how can I speed up similarity calculations ?

I'm testing sentence comparisons. Just to give you an example comparing two words requires finding similarity of ~10-20 synsets, then if a sentence on avg has 10 words this means 100 word comparison per every two sentences ~1000 sims ... ~2s - 25sec ... then to compare ~1000++ sentences ... the numbers are enormous.. its should be ~1000++! but its not cause words repeat ..

I do caching of word2word sim, which helps, but any juice i can squeeze will be good

@vsraptor vsraptor changed the title Speeding up similarity calculations Synset iterator and Speeding up similarity calculations Jun 18, 2021
@vsraptor vsraptor changed the title Synset iterator and Speeding up similarity calculations Speeding up similarity calculations Jun 18, 2021
@goodmami
Copy link
Owner

Hi, I see you've already closed this before I could respond. Were you able to find a solution?

While I do put it a bit of effort to optimize the performance of parts of this codebase, in general I'm currently more concerned about correctness than performance. Also, before doing further optimizations we should setup some benchmarks (see #98).

For the similarity metrics, caching the results for word pairs is a good idea, but at that level (processing corpora) it seems more like a part of some research project or application and less like a feature to be added to Wn. However, all the similarity metrics use hypernym path calculations, and currently each hop of such a path requires one or more hits to the database, and I've thought about ways to speed this up (#38, #110).

You might follow the issues linked above if you're interested in performance. Also, I'm happy to receive pull requests :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants