Calculate Similarity of Distinct LDA Models #1328

HarryBaker · 2017-05-16T18:15:19Z

This is a slightly modified version of topic2topic_difference found here: #1243

Rather than comparing the similarity of a single LDA model across training iterations, I want to compare the similarity of two distinct LDA models after training. The idea behind this is to calculate the similarity of two distinct LDA models trained on the exact same data with the exact same parameters. If their similarity is very high, this should indicate that the models are reproducible, and that another person could train a new model on the same data with the same parameters and be confident that their model is the same as ours.

However, I think this has an application beyond testing identical models trained under different seeds. If you use the Jaccard distance of the top N words of each topic, then I think you can compare topics across models trained under different datasets. For instance, if you have two models trained on similar datasets over different periods of time, you can match topics across models and study how they've changed over time. I'm going to do more research into this when I'm confident that the topic matching works.

A quick warning, but this is my first time ever contributing open source code, so I apologize in advance if I do anything wrong in terms of style or work flow. I'm currently working on solving the problem of model reproducibility for my company, and thought that my code might be useful for the gensim community.

tmylk · 2017-05-16T23:39:36Z

Hi, visualisations are the top priority on our roadmap so it would be very welcome.
Just painting the all vs all distance matrix is a good start. But some one number of "best alignment distance" would be nice, very similar idea to word movers distance
How do you suggest to match the topics between models?

HarryBaker · 2017-05-17T13:21:20Z

I am going to write a script that uses a modified topic2topic_difference to return the all vs all distance matrix of two topic models. I will use this to pair each topic in model 1 with it's most similar topic in model 2, along with it's distance score. I sort this list by distance, and start assigning the most similar topics as matches. If I run into a collision (that is, a topic that has already been matched), then I find the next most similar pairing for the topic in model 1. I then re-sort the list and continue until I have a 1 to 1 relationship between all topics. This gives me a bijection between topics, and I can then find the average of each pair's distance score to assign a score to the two models. From my tests so far the matching appears to be working reasonably well, though I need to do further testing of the average similarity score.

tmylk · 2017-05-17T15:46:54Z

Please have a look at the Word Movers Distance code in gensim referenced above - the "minimal transport search" algorithm can be re-used from that package

HarryBaker · 2017-05-18T14:02:50Z

Ok, I will check it out. Thanks!

HarryBaker · 2017-05-18T17:13:25Z

Word Mover's Distance does seem to work better than the other metrics so far. For the most similar topics it aligns identically with jaccard and kl, but for the "fuzzier" topics I think it does a better job of matching them.

Here is my fork of the project. It's still a work in progress. I need to add in sanity checks and make it fit with gensim's style, but it shows how my code works. It's in branch topic2topic_seperate_models

https://github.com/HarryBaker/gensim

tmylk · 2017-05-18T17:29:39Z

Thanks for looking into it. Could you please add an ipynb illustrating this point?

HarryBaker · 2017-05-18T17:33:27Z

I can't publish the data I'm studying now, but tomorrow I will try to find public data to demonstrate with.

HarryBaker · 2017-05-22T17:10:16Z

Do you know if there's been any papers written on measuring the reproducibility of LDA models? I've tried to find papers on the subject, and it doesn't seem like it has been studied. This is surprising, because I would think that guaranteeing reproducibility would be a major part of academic research. If this hasn't been studied, my department might look into putting out a research paper on the subject.

HarryBaker · 2017-05-22T18:28:06Z

https://pdfs.semanticscholar.org/d6d4/3ee873e40c3186f6313028ef1a4c08225c96.pdf

Seems like it's covering a similar issue. Weirdly, this is the only paper I've found on measuring the stability of LDA topics.

tmylk · 2017-05-30T00:38:14Z

@HarryBaker Stability under different random seeds is indeed an important issue. There is also inherent non-determinism in multithreading in MulticoreLDA vs single core LDAModel that would be nice to measure.

BTW see two model comparison graph in #1374

macks22 · 2017-06-15T19:47:33Z

@tymlk @HarryBaker
This paper evaluates a variety of techniques for topic similarity, ranking average PMI between top-N representations of the topics as one of the best techniques, along with Explicit Semantic Analysis (ESA). The average PMI approach can easily be implemented using the same code as the CoherenceModel.

The Jaccard similarity between topic-term distributions technique is analyzed and shown to have low agreement with human annotators. I believe this is the metric used by the topic2topic_difference code being referred to. That isn't to say it's not worth having; just that it may be an inferior technique to others that are also easy to integrate.

menshikh-iv · 2017-10-02T10:02:33Z

Resolved in #1374.

tmylk mentioned this issue May 16, 2017

[WIP][DNM] Visualize topic model difference (need feedback) #1243

Closed

menshikh-iv closed this as completed Oct 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate Similarity of Distinct LDA Models #1328

Calculate Similarity of Distinct LDA Models #1328

HarryBaker commented May 16, 2017

tmylk commented May 16, 2017

HarryBaker commented May 17, 2017

tmylk commented May 17, 2017

HarryBaker commented May 18, 2017

HarryBaker commented May 18, 2017 •

edited

Loading

tmylk commented May 18, 2017

HarryBaker commented May 18, 2017

HarryBaker commented May 22, 2017

HarryBaker commented May 22, 2017

tmylk commented May 30, 2017

macks22 commented Jun 15, 2017

menshikh-iv commented Oct 2, 2017

Calculate Similarity of Distinct LDA Models #1328

Calculate Similarity of Distinct LDA Models #1328

Comments

HarryBaker commented May 16, 2017

tmylk commented May 16, 2017

HarryBaker commented May 17, 2017

tmylk commented May 17, 2017

HarryBaker commented May 18, 2017

HarryBaker commented May 18, 2017 • edited Loading

tmylk commented May 18, 2017

HarryBaker commented May 18, 2017

HarryBaker commented May 22, 2017

HarryBaker commented May 22, 2017

tmylk commented May 30, 2017

macks22 commented Jun 15, 2017

menshikh-iv commented Oct 2, 2017

HarryBaker commented May 18, 2017 •

edited

Loading