-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix computation of topic coherence #3197
Conversation
Hello, do you have any plans to consider this PR and the related issue? Many people use gensim to estimate the coherence of the topics and I believe it's important to get accurate and reliable scores. Thanks :) |
Thanks for the clear and articulate PR! Sorry we've been swamped with life, and to be honest I forgot about this PR. It's great you pinged us! I'll try to review. CC @mpenkov WDYT? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand the relevant_words
business, and the existing code looks a bit shady to me overall.
So I only did a surface review – as long as your changes don't break the corner case you're replacing ("no relevant_words
at all"), and helps your cause, I think were good to merge. Thanks!
Hello, If a topic is composed of only words that are not in the dictionary, that section of code is not even reached. In fact, CoherenceModel(topics, ...) would raise an Exception (so before the computation of the coherence). I added an additional test for this case. Apologize for the late reply. I will have more time to work on this issue in the next days, in case you request other changes. |
Before, empty texts were skipped, with this PR they yield an empty numpy array. This changes the behaviour but solves @silviatti's original issue (and is also cleaner conceptually), so I'm merging. |
Fixes ##3181.
Problem description and how I solved it
Before this PR, the accumulator used for the computation of topic coherence discarded the documents that do not contain words that belong to the input topics, called "not relevant" (see in the code). Although this may seem computationally convenient, it has an impact on the computation of the topic coherence (c_v, c_npmi and c_uci). In fact, the number of documents that the accumulator computes is part of the formula for the computation of the direct confirmation measures.
As a result, discarding "irrelevant" documents makes the resulting topic coherence not comparable with the coherences computed over topics with different words (this could result in a different number of total documents). To solve this, I consider all the documents of the corpus in the counting of the number of documents. This PR should solve issue #3181.
Motivating example
I'll explain the fairness issue with an example, for the sake of simplicity. Suppose we have two topic models A and B. Each topic model outputs two topics, the first topic of each model (A1 and B1) happens to be the same. We want to compare their coherences in a fair way. We expect the coherence of A1 and B2 to be the same.
Topic A1: ['human', 'computer']
Topic A2: ['system', 'interface']
Topic B1: ['human', 'computer']
Topic B2: ['graph', 'minors']
Corpus of documents to compute the occurrences:
d1: 'human computer system'
d2: 'human interface'
d3: 'graph minors trees'
Before the PR:
Considering topic model A:
Document d3 ('graph minors trees') is discarded by the accumulator because it doesn't contain words that are in the top-words of A1 or A2. So the total number of documents (num_docs) is 2. Let us compute the probability of the occurrence of the word "human" (this should be part of the direct confirmation measure. I just show the example of "human" but could be applied to other words. It is just to show that the results are different even if they shouldn't be). P(human) would be:
P(human)= number of occurrences of "human" / number of total "relevant" docs = 2/2
Considering topic model B:
Now document d3 is not discarded in the counts because it contains words that are in the top-words of B2, so the total number of documents is 3. Then P(human) is:
P(human) = number of occurrences of "human" / number of total "relevant" docs = 2/3
P(human), along with all the other word probabilities and co-occurrence probabilities, will be then used for the computation of topic coherence (in particular, the direct confirmation measure). We will then get two different coherences for the topics A1 and B1, even if they are the same topic in practice. This shouldn't be expected. The coherence of the same topic should be the same.
Code for comparing the coherences of A1 and B1:
After the PR:
Considering topic model A:
Document d3 is not discarded in the countings because all the documents are considered, so P(human) is:
P(human) = number of occurrences of "human" / number of total docs = 2/3
Considering topic model B:
Document d3 is not discarded, so P(human) is:
P(human) = number of occurrences of "human" / number of total docs = 2/3
Now the coherences of A1 and B1 are the same (same code above to test it).