Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDA show_topic() might show wordids, not words #354

Closed
pshvechikov opened this issue Jun 2, 2015 · 2 comments
Closed

LDA show_topic() might show wordids, not words #354

pshvechikov opened this issue Jun 2, 2015 · 2 comments
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature

Comments

@pshvechikov
Copy link

I mean this line:
https://github.com/piskvorky/gensim/blob/develop/gensim/models/ldamodel.py#L757
should be

beststr = [(topic[id], id) for id in bestn]

Otherwise it is not possible to get similar output to tfidf model, which shows list of tuples such as this

 (98801, 0.008706198772450975),

including id of the word and its tfidf metric.
Maybe such a behavior should rely on some function parameter, so it dont ruin any other related code which uses default word representation of show_topic.
I mean something like this:

    def show_topic(self, topicid, topn=10, show_ids=False):                         
        """                                                                         
        Return a list of `(words_probability, word)` 2-tuples for the most probable
        words in topic `topicid`.                                                   

        Only return 2-tuples for the topn most probable words (ignore the rest). 

        """                                                                         
        topic = self.state.get_lambda()[topicid]                                    
        topic = topic / topic.sum() # normalize to probability dist                 
        bestn = numpy.argsort(topic)[::-1][:topn]                                   
        if show_ids:                                                                
            beststr = [(topic[id], id) for id in bestn]                             
        else:                                                                       
            beststr = [(topic[id], self.id2word[id]) for id in bestn]               
        return beststr       

Nevertheless, even this approach is contradictory in the order of id, metric inside a tuple - tfidf shows id as a first parameter and lda.show_topics() does it as a second.

@cscorley
Copy link
Contributor

I think that is something desirable, especially making sure that the (id, metric) is consistent across modules. Mind opening a PR?

BTW, there's no need to change the linked line. It isn't returned in the end for that method, anyway. I think you meant a few lines up, in show_topic: https://github.com/piskvorky/gensim/blob/develop/gensim/models/ldamodel.py#L734

@menshikh-iv
Copy link
Contributor

Resolved in #448.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

4 participants