Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Topic model visualization #1932

Closed
wants to merge 25 commits into from

Conversation

parulsethi
Copy link
Contributor

This PR adds the visualization for interactive exploration of doc/topic/word entities in topic models (as discussed).

@menshikh-iv menshikh-iv added the incubator project PR is RaRe incubator project label Feb 25, 2018
@menshikh-iv menshikh-iv mentioned this pull request Mar 16, 2018
2 tasks
@menshikh-iv
Copy link
Contributor

Feature review

Preparation

I'm using this script

from gensim.corpora import Dictionary
from gensim.models import ldamodel
import gensim.downloader as api
import logging

logging.basicConfig(level=logging.INFO)

texts = [['bank','river','shore','water'],
            ['river','water','flow','fast','tree'],
            ['bank','water','fall','flow'],
            ['bank','bank','water','rain','river'],
            ['river','water','mud','tree'],
            ['money','transaction','bank','finance'],
            ['bank','borrow','money'],
            ['bank','finance'],
            ['finance','money','sell','bank'],
            ['borrow','sell'],
            ['bank','loan','sell']]

# texts = api.load("text8")
   
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
doc_texts = [' '.join(i) for i in texts]

model = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=30)


from gensim.viz.topic_viz.gensim_wrap import prepare   
import gensim.viz.topic_viz as viz

print("LDA MODEL READY")
vis = prepare(model, corpus, dictionary, doc_texts)
viz.show(vis, local=True)

I also start from py2, but this doesn't works, works only with py3

Installation

  • Missing dependencies (need to add custom to setup.py, see "distributed" from setup.py):
    https://github.com/RaRe-Technologies/gensim/blob/49e6abdde6bca55b7a5b263c01ecdc94a75fb1ab/setup.py#L226.
    About dependencies:

    • jinja2 - OK
    • past (future), joblib, pandas, funcy (I'm not sure that this deps really needed)
  • Hardcoded & missed d3js IOError: [Errno 2] No such file or directory: '/home/ivan/Desktop/student_repo/parul/gensim/gensim/viz/topic_viz/js/d3.v4.min.js'

  • Problems with py2

     (parul-vis) ivan@P50:~/Desktop/student_repo/parul/gensim/parul_src$ python ldavis_run.py 
     Serving to http://127.0.0.1:8889/    [Ctrl-C to exit]
     Gtk-Message: Failed to load module "overlay-scrollbar"
     127.0.0.1 - - [10/Apr/2018 12:22:44] "GET / HTTP/1.1" 200 -
     Created new window in existing browser session.
     127.0.0.1 - - [10/Apr/2018 12:22:44] "GET /LDAvis.css HTTP/1.1" 200 -
     127.0.0.1 - - [10/Apr/2018 12:22:44] "GET /d3.js HTTP/1.1" 200 -
     ----------------------------------------
     Exception happened during processing of request from ('127.0.0.1', 60542)
     Traceback (most recent call last):
       File "/usr/lib/python2.7/SocketServer.py", line 290, in _handle_request_noblock
         self.process_request(request, client_address)
       File "/usr/lib/python2.7/SocketServer.py", line 318, in process_request
         self.finish_request(request, client_address)
       File "/usr/lib/python2.7/SocketServer.py", line 331, in finish_request
         self.RequestHandlerClass(request, client_address, self)
       File "/usr/lib/python2.7/SocketServer.py", line 652, in __init__
         self.handle()
       File "/usr/lib/python2.7/BaseHTTPServer.py", line 340, in handle
         self.handle_one_request()
       File "/usr/lib/python2.7/BaseHTTPServer.py", line 328, in handle_one_request
         method()
       File "/home/ivan/Desktop/student_repo/parul/gensim/gensim/viz/topic_viz/server.py", line 50, in do_GET
         self.wfile.write(content.encode())
     UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 133429: ordinal not in range(128)
     ----------------------------------------
     127.0.0.1 - - [10/Apr/2018 12:22:45] "GET /LDAvis.js HTTP/1.1" 200 -
     127.0.0.1 - - [10/Apr/2018 12:22:45] code 404, message Not Found
     127.0.0.1 - - [10/Apr/2018 12:22:45] "GET /favicon.ico HTTP/1.1" 404 -
    
  • Missing instruction (how to run it)

Usage

  • Better to resize it to fullscreen (2/3 of my screen are blank)
  • Better to handle numeration from 0 (to be consistent with Gensim)
  • Can't select a word by number (but by-default 0 in the field)
  • I want to see words (string representation) when I select topic/document. Otherwise - I see some big circles, but I have no idea, what this the big circle means (same for documents)
  • Error handling (if I try to query non-existent document or word, now this happens silently)
  • Missing instruction (directly in-app + how to run this app). Need to add .rst to docs/
  • [CRITICAL] I replace small data to text8 (too small corpus too, but on prepare function my RAM ended, immediately 13GB and looks like this isn't the end). If this happens with text8 (really small corpus, 32MB gzipped), I'm really worried how this will work with real datasets (several gigabytes or more)
  • Need a list of prompt when I type a word in the field (like tips in a search engine when you type a query)
  • The documents showed in a strange manner, I'm slightly confused. Better allow users to attach original text (instead of the list of tokens), because I can see most important tokens if I select the document.

@menshikh-iv
Copy link
Contributor

The serious part of work done, now need to fix first #1932 (comment), this needs a good understanding of Python, JS and topic modeling on the basic level, we need contributors here (but a project isn't easy and need much time).

Thanks for your work @parulsethi! I hope that you'll return to Gensim later 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incubator project PR is RaRe incubator project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants