This is a Python 3 version of the original implementation written by atefm. It has a number of improvements, mainly speed and clarity. A full list of changes is available below.
Applying topic models for short texts (e.g. Tweets) is more challenging because of data sparsity and the limited contexts in such texts. One approach is to combine short texts into long pseudo-documents before training LDA. Another approach is to assume that there is only one topic per document [3]. pDMM provides implementations of the one-topic-per-document Dirichlet Multinomial Mixture (DMM) model (i.e. mixture of unigrams) [1][4]. For further reading, see Manning [6] and Lu [7].
Bug reports, comments and suggestions about pDMM are highly appreciated. As a free open-source package, pDMM is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
pdmm
can be run without installation from within the repository directory, but the package can be installed locally with pip
:
$ pip install setup.py
From the command line:
$ python3 -m pdmm [-h] -c <path> [-n <integer>] [-a <double>] [-b <double>] [--output-path <path>] [--iterations <integer>] [--num-words <integer>]
where parameters in [ ] are optional.
-c, --corpus
Specify the path to the input corpus file.
-n, --num-topics
Specify the number of topics. The default is 20.
-a, --alpha
Specify the hyper-parameter alpha, relating to the probability of a document choosing a given topic. The default value is 0.1.
-b, --beta
Specify the hyper-parameter beta, relating to the probability of a document choosing a particular topic containing similar documents. Smaller values reduce the variance of a document in individual topics. The default value is 0.01, which is a common setting in the literature [5]. Following [1], the users may consider a beta
value of 0.1 for short texts.
--output-path
Specify the output path for the results, which are saved in a folder at the path containing the files topWords
and topicAssignments
. If a path is not given, output will not be saved.
--iterations
Specify the number of Gibbs sampling iterations. The default value is 2,000.
--num-words
Specify the number of top words to be presented and/or saved for each topic. The default value is 20.
Consider the following example:
$ python3 -m pdmm --corpus-file tests/data/sample_data --iterations 100
From the interpreter:
>>> import pdmm
>>> corpus = pdmm.Corpus.from_document_file("/path/to/corpus/file")
>>> model = pdmm.GibbsSamplingDMM(corpus)
>>> model.randomly_initialise_topic_assignment()
>>> model.inference(number_of_iterations=100)
Tests can be run from the command line:
$ python3 -m tests
The tests can be slow as they are ensuring that the inference is producing the expected results. Alternatively, a test module can be run individually:
$ python3 -m tests corpus
This will attempt to run the tests in the file tests/test_corpus.py
.
Python 3.7 is required. All package requirements can be found in requirements.txt
, but the main dependencies are coverage
, numba
and numpy
.
All changes can be tracked on Github, but the broad changes are as follows:
- The addition of tests to ensure that the algorithm was not affected during refactoring.
- Creation of a Python module rather than standalone scripts, allowing code to be run properly within the interpreter.
- Cleaner code and PEP8 compliance.
- Code rewritten to use NumPy arrays and run more steps in parallel. This led to a massive subsequent speed-up.
- Code to generate new documents after inference has been completed.
[1] Jianhua Yin and Jianyong Wang, 2014, August. A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 233-242). ACM.
[2] David M. Blei. 2012. Probabilistic Topic Models. Communications of the ACM, 55(4):77–84.
[3] Dat Quoc Nguyen, Richard Billingsley, Lan Du and Mark Johnson. 2015. Improving Topic Models with Latent Feature Word Representations. Transactions of the Association for Computational Linguistics, vol. 3, pp. 299-313.
[4] Kamal Nigam, AK McCallum, S Thrun, and T Mitchell. 2000. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39:103– 134.
[5] Thomas L. Griffiths and Mark Steyvers. 2004. Finding Scientific Topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1):5228–5235.
[6] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. 2008. Introduction to Information Retrieval. Cambridge University Press.
[7] Yue Lu, Qiaozhu Mei, and ChengXiang Zhai. 2011. Investigating Task Performance of Probabilistic Topic Models: an Empirical Study of PLSA and LDA. Information Retrieval, 14:178–203.