expgram is an ngram toolkit which can efficiently handle large ngram data:
- A succinct data structure for compactly represent ngram data [1]. Among ngram compression methods mentioned in [1], we do not implement block-wise compression (or zlib every 8k-byte) for computational efficiency reason.
- Language model is estimated by MapReduce proposed by [2] using pthread and/or MPI.
- Better rest cost estimation for chart-based decoding in machine translation which estimates lower-order ngram language model parameters [3].
- A transducer-like interface motivated by [4] and an efficient prefix/suffix ngram context computation [3].
Note this toolkit is primarily developed to handle large ngram count data, thus it is not called like xxxlm.
The expgram toolkit is mainly developed by
Taro Watanabe
at Multilingual Translation Laboratory, Universal Communication
Institute, National Institute of Information and Communications
Technology (NICT).
If you have any questions about expgram, you can send them to
taro.watanabe at nict dot go dot jp
.
The stable version is: 0.2.1. The latest code is also available from github.com.
For details, see BUILD.rst.
./autogen.sh (required when you get the code by git clone)
./configure
make
make install (optional)
Basically, you have only to use expgram.py (found at
<build dir>/scripts
or <install prefix>/bin
) which encapsulate all
the processes to estimate LM. For instance, you can run:
expgram.py \
--corpus <corpus> or --corpus-list <list of corpus> \
--output <prefix of lm name> \
--order <order of ngram lm> \
--temporary-dir <temporary disk space>
Here, we assume a corpus, newline delimited set of sentences, indicated by --corpus <corpus> or a list of corpus, newline delimited set of corpora files specified by --corpus-list <list of corpus>. This will dump 6 data:
<prefix>.counts extracted ngram counts <prefix>.index indexed ngram counts <prefix>.modified indexed modified counts for modified-KN smoothing <prefix>.estimated temporarily estiamted LM (don't use this!) <prefix>.lm LM with efficient indexing <prefix>.lm.quantize 8-bit quantized LM
or, if you already have count data organized into a Google format, simply run
expgram.py \
--counts <counts in Google format> \
--output <prefix of lm name> \
--order <order of ngram lm> \
--temporary-dir <temporary disk space>
This will dump 5 models:
<prefix>.index indexed ngram counts <prefix>.modified indexed modified counts for modified-KN smoothing <prefix>.estimated temporarily estiamted LM (don't use this!) <prefix>.lm LM with efficient indexing <prefix>.lm.quantize 8-bit quantized LM
To see the indexed counts, use (found at <build dir>/progs
or <install prefix>/bin
):
expgram_counts_dump --ngram <prefix>.index
which writes the indexed counts in a plain text. The language model probabilities are stored by the natural logarithm (with e as a base), not by the logarithm with base 10. If you want to see the LM, use:
expgram_dump --ngram <prefix>.lm (or <prefix>.lm.quantize)
which writes LM in ARPA format using the common logarithm with base 10.
expgram_perplexity --ngram <prefix>.lm (or <prefix>.lm.quantize) < [text-file]
computes the perplexity on the text-file.
It has been successfully compiled on x86_64 on Linux, OS X and Cygwin, and regularly tested on Linux and OS X.
[1] | (1, 2) Taro Watanabe, Hajime Tsukada, and Hideki Isozaki. A succinct n-gram language model. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 341-344, Suntec, Singapore, August 2009. Association for Computational Linguistics. |
[2] | Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 858-867, Prague, Czech Republic, June 2007. Association for Computational Linguistics. |
[3] | (1, 2) Kenneth Heafield, Philipp Koehn, and Alon Lavie. Language model rest costs and space-efficient storage. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1169-1178, Jeju Island, Korea, July 2012. Association for Computational Linguistics. |
[4] | Jeffrey Sorensen and Cyril Allauzen. Unary data structures for language models. In Interspeech 2011, pages 1425-1428, 2011. |