expgram

expgram is an ngram toolkit which can efficiently handle large ngram data:

A succinct data structure for compactly represent ngram data [1]. Among ngram compression methods mentioned in [1], we do not implement block-wise compression (or zlib every 8k-byte) for computational efficiency reason.
Language model is estimated by MapReduce proposed by [2] using pthread and/or MPI.
Better rest cost estimation for chart-based decoding in machine translation which estimates lower-order ngram language model parameters [3].
A transducer-like interface motivated by [4] and an efficient prefix/suffix ngram context computation [3].

Note this toolkit is primarily developed to handle large ngram count data, thus it is not called like xxxlm.

The expgram toolkit is mainly developed by Taro Watanabe at Multilingual Translation Laboratory, Universal Communication Institute, National Institute of Information and Communications Technology (NICT). If you have any questions about expgram, you can send them to taro.watanabe at nict dot go dot jp.

Quick Start

The stable version is: 0.2.1. The latest code is also available from github.com.

Compile

For details, see BUILD.rst.

./autogen.sh (required when you get the code by git clone)
./configure
make
make install (optional)

Run

Basically, you have only to use expgram.py (found at <build dir>/scripts or <install prefix>/bin) which encapsulate all the processes to estimate LM. For instance, you can run:

expgram.py \
         --corpus <corpus> or --corpus-list <list of corpus> \
         --output <prefix of lm name> \
         --order  <order of ngram lm> \
         --temporary-dir <temporary disk space>

Here, we assume a corpus, newline delimited set of sentences, indicated by --corpus <corpus> or a list of corpus, newline delimited set of corpora files specified by --corpus-list <list of corpus>. This will dump 6 data:

<prefix>.counts            extracted ngram counts
<prefix>.index             indexed ngram counts
<prefix>.modified          indexed modified counts for modified-KN smoothing
<prefix>.estimated         temporarily estiamted LM (don't use this!)
<prefix>.lm                LM with efficient indexing
<prefix>.lm.quantize       8-bit quantized LM

or, if you already have count data organized into a Google format, simply run

expgram.py \
         --counts <counts in Google format> \
         --output <prefix of lm name> \
         --order  <order of ngram lm> \
         --temporary-dir <temporary disk space>

This will dump 5 models:

<prefix>.index             indexed ngram counts
<prefix>.modified          indexed modified counts for modified-KN smoothing
<prefix>.estimated         temporarily estiamted LM (don't use this!)
<prefix>.lm                LM with efficient indexing
<prefix>.lm.quantize       8-bit quantized LM

To see the indexed counts, use (found at <build dir>/progs or <install prefix>/bin):

expgram_counts_dump --ngram <prefix>.index

which writes the indexed counts in a plain text. The language model probabilities are stored by the natural logarithm (with e as a base), not by the logarithm with base 10. If you want to see the LM, use:

expgram_dump --ngram <prefix>.lm (or <prefix>.lm.quantize)

which writes LM in ARPA format using the common logarithm with base 10.

expgram_perplexity --ngram <prefix>.lm (or <prefix>.lm.quantize) < [text-file]

computes the perplexity on the text-file.

Systems

It has been successfully compiled on x86_64 on Linux, OS X and Cygwin, and regularly tested on Linux and OS X.

References

[1]	(1, 2) Taro Watanabe, Hajime Tsukada, and Hideki Isozaki. A succinct n-gram language model. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 341-344, Suntec, Singapore, August 2009. Association for Computational Linguistics.

[2]

Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 858-867, Prague, Czech Republic, June 2007. Association for Computational Linguistics.

[3]

(1, 2) Kenneth Heafield, Philipp Koehn, and Alon Lavie. Language model rest costs and space-efficient storage. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1169-1178, Jeju Island, Korea, July 2012. Association for Computational Linguistics.

[4]	Jeffrey Sorensen and Cyril Allauzen. Unary data structures for language models. In Interspeech 2011, pages 1425-1428, 2011.

Name		Name	Last commit message	Last commit date
Latest commit History 779 Commits
config		config
doc		doc
expgram		expgram
man		man
progs		progs
sample		sample
scripts		scripts
succinct_db		succinct_db
utils		utils
.gitignore		.gitignore
BUILD.rst		BUILD.rst
COPYING.GPL		COPYING.GPL
COPYING.LGPL		COPYING.LGPL
FAQ		FAQ
LICENSE		LICENSE
Makefile.am		Makefile.am
NEWS.rst		NEWS.rst
README.rst		README.rst
TODO.rst		TODO.rst
autogen.sh		autogen.sh
configure.ac		configure.ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

expgram

Quick Start

Compile

Run

Systems

References

About

Licenses found

Releases

Packages

Languages

License

Licenses found

tarowatanabe/expgram

Folders and files

Latest commit

History

Repository files navigation

expgram

Quick Start

Compile

Run

Systems

References

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages