scripts for training a transliterator using a list of transliteration pairs.
- m2m-aligner
- python v2.7 (+ modules: argparser)
- cdec decoder
- ducttape v2.1 https://github.com/jhclark/ducttape
- ken lm
an example configuration file is provided ruen-config.tape
. The following variables are mandatory:
ducttape_output
output directorytransliterator_home
root of the transliterator's repositoryall_oovs
source-language words which needs to be transliterated (e.g. a test set)char_lm
kenlm-compiled language model of target language characters. An English character language model is providedtransliteration_pairs
src-tgt transliterations, one per line, formatted asSOURCE LANGUAGE ||| CEURSE LAUNJE
m2m_maxX
maximum source-language character sequence which corresponds to one character in target languagem2m_maxY
maximum target-language character sequence which corresponds to one character in source languagenprocs
number of processors to use for trainingwammar_utils_dir
root of this repositorym2m_aligner
path to m2m alignercdec_dir
path to cdec decoderDelX: yes
means that some characters in the source language may be deletedDelY: yes
means that some characters in the target language may be deleted
ducttape translit.tape -C ruen-config.tape -p Full -y
- use
mpi_adagrad_optimize
instead ofmpi_flex_optimize
- rewrite
convert-alignments-to-cdec-format.py
##disclaimer:
scripts are still under development and may be unstable. please do contact me if anything does not work.
if you use this software, consider citing our ACL 2012 workshop paper: http://www.cs.cmu.edu/~wammar/pubs/translit-acl12.pdf