MtJaenSmt

Japanese English Statistical Machine Translation

Disclaimer: This page is for notes and discussion of work in progress on SMT between Japanese and English. It is unlikely to be understandable or useful to anyone outside the project.

Results (no MERT):


Model		Test 1 BLEU	Test 2 BLEU	Test 3 BLEU	Average BLEU	Time taken	Comments
1 (Mecab; No Punctuation)	JE	19.76	19.36	20.44	19.85	JST data
2 (Mecab; Punctuation)	JE	21.11	21.39	21.84	21.45
3 (Mecab Tokenization & Chasen POS)	JE	19.14	19.56	20.14	19.58
4 (Juman Tok & No Punctuation)	JE	18.98	17.55	17.66	18.71
5 (Model 4 but with Lemmas too)	JE	20.72	21.44	21.72	21.29


Model		Test 1 BLEU	Test 2 BLEU	Test 3 BLEU	Average BLEU	Time taken	Comments
1	EJ						JST data
2	EJ
3 (Mecab)	EJ	24.67
4 (Juman)	EJ
3 (reversed)	EJ

[http://feast.atr.jp/nonverbal/ JST data] 100,000 sentence pairs

Eric's systems:


Model	Factors	Corpus	Pair	MERT	BLEU	Comments
punctuation; lowercase	none	IWSLT06	JE	yes	--	tokenization: Mecab; Moses baseline script
punctuation; lowercase	none	Tanaka	JE	yes	14.39	tokenization: Mecab; Moses baseline script
punctuation; lowercase	none	Tanaka	EJ	yes	26.87	tokenization: Moses baseline script; Mecab
punctuation; lowercase	surface->surface+pos	Tanaka	JE	no	11.39	EN factors: tree tagger
punctuation; lowercase	surface->surface+pos	Tanaka	JE	yes	19.06	EN factors: tree tagger

Models Under Construction

Model 1: (Lowercase English, No Punctuation; Mecab Tokenized Japanese, POS from MeCab (is it really chasen??) & no punctuation) sentence.lc.np.pose.en sentence.np.tokm.posm.ja

Model 2: (Lowercase English; Punctuation; Mecab Tokenized Japanese, POS from MeCab (is it really chasen??) & Punctuation) sentence.lc.p.pose.en sentence.p.tokm.posm.ja

Model 3: (Lowercase English, No Punctuation; Mecab Tokenized Japanese, POS fromChasen & no punctuation) sentence.lc.np.pose.en sentence.np.tokc.posm.ja

Model 4: (Lowercase English, No Punctuation; Juman Tokenized Japanese, POS from Juman & no punctuation) sentence.lc.np.pose.en sentence.np.tokj.posm.ja

Model 5: (Lowercase English, No Punctuation; lemmatized in both languages & no punctuation) sentence.lc.np.dicm.pose.en sentence.np.tokj.dicj.posm.ja

Model 6: (best of 1-5 + NE) do NE on both languages and add as a factor Francis|n|name-B Bond|n|name-M was|v|O here|n|O (or here|n|place-B, depending on your tagger)

Sort of inspired by work at ATR Introducing Translation Dictionary Into Phrase-based SMT.
Which NE? Try [http://nlp.cs.nyu.edu/ene Sekine's]

Model 7: (best of 1-5 + NE variant) do NE on both languages and filter out NEs that don't align in preprocessing. e.g: compare the results (maybe taking only the intersection) then you can ge better results, as the cues must be different in the two languages.

lc = lowercase np = no punctuation p = punctuation tokc = Chasen Tokenized tokj = Juman Tokenized tokm = Mecab Tokenized dicj = Root Form from Juman dicm = Root Form from MORPH English Morphological Tagger pose = POS Adawati Maximum Entropy Tagger posc = POS Chasen posj = POS Juman posm = POS MeCab en = english ja = japanese

Sample File Names: sentences.lc.p.pose.en sentences.p.tokm.posm.ja

Other Ideas

Use French from http://wwwcyg.utc.fr/tatoeba/ to cross align on Tanaka Corpus
Parse and generate both sides and train off the expanded corpus.

Data Sources

Home | Forum | Discussions | Events

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MtJaenSmt

Japanese English Statistical Machine Translation

Models Under Construction

Other Ideas

Data Sources

Clone this wiki locally