Skip to content

MtJaenSmt

FrancisBond edited this page Jun 20, 2008 · 63 revisions

Japanese English Statistical Machine Translation

Disclaimer: This page is for notes and discussion of work in progress on SMT between Japanese and English. It is unlikely to be understandable or useful to anyone outside the project.

Results (no MERT):

Model Factors Test 1 BLEU Test 2 BLEU Test 3 BLEU Average BLEU Time taken Comments
Mecab; No Punctuation) surface-->surface JE 19.76 19.36 20.44 19.85 JST data
Mecab; Punctuation) surface-->surface JE 21.11 21.39 21.84 21.45
Mecab Tokenization & Chasen POS) surface-->surface pos-->pos JE 19.14 19.56 20.14 19.58
Juman No Punctuation) surface-->surface pos-->pos JE 18.98 17.55 17.66 18.71
Juman & Punctuation, Lemmas too) surface-->surface pos-->pos lemma,pos-->lemma lemma,pos-->surface JE 20.72 21.44 21.72 21.29
Mecab Punctuation, POS, Lemmas, Morph t2:surface-->surface, t0:lemma-->lemma, g1:lemma-->pos, t1: morph-->pos, g2: pos,lemma-->surface JE 19.68 19.87 19.59 19.71

Next:

Model Test 1 BLEU Test 2 BLEU Test 3 BLEU Average BLEU Time taken Comments
1 EJ JST data
2 EJ
3 (Mecab) EJ 24.67
4 (Juman) EJ
3 (reversed) EJ

Eric's systems:

Model Factors Corpus Pair MERT BLEU Comments Time
punctuation; lowercase none IWSLT06 JE yes -- tokenization: Mecab; Moses baseline script
punctuation; lowercase none Tanaka JE yes 14.39 tokenization: Mecab; Moses baseline script
punctuation; lowercase surface->surface+pos Tanaka JE no 11.39 EN factors: tree tagger < 24h
punctuation; lowercase surface->surface+pos Tanaka JE yes 19.06 EN factors: tree tagger
punctuation; lowercase t: lemma->lemma; morph->pos g: lemma->pos; lemma+pos->surface Tanaka JE yes 18.67 JA factors: Mecab; EN factors: tree tagger
punctuation; lowercase none Tanaka EJ yes 26.87 tokenization: Moses baseline script; Mecab
punctuation; lowercase surface->surface+pos Tanaka EJ yes 26.10 JA factors: Mecab

Models Under Construction

Model 1: (Lowercase English, No Punctuation; Mecab Tokenized Japanese, POS from MeCab (is it really chasen??) & no punctuation) sentence.lc.np.pose.en sentence.np.tokm.posm.ja

Model 2: (Lowercase English; Punctuation; Mecab Tokenized Japanese, POS from MeCab (is it really chasen??) & Punctuation) sentence.lc.p.pose.en sentence.p.tokm.posm.ja

Model 3: (Lowercase English, No Punctuation; Mecab Tokenized Japanese, POS fromChasen & no punctuation) sentence.lc.np.pose.en sentence.np.tokc.posm.ja

Model 4: (Lowercase English, No Punctuation; Juman Tokenized Japanese, POS from Juman & no punctuation) sentence.lc.np.pose.en sentence.np.tokj.posm.ja

Model 5: (Lowercase English, No Punctuation; lemmatized in both languages & no punctuation) sentence.lc.np.dicm.pose.en sentence.np.tokj.dicj.posm.ja

Model 6: (best of 1-5 + NE) do NE on both languages and add as a factor Francis|n|name-B Bond|n|name-M was|v|O here|n|O (or here|n|place-B, depending on your tagger)

  • Sort of inspired by work at ATR Introducing Translation Dictionary Into Phrase-based SMT.

  • Which NE? Try [http://nlp.cs.nyu.edu/ene Sekine's]

Model 7: (best of 1-5 + NE variant) do NE on both languages and filter out NEs that don't align in preprocessing. e.g: compare the results (maybe taking only the intersection) then you can ge better results, as the cues must be different in the two languages.

lc = lowercase np = no punctuation p = punctuation tokc = Chasen Tokenized tokj = Juman Tokenized tokm = Mecab Tokenized dicj = Root Form from Juman dicm = Root Form from MORPH English Morphological Tagger pose = POS Adawati Maximum Entropy Tagger posc = POS Chasen posj = POS Juman posm = POS MeCab en = english ja = japanese

Sample File Names: sentences.lc.p.pose.en sentences.p.tokm.posm.ja

Other Ideas

Data Sources

Clone this wiki locally