Skip to content

MtJaenSmt

EricNichols edited this page Jun 3, 2008 · 63 revisions

Japanese English Statistical Machine Translation

Disclaimer: This page is for notes and discussion of work in progress on SMT between Japanese and English. It is unlikely to be understandable or useful to anyone outside the project.

Results (no MERT):

Model Test 1 BLEU Test 2 BLEU Test 3 BLEU Average BLEU Time taken Comments
1 (Mecab; No Punctuation) JE 19.76 19.36 20.44 19.85 JST data
2 (Mecab; Punctuation) JE 21.11 21.39 21.84 21.45
3 (Mecab Tokenization & Chasen POS) JE 19.14 19.56 20.14 19.58
4 (Juman Tok & No Punctuation) JE 18.98 17.55 17.66 18.71
5 (Model 4 but with Lemmas too) JE 20.72 21.44 21.72 21.29

Next:

Model Test 1 BLEU Test 2 BLEU Test 3 BLEU Average BLEU Time taken Comments
1 EJ JST data
2 EJ
3 (Mecab) EJ 24.67
4 (Juman) EJ
3 (reversed) EJ

Eric's systems:

Model Factors Corpus Pair MERT BLEU Comments
punctuation; lowercase none IWSLT06 JE yes -- tokenization: Mecab; Moses baseline script
punctuation; lowercase none Tanaka JE yes 14.39 tokenization: Mecab; Moses baseline script
punctuation; lowercase none Tanaka EJ yes 26.87 tokenization: Moses baseline script; Mecab
punctuation; lowercase surface->surface+pos Tanaka JE no 11.39 EN factors: tree tagger
punctuation; lowercase surface->surface+pos Tanaka JE yes 19.06 EN factors: tree tagger

Models Under Construction

Model 1: (Lowercase English, No Punctuation; Mecab Tokenized Japanese, POS from MeCab (is it really chasen??) & no punctuation) sentence.lc.np.pose.en sentence.np.tokm.posm.ja

Model 2: (Lowercase English; Punctuation; Mecab Tokenized Japanese, POS from MeCab (is it really chasen??) & Punctuation) sentence.lc.p.pose.en sentence.p.tokm.posm.ja

Model 3: (Lowercase English, No Punctuation; Mecab Tokenized Japanese, POS fromChasen & no punctuation) sentence.lc.np.pose.en sentence.np.tokc.posm.ja

Model 4: (Lowercase English, No Punctuation; Juman Tokenized Japanese, POS from Juman & no punctuation) sentence.lc.np.pose.en sentence.np.tokj.posm.ja

Model 5: (Lowercase English, No Punctuation; lemmatized in both languages & no punctuation) sentence.lc.np.dicm.pose.en sentence.np.tokj.dicj.posm.ja

Model 6: (best of 1-5 + NE) do NE on both languages and add as a factor Francis|n|name-B Bond|n|name-M was|v|O here|n|O (or here|n|place-B, depending on your tagger)

  • Sort of inspired by work at ATR Introducing Translation Dictionary Into Phrase-based SMT.

  • Which NE? Try [http://nlp.cs.nyu.edu/ene Sekine's]

Model 7: (best of 1-5 + NE variant) do NE on both languages and filter out NEs that don't align in preprocessing. e.g: compare the results (maybe taking only the intersection) then you can ge better results, as the cues must be different in the two languages.

lc = lowercase np = no punctuation p = punctuation tokc = Chasen Tokenized tokj = Juman Tokenized tokm = Mecab Tokenized dicj = Root Form from Juman dicm = Root Form from MORPH English Morphological Tagger pose = POS Adawati Maximum Entropy Tagger posc = POS Chasen posj = POS Juman posm = POS MeCab en = english ja = japanese

Sample File Names: sentences.lc.p.pose.en sentences.p.tokm.posm.ja

Other Ideas

Data Sources

Clone this wiki locally