GitHub

File Introduction

data.py: preprocess datasets including loading, embedding, padding, and batch

model.py: implementation of Transformer

train.py: several training methods

loadData.ipynb: use torchtext to load dataset from disk, numericalize words, generate voacb, and batch them

Training Example

characterCopy.py: a simple copying experiment to test model

IWSLTGeEnTranslation.py: IWSLTG Ge-En experiment

Pretrained Chinese Word Embedding

Although there are several pretrained word embeddings, the segmentation methods can hugely affect the performance of embedding and downstream tasks.

Name	Format	Algorithm	Dimension
Tencent AI Lab Embedding Corpus for Chinese Words and Phrases	text (.txt)	DSG (directional skip-gram)	200
fastText	text (.txt) & binary (.bin)	CBOW (n=5, window=5, negative=10)	300
Wikipedia2Vec	text (.txt) & binary (.bin)	skip-gram word-based (window=5, iteration=10, negative=15)	100 & 300

fastText uses Stanford Word Segmenter for Chinese, the same toolkit I used to tokenize infoq corpus. Also fastText provides a easy to use tool (both skip-gram and CBOW are available) to generate your own embedding.

Web Development

Machine Translation Web Interface for OpenNMT

First start server:

cd website

/bin/bash start_server.sh path/to/OpenNMT-tf

Then start this website

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

File Introduction

Training Example

Pretrained Chinese Word Embedding

Web Development

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.ipynb_checkpoints		.ipynb_checkpoints
img		img
website		website
.gitignore		.gitignore
IWSLTGeEnTranslation.py		IWSLTGeEnTranslation.py
README.md		README.md
characterCopy.py		characterCopy.py
loadData.ipynb		loadData.ipynb
loadData.py		loadData.py
model.py		model.py
train.py		train.py

Mercy811/en_zh_transformer

Folders and files

Latest commit

History

Repository files navigation

File Introduction

Training Example

Pretrained Chinese Word Embedding

Web Development

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages