Pre-trained word vectors of 30+ languages

This project has two purposes. First of all, I'd like to share some of my experience in nlp tasks such as segmentation or word vectors. The other, which is more important, is that probably some people are searching for pre-trained word vector models for non-English languages. Alas! English has gained much more attention than any other languages has done. Check this to see how easily you can get a variety of pre-trained English word vectors without efforts. I think it's time to turn our eyes to a multi language version of this.

Nearing the end of the work, I happened to know that there is already a similar job named polyglot. I strongly encourage you to check this great project. How embarrassing! Nevertheless, I decided to open this project. You will know that my job has its own flavor, after all.

Requirements

nltk >= 1.11.1
regex >= 2016.6.24
lxml >= 3.3.3
numpy >= 1.11.2
konlpy >= 0.4.4 (Only for Korean)
mecab (Only for Japanese)
pythai >= 0.1.3 (Only for Thai)
pyvi >= 0.0.7.2 (Only for Vietnamese)
jieba >= 0.38 (Only for Chinese)
gensim > =0.13.1

Background / References

Check this to know what word embedding is.
Check this to quickly get a picture of Word2vec.
Watch this to really understand what's happening under the hood of Word2vec.
Go get various English word vectors here if needed.
Check this more ambitious project here

Workflow

STEP 1. Download the wikipedia database backup dumps of the language you want.
STEP 2. Extract running texts from the downloaded file to build a corpus.
STEP 3. Preprocess the corpus.
STEP 4. Run Word2Vec.

Pre-trained models

Click the name of a language to download its pretrained word vectors. The zip file contains two files: .bin (word2vec model file) and .txt (word vector file). Any contributions are welcomed.

Language	ISO 639-1	Vector Size	Corpus Size	Vocabulary Size	Training Algorithm
Bengali	bn	300	147M	10059	negative sampling
Catalan	ca	300	967M	50013	negative sampling
Chinese	zh	300	1G	50101	negative sampling
Danish	da	300	295M	30134	negative sampling
Dutch	nl	300	1G	50160	negative sampling
Esperanto	eo	300	1G	50597	negative sampling
Finnish	fi	300	467M	30029	negative sampling
French	fr	300	1G	50130	negative sampling
German	de	300	1G	50006	negative sampling
Hindi	hi	300	323M	30393	negative sampling
Hungarian	hu	300	692M	40122	negative sampling
Indonesian	id	300	402M	30048	negative sampling
Italian	it	300	1G	50031	negative sampling
Japanese	ja	300	1G	50108	negative sampling
Javanese	jv	100	31M	10019	negative sampling
Korean	ko	200	339M	30185	negative sampling
Malay	ms	100	173M	10010	negative sampling
Norwegian	no	300	1G	50209	negative sampling
Norwegian Nynorsk	nn	100	114M	10036	negative sampling
Polish	pl	300	1G	50035	negative sampling
Portuguese	pt	300	1G	50246	negative sampling
Russian	ru	300	1G	50102	negative sampling
Spanish	es	300	1G	50003	negative sampling
Swahili	sw	100	24M	10222	negative sampling
Swedish	sv	300	1G	50052	negative sampling
Tagalog	tl	100	38M	10068	negative sampling
Thai	th	300	696M	30225	negative sampling
Turkish	tr	200	370M	30036	negative sampling
Vietnamese	vi	100	74M	10087	negative sampling

|

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_corpus.py		build_corpus.py
lcodes.txt		lcodes.txt
make_wordvectors.py		make_wordvectors.py
make_wordvectors.sh		make_wordvectors.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pre-trained word vectors of 30+ languages

Requirements

Background / References

Workflow

Pre-trained models

About

Releases

Packages

Languages

License

binxuan/wordvectors

Folders and files

Latest commit

History

Repository files navigation

Pre-trained word vectors of 30+ languages

Requirements

Background / References

Workflow

Pre-trained models

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages