menshikh-iv
released this
16 Mar 12:50
·
1 commit
to master
since this release
Pre-trained FastText 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).
Feature | Description |
---|---|
File size | 959MB |
Number of vectors | 999999 |
Dimension | 300 |
License | https://creativecommons.org/licenses/by-sa/3.0/ |
Read more:
- https://fasttext.cc/docs/en/english-vectors.html
- Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, Armand Joulin: "Advances in Pre-Training Distributed Word Representations"
- Armand Joulin Edouard Grave Piotr Bojanowski Tomas Mikolov: "Bag of Tricks for Efficient Text Classification"
Example
import gensim.downloader as api
model = api.load("fasttext-wiki-news-subwords-300")
model.most_similar(positive=["russia", "river"])
"""
Output:
[(u'russias', 0.6939424276351929),
(u'danube', 0.6881916522979736),
(u'river.', 0.6683923006057739),
(u'crimea', 0.6638611555099487),
(u'rhine', 0.6632323861122131),
(u'rivermouth', 0.6602864265441895),
(u'wester', 0.6586191058158875),
(u'finland', 0.6585439443588257),
(u'volga', 0.6576792001724243),
(u'ukraine', 0.6569074392318726)]
"""