Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add section headings for parameters #1348

Merged
merged 2 commits into from
May 22, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 8 additions & 6 deletions docs/notebooks/word2vec.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
"metadata": {},
"source": [
"## Preparing the Input\n",
"Starting from the beginning, gensim’s `word2vec` expects a sequence of sentences as its input. Each sentence a list of words (utf8 strings):"
"Starting from the beginning, gensim’s `word2vec` expects a sequence of sentences as its input. Each sentence is a list of words (utf8 strings):"
]
},
{
Expand Down Expand Up @@ -276,7 +276,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## More data would be nice\n",
"### More data would be nice\n",
"For the following examples, we'll use the [Lee Corpus](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_data/lee_background.cor) (which you already have if you've installed gensim):"
]
},
Expand Down Expand Up @@ -324,8 +324,8 @@
"source": [
"## Training\n",
"`Word2Vec` accepts several parameters that affect both training speed and quality.\n",
"\n",
"One of them is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:"
"\n### min_count\n",
"`min_count` is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:"
]
},
{
Expand Down Expand Up @@ -365,6 +365,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### size\n",
"`size` is the number of dimensions (N) of the N-dimensional space that gensim Word2Vec maps the words onto.\n",
"\n",
"Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds."
Expand Down Expand Up @@ -407,7 +408,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The last of the major parameters (full list [here](http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)) is for training parallelization, to speed up training:"
"### workers\n",
"`workers`, the last of the major parameters (full list [here](http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)) is for training parallelization, to speed up training:"
]
},
{
Expand Down Expand Up @@ -471,7 +473,7 @@
"## Evaluating\n",
"`Word2Vec` training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.\n",
"\n",
"Google have released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task. It is provided in the 'datasets' folder.\n",
"Google has released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task. It is provided in the 'datasets' folder.\n",
"\n",
"For example a syntactic analogy of comparative type is bad:worse;good:?. There are total of 9 types of syntactic comparisons in the dataset like plural nouns and nouns of opposite meaning.\n",
"\n",
Expand Down