diff --git a/docs/src/auto_examples/tutorials/run_doc2vec_lee.ipynb b/docs/src/auto_examples/tutorials/run_doc2vec_lee.ipynb index 5314bea335..a886f2f526 100644 --- a/docs/src/auto_examples/tutorials/run_doc2vec_lee.ipynb +++ b/docs/src/auto_examples/tutorials/run_doc2vec_lee.ipynb @@ -15,7 +15,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "\nDoc2Vec Model\n=============\n\nIntroduces Gensim's Doc2Vec model and demonstrates its use on the\n`Lee Corpus `__.\n\n\n" + "\n# Doc2Vec Model\n\nIntroduces Gensim's Doc2Vec model and demonstrates its use on the\n[Lee Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf)_.\n" ] }, { @@ -33,7 +33,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Doc2Vec is a `core_concepts_model` that represents each\n`core_concepts_document` as a `core_concepts_vector`. This\ntutorial introduces the model and demonstrates how to train and assess it.\n\nHere's a list of what we'll be doing:\n\n0. Review the relevant models: bag-of-words, Word2Vec, Doc2Vec\n1. Load and preprocess the training and test corpora (see `core_concepts_corpus`)\n2. Train a Doc2Vec `core_concepts_model` model using the training corpus\n3. Demonstrate how the trained model can be used to infer a `core_concepts_vector`\n4. Assess the model\n5. Test the model on the test corpus\n\nReview: Bag-of-words\n--------------------\n\n.. Note:: Feel free to skip these review sections if you're already familiar with the models.\n\nYou may be familiar with the `bag-of-words model\n`_ from the\n`core_concepts_vector` section.\nThis model transforms each document to a fixed-length vector of integers.\nFor example, given the sentences:\n\n- ``John likes to watch movies. Mary likes movies too.``\n- ``John also likes to watch football games. Mary hates football.``\n\nThe model outputs the vectors:\n\n- ``[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]``\n- ``[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]``\n\nEach vector has 10 elements, where each element counts the number of times a\nparticular word occurred in the document.\nThe order of elements is arbitrary.\nIn the example above, the order of the elements corresponds to the words:\n``[\"John\", \"likes\", \"to\", \"watch\", \"movies\", \"Mary\", \"too\", \"also\", \"football\", \"games\", \"hates\"]``.\n\nBag-of-words models are surprisingly effective, but have several weaknesses.\n\nFirst, they lose all information about word order: \"John likes Mary\" and\n\"Mary likes John\" correspond to identical vectors. There is a solution: bag\nof `n-grams `__\nmodels consider word phrases of length n to represent documents as\nfixed-length vectors to capture local word order but suffer from data\nsparsity and high dimensionality.\n\nSecond, the model does not attempt to learn the meaning of the underlying\nwords, and as a consequence, the distance between vectors doesn't always\nreflect the difference in meaning. The ``Word2Vec`` model addresses this\nsecond problem.\n\nReview: ``Word2Vec`` Model\n--------------------------\n\n``Word2Vec`` is a more recent model that embeds words in a lower-dimensional\nvector space using a shallow neural network. The result is a set of\nword-vectors where vectors close together in vector space have similar\nmeanings based on context, and word-vectors distant to each other have\ndiffering meanings. For example, ``strong`` and ``powerful`` would be close\ntogether and ``strong`` and ``Paris`` would be relatively far.\n\nGensim's :py:class:`~gensim.models.word2vec.Word2Vec` class implements this model.\n\nWith the ``Word2Vec`` model, we can calculate the vectors for each **word** in a document.\nBut what if we want to calculate a vector for the **entire document**\\ ?\nWe could average the vectors for each word in the document - while this is quick and crude, it can often be useful.\nHowever, there is a better way...\n\nIntroducing: Paragraph Vector\n-----------------------------\n\n.. Important:: In Gensim, we refer to the Paragraph Vector model as ``Doc2Vec``.\n\nLe and Mikolov in 2014 introduced the `Doc2Vec algorithm `__,\nwhich usually outperforms such simple-averaging of ``Word2Vec`` vectors.\n\nThe basic idea is: act as if a document has another floating word-like\nvector, which contributes to all training predictions, and is updated like\nother word-vectors, but we will call it a doc-vector. Gensim's\n:py:class:`~gensim.models.doc2vec.Doc2Vec` class implements this algorithm.\n\nThere are two implementations:\n\n1. Paragraph Vector - Distributed Memory (PV-DM)\n2. Paragraph Vector - Distributed Bag of Words (PV-DBOW)\n\n.. Important::\n Don't let the implementation details below scare you.\n They're advanced material: if it's too much, then move on to the next section.\n\nPV-DM is analogous to Word2Vec CBOW. The doc-vectors are obtained by training\na neural network on the synthetic task of predicting a center word based an\naverage of both context word-vectors and the full document's doc-vector.\n\nPV-DBOW is analogous to Word2Vec SG. The doc-vectors are obtained by training\na neural network on the synthetic task of predicting a target word just from\nthe full document's doc-vector. (It is also common to combine this with\nskip-gram testing, using both the doc-vector and nearby word-vectors to\npredict a single target word, but only one at a time.)\n\nPrepare the Training and Test Data\n----------------------------------\n\nFor this tutorial, we'll be training our model using the `Lee Background\nCorpus\n`_\nincluded in gensim. This corpus contains 314 documents selected from the\nAustralian Broadcasting Corporation\u2019s news mail service, which provides text\ne-mails of headline stories and covers a number of broad topics.\n\nAnd we'll test our model by eye using the much shorter `Lee Corpus\n`_\nwhich contains 50 documents.\n\n\n" + "Doc2Vec is a `core_concepts_model` that represents each\n`core_concepts_document` as a `core_concepts_vector`. This\ntutorial introduces the model and demonstrates how to train and assess it.\n\nHere's a list of what we'll be doing:\n\n0. Review the relevant models: bag-of-words, Word2Vec, Doc2Vec\n1. Load and preprocess the training and test corpora (see `core_concepts_corpus`)\n2. Train a Doc2Vec `core_concepts_model` model using the training corpus\n3. Demonstrate how the trained model can be used to infer a `core_concepts_vector`\n4. Assess the model\n5. Test the model on the test corpus\n\n## Review: Bag-of-words\n\n.. Note:: Feel free to skip these review sections if you're already familiar with the models.\n\nYou may be familiar with the [bag-of-words model](https://en.wikipedia.org/wiki/Bag-of-words_model) from the\n`core_concepts_vector` section.\nThis model transforms each document to a fixed-length vector of integers.\nFor example, given the sentences:\n\n- ``John likes to watch movies. Mary likes movies too.``\n- ``John also likes to watch football games. Mary hates football.``\n\nThe model outputs the vectors:\n\n- ``[1, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0]``\n- ``[1, 1, 1, 1, 0, 1, 0, 1, 2, 1, 1]``\n\nEach vector has 10 elements, where each element counts the number of times a\nparticular word occurred in the document.\nThe order of elements is arbitrary.\nIn the example above, the order of the elements corresponds to the words:\n``[\"John\", \"likes\", \"to\", \"watch\", \"movies\", \"Mary\", \"too\", \"also\", \"football\", \"games\", \"hates\"]``.\n\nBag-of-words models are surprisingly effective, but have several weaknesses.\n\nFirst, they lose all information about word order: \"John likes Mary\" and\n\"Mary likes John\" correspond to identical vectors. There is a solution: bag\nof [n-grams](https://en.wikipedia.org/wiki/N-gram)_\nmodels consider word phrases of length n to represent documents as\nfixed-length vectors to capture local word order but suffer from data\nsparsity and high dimensionality.\n\nSecond, the model does not attempt to learn the meaning of the underlying\nwords, and as a consequence, the distance between vectors doesn't always\nreflect the difference in meaning. The ``Word2Vec`` model addresses this\nsecond problem.\n\n## Review: ``Word2Vec`` Model\n\n``Word2Vec`` is a more recent model that embeds words in a lower-dimensional\nvector space using a shallow neural network. The result is a set of\nword-vectors where vectors close together in vector space have similar\nmeanings based on context, and word-vectors distant to each other have\ndiffering meanings. For example, ``strong`` and ``powerful`` would be close\ntogether and ``strong`` and ``Paris`` would be relatively far.\n\nGensim's :py:class:`~gensim.models.word2vec.Word2Vec` class implements this model.\n\nWith the ``Word2Vec`` model, we can calculate the vectors for each **word** in a document.\nBut what if we want to calculate a vector for the **entire document**\\ ?\nWe could average the vectors for each word in the document - while this is quick and crude, it can often be useful.\nHowever, there is a better way...\n\n## Introducing: Paragraph Vector\n\n.. Important:: In Gensim, we refer to the Paragraph Vector model as ``Doc2Vec``.\n\nLe and Mikolov in 2014 introduced the [Doc2Vec algorithm](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)_,\nwhich usually outperforms such simple-averaging of ``Word2Vec`` vectors.\n\nThe basic idea is: act as if a document has another floating word-like\nvector, which contributes to all training predictions, and is updated like\nother word-vectors, but we will call it a doc-vector. Gensim's\n:py:class:`~gensim.models.doc2vec.Doc2Vec` class implements this algorithm.\n\nThere are two implementations:\n\n1. Paragraph Vector - Distributed Memory (PV-DM)\n2. Paragraph Vector - Distributed Bag of Words (PV-DBOW)\n\n.. Important::\n Don't let the implementation details below scare you.\n They're advanced material: if it's too much, then move on to the next section.\n\nPV-DM is analogous to Word2Vec CBOW. The doc-vectors are obtained by training\na neural network on the synthetic task of predicting a center word based an\naverage of both context word-vectors and the full document's doc-vector.\n\nPV-DBOW is analogous to Word2Vec SG. The doc-vectors are obtained by training\na neural network on the synthetic task of predicting a target word just from\nthe full document's doc-vector. (It is also common to combine this with\nskip-gram testing, using both the doc-vector and nearby word-vectors to\npredict a single target word, but only one at a time.)\n\n## Prepare the Training and Test Data\n\nFor this tutorial, we'll be training our model using the [Lee Background\nCorpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf)\nincluded in gensim. This corpus contains 314 documents selected from the\nAustralian Broadcasting Corporation\u2019s news mail service, which provides text\ne-mails of headline stories and covers a number of broad topics.\n\nAnd we'll test our model by eye using the much shorter [Lee Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf)\nwhich contains 50 documents.\n\n\n" ] }, { @@ -51,7 +51,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Define a Function to Read and Preprocess Text\n---------------------------------------------\n\nBelow, we define a function to:\n\n- open the train/test file (with latin encoding)\n- read the file line-by-line\n- pre-process each line (tokenize text into individual words, remove punctuation, set to lowercase, etc)\n\nThe file we're reading is a **corpus**.\nEach line of the file is a **document**.\n\n.. Important::\n To train the model, we'll need to associate a tag/number with each document\n of the training corpus. In our case, the tag is simply the zero-based line\n number.\n\n\n" + "## Define a Function to Read and Preprocess Text\n\nBelow, we define a function to:\n\n- open the train/test file (with latin encoding)\n- read the file line-by-line\n- pre-process each line (tokenize text into individual words, remove punctuation, set to lowercase, etc)\n\nThe file we're reading is a **corpus**.\nEach line of the file is a **document**.\n\n.. Important::\n To train the model, we'll need to associate a tag/number with each document\n of the training corpus. In our case, the tag is simply the zero-based line\n number.\n\n\n" ] }, { @@ -112,7 +112,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Training the Model\n------------------\n\nNow, we'll instantiate a Doc2Vec model with a vector size with 50 dimensions and\niterating over the training corpus 40 times. We set the minimum word count to\n2 in order to discard words with very few occurrences. (Without a variety of\nrepresentative examples, retaining such infrequent words can often make a\nmodel worse!) Typical iteration counts in the published `Paragraph Vector paper `__\nresults, using 10s-of-thousands to millions of docs, are 10-20. More\niterations take more time and eventually reach a point of diminishing\nreturns.\n\nHowever, this is a very very small dataset (300 documents) with shortish\ndocuments (a few hundred words). Adding training passes can sometimes help\nwith such small datasets.\n\n\n" + "## Training the Model\n\nNow, we'll instantiate a Doc2Vec model with a vector size with 50 dimensions and\niterating over the training corpus 40 times. We set the minimum word count to\n2 in order to discard words with very few occurrences. (Without a variety of\nrepresentative examples, retaining such infrequent words can often make a\nmodel worse!) Typical iteration counts in the published [Paragraph Vector paper](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)_\nresults, using 10s-of-thousands to millions of docs, are 10-20. More\niterations take more time and eventually reach a point of diminishing\nreturns.\n\nHowever, this is a very very small dataset (300 documents) with shortish\ndocuments (a few hundred words). Adding training passes can sometimes help\nwith such small datasets.\n\n\n" ] }, { @@ -166,7 +166,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Next, train the model on the corpus.\nIf optimized Gensim (with BLAS library) is being used, this should take no more than 3 seconds.\nIf the BLAS library is not being used, this should take no more than 2\nminutes, so use optimized Gensim with BLAS if you value your time.\n\n\n" + "Next, train the model on the corpus.\nIn the usual case, where Gensim installation found a BLAS library for optimized\nbulk vector operations, this training on this tiny 300 document, ~60k word corpus \nshould take just a few seconds. (More realistic datasets of tens-of-millions\nof words or more take proportionately longer.) If for some reason a BLAS library \nisn't available, training uses a fallback approach that takes 60x-120x longer, \nso even this tiny training will take minutes rather than seconds. (And, in that \ncase, you should also notice a warning in the logging letting you know there's \nsomething worth fixing.) So, be sure your installation uses the BLAS-optimized \nGensim if you value your time.\n\n\n" ] }, { @@ -209,7 +209,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Assessing the Model\n-------------------\n\nTo assess our new model, we'll first infer new vectors for each document of\nthe training corpus, compare the inferred vectors with the training corpus,\nand then returning the rank of the document based on self-similarity.\nBasically, we're pretending as if the training corpus is some new unseen data\nand then seeing how they compare with the trained model. The expectation is\nthat we've likely overfit our model (i.e., all of the ranks will be less than\n2) and so we should be able to find similar documents very easily.\nAdditionally, we'll keep track of the second ranks for a comparison of less\nsimilar documents.\n\n\n" + "## Assessing the Model\n\nTo assess our new model, we'll first infer new vectors for each document of\nthe training corpus, compare the inferred vectors with the training corpus,\nand then returning the rank of the document based on self-similarity.\nBasically, we're pretending as if the training corpus is some new unseen data\nand then seeing how they compare with the trained model. The expectation is\nthat we've likely overfit our model (i.e., all of the ranks will be less than\n2) and so we should be able to find similar documents very easily.\nAdditionally, we'll keep track of the second ranks for a comparison of less\nsimilar documents.\n\n\n" ] }, { @@ -281,7 +281,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Testing the Model\n-----------------\n\nUsing the same approach above, we'll infer the vector for a randomly chosen\ntest document, and compare the document to our model by eye.\n\n\n" + "## Testing the Model\n\nUsing the same approach above, we'll infer the vector for a randomly chosen\ntest document, and compare the document to our model by eye.\n\n\n" ] }, { @@ -299,7 +299,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Conclusion\n----------\n\nLet's review what we've seen in this tutorial:\n\n0. Review the relevant models: bag-of-words, Word2Vec, Doc2Vec\n1. Load and preprocess the training and test corpora (see `core_concepts_corpus`)\n2. Train a Doc2Vec `core_concepts_model` model using the training corpus\n3. Demonstrate how the trained model can be used to infer a `core_concepts_vector`\n4. Assess the model\n5. Test the model on the test corpus\n\nThat's it! Doc2Vec is a great way to explore relationships between documents.\n\nAdditional Resources\n--------------------\n\nIf you'd like to know more about the subject matter of this tutorial, check out the links below.\n\n* `Word2Vec Paper `_\n* `Doc2Vec Paper `_\n* `Dr. Michael D. Lee's Website `_\n* `Lee Corpus `__\n* `IMDB Doc2Vec Tutorial `_\n\n\n" + "## Conclusion\n\nLet's review what we've seen in this tutorial:\n\n0. Review the relevant models: bag-of-words, Word2Vec, Doc2Vec\n1. Load and preprocess the training and test corpora (see `core_concepts_corpus`)\n2. Train a Doc2Vec `core_concepts_model` model using the training corpus\n3. Demonstrate how the trained model can be used to infer a `core_concepts_vector`\n4. Assess the model\n5. Test the model on the test corpus\n\nThat's it! Doc2Vec is a great way to explore relationships between documents.\n\n## Additional Resources\n\nIf you'd like to know more about the subject matter of this tutorial, check out the links below.\n\n* [Word2Vec Paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)\n* [Doc2Vec Paper](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)\n* [Dr. Michael D. Lee's Website](http://faculty.sites.uci.edu/mdlee)\n* [Lee Corpus](http://faculty.sites.uci.edu/mdlee/similarity-data/)_\n* [IMDB Doc2Vec Tutorial](doc2vec-IMDB.ipynb)\n\n\n" ] } ], @@ -319,7 +319,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.5" + "version": "3.8.10" } }, "nbformat": 4, diff --git a/docs/src/auto_examples/tutorials/run_doc2vec_lee.py b/docs/src/auto_examples/tutorials/run_doc2vec_lee.py index 7012d38f66..18f4ee7b16 100644 --- a/docs/src/auto_examples/tutorials/run_doc2vec_lee.py +++ b/docs/src/auto_examples/tutorials/run_doc2vec_lee.py @@ -215,9 +215,15 @@ def read_corpus(fname, tokens_only=False): ############################################################################### # Next, train the model on the corpus. -# If optimized Gensim (with BLAS library) is being used, this should take no more than 3 seconds. -# If the BLAS library is not being used, this should take no more than 2 -# minutes, so use optimized Gensim with BLAS if you value your time. +# In the usual case, where Gensim installation found a BLAS library for optimized +# bulk vector operations, this training on this tiny 300 document, ~60k word corpus +# should take just a few seconds. (More realistic datasets of tens-of-millions +# of words or more take proportionately longer.) If for some reason a BLAS library +# isn't available, training uses a fallback approach that takes 60x-120x longer, +# so even this tiny training will take minutes rather than seconds. (And, in that +# case, you should also notice a warning in the logging letting you know there's +# something worth fixing.) So, be sure your installation uses the BLAS-optimized +# Gensim if you value your time. # model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs) diff --git a/docs/src/auto_examples/tutorials/run_doc2vec_lee.py.md5 b/docs/src/auto_examples/tutorials/run_doc2vec_lee.py.md5 index f1b58e756c..5c0d021557 100644 --- a/docs/src/auto_examples/tutorials/run_doc2vec_lee.py.md5 +++ b/docs/src/auto_examples/tutorials/run_doc2vec_lee.py.md5 @@ -1 +1 @@ -7d0ee86f6eb9d1e2f55b9f295eec3060 \ No newline at end of file +581caa67e8496a210a030c2886fb8bbc \ No newline at end of file diff --git a/docs/src/auto_examples/tutorials/run_doc2vec_lee.rst b/docs/src/auto_examples/tutorials/run_doc2vec_lee.rst index 6e99a47a13..68a6fc7d3f 100644 --- a/docs/src/auto_examples/tutorials/run_doc2vec_lee.rst +++ b/docs/src/auto_examples/tutorials/run_doc2vec_lee.rst @@ -1,12 +1,21 @@ + +.. DO NOT EDIT. +.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. +.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: +.. "auto_examples/tutorials/run_doc2vec_lee.py" +.. LINE NUMBERS ARE GIVEN BELOW. + .. only:: html .. note:: :class: sphx-glr-download-link-note - Click :ref:`here ` to download the full example code - .. rst-class:: sphx-glr-example-title + Click :ref:`here ` + to download the full example code + +.. rst-class:: sphx-glr-example-title - .. _sphx_glr_auto_examples_tutorials_run_doc2vec_lee.py: +.. _sphx_glr_auto_examples_tutorials_run_doc2vec_lee.py: Doc2Vec Model @@ -15,7 +24,7 @@ Doc2Vec Model Introduces Gensim's Doc2Vec model and demonstrates its use on the `Lee Corpus `__. - +.. GENERATED FROM PYTHON SOURCE LINES 9-13 .. code-block:: default @@ -30,6 +39,8 @@ Introduces Gensim's Doc2Vec model and demonstrates its use on the +.. GENERATED FROM PYTHON SOURCE LINES 14-129 + Doc2Vec is a :ref:`core_concepts_model` that represents each :ref:`core_concepts_document` as a :ref:`core_concepts_vector`. This tutorial introduces the model and demonstrates how to train and assess it. @@ -146,6 +157,7 @@ And we'll test our model by eye using the much shorter `Lee Corpus which contains 50 documents. +.. GENERATED FROM PYTHON SOURCE LINES 129-137 .. code-block:: default @@ -164,6 +176,8 @@ which contains 50 documents. +.. GENERATED FROM PYTHON SOURCE LINES 138-155 + Define a Function to Read and Preprocess Text --------------------------------------------- @@ -182,6 +196,7 @@ Each line of the file is a **document**. number. +.. GENERATED FROM PYTHON SOURCE LINES 155-170 .. code-block:: default @@ -207,9 +222,12 @@ Each line of the file is a **document**. +.. GENERATED FROM PYTHON SOURCE LINES 171-173 + Let's take a look at the training corpus +.. GENERATED FROM PYTHON SOURCE LINES 173-175 .. code-block:: default @@ -221,8 +239,6 @@ Let's take a look at the training corpus .. rst-class:: sphx-glr-script-out - Out: - .. code-block:: none [TaggedDocument(words=['hundreds', 'of', 'people', 'have', 'been', 'forced', 'to', 'vacate', 'their', 'homes', 'in', 'the', 'southern', 'highlands', 'of', 'new', 'south', 'wales', 'as', 'strong', 'winds', 'today', 'pushed', 'huge', 'bushfire', 'towards', 'the', 'town', 'of', 'hill', 'top', 'new', 'blaze', 'near', 'goulburn', 'south', 'west', 'of', 'sydney', 'has', 'forced', 'the', 'closure', 'of', 'the', 'hume', 'highway', 'at', 'about', 'pm', 'aedt', 'marked', 'deterioration', 'in', 'the', 'weather', 'as', 'storm', 'cell', 'moved', 'east', 'across', 'the', 'blue', 'mountains', 'forced', 'authorities', 'to', 'make', 'decision', 'to', 'evacuate', 'people', 'from', 'homes', 'in', 'outlying', 'streets', 'at', 'hill', 'top', 'in', 'the', 'new', 'south', 'wales', 'southern', 'highlands', 'an', 'estimated', 'residents', 'have', 'left', 'their', 'homes', 'for', 'nearby', 'mittagong', 'the', 'new', 'south', 'wales', 'rural', 'fire', 'service', 'says', 'the', 'weather', 'conditions', 'which', 'caused', 'the', 'fire', 'to', 'burn', 'in', 'finger', 'formation', 'have', 'now', 'eased', 'and', 'about', 'fire', 'units', 'in', 'and', 'around', 'hill', 'top', 'are', 'optimistic', 'of', 'defending', 'all', 'properties', 'as', 'more', 'than', 'blazes', 'burn', 'on', 'new', 'year', 'eve', 'in', 'new', 'south', 'wales', 'fire', 'crews', 'have', 'been', 'called', 'to', 'new', 'fire', 'at', 'gunning', 'south', 'of', 'goulburn', 'while', 'few', 'details', 'are', 'available', 'at', 'this', 'stage', 'fire', 'authorities', 'says', 'it', 'has', 'closed', 'the', 'hume', 'highway', 'in', 'both', 'directions', 'meanwhile', 'new', 'fire', 'in', 'sydney', 'west', 'is', 'no', 'longer', 'threatening', 'properties', 'in', 'the', 'cranebrook', 'area', 'rain', 'has', 'fallen', 'in', 'some', 'parts', 'of', 'the', 'illawarra', 'sydney', 'the', 'hunter', 'valley', 'and', 'the', 'north', 'coast', 'but', 'the', 'bureau', 'of', 'meteorology', 'claire', 'richards', 'says', 'the', 'rain', 'has', 'done', 'little', 'to', 'ease', 'any', 'of', 'the', 'hundred', 'fires', 'still', 'burning', 'across', 'the', 'state', 'the', 'falls', 'have', 'been', 'quite', 'isolated', 'in', 'those', 'areas', 'and', 'generally', 'the', 'falls', 'have', 'been', 'less', 'than', 'about', 'five', 'millimetres', 'she', 'said', 'in', 'some', 'places', 'really', 'not', 'significant', 'at', 'all', 'less', 'than', 'millimetre', 'so', 'there', 'hasn', 'been', 'much', 'relief', 'as', 'far', 'as', 'rain', 'is', 'concerned', 'in', 'fact', 'they', 've', 'probably', 'hampered', 'the', 'efforts', 'of', 'the', 'firefighters', 'more', 'because', 'of', 'the', 'wind', 'gusts', 'that', 'are', 'associated', 'with', 'those', 'thunderstorms'], tags=[0]), TaggedDocument(words=['indian', 'security', 'forces', 'have', 'shot', 'dead', 'eight', 'suspected', 'militants', 'in', 'night', 'long', 'encounter', 'in', 'southern', 'kashmir', 'the', 'shootout', 'took', 'place', 'at', 'dora', 'village', 'some', 'kilometers', 'south', 'of', 'the', 'kashmiri', 'summer', 'capital', 'srinagar', 'the', 'deaths', 'came', 'as', 'pakistani', 'police', 'arrested', 'more', 'than', 'two', 'dozen', 'militants', 'from', 'extremist', 'groups', 'accused', 'of', 'staging', 'an', 'attack', 'on', 'india', 'parliament', 'india', 'has', 'accused', 'pakistan', 'based', 'lashkar', 'taiba', 'and', 'jaish', 'mohammad', 'of', 'carrying', 'out', 'the', 'attack', 'on', 'december', 'at', 'the', 'behest', 'of', 'pakistani', 'military', 'intelligence', 'military', 'tensions', 'have', 'soared', 'since', 'the', 'raid', 'with', 'both', 'sides', 'massing', 'troops', 'along', 'their', 'border', 'and', 'trading', 'tit', 'for', 'tat', 'diplomatic', 'sanctions', 'yesterday', 'pakistan', 'announced', 'it', 'had', 'arrested', 'lashkar', 'taiba', 'chief', 'hafiz', 'mohammed', 'saeed', 'police', 'in', 'karachi', 'say', 'it', 'is', 'likely', 'more', 'raids', 'will', 'be', 'launched', 'against', 'the', 'two', 'groups', 'as', 'well', 'as', 'other', 'militant', 'organisations', 'accused', 'of', 'targetting', 'india', 'military', 'tensions', 'between', 'india', 'and', 'pakistan', 'have', 'escalated', 'to', 'level', 'not', 'seen', 'since', 'their', 'war'], tags=[1])] @@ -230,9 +246,12 @@ Let's take a look at the training corpus +.. GENERATED FROM PYTHON SOURCE LINES 176-178 + And the testing corpus looks like this: +.. GENERATED FROM PYTHON SOURCE LINES 178-180 .. code-block:: default @@ -244,8 +263,6 @@ And the testing corpus looks like this: .. rst-class:: sphx-glr-script-out - Out: - .. code-block:: none [['the', 'national', 'executive', 'of', 'the', 'strife', 'torn', 'democrats', 'last', 'night', 'appointed', 'little', 'known', 'west', 'australian', 'senator', 'brian', 'greig', 'as', 'interim', 'leader', 'shock', 'move', 'likely', 'to', 'provoke', 'further', 'conflict', 'between', 'the', 'party', 'senators', 'and', 'its', 'organisation', 'in', 'move', 'to', 'reassert', 'control', 'over', 'the', 'party', 'seven', 'senators', 'the', 'national', 'executive', 'last', 'night', 'rejected', 'aden', 'ridgeway', 'bid', 'to', 'become', 'interim', 'leader', 'in', 'favour', 'of', 'senator', 'greig', 'supporter', 'of', 'deposed', 'leader', 'natasha', 'stott', 'despoja', 'and', 'an', 'outspoken', 'gay', 'rights', 'activist'], ['cash', 'strapped', 'financial', 'services', 'group', 'amp', 'has', 'shelved', 'million', 'plan', 'to', 'buy', 'shares', 'back', 'from', 'investors', 'and', 'will', 'raise', 'million', 'in', 'fresh', 'capital', 'after', 'profits', 'crashed', 'in', 'the', 'six', 'months', 'to', 'june', 'chief', 'executive', 'paul', 'batchelor', 'said', 'the', 'result', 'was', 'solid', 'in', 'what', 'he', 'described', 'as', 'the', 'worst', 'conditions', 'for', 'stock', 'markets', 'in', 'years', 'amp', 'half', 'year', 'profit', 'sank', 'per', 'cent', 'to', 'million', 'or', 'share', 'as', 'australia', 'largest', 'investor', 'and', 'fund', 'manager', 'failed', 'to', 'hit', 'projected', 'per', 'cent', 'earnings', 'growth', 'targets', 'and', 'was', 'battered', 'by', 'falling', 'returns', 'on', 'share', 'markets']] @@ -253,10 +270,14 @@ And the testing corpus looks like this: +.. GENERATED FROM PYTHON SOURCE LINES 181-184 + Notice that the testing corpus is just a list of lists and does not contain any tags. +.. GENERATED FROM PYTHON SOURCE LINES 186-202 + Training the Model ------------------ @@ -274,6 +295,7 @@ documents (a few hundred words). Adding training passes can sometimes help with such small datasets. +.. GENERATED FROM PYTHON SOURCE LINES 202-204 .. code-block:: default @@ -283,11 +305,20 @@ with such small datasets. +.. rst-class:: sphx-glr-script-out + + .. code-block:: none + + 2022-12-07 10:59:00,578 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec', 'datetime': '2022-12-07T10:59:00.540082', 'gensim': '4.2.1.dev0', 'python': '3.8.10 (default, Jun 22 2022, 20:18:18) \n[GCC 9.4.0]', 'platform': 'Linux-5.4.0-135-generic-x86_64-with-glibc2.29', 'event': 'created'} + +.. GENERATED FROM PYTHON SOURCE LINES 205-206 + Build a vocabulary +.. GENERATED FROM PYTHON SOURCE LINES 206-208 .. code-block:: default @@ -299,24 +330,24 @@ Build a vocabulary .. rst-class:: sphx-glr-script-out - Out: - .. code-block:: none - 2020-09-30 21:08:55,026 : INFO : collecting all words and their counts - 2020-09-30 21:08:55,027 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags - 2020-09-30 21:08:55,043 : INFO : collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words - 2020-09-30 21:08:55,043 : INFO : Loading a fresh vocabulary - 2020-09-30 21:08:55,064 : INFO : effective_min_count=2 retains 3955 unique words (56% of original 6981, drops 3026) - 2020-09-30 21:08:55,064 : INFO : effective_min_count=2 leaves 55126 word corpus (94% of original 58152, drops 3026) - 2020-09-30 21:08:55,098 : INFO : deleting the raw counts dictionary of 6981 items - 2020-09-30 21:08:55,100 : INFO : sample=0.001 downsamples 46 most-common words - 2020-09-30 21:08:55,100 : INFO : downsampling leaves estimated 42390 word corpus (76.9% of prior 55126) - 2020-09-30 21:08:55,149 : INFO : estimated required memory for 3955 words and 50 dimensions: 3679500 bytes - 2020-09-30 21:08:55,149 : INFO : resetting layer weights + 2022-12-07 10:59:00,806 : INFO : collecting all words and their counts + 2022-12-07 10:59:00,808 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags + 2022-12-07 10:59:00,850 : INFO : collected 6981 word types and 300 unique tags from a corpus of 300 examples and 58152 words + 2022-12-07 10:59:00,850 : INFO : Creating a fresh vocabulary + 2022-12-07 10:59:00,887 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=2 retains 3955 unique words (56.65% of original 6981, drops 3026)', 'datetime': '2022-12-07T10:59:00.886953', 'gensim': '4.2.1.dev0', 'python': '3.8.10 (default, Jun 22 2022, 20:18:18) \n[GCC 9.4.0]', 'platform': 'Linux-5.4.0-135-generic-x86_64-with-glibc2.29', 'event': 'prepare_vocab'} + 2022-12-07 10:59:00,887 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=2 leaves 55126 word corpus (94.80% of original 58152, drops 3026)', 'datetime': '2022-12-07T10:59:00.887466', 'gensim': '4.2.1.dev0', 'python': '3.8.10 (default, Jun 22 2022, 20:18:18) \n[GCC 9.4.0]', 'platform': 'Linux-5.4.0-135-generic-x86_64-with-glibc2.29', 'event': 'prepare_vocab'} + 2022-12-07 10:59:00,917 : INFO : deleting the raw counts dictionary of 6981 items + 2022-12-07 10:59:00,918 : INFO : sample=0.001 downsamples 46 most-common words + 2022-12-07 10:59:00,918 : INFO : Doc2Vec lifecycle event {'msg': 'downsampling leaves estimated 42390.98914085061 word corpus (76.9%% of prior 55126)', 'datetime': '2022-12-07T10:59:00.918276', 'gensim': '4.2.1.dev0', 'python': '3.8.10 (default, Jun 22 2022, 20:18:18) \n[GCC 9.4.0]', 'platform': 'Linux-5.4.0-135-generic-x86_64-with-glibc2.29', 'event': 'prepare_vocab'} + 2022-12-07 10:59:00,965 : INFO : estimated required memory for 3955 words and 50 dimensions: 3679500 bytes + 2022-12-07 10:59:00,965 : INFO : resetting layer weights + +.. GENERATED FROM PYTHON SOURCE LINES 209-214 Essentially, the vocabulary is a list (accessible via ``model.wv.index_to_key``) of all of the unique words extracted from the training corpus. @@ -324,6 +355,7 @@ Additional attributes for each word are available using the ``model.wv.get_vecat For example, to see how many times ``penalty`` appeared in the training corpus: +.. GENERATED FROM PYTHON SOURCE LINES 214-216 .. code-block:: default @@ -335,8 +367,6 @@ For example, to see how many times ``penalty`` appeared in the training corpus: .. rst-class:: sphx-glr-script-out - Out: - .. code-block:: none Word 'penalty' appeared 4 times in the training corpus. @@ -344,12 +374,21 @@ For example, to see how many times ``penalty`` appeared in the training corpus: +.. GENERATED FROM PYTHON SOURCE LINES 217-228 + Next, train the model on the corpus. -If optimized Gensim (with BLAS library) is being used, this should take no more than 3 seconds. -If the BLAS library is not being used, this should take no more than 2 -minutes, so use optimized Gensim with BLAS if you value your time. +In the usual case, where Gensim installation found a BLAS library for optimized +bulk vector operations, this training on this tiny 300 document, ~60k word corpus +should take just a few seconds. (More realistic datasets of tens-of-millions +of words or more take proportionately longer.) If for some reason a BLAS library +isn't available, training uses a fallback approach that takes 60x-120x longer, +so even this tiny training will take minutes rather than seconds. (And, in that +case, you should also notice a warning in the logging letting you know there's +something worth fixing.) So, be sure your installation uses the BLAS-optimized +Gensim if you value your time. +.. GENERATED FROM PYTHON SOURCE LINES 228-230 .. code-block:: default @@ -361,181 +400,62 @@ minutes, so use optimized Gensim with BLAS if you value your time. .. rst-class:: sphx-glr-script-out - Out: - .. code-block:: none - 2020-09-30 21:08:55,553 : INFO : training model with 3 workers on 3955 vocabulary and 50 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 - 2020-09-30 21:08:55,613 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:55,614 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:55,614 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:55,614 : INFO : EPOCH - 1 : training on 58152 raw words (42784 effective words) took 0.1s, 751479 effective words/s - 2020-09-30 21:08:55,664 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:55,666 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:55,666 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:55,666 : INFO : EPOCH - 2 : training on 58152 raw words (42745 effective words) took 0.1s, 845101 effective words/s - 2020-09-30 21:08:55,718 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:55,719 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:55,720 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:55,720 : INFO : EPOCH - 3 : training on 58152 raw words (42605 effective words) took 0.1s, 810845 effective words/s - 2020-09-30 21:08:55,781 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:55,783 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:55,784 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:55,784 : INFO : EPOCH - 4 : training on 58152 raw words (42723 effective words) took 0.1s, 677810 effective words/s - 2020-09-30 21:08:55,846 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:55,847 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:55,848 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:55,848 : INFO : EPOCH - 5 : training on 58152 raw words (42641 effective words) took 0.1s, 682513 effective words/s - 2020-09-30 21:08:55,903 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:55,905 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:55,905 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:55,905 : INFO : EPOCH - 6 : training on 58152 raw words (42654 effective words) took 0.1s, 760381 effective words/s - 2020-09-30 21:08:55,960 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:55,962 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:55,964 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:55,964 : INFO : EPOCH - 7 : training on 58152 raw words (42751 effective words) took 0.1s, 741994 effective words/s - 2020-09-30 21:08:56,018 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,020 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,020 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,020 : INFO : EPOCH - 8 : training on 58152 raw words (42692 effective words) took 0.1s, 773631 effective words/s - 2020-09-30 21:08:56,076 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,078 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,081 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,081 : INFO : EPOCH - 9 : training on 58152 raw words (42745 effective words) took 0.1s, 719453 effective words/s - 2020-09-30 21:08:56,137 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,137 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,137 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,138 : INFO : EPOCH - 10 : training on 58152 raw words (42733 effective words) took 0.1s, 770082 effective words/s - 2020-09-30 21:08:56,195 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,196 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,197 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,197 : INFO : EPOCH - 11 : training on 58152 raw words (42791 effective words) took 0.1s, 734171 effective words/s - 2020-09-30 21:08:56,253 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,255 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,255 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,255 : INFO : EPOCH - 12 : training on 58152 raw words (42773 effective words) took 0.1s, 745248 effective words/s - 2020-09-30 21:08:56,316 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,318 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,318 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,318 : INFO : EPOCH - 13 : training on 58152 raw words (42793 effective words) took 0.1s, 702300 effective words/s - 2020-09-30 21:08:56,369 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,371 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,373 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,373 : INFO : EPOCH - 14 : training on 58152 raw words (42637 effective words) took 0.1s, 802259 effective words/s - 2020-09-30 21:08:56,421 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,425 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,426 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,426 : INFO : EPOCH - 15 : training on 58152 raw words (42686 effective words) took 0.1s, 820787 effective words/s - 2020-09-30 21:08:56,475 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,478 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,479 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,479 : INFO : EPOCH - 16 : training on 58152 raw words (42799 effective words) took 0.1s, 829690 effective words/s - 2020-09-30 21:08:56,530 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,530 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,533 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,534 : INFO : EPOCH - 17 : training on 58152 raw words (42733 effective words) took 0.1s, 794744 effective words/s - 2020-09-30 21:08:56,583 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,585 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,587 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,587 : INFO : EPOCH - 18 : training on 58152 raw words (42703 effective words) took 0.1s, 813146 effective words/s - 2020-09-30 21:08:56,638 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,640 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,640 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,641 : INFO : EPOCH - 19 : training on 58152 raw words (42763 effective words) took 0.1s, 822300 effective words/s - 2020-09-30 21:08:56,696 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,700 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,700 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,700 : INFO : EPOCH - 20 : training on 58152 raw words (42649 effective words) took 0.1s, 733047 effective words/s - 2020-09-30 21:08:56,752 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,753 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,754 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,754 : INFO : EPOCH - 21 : training on 58152 raw words (42701 effective words) took 0.1s, 822006 effective words/s - 2020-09-30 21:08:56,803 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,805 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,805 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,805 : INFO : EPOCH - 22 : training on 58152 raw words (42714 effective words) took 0.1s, 848390 effective words/s - 2020-09-30 21:08:56,857 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,857 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,859 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,860 : INFO : EPOCH - 23 : training on 58152 raw words (42740 effective words) took 0.1s, 811758 effective words/s - 2020-09-30 21:08:56,907 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,909 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,910 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,910 : INFO : EPOCH - 24 : training on 58152 raw words (42754 effective words) took 0.0s, 873741 effective words/s - 2020-09-30 21:08:56,959 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:56,960 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:56,960 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:56,960 : INFO : EPOCH - 25 : training on 58152 raw words (42704 effective words) took 0.0s, 862291 effective words/s - 2020-09-30 21:08:57,009 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:57,010 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:57,011 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:57,011 : INFO : EPOCH - 26 : training on 58152 raw words (42741 effective words) took 0.0s, 868076 effective words/s - 2020-09-30 21:08:57,059 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:57,062 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:57,063 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:57,063 : INFO : EPOCH - 27 : training on 58152 raw words (42610 effective words) took 0.1s, 830699 effective words/s - 2020-09-30 21:08:57,112 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:57,114 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:57,115 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:57,116 : INFO : EPOCH - 28 : training on 58152 raw words (42747 effective words) took 0.1s, 835959 effective words/s - 2020-09-30 21:08:57,164 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:57,169 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:57,170 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:57,170 : INFO : EPOCH - 29 : training on 58152 raw words (42755 effective words) took 0.1s, 804348 effective words/s - 2020-09-30 21:08:57,219 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:57,222 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:57,224 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:57,224 : INFO : EPOCH - 30 : training on 58152 raw words (42760 effective words) took 0.1s, 808636 effective words/s - 2020-09-30 21:08:57,271 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:57,273 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:57,273 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:57,273 : INFO : EPOCH - 31 : training on 58152 raw words (42727 effective words) took 0.0s, 889118 effective words/s - 2020-09-30 21:08:57,323 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:57,326 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:57,327 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:57,327 : INFO : EPOCH - 32 : training on 58152 raw words (42786 effective words) took 0.1s, 819149 effective words/s - 2020-09-30 21:08:57,377 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:57,378 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:57,379 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:57,379 : INFO : EPOCH - 33 : training on 58152 raw words (42614 effective words) took 0.1s, 828217 effective words/s - 2020-09-30 21:08:57,427 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:57,430 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:57,431 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:57,431 : INFO : EPOCH - 34 : training on 58152 raw words (42757 effective words) took 0.1s, 848700 effective words/s - 2020-09-30 21:08:57,476 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:57,479 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:57,481 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:57,481 : INFO : EPOCH - 35 : training on 58152 raw words (42713 effective words) took 0.0s, 881912 effective words/s - 2020-09-30 21:08:57,530 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:57,530 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:57,532 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:57,532 : INFO : EPOCH - 36 : training on 58152 raw words (42632 effective words) took 0.1s, 843930 effective words/s - 2020-09-30 21:08:57,580 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:57,583 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:57,584 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:57,584 : INFO : EPOCH - 37 : training on 58152 raw words (42691 effective words) took 0.1s, 851268 effective words/s - 2020-09-30 21:08:57,632 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:57,634 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:57,635 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:57,635 : INFO : EPOCH - 38 : training on 58152 raw words (42667 effective words) took 0.1s, 850589 effective words/s - 2020-09-30 21:08:57,685 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:57,686 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:57,687 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:57,687 : INFO : EPOCH - 39 : training on 58152 raw words (42641 effective words) took 0.1s, 843857 effective words/s - 2020-09-30 21:08:57,736 : INFO : worker thread finished; awaiting finish of 2 more threads - 2020-09-30 21:08:57,737 : INFO : worker thread finished; awaiting finish of 1 more threads - 2020-09-30 21:08:57,741 : INFO : worker thread finished; awaiting finish of 0 more threads - 2020-09-30 21:08:57,741 : INFO : EPOCH - 40 : training on 58152 raw words (42721 effective words) took 0.1s, 807691 effective words/s - 2020-09-30 21:08:57,741 : INFO : training on a 2326080 raw words (1708575 effective words) took 2.2s, 781245 effective words/s - - - + 2022-12-07 10:59:01,272 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 3 workers on 3955 vocabulary and 50 features, using sg=0 hs=0 sample=0.001 negative=5 window=5 shrink_windows=True', 'datetime': '2022-12-07T10:59:01.271863', 'gensim': '4.2.1.dev0', 'python': '3.8.10 (default, Jun 22 2022, 20:18:18) \n[GCC 9.4.0]', 'platform': 'Linux-5.4.0-135-generic-x86_64-with-glibc2.29', 'event': 'train'} + 2022-12-07 10:59:01,408 : INFO : EPOCH 0: training on 58152 raw words (42665 effective words) took 0.1s, 335294 effective words/s + 2022-12-07 10:59:01,462 : INFO : EPOCH 1: training on 58152 raw words (42755 effective words) took 0.1s, 816420 effective words/s + 2022-12-07 10:59:01,521 : INFO : EPOCH 2: training on 58152 raw words (42692 effective words) took 0.1s, 745004 effective words/s + 2022-12-07 10:59:01,573 : INFO : EPOCH 3: training on 58152 raw words (42670 effective words) took 0.1s, 841368 effective words/s + 2022-12-07 10:59:01,627 : INFO : EPOCH 4: training on 58152 raw words (42685 effective words) took 0.1s, 815442 effective words/s + 2022-12-07 10:59:01,703 : INFO : EPOCH 5: training on 58152 raw words (42709 effective words) took 0.1s, 578402 effective words/s + 2022-12-07 10:59:01,753 : INFO : EPOCH 6: training on 58152 raw words (42594 effective words) took 0.0s, 864899 effective words/s + 2022-12-07 10:59:01,804 : INFO : EPOCH 7: training on 58152 raw words (42721 effective words) took 0.0s, 864073 effective words/s + 2022-12-07 10:59:01,881 : INFO : EPOCH 8: training on 58152 raw words (42622 effective words) took 0.1s, 566867 effective words/s + 2022-12-07 10:59:01,932 : INFO : EPOCH 9: training on 58152 raw words (42770 effective words) took 0.0s, 862066 effective words/s + 2022-12-07 10:59:02,006 : INFO : EPOCH 10: training on 58152 raw words (42739 effective words) took 0.1s, 587035 effective words/s + 2022-12-07 10:59:02,058 : INFO : EPOCH 11: training on 58152 raw words (42612 effective words) took 0.1s, 850879 effective words/s + 2022-12-07 10:59:02,135 : INFO : EPOCH 12: training on 58152 raw words (42655 effective words) took 0.1s, 566216 effective words/s + 2022-12-07 10:59:02,187 : INFO : EPOCH 13: training on 58152 raw words (42749 effective words) took 0.1s, 844125 effective words/s + 2022-12-07 10:59:02,265 : INFO : EPOCH 14: training on 58152 raw words (42748 effective words) took 0.1s, 556136 effective words/s + 2022-12-07 10:59:02,347 : INFO : EPOCH 15: training on 58152 raw words (42748 effective words) took 0.1s, 530528 effective words/s + 2022-12-07 10:59:02,398 : INFO : EPOCH 16: training on 58152 raw words (42737 effective words) took 0.0s, 871200 effective words/s + 2022-12-07 10:59:02,485 : INFO : EPOCH 17: training on 58152 raw words (42697 effective words) took 0.1s, 499981 effective words/s + 2022-12-07 10:59:02,584 : INFO : EPOCH 18: training on 58152 raw words (42747 effective words) took 0.1s, 440730 effective words/s + 2022-12-07 10:59:02,672 : INFO : EPOCH 19: training on 58152 raw words (42739 effective words) took 0.1s, 497651 effective words/s + 2022-12-07 10:59:02,761 : INFO : EPOCH 20: training on 58152 raw words (42782 effective words) took 0.1s, 499103 effective words/s + 2022-12-07 10:59:02,851 : INFO : EPOCH 21: training on 58152 raw words (42580 effective words) took 0.1s, 489515 effective words/s + 2022-12-07 10:59:02,939 : INFO : EPOCH 22: training on 58152 raw words (42687 effective words) took 0.1s, 496560 effective words/s + 2022-12-07 10:59:03,023 : INFO : EPOCH 23: training on 58152 raw words (42667 effective words) took 0.1s, 517527 effective words/s + 2022-12-07 10:59:03,156 : INFO : EPOCH 24: training on 58152 raw words (42678 effective words) took 0.1s, 328575 effective words/s + 2022-12-07 10:59:03,322 : INFO : EPOCH 25: training on 58152 raw words (42743 effective words) took 0.2s, 261440 effective words/s + 2022-12-07 10:59:03,486 : INFO : EPOCH 26: training on 58152 raw words (42692 effective words) took 0.2s, 266564 effective words/s + 2022-12-07 10:59:03,627 : INFO : EPOCH 27: training on 58152 raw words (42774 effective words) took 0.1s, 310530 effective words/s + 2022-12-07 10:59:03,770 : INFO : EPOCH 28: training on 58152 raw words (42706 effective words) took 0.1s, 305665 effective words/s + 2022-12-07 10:59:03,901 : INFO : EPOCH 29: training on 58152 raw words (42658 effective words) took 0.1s, 334228 effective words/s + 2022-12-07 10:59:04,028 : INFO : EPOCH 30: training on 58152 raw words (42746 effective words) took 0.1s, 344379 effective words/s + 2022-12-07 10:59:04,159 : INFO : EPOCH 31: training on 58152 raw words (42676 effective words) took 0.1s, 334291 effective words/s + 2022-12-07 10:59:04,295 : INFO : EPOCH 32: training on 58152 raw words (42763 effective words) took 0.1s, 322886 effective words/s + 2022-12-07 10:59:04,488 : INFO : EPOCH 33: training on 58152 raw words (42647 effective words) took 0.2s, 224522 effective words/s + 2022-12-07 10:59:04,629 : INFO : EPOCH 34: training on 58152 raw words (42720 effective words) took 0.1s, 310616 effective words/s + 2022-12-07 10:59:04,764 : INFO : EPOCH 35: training on 58152 raw words (42775 effective words) took 0.1s, 323299 effective words/s + 2022-12-07 10:59:04,899 : INFO : EPOCH 36: training on 58152 raw words (42662 effective words) took 0.1s, 322458 effective words/s + 2022-12-07 10:59:05,032 : INFO : EPOCH 37: training on 58152 raw words (42656 effective words) took 0.1s, 329126 effective words/s + 2022-12-07 10:59:05,162 : INFO : EPOCH 38: training on 58152 raw words (42720 effective words) took 0.1s, 337238 effective words/s + 2022-12-07 10:59:05,308 : INFO : EPOCH 39: training on 58152 raw words (42688 effective words) took 0.1s, 299620 effective words/s + 2022-12-07 10:59:05,308 : INFO : Doc2Vec lifecycle event {'msg': 'training on 2326080 raw words (1708074 effective words) took 4.0s, 423332 effective words/s', 'datetime': '2022-12-07T10:59:05.308684', 'gensim': '4.2.1.dev0', 'python': '3.8.10 (default, Jun 22 2022, 20:18:18) \n[GCC 9.4.0]', 'platform': 'Linux-5.4.0-135-generic-x86_64-with-glibc2.29', 'event': 'train'} + + + + +.. GENERATED FROM PYTHON SOURCE LINES 231-235 Now, we can use the trained model to infer a vector for any piece of text by passing a list of words to the ``model.infer_vector`` function. This vector can then be compared with other vectors via cosine similarity. +.. GENERATED FROM PYTHON SOURCE LINES 235-238 .. code-block:: default @@ -548,22 +468,22 @@ vector can then be compared with other vectors via cosine similarity. .. rst-class:: sphx-glr-script-out - Out: - .. code-block:: none - [-0.08478509 0.05011684 0.0675064 -0.19926868 -0.1235586 0.01768214 - -0.12645927 0.01062329 0.06113973 0.35424358 0.01320948 0.07561274 - -0.01645093 0.0692549 0.08346193 -0.01599065 0.08287009 -0.0139379 - -0.17772709 -0.26271465 0.0442089 -0.04659882 -0.12873884 0.28799203 - -0.13040264 0.12478471 -0.14091878 -0.09698066 -0.07903259 -0.10124907 - -0.28239366 0.13270256 0.04445919 -0.24210942 -0.1907376 -0.07264525 - -0.14167067 -0.22816683 -0.00663796 0.23165748 -0.10436232 -0.01028251 - -0.04064698 0.08813146 0.01072008 -0.149789 0.05923386 0.16301566 - 0.05815683 0.1258063 ] + [-0.10196274 -0.36020595 -0.10973375 0.28432116 -0.00792601 0.01950991 + 0.01309869 0.1045896 -0.2011485 -0.12135196 0.15298457 0.05421316 + -0.06486023 -0.00131951 -0.2237759 -0.08489189 0.05889525 0.27961093 + 0.08121023 -0.06200862 -0.00651888 -0.06831821 0.13001564 0.04539844 + -0.01659351 -0.02359444 -0.22276032 0.06692155 -0.11293832 -0.08056813 + 0.38737044 0.05470002 0.19902836 0.19122775 0.17020799 0.10668964 + 0.01216549 -0.3049222 -0.05198798 0.00130251 0.04994885 -0.0069596 + -0.06367141 -0.11740001 0.14623125 0.10109582 -0.06466878 -0.06512908 + 0.17817481 -0.00934212] + +.. GENERATED FROM PYTHON SOURCE LINES 239-247 Note that ``infer_vector()`` does *not* take a string, but rather a list of string tokens, which should have already been tokenized the same way as the @@ -574,6 +494,8 @@ iterative approximation problem that makes use of internal randomization, repeated inferences of the same text will return slightly different vectors. +.. GENERATED FROM PYTHON SOURCE LINES 249-262 + Assessing the Model ------------------- @@ -588,6 +510,7 @@ Additionally, we'll keep track of the second ranks for a comparison of less similar documents. +.. GENERATED FROM PYTHON SOURCE LINES 262-272 .. code-block:: default @@ -608,10 +531,13 @@ similar documents. +.. GENERATED FROM PYTHON SOURCE LINES 273-276 + Let's count how each document ranks with respect to the training corpus NB. Results vary between runs due to random seeding and very small corpus +.. GENERATED FROM PYTHON SOURCE LINES 276-281 .. code-block:: default @@ -626,8 +552,6 @@ NB. Results vary between runs due to random seeding and very small corpus .. rst-class:: sphx-glr-script-out - Out: - .. code-block:: none Counter({0: 292, 1: 8}) @@ -635,6 +559,8 @@ NB. Results vary between runs due to random seeding and very small corpus +.. GENERATED FROM PYTHON SOURCE LINES 282-290 + Basically, greater than 95% of the inferred documents are found to be most similar to itself and about 5% of the time it is mistakenly most similar to another document. Checking the inferred-vector against a @@ -644,6 +570,7 @@ behaving in a usefully consistent manner, though not a real 'accuracy' value. This is great and not entirely surprising. We can take a look at an example: +.. GENERATED FROM PYTHON SOURCE LINES 290-295 .. code-block:: default @@ -658,26 +585,26 @@ This is great and not entirely surprising. We can take a look at an example: .. rst-class:: sphx-glr-script-out - Out: - .. code-block:: none Document (299): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well» - SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3): + SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec: - MOST (299, 0.9482713341712952): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well» + MOST (299, 0.9564058780670166): «australia will take on france in the doubles rubber of the davis cup tennis final today with the tie levelled at wayne arthurs and todd woodbridge are scheduled to lead australia in the doubles against cedric pioline and fabrice santoro however changes can be made to the line up up to an hour before the match and australian team captain john fitzgerald suggested he might do just that we ll make team appraisal of the whole situation go over the pros and cons and make decision french team captain guy forget says he will not make changes but does not know what to expect from australia todd is the best doubles player in the world right now so expect him to play he said would probably use wayne arthurs but don know what to expect really pat rafter salvaged australia davis cup campaign yesterday with win in the second singles match rafter overcame an arm injury to defeat french number one sebastien grosjean in three sets the australian says he is happy with his form it not very pretty tennis there isn too many consistent bounces you are playing like said bit of classic old grass court rafter said rafter levelled the score after lleyton hewitt shock five set loss to nicholas escude in the first singles rubber but rafter says he felt no added pressure after hewitt defeat knew had good team to back me up even if we were down he said knew could win on the last day know the boys can win doubles so even if we were down still feel we are good enough team to win and vice versa they are good enough team to beat us as well» - SECOND-MOST (104, 0.8029672503471375): «australian cricket captain steve waugh has supported fast bowler brett lee after criticism of his intimidatory bowling to the south african tailenders in the first test in adelaide earlier this month lee was fined for giving new zealand tailender shane bond an unsportsmanlike send off during the third test in perth waugh says tailenders should not be protected from short pitched bowling these days you re earning big money you ve got responsibility to learn how to bat he said mean there no times like years ago when it was not professional and sort of bowlers code these days you re professional our batsmen work very hard at their batting and expect other tailenders to do likewise meanwhile waugh says his side will need to guard against complacency after convincingly winning the first test by runs waugh says despite the dominance of his side in the first test south africa can never be taken lightly it only one test match out of three or six whichever way you want to look at it so there lot of work to go he said but it nice to win the first battle definitely it gives us lot of confidence going into melbourne you know the big crowd there we love playing in front of the boxing day crowd so that will be to our advantage as well south africa begins four day match against new south wales in sydney on thursday in the lead up to the boxing day test veteran fast bowler allan donald will play in the warm up match and is likely to take his place in the team for the second test south african captain shaun pollock expects much better performance from his side in the melbourne test we still believe that we didn play to our full potential so if we can improve on our aspects the output we put out on the field will be lot better and we still believe we have side that is good enough to beat australia on our day he said» + SECOND-MOST (104, 0.7868924140930176): «australian cricket captain steve waugh has supported fast bowler brett lee after criticism of his intimidatory bowling to the south african tailenders in the first test in adelaide earlier this month lee was fined for giving new zealand tailender shane bond an unsportsmanlike send off during the third test in perth waugh says tailenders should not be protected from short pitched bowling these days you re earning big money you ve got responsibility to learn how to bat he said mean there no times like years ago when it was not professional and sort of bowlers code these days you re professional our batsmen work very hard at their batting and expect other tailenders to do likewise meanwhile waugh says his side will need to guard against complacency after convincingly winning the first test by runs waugh says despite the dominance of his side in the first test south africa can never be taken lightly it only one test match out of three or six whichever way you want to look at it so there lot of work to go he said but it nice to win the first battle definitely it gives us lot of confidence going into melbourne you know the big crowd there we love playing in front of the boxing day crowd so that will be to our advantage as well south africa begins four day match against new south wales in sydney on thursday in the lead up to the boxing day test veteran fast bowler allan donald will play in the warm up match and is likely to take his place in the team for the second test south african captain shaun pollock expects much better performance from his side in the melbourne test we still believe that we didn play to our full potential so if we can improve on our aspects the output we put out on the field will be lot better and we still believe we have side that is good enough to beat australia on our day he said» - MEDIAN (238, 0.2635717988014221): «centrelink is urging people affected by job cuts at regional pay tv operator austar and travel company traveland to seek information about their income support options traveland has announced it is shedding more than jobs around australia and austar is letting employees go centrelink finance information officer peter murray says those facing uncertain futures should head to centrelink in the next few days centrelink is the shopfront now for commonwealth services for income support and the employment network so that it is important if people haven been to us before they might get pleasant surprise at the range of services that we do offer to try and help them through situations where things might have changed for them mr murray said» + MEDIAN (119, 0.24808582663536072): «australia is continuing to negotiate with the united states government in an effort to interview the australian david hicks who was captured fighting alongside taliban forces in afghanistan mr hicks is being held by the united states on board ship in the afghanistan region where the australian federal police and australian security intelligence organisation asio officials are trying to gain access foreign affairs minister alexander downer has also confirmed that the australian government is investigating reports that another australian has been fighting for taliban forces in afghanistan we often get reports of people going to different parts of the world and asking us to investigate them he said we always investigate sometimes it is impossible to find out we just don know in this case but it is not to say that we think there are lot of australians in afghanistan the only case we know is hicks mr downer says it is unclear when mr hicks will be back on australian soil but he is hopeful the americans will facilitate australian authorities interviewing him» - LEAST (243, -0.13247375190258026): «four afghan factions have reached agreement on an interim cabinet during talks in germany the united nations says the administration which will take over from december will be headed by the royalist anti taliban commander hamed karzai it concludes more than week of negotiations outside bonn and is aimed at restoring peace and stability to the war ravaged country the year old former deputy foreign minister who is currently battling the taliban around the southern city of kandahar is an ally of the exiled afghan king mohammed zahir shah he will serve as chairman of an interim authority that will govern afghanistan for six month period before loya jirga or grand traditional assembly of elders in turn appoints an month transitional government meanwhile united states marines are now reported to have been deployed in eastern afghanistan where opposition forces are closing in on al qaeda soldiers reports from the area say there has been gun battle between the opposition and al qaeda close to the tora bora cave complex where osama bin laden is thought to be hiding in the south of the country american marines are taking part in patrols around the air base they have secured near kandahar but are unlikely to take part in any assault on the city however the chairman of the joint chiefs of staff general richard myers says they are prepared for anything they are prepared for engagements they re robust fighting force and they re absolutely ready to engage if that required he said» + LEAST (216, -0.11085141450166702): «senior taliban official confirmed the islamic militia would begin handing over its last bastion of kandahar to pashtun tribal leaders on friday this agreement was that taliban should surrender kandahar peacefully to the elders of these areas and we should guarantee the lives and the safety of taliban authorities and all the taliban from tomorrow should start this program former taliban ambassador to pakistan abdul salam zaeef told cnn in telephone interview he insisted that the taliban would not surrender to hamid karzai the new afghan interim leader and pashtun elder who has been cooperating with the united states to calm unrest among the southern tribes the taliban will surrender to elders not to karzai karzai and other persons which they want to enter kandahar by the support of america they don allow to enter kandahar city he said the taliban will surrender the weapons the ammunition to elders» +.. GENERATED FROM PYTHON SOURCE LINES 296-305 + Notice above that the most similar document (usually the same text) is has a similarity score approaching 1.0. However, the similarity score for the second-ranked documents should be significantly lower (assuming the documents @@ -688,6 +615,7 @@ We can run the next cell repeatedly to see a sampling other target-document comparisons. +.. GENERATED FROM PYTHON SOURCE LINES 305-315 .. code-block:: default @@ -707,17 +635,17 @@ comparisons. .. rst-class:: sphx-glr-script-out - Out: - .. code-block:: none - Train Document (292): «rival afghan factions are deadlocked over the shape of future government the northern alliance has demanded day adjournment of power sharing talks in germany after its president burhanuddin rabbani objected to the appointment system for an interim administration president rabbani has objected to the plans for an interim government to be drawn up by appointment as discussed in bonn saying the interim leaders should be voted in by afghans themselves he also says there is no real need for sizeable international security force president rabbani says he would prefer local afghan factions drew up their own internal security forces of around personnel but if the world insisted there should be an international security presence there should be no more than or personnel in their security forces he says president rabbani objections are likely to cast doubt on his delegation ability to commit the northern alliance to any course of action decided upon in bonn he now threatens to undermine the very process he claims to support in the quest for stable government in afghanistan» + Train Document (198): «authorities are trying to track down the crew of vessel that landed undetected at cocos islands carrying asylum seekers the group of sri lankan men was found aboard their boat moored to the south of the islands yesterday afternoon shire president ron grant says investigations are underway as to the whereabouts of the crew after the asylum seekers told authorities they had left in another boat after dropping them off unfortunately for them there two aircraft the royal australian air force here at the moment and one getting prepared to fly off and obviously they will be looking to see if there is another boat he said mr grant says the sri lankans have not yet been brought ashore» + + Similar Document (89, 0.7137947082519531): «after the torching of more than buildings over the past three days the situation at the woomera detention centre overnight appeared relatively calm there was however tension inside the south australian facility with up to detainees breaking into prohibited zone the group became problem for staff after breaching fence within the centre at one point staff considered using water cannon to control the detainees it is not known if they actually resorted to any tough action but group of men wearing riot gear possibly star force police officers brought in on standby could be seen in one of the compounds late yesterday government authorities confirmed that two detainees had committed acts of self harm one of them needed stitches and is believed to have been taken away in an ambulance no other details have been released» - Similar Document (13, 0.7867921590805054): «talks between afghan and british officials in kabul have ended without final agreement on the deployment of international security force the lack of suitable translation of the document meant further delay authorities in kabul have been giving conflicting signals for weeks now over the number of peacekeepers they would allow and the role the international force would play the foreign minister dr abdullah appeared to be ending the confusion saying an agreement was about to be signed there is already the agreement so it was finalised he said but spokesman for the interior minister yunis kanooni emerged soon after to say there was no agreement and nothing to sign scores of british peacekeepers are already patrolling the streets of kabul in tandem with afghan police but proposals to enlarge the force to as many as international peacekeepers have been criticised by some commanders as tantamount to foreign occupation» +.. GENERATED FROM PYTHON SOURCE LINES 316-322 Testing the Model ----------------- @@ -726,6 +654,7 @@ Using the same approach above, we'll infer the vector for a randomly chosen test document, and compare the document to our model by eye. +.. GENERATED FROM PYTHON SOURCE LINES 322-334 .. code-block:: default @@ -747,23 +676,23 @@ test document, and compare the document to our model by eye. .. rst-class:: sphx-glr-script-out - Out: - .. code-block:: none - Test Document (49): «labor needed to distinguish itself from the government on the issue of asylum seekers greens leader bob brown has said his senate colleague kerry nettle intends to move motion today on the first anniversary of the tampa crisis condemning the government over its refugee policy and calling for an end to mandatory detention we greens want to bring the government to book over its serial breach of international obligations as far as asylum seekers in this country are concerned senator brown said today» + Test Document (17): «the united nations world food program estimates that up to million people in seven countries malawi mozambique zambia angola swaziland lesotho and zimbabwe face death by starvation unless there is massive international response in malawi as many as people may have already died the signs of malnutrition swollen stomachs stick thin arms light coloured hair are everywhere» + + SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec: - SIMILAR/DISSIMILAR DOCS PER MODEL Doc2Vec(dm/m,d50,n5,w5,mc2,s0.001,t3): + MOST (86, 0.8239533305168152): «argentina economy minister domingo cavallo is reported to have resigned in the face of mounting unrest over the country crumbling economy the reports in number of local media outlets could not be officially confirmed the news comes as police used teargas to disperse tens of thousands of people who had massed near the presidential palace in buenos aires and in other parts of the city to protest against the declaration of state of emergency it was declared after mounting popular discontent and widespread looting in the past few days with people over the state of the economy which has been in recession for four years» - MOST (218, 0.8016394376754761): «refugee support groups are strongly critical of federal government claims that the pacific solution program is working well the immigration minister philip ruddock says he is pleased with the program which uses pacific island nations to process asylum seekers wanting to come to australia president of the hazara ethnic society of australia hassan ghulam says the australian government is bullying smaller nations into accepting asylum seekers if the pacific countries wanted refugees they can clearly raise their voice in the united nations and say yes we are accepting refugees and why australia who gives this authority to the australian government to force the pacific countries to accept refugees in this form or in the other form he asked» + MEDIAN (221, 0.40627941489219666): «reserve bank governor ian macfarlane says he is confident australia will ride through the current world economic slump largely brought on by the united states mr macfarlane told gathering in sydney last night australia growth is remarkably good by world standards and inflation should come down in the next months he predicts the united states economy will show signs of recovery from mid year and that as result it is highly unlikely that the reserve bank will raise interest rates in the next six months calendar year has been difficult one for the world economy and the first half of looks like remaining weak before recovery gets underway therefore this period will be classified as world recession like those of the mid the early and the early mr macfarlane said the australian economy has got through the first half of it in reasonably good shape» - MEDIAN (204, 0.3319269120693207): «an iraqi doctor being held at sydney villawood detention centre claims he was prevented from receiving human rights award dr aamer sultan had been awarded special commendation at yesterday human rights and equal opportunity commission awards in sydney but was not able to receive the honour in person dr sultan says he had been hoping to attend the ceremony but says the management at villawood stopped him from going submitted formal request to the centre manager who promised me that he will present the matter to migration management here who are the main authority here they also came back that unfortunately we can not fulfill this request for you but they didn give any explanation dr sultan says he was disappointed by the decision the immigration minister philip ruddock has written letter of complaint to the medical journal of australia about an article penned by dr sultan on the psychological state of detainees at villawood the journal has published research dr sultan conducted with former visiting psychologist to the centre kevin sullivan their survey of detainees over nine months found all but one displayed symptoms of psychological distress at some time the article says per cent acknowledged chronic depressive symptoms and close to half of the group had reached severe stages of depression» + LEAST (37, -0.06813289225101471): «australia quicks and opening batsmen have put the side in dominant position going into day three of the boxing day test match against south africa at the mcg australia is no wicket for only runs shy of south africa after andy bichel earlier starred as the tourists fell for when play was abandoned due to rain few overs short of scheduled stumps yesterday justin langer was not out and matthew hayden the openers went on the attack from the start with langer innings including six fours and hayden eight earlier shaun pollock and nantie haywood launched vital rearguard action to help south africa to respectable first innings total the pair put on runs for the final wicket to help the tourists to the south africans had slumped to for through combination of australia good bowling good fielding and good luck after resuming at for yesterday morning the tourists looked to be cruising as jacques kallis and neil mckenzie added without loss but then bichel suddenly had them reeling after snatching two wickets in two balls first he had jacques kallis caught behind for although kallis could consider himself very unlucky as replays showed his bat was long way from the ball on the next ball bichel snatched sharp return catch to dismiss lance klusener first ball and have shot at hat trick bichel missed out on the hat trick and mark boucher and neil mckenzie again steadied the south african innings adding before the introduction of part timer mark waugh to the attack paid off for australia waugh removed boucher for caught by bichel brett lee then chipped in trapping mckenzie leg before for with perfect inswinger bichel continued his good day in the field running out claude henderson for with direct hit from the in field lee roared in to allan donald bouncing him and then catching the edge with rising delivery which ricky ponting happily swallowed at third slip to remove the returning paceman for duck bichel did not get his hat trick but ended with the best figures of the australian bowlers after also picking up the final wicket of nantie haywood for lee took for and glenn mcgrath for» - LEAST (157, -0.10524928569793701): «british man has been found guilty by unanimous verdict of the kidnap and murder of an eight year old schoolgirl whose death in july shocked britain and set off rampage of anti paedophile vigilantes roy whiting was sentenced to life imprisonment for the abduction and murder of eight year old sarah payne with recommendation by trial judge justice richard curtis that he never be released you are indeed an evil man you are in no way mentally unwell have seen you for month and in my view you are glib and cunning liar justice curtis said there were cheers of delight as the verdicts were read out by the foreman at lewes crown court the jury of nine men and three women had been deliberating for nine hours as soon as the verdicts were declared the court heard details of whiting previous conviction for the kidnap and indecent assault of nine year old girl in prosecutor timothy langdale told the jury how the defendant threw the child into the back of his dirty red ford sierra and locked the doors he had driven her somewhere she didn know where when she asked where they were going he said shut up because he had knife mr langdale said the defendant told the girl to take off her clothes when she refused he produced rope from his pocket and threatened to tie her up what he actually threatened was that he would tie her mouth up she took her clothes off as he had ordered her to do mr langdale then gave graphic details of the abuse to which whiting subjected the terrified child whiting was given four year jail sentence in june after admitting carrying out the attack in march that year but he was released in november despite warnings from probation officers who were convinced there was danger he would attack another child they set out their warnings in pre sentence report prepared after the first assault and in the parole report before he was released from prison he was kept under supervision for four months after his release but was not being monitored by july last year when eight year old sarah was abducted and killed whiting has been arrested three times in connection with the case but the first and second times was released without being charged sarah disappeared on july last year prompting massive police search her partially buried naked body was found days later in field and police believe she was strangled or suffocated» +.. GENERATED FROM PYTHON SOURCE LINES 335-360 Conclusion ---------- @@ -794,30 +723,25 @@ If you'd like to know more about the subject matter of this tutorial, check out .. rst-class:: sphx-glr-timing - **Total running time of the script:** ( 0 minutes 7.863 seconds) + **Total running time of the script:** ( 0 minutes 16.509 seconds) -**Estimated memory usage:** 37 MB +**Estimated memory usage:** 48 MB .. _sphx_glr_download_auto_examples_tutorials_run_doc2vec_lee.py: +.. only:: html -.. only :: html - - .. container:: sphx-glr-footer - :class: sphx-glr-footer-example - - - - .. container:: sphx-glr-download sphx-glr-download-python + .. container:: sphx-glr-footer sphx-glr-footer-example - :download:`Download Python source code: run_doc2vec_lee.py ` + .. container:: sphx-glr-download sphx-glr-download-python + :download:`Download Python source code: run_doc2vec_lee.py ` - .. container:: sphx-glr-download sphx-glr-download-jupyter + .. container:: sphx-glr-download sphx-glr-download-jupyter - :download:`Download Jupyter notebook: run_doc2vec_lee.ipynb ` + :download:`Download Jupyter notebook: run_doc2vec_lee.ipynb ` .. only:: html diff --git a/docs/src/auto_examples/tutorials/sg_execution_times.rst b/docs/src/auto_examples/tutorials/sg_execution_times.rst index 0b10d0ae69..eb3d2c3c8c 100644 --- a/docs/src/auto_examples/tutorials/sg_execution_times.rst +++ b/docs/src/auto_examples/tutorials/sg_execution_times.rst @@ -5,22 +5,21 @@ Computation times ================= -**00:36.418** total execution time for **auto_examples_tutorials** files: -+-------------------------------------------------------------------------------------+-----------+-----------+ -| :ref:`sphx_glr_auto_examples_tutorials_run_wmd.py` (``run_wmd.py``) | 00:36.418 | 7551.3 MB | -+-------------------------------------------------------------------------------------+-----------+-----------+ -| :ref:`sphx_glr_auto_examples_tutorials_run_annoy.py` (``run_annoy.py``) | 00:00.000 | 0.0 MB | -+-------------------------------------------------------------------------------------+-----------+-----------+ -| :ref:`sphx_glr_auto_examples_tutorials_run_doc2vec_lee.py` (``run_doc2vec_lee.py``) | 00:00.000 | 0.0 MB | -+-------------------------------------------------------------------------------------+-----------+-----------+ -| :ref:`sphx_glr_auto_examples_tutorials_run_ensemblelda.py` (``run_ensemblelda.py``) | 00:00.000 | 0.0 MB | -+-------------------------------------------------------------------------------------+-----------+-----------+ -| :ref:`sphx_glr_auto_examples_tutorials_run_fasttext.py` (``run_fasttext.py``) | 00:00.000 | 0.0 MB | -+-------------------------------------------------------------------------------------+-----------+-----------+ -| :ref:`sphx_glr_auto_examples_tutorials_run_lda.py` (``run_lda.py``) | 00:00.000 | 0.0 MB | -+-------------------------------------------------------------------------------------+-----------+-----------+ -| :ref:`sphx_glr_auto_examples_tutorials_run_scm.py` (``run_scm.py``) | 00:00.000 | 0.0 MB | -+-------------------------------------------------------------------------------------+-----------+-----------+ -| :ref:`sphx_glr_auto_examples_tutorials_run_word2vec.py` (``run_word2vec.py``) | 00:00.000 | 0.0 MB | -+-------------------------------------------------------------------------------------+-----------+-----------+ ++-------------------------------------------------------------------------------------+-----------+---------+ +| :ref:`sphx_glr_auto_examples_tutorials_run_doc2vec_lee.py` (``run_doc2vec_lee.py``) | 00:16.509 | 48.4 MB | ++-------------------------------------------------------------------------------------+-----------+---------+ +| :ref:`sphx_glr_auto_examples_tutorials_run_annoy.py` (``run_annoy.py``) | 00:00.000 | 0.0 MB | ++-------------------------------------------------------------------------------------+-----------+---------+ +| :ref:`sphx_glr_auto_examples_tutorials_run_ensemblelda.py` (``run_ensemblelda.py``) | 00:00.000 | 0.0 MB | ++-------------------------------------------------------------------------------------+-----------+---------+ +| :ref:`sphx_glr_auto_examples_tutorials_run_fasttext.py` (``run_fasttext.py``) | 00:00.000 | 0.0 MB | ++-------------------------------------------------------------------------------------+-----------+---------+ +| :ref:`sphx_glr_auto_examples_tutorials_run_lda.py` (``run_lda.py``) | 00:00.000 | 0.0 MB | ++-------------------------------------------------------------------------------------+-----------+---------+ +| :ref:`sphx_glr_auto_examples_tutorials_run_scm.py` (``run_scm.py``) | 00:00.000 | 0.0 MB | ++-------------------------------------------------------------------------------------+-----------+---------+ +| :ref:`sphx_glr_auto_examples_tutorials_run_wmd.py` (``run_wmd.py``) | 00:00.000 | 0.0 MB | ++-------------------------------------------------------------------------------------+-----------+---------+ +| :ref:`sphx_glr_auto_examples_tutorials_run_word2vec.py` (``run_word2vec.py``) | 00:00.000 | 0.0 MB | ++-------------------------------------------------------------------------------------+-----------+---------+ diff --git a/docs/src/gallery/tutorials/run_doc2vec_lee.py b/docs/src/gallery/tutorials/run_doc2vec_lee.py index 7012d38f66..18f4ee7b16 100644 --- a/docs/src/gallery/tutorials/run_doc2vec_lee.py +++ b/docs/src/gallery/tutorials/run_doc2vec_lee.py @@ -215,9 +215,15 @@ def read_corpus(fname, tokens_only=False): ############################################################################### # Next, train the model on the corpus. -# If optimized Gensim (with BLAS library) is being used, this should take no more than 3 seconds. -# If the BLAS library is not being used, this should take no more than 2 -# minutes, so use optimized Gensim with BLAS if you value your time. +# In the usual case, where Gensim installation found a BLAS library for optimized +# bulk vector operations, this training on this tiny 300 document, ~60k word corpus +# should take just a few seconds. (More realistic datasets of tens-of-millions +# of words or more take proportionately longer.) If for some reason a BLAS library +# isn't available, training uses a fallback approach that takes 60x-120x longer, +# so even this tiny training will take minutes rather than seconds. (And, in that +# case, you should also notice a warning in the logging letting you know there's +# something worth fixing.) So, be sure your installation uses the BLAS-optimized +# Gensim if you value your time. # model.train(train_corpus, total_examples=model.corpus_count, epochs=model.epochs)