Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix docstrings for gensim.sklearn_api. Fix #1667 #1895

Merged
merged 52 commits into from
Mar 15, 2018
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
4cee8fa
fixed docstring for `sklearn_api.lsimodel`
steremma Feb 10, 2018
ab0303c
removed duplicated comment
steremma Feb 10, 2018
4dc001f
Fixed docstring for `sklearn_api.text2bow`
steremma Feb 10, 2018
69faf41
Fixed docstrings for `sklearn_api.phrases`
steremma Feb 10, 2018
5052dfb
Applied code review corrections in sklearn wrappers for:
steremma Feb 12, 2018
c027203
constructor docstrings now only mention the type of each argument. Fo…
steremma Feb 12, 2018
3815605
Brought back parameter explanation in the wrappers for easier lookup
steremma Feb 13, 2018
c1e05df
added examples to __doc__, work still in progress
steremma Feb 15, 2018
4cfbf5c
added simple and executable examples to `__doc__`
steremma Feb 15, 2018
f2615ef
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
steremma Feb 19, 2018
3581a46
temp work on some more wrappers
steremma Feb 19, 2018
8ef1105
finished docstrings for LDA wrapper, examples pending
steremma Feb 19, 2018
add7420
finished doc2vec wrapper with example
steremma Feb 20, 2018
38a610f
completed LDA wrapper including example
steremma Feb 20, 2018
5f00f34
finished the tfidf wrapper including example
steremma Feb 20, 2018
1d8c63c
PEP-8 corrections
steremma Feb 20, 2018
f8fffd6
w2v documentation - example result pending
steremma Feb 21, 2018
c866af0
Merge branch 'sklearn-api-docs' of https://github.com/steremma/gensim…
steremma Feb 21, 2018
3cf28a3
fixed w2v example
steremma Feb 21, 2018
b55a2a2
added documentation for the lda sequential model - examples pending
steremma Feb 22, 2018
6c1aeb8
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
steremma Feb 24, 2018
b0600cd
added documentation for the author topic sklearn wrapper including ex…
steremma Feb 24, 2018
e2ca72f
improved example by presenting a way to get a pipeline score
steremma Feb 24, 2018
f66abbb
improved example using similarities
steremma Feb 24, 2018
e4dc868
added documentation and examples for the rp and hdp models
steremma Feb 24, 2018
8df7ce5
minor example improvements
steremma Feb 25, 2018
dc33b91
fixed reference
steremma Feb 25, 2018
836af6f
removed reference
steremma Feb 25, 2018
4a3ce08
fix doc building
menshikh-iv Feb 27, 2018
ef5d7ab
Merge branch 'sklearn-api-docs' of https://github.com/steremma/gensim…
steremma Feb 27, 2018
4285741
unidented examples and fixed paper references
steremma Feb 27, 2018
2f02cfe
Merge branch 'sklearn-api-docs' of https://github.com/steremma/gensim…
steremma Feb 28, 2018
0c56ae9
finalized ldaseq wrapper
steremma Feb 28, 2018
64f8d4f
fix __init__
menshikh-iv Mar 13, 2018
9b4c375
Merge remote-tracking branch 'upstream/develop' into sklearn-api-docs
menshikh-iv Mar 13, 2018
7a204e1
resolve merge-conflict with pivot norm
menshikh-iv Mar 13, 2018
39bbe31
fix atmodel
menshikh-iv Mar 15, 2018
20ea33e
fix atmodel[2]
menshikh-iv Mar 15, 2018
31fb94e
fix d2vmodel
menshikh-iv Mar 15, 2018
4432b77
fix hdp + small fixes
menshikh-iv Mar 15, 2018
e729a26
fix ldamodel + small fixes
menshikh-iv Mar 15, 2018
14fcf22
small fixes
menshikh-iv Mar 15, 2018
07a8cba
fix ldaseqmodel
menshikh-iv Mar 15, 2018
5325d05
small fixes (again)
menshikh-iv Mar 15, 2018
b250ca4
fix lsimodel
menshikh-iv Mar 15, 2018
3fc3bef
fix phrases
menshikh-iv Mar 15, 2018
dc9f659
fix rpmodel
menshikh-iv Mar 15, 2018
4ec4619
fix text2bow
menshikh-iv Mar 15, 2018
36a263a
fix tfidf
menshikh-iv Mar 15, 2018
ae4a5b4
fix word2vec
menshikh-iv Mar 15, 2018
0ad6580
cleanup
menshikh-iv Mar 15, 2018
8a45bef
cleanup[2]
menshikh-iv Mar 15, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 65 additions & 19 deletions gensim/sklearn_api/lsimodel.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,6 @@
# Copyright (C) 2017 Radim Rehurek <[email protected]>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""
Scikit learn interface for gensim for easy use of gensim with scikit-learn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some __doc__ definitely needed

Follows scikit-learn API conventions
"""

import numpy as np
from scipy import sparse
from sklearn.base import TransformerMixin, BaseEstimator
Expand All @@ -20,14 +15,36 @@


class LsiTransformer(TransformerMixin, BaseEstimator):
"""
Base LSI module
"""Base LSI module.

Scikit learn interface for `gensim.models.lsimodel` for easy use of gensim with scikit-learn.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use links

:class:`~gensim.model.lsimodel.LsiModel`

here and everywhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, explicit mention "if you want to read more about it, please look into original class :class:..."

Follows scikit-learn API conventions.

"""

def __init__(self, num_topics=200, id2word=None, chunksize=20000,
decay=1.0, onepass=True, power_iters=2, extra_samples=100):
"""
Sklearn wrapper for LSI model. See gensim.model.LsiModel for parameter details.
"""Sklearn wrapper for LSI model.

Parameters
----------
num_topics : int, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wdyt about the link to original method only (for avoiding duplication)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought about that and I am not sure what is better. On one hand now we have duplication but on the other hand its easier for the developer and user to see the documentation in one tab. Because not all parameters are propagated to the inner model, some of the parameters will be visible in the wrapper and some in the original model (you would need 2 tabs open). I am a bit in favor of duplicating but not 100% sure so if you prefer I will remove duplication.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, maybe combine both approaches: mentioned parameter & type here, but for description - sent the user to the parameter from original class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that sounds reasonable, I will apply asap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steremma we discuss this questions again and this isn't good idea, because it's OK if user look into documentation online (and have a link), but if user use python/jupyter, he will call something like help(model) or model? and for this case, links don't work :( (and this is the main problem). For this reason - can you return descriptions for parameters, copy-paste is the lesser evil than docstring, that exists, but useless, if you can't read it in your interpreter.

Also, the link to original class must be in any case too.

Number of requested factors (latent dimensions)
id2word : dict of {int: str}, optional
ID to word mapping, optional.
chunksize : int, optional
Number of documents to be used in each training chunk.
decay : float, optional
Weight of existing observations relatively to new ones.
onepass : bool, optional
Whether the one-pass algorithm should be used for training.
Pass `False` to force a multi-pass stochastic algorithm.
power_iters: int, optional
Number of power iteration steps to be used.
Increasing the number of power iterations improves accuracy, but lowers performance
extra_samples : int, optional
Extra samples to be used besides the rank `k`. Can improve accuracy.

"""
self.gensim_model = None
self.num_topics = num_topics
Expand All @@ -42,6 +59,17 @@ def fit(self, X, y=None):
"""
Fit the model according to the given training data.
Calls gensim.models.LsiModel

Parameters
----------
X : iterable of iterable of (int, float)
Stream of document vectors or sparse matrix of shape: [num_terms, num_documents].

Returns
-------
LsiTransformer
The trained model

"""
if sparse.issparse(X):
corpus = matutils.Sparse2Corpus(sparse=X, documents_columns=False)
Expand All @@ -55,14 +83,18 @@ def fit(self, X, y=None):
return self

def transform(self, docs):
"""
Takes a list of documents as input ('docs').
Returns a matrix of topic distribution for the given document bow, where a_ij
indicates (topic_i, topic_probability_j).
The input `docs` should be in BOW format and can be a list of documents like
[[(4, 1), (7, 1)],
[(9, 1), (13, 1)], [(2, 1), (6, 1)]]
or a single document like : [(4, 1), (7, 1)]
"""Computes the topic distribution matrix

Parameters
----------
docs : iterable of iterable of (int, float)
Stream of document vectors or sparse matrix of shape: [`num_terms`, num_documents].

Returns
-------
list of (int, int)
Topic distribution matrix of shape [num_docs, num_topics]

"""
if self.gensim_model is None:
raise NotFittedError(
Expand All @@ -78,8 +110,22 @@ def transform(self, docs):
return np.reshape(np.array(distribution), (len(docs), self.num_topics))

def partial_fit(self, X):
"""
Train model over X.
"""Train model over a potentially incomplete set of documents.

This method can be used in two ways:
1. On an unfitted model in which case the model is initialized and trained on `X`.
2. On an already fitted model in which case the model is **further** trained on `X`.

Parameters
----------
X : iterable of iterable of (int, float)
Stream of document vectors or sparse matrix of shape: [num_terms, num_documents].

Returns
-------
LsiTransformer
The trained model.

"""
if sparse.issparse(X):
X = matutils.Sparse2Corpus(sparse=X, documents_columns=False)
Expand Down
105 changes: 97 additions & 8 deletions gensim/sklearn_api/phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,62 @@


class PhrasesTransformer(TransformerMixin, BaseEstimator):
"""
Base Phrases module
"""Base Phrases module

Scikit learn interface for `gensim.models.phrases` for easy use of gensim with scikit-learn.
Follows scikit-learn API conventions.

"""

def __init__(self, min_count=5, threshold=10.0, max_vocab_size=40000000,
delimiter=b'_', progress_per=10000, scoring='default'):
"""
Sklearn wrapper for Phrases model.
"""Sklearn wrapper for Phrases model.

Parameters
----------
min_count : int
Terms with a count lower than this will be ignored
threshold : float
Only phrases scoring above this will be accepted, see `scoring` below.
max_vocab_size : int
Maximum size of the vocabulary.
Used to control pruning of less common words, to keep memory under control.
The default of 40M needs about 3.6GB of RAM;
delimiter : str
Character used to join collocation tokens. Should be a byte string (e.g. b'_').
progress_per : int
Training will report to the logger every that many phrases are learned.
scoring : str or callable
Specifies how potential phrases are scored for comparison to the `threshold`
setting. `scoring` can be set with either a string that refers to a built-in scoring function,
or with a function with the expected parameter names. Two built-in scoring functions are available
by setting `scoring` to a string:

'default': from [1]_.
'npmi': normalized pointwise mutual information, from [2]_.

'npmi' is more robust when dealing with common words that form part of common bigrams, and
ranges from -1 to 1, but is slower to calculate than the default.

To use a custom scoring function, create a function with the following parameters and set the `scoring`
parameter to the custom function. You must use all the parameters in your function call, even if the
function does not require all the parameters.

worda_count: number of occurrances in `sentences` of the first token in the phrase being scored
wordb_count: number of occurrances in `sentences` of the second token in the phrase being scored
bigram_count: number of occurrances in `sentences` of the phrase being scored
len_vocab: the number of unique tokens in `sentences`
min_count: the `min_count` setting of the Phrases class
corpus_word_count: the total number of (non-unique) tokens in `sentences`

A scoring function without any of these parameters (even if the parameters are not used) will
raise a ValueError on initialization of the Phrases class. The scoring function must be pic

References
----------
.. [1] "Efficient Estimaton of Word Representations in Vector Space" by Mikolov, et. al.
.. [2] "Normalized (Pointwise) Mutual Information in Colocation Extraction" by Gerlof Bouma.

"""
self.gensim_model = None
self.min_count = min_count
Expand All @@ -35,8 +83,18 @@ def __init__(self, min_count=5, threshold=10.0, max_vocab_size=40000000,
self.scoring = scoring

def fit(self, X, y=None):
"""
Fit the model according to the given training data.
"""Fit the model according to the given training data.

Parameters
----------
X : iterable of list of str
Sequence of sentences to be used for training the model.

Returns
-------
PhrasesTransformer
The trained model.

"""
self.gensim_model = models.Phrases(
sentences=X, min_count=self.min_count, threshold=self.threshold,
Expand All @@ -46,9 +104,22 @@ def fit(self, X, y=None):
return self

def transform(self, docs):
"""Transform the input documents into phrase tokens.

Words in the sentence will be joined by u`_`.

Parameters
----------
docs : iterable of list of str
Sequence of sentences to be used transformed.

Returns
-------
iterable of str
Phrase representation for each of the input sentences.

"""
Return the input documents to return phrase tokens.
"""

if self.gensim_model is None:
raise NotFittedError(
"This model has not been fitted yet. Call 'fit' with appropriate arguments before using this method."
Expand All @@ -60,6 +131,24 @@ def transform(self, docs):
return [self.gensim_model[doc] for doc in docs]

def partial_fit(self, X):
"""Train model over a potentially incomplete set of sentences.

This method can be used in two ways:
1. On an unfitted model in which case the model is initialized and trained on `X`.
2. On an already fitted model in which case the X sentences are **added** to the vocabulary.

Parameters
----------
X : iterable of list of str
Sequence of sentences to be used for training the model.

Returns
-------
PhrasesTransformer
The trained model.

"""

if self.gensim_model is None:
self.gensim_model = models.Phrases(
sentences=X, min_count=self.min_count, threshold=self.threshold,
Expand Down
68 changes: 55 additions & 13 deletions gensim/sklearn_api/text2bow.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,6 @@
# Copyright (C) 2011 Radim Rehurek <[email protected]>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""
Scikit learn interface for gensim for easy use of gensim with scikit-learn
Follows scikit-learn API conventions
"""

from six import string_types
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.exceptions import NotFittedError
Expand All @@ -18,29 +13,59 @@


class Text2BowTransformer(TransformerMixin, BaseEstimator):
"""
Base Text2Bow module
"""Base Text2Bow module

Scikit learn interface for `gensim.models.lsimodel` for easy use of gensim with scikit-learn.
Follows scikit-learn API conventions.

"""

def __init__(self, prune_at=2000000, tokenizer=tokenize):
"""
Sklearn wrapper for Text2Bow model.
"""Sklearn wrapper for Text2Bow model.

Parameters
----------
prune_at : int, optional
Total number of unique words. Dictionary will keep not more than `prune_at` words.
tokenizer : callable (str -> list of str), optional
A callable to split a document into a list of each terms

"""
self.gensim_model = None
self.prune_at = prune_at
self.tokenizer = tokenizer

def fit(self, X, y=None):
"""
Fit the model according to the given training data.
"""Fit the model according to the given training data.

Parameters
----------
X : iterable of str
A collection of documents used for training the model.

Returns
-------
Text2BowTransformer
The trained model.

"""
tokenized_docs = [list(self.tokenizer(x)) for x in X]
self.gensim_model = Dictionary(documents=tokenized_docs, prune_at=self.prune_at)
return self

def transform(self, docs):
"""
Return the BOW format for the input documents.
"""Return the BOW format for the input documents.

Parameters
----------
docs : iterable of str
A collection of documents to be transformed.

Returns
-------
iterable of list (int, int) 2-tuples.
The BOW representation of each document.

"""
if self.gensim_model is None:
raise NotFittedError(
Expand All @@ -54,6 +79,23 @@ def transform(self, docs):
return [self.gensim_model.doc2bow(doc) for doc in tokenized_docs]

def partial_fit(self, X):
"""Train model over a potentially incomplete set of documents.

This method can be used in two ways:
1. On an unfitted model in which case the dictionary is initialized and trained on `X`.
2. On an already fitted model in which case the dictionary is **expanded** by `X`.

Parameters
----------
X : iterable of str
A collection of documents used to train the model.

Returns
-------
Text2BowTransformer
The trained model.

"""
if self.gensim_model is None:
self.gensim_model = Dictionary(prune_at=self.prune_at)

Expand Down