-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix docstrings for gensim.sklearn_api
. Fix #1667
#1895
Changes from 4 commits
4cee8fa
ab0303c
4dc001f
69faf41
5052dfb
c027203
3815605
c1e05df
4cfbf5c
f2615ef
3581a46
8ef1105
add7420
38a610f
5f00f34
1d8c63c
f8fffd6
c866af0
3cf28a3
b55a2a2
6c1aeb8
b0600cd
e2ca72f
f66abbb
e4dc868
8df7ce5
dc33b91
836af6f
4a3ce08
ef5d7ab
4285741
2f02cfe
0c56ae9
64f8d4f
9b4c375
7a204e1
39bbe31
20ea33e
31fb94e
4432b77
e729a26
14fcf22
07a8cba
5325d05
b250ca4
3fc3bef
dc9f659
4ec4619
36a263a
ae4a5b4
0ad6580
8a45bef
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,11 +5,6 @@ | |
# Copyright (C) 2017 Radim Rehurek <[email protected]> | ||
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html | ||
|
||
""" | ||
Scikit learn interface for gensim for easy use of gensim with scikit-learn | ||
Follows scikit-learn API conventions | ||
""" | ||
|
||
import numpy as np | ||
from scipy import sparse | ||
from sklearn.base import TransformerMixin, BaseEstimator | ||
|
@@ -20,14 +15,36 @@ | |
|
||
|
||
class LsiTransformer(TransformerMixin, BaseEstimator): | ||
""" | ||
Base LSI module | ||
"""Base LSI module. | ||
|
||
Scikit learn interface for `gensim.models.lsimodel` for easy use of gensim with scikit-learn. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. please use links
here and everywhere There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, explicit mention "if you want to read more about it, please look into original class :class: |
||
Follows scikit-learn API conventions. | ||
|
||
""" | ||
|
||
def __init__(self, num_topics=200, id2word=None, chunksize=20000, | ||
decay=1.0, onepass=True, power_iters=2, extra_samples=100): | ||
""" | ||
Sklearn wrapper for LSI model. See gensim.model.LsiModel for parameter details. | ||
"""Sklearn wrapper for LSI model. | ||
|
||
Parameters | ||
---------- | ||
num_topics : int, optional | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wdyt about the link to original method only (for avoiding duplication)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I also thought about that and I am not sure what is better. On one hand now we have duplication but on the other hand its easier for the developer and user to see the documentation in one tab. Because not all parameters are propagated to the inner model, some of the parameters will be visible in the wrapper and some in the original model (you would need 2 tabs open). I am a bit in favor of duplicating but not 100% sure so if you prefer I will remove duplication. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, maybe combine both approaches: mentioned parameter & type here, but for description - sent the user to the parameter from original class? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok that sounds reasonable, I will apply asap There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @steremma we discuss this questions again and this isn't good idea, because it's OK if user look into documentation online (and have a link), but if user use Also, the link to original class must be in any case too. |
||
Number of requested factors (latent dimensions) | ||
id2word : dict of {int: str}, optional | ||
ID to word mapping, optional. | ||
chunksize : int, optional | ||
Number of documents to be used in each training chunk. | ||
decay : float, optional | ||
Weight of existing observations relatively to new ones. | ||
onepass : bool, optional | ||
Whether the one-pass algorithm should be used for training. | ||
Pass `False` to force a multi-pass stochastic algorithm. | ||
power_iters: int, optional | ||
Number of power iteration steps to be used. | ||
Increasing the number of power iterations improves accuracy, but lowers performance | ||
extra_samples : int, optional | ||
Extra samples to be used besides the rank `k`. Can improve accuracy. | ||
|
||
""" | ||
self.gensim_model = None | ||
self.num_topics = num_topics | ||
|
@@ -42,6 +59,17 @@ def fit(self, X, y=None): | |
""" | ||
Fit the model according to the given training data. | ||
Calls gensim.models.LsiModel | ||
|
||
Parameters | ||
---------- | ||
X : iterable of iterable of (int, float) | ||
Stream of document vectors or sparse matrix of shape: [num_terms, num_documents]. | ||
|
||
Returns | ||
------- | ||
LsiTransformer | ||
The trained model | ||
|
||
""" | ||
if sparse.issparse(X): | ||
corpus = matutils.Sparse2Corpus(sparse=X, documents_columns=False) | ||
|
@@ -55,14 +83,18 @@ def fit(self, X, y=None): | |
return self | ||
|
||
def transform(self, docs): | ||
""" | ||
Takes a list of documents as input ('docs'). | ||
Returns a matrix of topic distribution for the given document bow, where a_ij | ||
indicates (topic_i, topic_probability_j). | ||
The input `docs` should be in BOW format and can be a list of documents like | ||
[[(4, 1), (7, 1)], | ||
[(9, 1), (13, 1)], [(2, 1), (6, 1)]] | ||
or a single document like : [(4, 1), (7, 1)] | ||
"""Computes the topic distribution matrix | ||
|
||
Parameters | ||
---------- | ||
docs : iterable of iterable of (int, float) | ||
Stream of document vectors or sparse matrix of shape: [`num_terms`, num_documents]. | ||
|
||
Returns | ||
------- | ||
list of (int, int) | ||
Topic distribution matrix of shape [num_docs, num_topics] | ||
|
||
""" | ||
if self.gensim_model is None: | ||
raise NotFittedError( | ||
|
@@ -78,8 +110,22 @@ def transform(self, docs): | |
return np.reshape(np.array(distribution), (len(docs), self.num_topics)) | ||
|
||
def partial_fit(self, X): | ||
""" | ||
Train model over X. | ||
"""Train model over a potentially incomplete set of documents. | ||
|
||
This method can be used in two ways: | ||
1. On an unfitted model in which case the model is initialized and trained on `X`. | ||
2. On an already fitted model in which case the model is **further** trained on `X`. | ||
|
||
Parameters | ||
---------- | ||
X : iterable of iterable of (int, float) | ||
Stream of document vectors or sparse matrix of shape: [num_terms, num_documents]. | ||
|
||
Returns | ||
------- | ||
LsiTransformer | ||
The trained model. | ||
|
||
""" | ||
if sparse.issparse(X): | ||
X = matutils.Sparse2Corpus(sparse=X, documents_columns=False) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,11 +4,6 @@ | |
# Copyright (C) 2011 Radim Rehurek <[email protected]> | ||
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html | ||
|
||
""" | ||
Scikit learn interface for gensim for easy use of gensim with scikit-learn | ||
Follows scikit-learn API conventions | ||
""" | ||
|
||
from six import string_types | ||
from sklearn.base import TransformerMixin, BaseEstimator | ||
from sklearn.exceptions import NotFittedError | ||
|
@@ -18,29 +13,59 @@ | |
|
||
|
||
class Text2BowTransformer(TransformerMixin, BaseEstimator): | ||
""" | ||
Base Text2Bow module | ||
"""Base Text2Bow module | ||
|
||
Scikit learn interface for `gensim.models.lsimodel` for easy use of gensim with scikit-learn. | ||
Follows scikit-learn API conventions. | ||
|
||
""" | ||
|
||
def __init__(self, prune_at=2000000, tokenizer=tokenize): | ||
""" | ||
Sklearn wrapper for Text2Bow model. | ||
"""Sklearn wrapper for Text2Bow model. | ||
|
||
Parameters | ||
---------- | ||
prune_at : int, optional | ||
Total number of unique words. Dictionary will keep not more than `prune_at` words. | ||
tokenizer : callable (str -> list of str), optional | ||
A callable to split a document into a list of each terms | ||
|
||
""" | ||
self.gensim_model = None | ||
self.prune_at = prune_at | ||
self.tokenizer = tokenizer | ||
|
||
def fit(self, X, y=None): | ||
""" | ||
Fit the model according to the given training data. | ||
"""Fit the model according to the given training data. | ||
|
||
Parameters | ||
---------- | ||
X : iterable of str | ||
A collection of documents used for training the model. | ||
|
||
Returns | ||
------- | ||
Text2BowTransformer | ||
The trained model. | ||
|
||
""" | ||
tokenized_docs = [list(self.tokenizer(x)) for x in X] | ||
self.gensim_model = Dictionary(documents=tokenized_docs, prune_at=self.prune_at) | ||
return self | ||
|
||
def transform(self, docs): | ||
""" | ||
Return the BOW format for the input documents. | ||
"""Return the BOW format for the input documents. | ||
|
||
Parameters | ||
---------- | ||
docs : iterable of str | ||
A collection of documents to be transformed. | ||
|
||
Returns | ||
------- | ||
iterable of list (int, int) 2-tuples. | ||
The BOW representation of each document. | ||
|
||
""" | ||
if self.gensim_model is None: | ||
raise NotFittedError( | ||
|
@@ -54,6 +79,23 @@ def transform(self, docs): | |
return [self.gensim_model.doc2bow(doc) for doc in tokenized_docs] | ||
|
||
def partial_fit(self, X): | ||
"""Train model over a potentially incomplete set of documents. | ||
|
||
This method can be used in two ways: | ||
1. On an unfitted model in which case the dictionary is initialized and trained on `X`. | ||
2. On an already fitted model in which case the dictionary is **expanded** by `X`. | ||
|
||
Parameters | ||
---------- | ||
X : iterable of str | ||
A collection of documents used to train the model. | ||
|
||
Returns | ||
------- | ||
Text2BowTransformer | ||
The trained model. | ||
|
||
""" | ||
if self.gensim_model is None: | ||
self.gensim_model = Dictionary(prune_at=self.prune_at) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some
__doc__
definitely needed