Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix docstrings for gensim.sklearn_api. Fix #1667 #1895

Merged
merged 52 commits into from
Mar 15, 2018

Conversation

steremma
Copy link
Contributor

In this PR I am working on the source documentation for the sklearn API.
More wrappers will be submitted in subsequent pushes to the same PR until they are all complete.

@@ -5,11 +5,6 @@
# Copyright (C) 2017 Radim Rehurek <[email protected]>
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html

"""
Scikit learn interface for gensim for easy use of gensim with scikit-learn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some __doc__ definitely needed

Base LSI module
"""Base LSI module.

Scikit learn interface for `gensim.models.lsimodel` for easy use of gensim with scikit-learn.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use links

:class:`~gensim.model.lsimodel.LsiModel`

here and everywhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, explicit mention "if you want to read more about it, please look into original class :class:..."


Parameters
----------
num_topics : int, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wdyt about the link to original method only (for avoiding duplication)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also thought about that and I am not sure what is better. On one hand now we have duplication but on the other hand its easier for the developer and user to see the documentation in one tab. Because not all parameters are propagated to the inner model, some of the parameters will be visible in the wrapper and some in the original model (you would need 2 tabs open). I am a bit in favor of duplicating but not 100% sure so if you prefer I will remove duplication.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, maybe combine both approaches: mentioned parameter & type here, but for description - sent the user to the parameter from original class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that sounds reasonable, I will apply asap

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steremma we discuss this questions again and this isn't good idea, because it's OK if user look into documentation online (and have a link), but if user use python/jupyter, he will call something like help(model) or model? and for this case, links don't work :( (and this is the main problem). For this reason - can you return descriptions for parameters, copy-paste is the lesser evil than docstring, that exists, but useless, if you can't read it in your interpreter.

Also, the link to original class must be in any case too.

- `lsimodel`
- `text2bow`
- `phrases`

* Added `doc` in every file

* Provided sphinx style links to parameter types referencing gensim classes.

* Propagated arguments are still duplicated for readability - maybe remove?
…r explanation of their meaning the reader is redirected to the original models documentation

Parameters
----------
num_topics : int, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steremma we discuss this questions again and this isn't good idea, because it's OK if user look into documentation online (and have a link), but if user use python/jupyter, he will call something like help(model) or model? and for this case, links don't work :( (and this is the main problem). For this reason - can you return descriptions for parameters, copy-paste is the lesser evil than docstring, that exists, but useless, if you can't read it in your interpreter.

Also, the link to original class must be in any case too.

--------
Integrate with sklearn Pipelines:

>>> model = LsiTransformer(num_topics=15, id2word=id2word)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great (examples is the really nice idea), but please be sure, that example executable (i.e. you write all needed imports & define data).

You can check that all works easily with python -m doctest path/to/file/with/examples.py

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Feb 20, 2018

@steremma please be accurate with PEP8 checks, look to https://travis-ci.org/RaRe-Technologies/gensim/jobs/343473932#L509 and resolve it.
You can easily check it locally before push, like tox -e flake8

@steremma
Copy link
Contributor Author

@menshikh-iv this PR is a work in progress I only push to sync between different workstations (I opened it so that you can easily track the status). Sometimes I have to leave one workstation quickly so I just push to continue from another one. In the end the commits will be squashed so I hope it won't be a problem. We can close the PR for now and re-open when its ready for review and merge.

>>> # Lets represent each document using a 50 dimensional vector
>>> model = D2VTransformer(min_count=1, size=50)
>>> docvecs = model.fit_transform(common_texts)
>>> assert docvecs.shape == (len(common_texts), 50)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something more interesting (like calculation similarity between documents)?


>>> # Fit our pipeline to some corpus
>>> corpus = [id2word.doc2bow(i.split()) for i in data.data]
>>> fitted_pipeline = pipe.fit(corpus, data.target)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some evaluation here (what's score here?)

>>>
>>> # Create an ID to word mapping using some corpus included in sklearn.
>>> cats = ['rec.sport.baseball', 'sci.crypt']
>>> data = fetch_20newsgroups(subset='train', categories=cats, shuffle=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use gensim-data here (20-news available)

>>>
>>> # What is the vector representation of the word 'graph'?
>>> wordvecs = model.fit(common_texts).transform(['graph', 'system'])
>>> assert wordvecs.shape == (2, 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call most_similar?

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, you missed ldaseqmodel, please add this too


For more information on the inner workings please take a look at the original class. The model's internal workings
are heavily based on `"The Author-Topic Model for Authors and Documents", Osen-Zvi et. al 2004
<https://mimno.infosci.cornell.edu/info6150/readings/398.pdf>`_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention paper only in original class (not here), wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will replace references with some text <http://...>_

Examples
--------

>>> from gensim.test.utils import common_dictionary, common_corpus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unindent example please (here and everywhere)

@menshikh-iv menshikh-iv added RFM incubator project PR is RaRe incubator project labels Mar 5, 2018
@menshikh-iv menshikh-iv changed the title Sklearn API docstrings Fix docstrings for gensim.sklearn_api. Fix #1667 Mar 9, 2018
@menshikh-iv
Copy link
Contributor

Current PR will fix #1895

@menshikh-iv menshikh-iv merged commit 75a2309 into piskvorky:develop Mar 15, 2018
@steremma steremma deleted the sklearn-api-docs branch March 15, 2018 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incubator project PR is RaRe incubator project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants