Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix documentation for gensim.corpora. Partial fix #1671 #1729

Merged
merged 54 commits into from
Jan 22, 2018

Conversation

anotherbugmaster
Copy link
Contributor

@anotherbugmaster anotherbugmaster commented Nov 21, 2017

Fix #1671

Docs formally comply with numpy style now but not all type annotations and descriptions are there.

Copy link
Contributor

@menshikh-iv menshikh-iv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please continue your work, what a voluminous PR 👍

Parameters
----------
fname : str
Serialized corpus's filename
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dot on the end of sentence (everywhere)

corpus : iterable
Iterable of documents
id2word : dict of (str, str), optional
Transforms id to word (Default value = None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no default values in docstrings (everywhere)

----------
fname : str
Serialized corpus's filename
fname_vocab : str or None, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to understand how to:

  • Document multiple types of argument (i.e. when the parameter can be type X or Y)
  • Document multiple types for "Return" section
  • Correctly specify the parent class (if there are many heirs)

----------
fname : str
Filename
corpus : iterable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iterable of ... ? (here and everywhere)


Returns
-------
list of (int, float)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing parameter description (here and everywhere)

>>> corpus_with_random_access = gensim.corpora.SvmLightCorpus('tstfile.svmlight')
>>> print(corpus_with_random_access[1])
[(0, 1.0), (1, 2.0)]
>>> corpus = [[(1, 0.5)], [(0, 1.0), (1, 2.0)]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples should be executable and split into 3 sections: imports, data preparation, direct functionality

>>> from .. import ...
>>> import ...
>>>
>>> data = ...
>>> makesomething(data)

return [word for word in utils.to_unicode(s).strip().split(' ') if word]


class LowCorpus(IndexedCorpus):
"""
List_Of_Words corpus handles input in GibbsLda++ format.
"""List_Of_Words corpus handles input in GibbsLda++ format.

Quoting http://gibbslda.sourceforge.net/#3.2_Input_Data_Format::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link in other format


Parameters
----------
s :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???? (empty descriptions here and everywhere)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs formally comply with numpy style now but not all type annotations and descriptions are there.

:)

@menshikh-iv menshikh-iv changed the title Convert corpora docs to numpy style Fix documentation for gensim.corpora. Partial fix #1671 Jan 18, 2018
Copy link
Contributor Author

@anotherbugmaster anotherbugmaster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it was one of the first files in corpora, I didn't know about some of the specification features.

vocab/vocab.txt file.
File path to Serialized corpus.
fname_vocab : str, optional
Vocabulary file. If `fname_vocab` is None, searching for the vocab.txt or `fname_vocab`.vocab file.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it's fname_vocab.vocab? fname_vocab is none, isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, I added correct description

Copy link
Contributor Author

@anotherbugmaster anotherbugmaster Jan 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still don't get it. It should be `fname`.vocab, `fname_vocab`.vocab is undefined!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite :) I go through the code with ipdb for this case, this is significantly "wider" that we discuss here (I already fix it).

Filename.
corpus : iterable
Iterable of documents.
Path to output filename.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To output file

Iterable of documents.
Path to output filename.
corpus : iterable of iterable of (int, float)
Input corpus
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obvious, no additional information provided. There's no need to have descriptions for all arguments. :)

@@ -153,16 +160,18 @@ def save_corpus(fname, corpus, id2word=None, metadata=False):
return offsets

def docbyoffset(self, offset):
"""Return document corresponding to `offset`.
"""Get document corresponding to `offset`,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First line of docstring should always end with a dot.

Parameters
----------
fname : str
File path to Serialized corpus.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path to corpus here and in other corpora maybe?

fname : str
File path to Serialized corpus.
fname_vocab : str, optional
Vocabulary file. If `fname_vocab` is None, searching for the vocab.txt or `fname_vocab`.vocab file.
Copy link
Contributor Author

@anotherbugmaster anotherbugmaster Jan 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vocabulary file. If fname_vocab is None, searching for the vocab.txt or fname.vocab file.

fname : str
Path to output filename.
corpus : iterable of iterable of (int, float)
Input corpus
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think that it's not necessary. Also, there's a dot missing at the end of the line.

@@ -121,8 +160,19 @@ def save_corpus(fname, corpus, id2word=None, metadata=False):
return offsets

def docbyoffset(self, offset):
"""
Return the document stored at file position `offset`.
"""Get document corresponding to `offset`,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first line should end with a dot.

Parameters
----------
fname : str
Path to output filename
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dots at the end of the line. Did I miss these? O_o


def line2doc(self, line):
"""
Create a document from a single line (string) in SVMlight format
"""Get a document from a single line in SVMlight format,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first line should end with a dot.

Parameters
----------
s : str
String containing markup template
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A dot at the EOL.

token_min_len : int
Minimal token length.
token_max_len : int
Maximal token length
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dot

f : file
File-like object.
filter_namespaces : list of str or bool
Namespaces that will be extracted
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dot

the standard corpus interface instead of this function::
Notes
-----
This iterates over the **texts**. If you want vectors, just use the standard corpus interface
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dot

@menshikh-iv menshikh-iv merged commit c5f487d into piskvorky:develop Jan 22, 2018
sj29-innovate pushed a commit to sj29-innovate/gensim that referenced this pull request Feb 21, 2018
…iskvorky#1729)

* Fix typo

* Make `save_corpus` private

* Annotate `bleicorpus.py`

* Make __save_corpus weakly private

* Fix _save_corpus in tests

* Fix _save_corpus[2]

* Document bleicorpus in Numpy style

* Document indexedcorpus

* Annotate csvcorpus

* Add "Yields" section

* Make `_save_corpus` public

* Annotate bleicorpus

* Fix indentation in bleicorpus

* `_save_corpus` -> `save_corpus`

* Annotate bleicorpus

* Convert dictionary docs to numpy style

* Convert hashdictionary docs to numpy style

* Convert indexedcorpus docs to numpy style

* Convert lowcorpus docs to numpy style

* Convert malletcorpus docs to numpy style

* Convert mmcorpus docs to numpy style

* Convert sharded_corpus docs to numpy style

* Convert svmlightcorpus docs to numpy style

* Convert textcorpus docs to numpy style

* Convert ucicorpus docs to numpy style

* Convert wikicorpus docs to numpy style

* Add sphinx tweaks

* Remove trailing whitespaces

* Annotate wikicorpus

* SVMLight Corpus annotated

* Fix TODO

* Fix grammar mistake

* Undo changes to dictionary

* Undo changes to hashdictionary

* Document indexedcorpus

* Document indexedcorpus[2]

Fix identation

* Remove redundant files

* Add more dots. :)

* Fix monospace

* remove useless method

* fix bleicorpus

* fix csvcorpus

* fix indexedcorpus

* fix svmlightcorpus

* fix wikicorpus[1]

* fix wikicorpus[2]

* fix wikicorpus[3]

* fix review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incubator project PR is RaRe incubator project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants