Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix documentation for gensim.corpora. Partial fix #1671 #1729

Merged
merged 54 commits into from
Jan 22, 2018
Merged
Changes from 1 commit
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
b260d4b
Fix typo
anotherbugmaster Sep 30, 2017
36d98d1
Make `save_corpus` private
anotherbugmaster Oct 2, 2017
981ebbb
Annotate `bleicorpus.py`
anotherbugmaster Oct 2, 2017
3428113
Make __save_corpus weakly private
anotherbugmaster Oct 2, 2017
69fc7e0
Fix _save_corpus in tests
anotherbugmaster Oct 2, 2017
b65a69a
Fix _save_corpus[2]
anotherbugmaster Oct 3, 2017
6fa92f3
Merge remote-tracking branch 'upstream/develop' into develop
anotherbugmaster Oct 15, 2017
78e207d
Document bleicorpus in Numpy style
anotherbugmaster Oct 24, 2017
7519382
Document indexedcorpus
anotherbugmaster Oct 24, 2017
ae69867
Annotate csvcorpus
anotherbugmaster Nov 3, 2017
c2765ed
Add "Yields" section
anotherbugmaster Nov 3, 2017
40add21
Make `_save_corpus` public
anotherbugmaster Nov 3, 2017
e044c3a
Annotate bleicorpus
anotherbugmaster Nov 3, 2017
123327d
Fix indentation in bleicorpus
anotherbugmaster Nov 3, 2017
2382d01
`_save_corpus` -> `save_corpus`
anotherbugmaster Nov 21, 2017
42409bf
Annotate bleicorpus
anotherbugmaster Nov 21, 2017
7cb5bbf
Convert dictionary docs to numpy style
anotherbugmaster Nov 21, 2017
56f19e6
Convert hashdictionary docs to numpy style
anotherbugmaster Nov 21, 2017
9162a7e
Convert indexedcorpus docs to numpy style
anotherbugmaster Nov 21, 2017
5eaaac4
Convert lowcorpus docs to numpy style
anotherbugmaster Nov 21, 2017
3b6b076
Convert malletcorpus docs to numpy style
anotherbugmaster Nov 21, 2017
d7f3fc8
Convert mmcorpus docs to numpy style
anotherbugmaster Nov 21, 2017
c46bff4
Convert sharded_corpus docs to numpy style
anotherbugmaster Nov 21, 2017
7823546
Convert svmlightcorpus docs to numpy style
anotherbugmaster Nov 21, 2017
9878133
Convert textcorpus docs to numpy style
anotherbugmaster Nov 21, 2017
dba4429
Convert ucicorpus docs to numpy style
anotherbugmaster Nov 21, 2017
6a95c94
Convert wikicorpus docs to numpy style
anotherbugmaster Nov 21, 2017
6dcfb07
Add sphinx tweaks
anotherbugmaster Nov 21, 2017
2f61fc3
Merge remote-tracking branch 'upstream/develop' into develop
anotherbugmaster Nov 21, 2017
ac01abb
Merge branch 'develop' into fix_1605
anotherbugmaster Nov 21, 2017
833ec64
Remove trailing whitespaces
anotherbugmaster Nov 21, 2017
e656609
Merge branch 'develop' into fix_1605
anotherbugmaster Nov 23, 2017
3e597fe
Annotate wikicorpus
anotherbugmaster Nov 28, 2017
da1d5c2
SVMLight Corpus annotated
anotherbugmaster Dec 5, 2017
89f6098
Fix TODO
anotherbugmaster Dec 5, 2017
9eeea21
Fix grammar mistake
anotherbugmaster Dec 6, 2017
2b6aeaf
Undo changes to dictionary
anotherbugmaster Dec 7, 2017
9b17057
Undo changes to hashdictionary
anotherbugmaster Dec 7, 2017
de3ea0f
Document indexedcorpus
anotherbugmaster Dec 9, 2017
dafc373
Document indexedcorpus[2]
anotherbugmaster Dec 10, 2017
ff980bc
Merge upstream
anotherbugmaster Jan 9, 2018
0189d8d
Remove redundant files
anotherbugmaster Jan 11, 2018
943406c
Merge upstream
anotherbugmaster Jan 16, 2018
57cb5a3
Add more dots. :)
anotherbugmaster Jan 16, 2018
08ca492
Fix monospace
anotherbugmaster Jan 16, 2018
381fb97
remove useless method
menshikh-iv Jan 18, 2018
5b5701a
fix bleicorpus
menshikh-iv Jan 18, 2018
0e5c0cf
fix csvcorpus
menshikh-iv Jan 18, 2018
627c0e5
fix indexedcorpus
menshikh-iv Jan 18, 2018
b771bb5
fix svmlightcorpus
menshikh-iv Jan 18, 2018
d76af8d
fix wikicorpus[1]
menshikh-iv Jan 18, 2018
7fe753f
fix wikicorpus[2]
menshikh-iv Jan 18, 2018
a9eb1a3
fix wikicorpus[3]
menshikh-iv Jan 18, 2018
e3a8ebf
fix review comments
menshikh-iv Jan 22, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 33 additions & 24 deletions gensim/corpora/bleicorpus.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html


"""Blei's LDA-C format."""
"""Сorpus in Blei's LDA-C format."""

from __future__ import with_statement

Expand All @@ -17,7 +17,7 @@
from six.moves import xrange


logger = logging.getLogger('gensim.corpora.bleicorpus')
logger = logging.getLogger(__name__)


class BleiCorpus(IndexedCorpus):
Expand All @@ -28,24 +28,22 @@ class BleiCorpus(IndexedCorpus):

Each document is one line::

N fieldId1:fieldValue1 fieldId2:fieldValue2 ... fieldIdN:fieldValueN
N fieldId1:fieldValue1 fieldId2:fieldValue2 ... fieldIdN:fieldValueN

The vocabulary is a file with words, one word per line; word at line K has an
implicit ``id=K``.

The vocabulary is a file with words, one word per line; word at line K has an implicit `id=K`.

"""

def __init__(self, fname, fname_vocab=None):
"""
Initialize the corpus from a file.

Parameters
----------
fname : str
Serialized corpus's filename.
fname_vocab : str or None, optional
Vocabulary file. If not given, searching for the
vocab/vocab.txt file.
File path to Serialized corpus.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path to corpus here and in other corpora maybe?

fname_vocab : str, optional
Vocabulary file. If `fname_vocab` is None, searching for the vocab.txt or `fname_vocab`.vocab file.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it's fname_vocab.vocab? fname_vocab is none, isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, I added correct description

Copy link
Contributor Author

@anotherbugmaster anotherbugmaster Jan 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still don't get it. It should be `fname`.vocab, `fname_vocab`.vocab is undefined!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite :) I go through the code with ipdb for this case, this is significantly "wider" that we discuss here (I already fix it).

Copy link
Contributor Author

@anotherbugmaster anotherbugmaster Jan 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vocabulary file. If fname_vocab is None, searching for the vocab.txt or fname.vocab file.


Raises
------
Expand Down Expand Up @@ -76,25 +74,32 @@ def __init__(self, fname, fname_vocab=None):
self.id2word = dict(enumerate(words))

def __iter__(self):
"""Iterate over the corpus, returning one sparse vector at a time."""
"""Iterate over the corpus, returning one sparse (BoW) vector at a time.

Yields
------
list of (int, float)
Document's BoW representation.

"""
lineno = -1
with utils.smart_open(self.fname) as fin:
for lineno, line in enumerate(fin):
yield self.line2doc(line)
self.length = lineno + 1

def line2doc(self, line):
"""Convert line to document.
"""Convert line in Blei LDA-C format to document (BoW representation).

Parameters
----------
line : str
Document's string representation.
Line in Blei's LDA-C format.

Returns
-------
list of (int, float)
Document's list representation.
Document's BoW representation.

"""
parts = utils.to_unicode(line).split()
Expand All @@ -108,23 +113,25 @@ def line2doc(self, line):
def save_corpus(fname, corpus, id2word=None, metadata=False):
"""Save a corpus in the LDA-C format.

There are actually two files saved: `fname` and `fname.vocab`, where
`fname.vocab` is the vocabulary file.
Notes
-----
There are actually two files saved: `fname` and `fname.vocab`, where `fname.vocab` is the vocabulary file.

Parameters
----------
fname : str
Filename.
corpus : iterable
Iterable of documents.
Path to output filename.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To output file

corpus : iterable of iterable of (int, float)
Input corpus
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obvious, no additional information provided. There's no need to have descriptions for all arguments. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think that it's not necessary. Also, there's a dot missing at the end of the line.

id2word : dict of (str, str), optional
Transforms id to word.
metadata : bool
Any additional info.
Mapping id -> word for `corpus`.
metadata : bool, optional
THIS PARAMETER WILL BE IGNORED.

Returns
-------
list of int
Offsets for each line in file (in bytes).

"""
if id2word is None:
Expand Down Expand Up @@ -153,16 +160,18 @@ def save_corpus(fname, corpus, id2word=None, metadata=False):
return offsets

def docbyoffset(self, offset):
"""Return document corresponding to `offset`.
"""Get document corresponding to `offset`,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First line of docstring should always end with a dot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first line should end with a dot.

offset can be given from :meth:`~gensim.corpora.bleicorpus.BleiCorpus.save_corpus`.

Parameters
----------
offset : int
Position of the document in the file.
Position of the document in the file (in bytes).

Returns
-------
list of (int, float)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing parameter description (here and everywhere)

Document in BoW format.

"""
with utils.smart_open(self.fname) as f:
Expand Down