corpus: why not update self.length after iterating all #3

Dieterbe · 2011-02-21T09:40:38Z

Hi,
why not do in every corpus, something like:

def __iter__(self):
     (...)
     length = 0   
     for lineNo, line in enumerate(...):
          (....)
          length += 1
          yield doc
     self.length = length

this reduces the chance of needing to run the highly expensive iteration for the sole sake of returning the length, in the len function.

The text was updated successfully, but these errors were encountered:

piskvorky · 2011-02-21T12:08:15Z

Usually len() is needed earlier than iter(), so caching the length in iter wouldn't help.

But I'll add length caching to IndexedCorpus (see our Google groups discussion), so it doesn't matter anyway :) Killing two flies at once...

Dieterbe · 2011-02-21T13:17:43Z

Usually len() is needed earlier than iter(), so caching the length in iter wouldn't help.

not in my case :)

But I'll add length caching to IndexedCorpus (see our Google groups discussion), so it doesn't matter anyway :)

it does. your codebase explicitly supports "the old way" of just having the streaming corpus without an index.
AFAICT, in the case where the user does not need corpus[123456]-style document retrieval (only streaming) and where the user iterates corpus first, calls len() afterwards, there are two options for fast len():
A) tell user to use an index (for the sole purpose of speeding up len())
B) add the code I suggested

I think A is quite expensive (building and storing the index structure but only using it for len()), so I would do B. But of course, it's your decision.

piskvorky · 2011-02-22T15:02:34Z

Ok. I still think determining your input data length belongs conceptually elsewhere (i.e., not in gensim at all), but on the other hand, it's just 3 lines of code and i finally want to see how the pulls work on github :) Can you please initiate a pull request?

EDIT: (to develop branch)

Dieterbe · 2011-02-22T15:39:52Z

#4
there you go.
I went over all the corpus classes and found 2 of them that benefit from this tweak. So it's 6 lines ;)

Dieterbe · 2011-02-22T15:41:03Z

note that github automatically generates an issue on a pull request.
in this case that's issue 4:
#4

Ziky90 develop

remove import error which was used for testing.

piskvorky pushed a commit that referenced this issue Sep 15, 2014

Merge pull request #3 from piskvorky/ziky90-develop

868014b

Ziky90 develop

piskvorky pushed a commit that referenced this issue Oct 5, 2014

Merge pull request #3 from ccri/doc2vec

13bb55e

remove import error which was used for testing.

danwiesenthal mentioned this issue Oct 13, 2014

LdaMulticore livelock when documents converge? #244

Closed

lerela mentioned this issue Oct 16, 2014

multicore LDA #232

Merged

thomaskern mentioned this issue May 13, 2015

word2vec (& doc2vec) training doesn't benefit from all CPU cores with high workers values #336

Closed

This was referenced Jul 1, 2015

Multicore NULL bug #376

Closed

Distributed and multicore LDA print messages saying LDA is being run serially #375

Closed

hiral2cool mentioned this issue Oct 14, 2016

LDA Multicore Windows Error #940

Closed

maplejia mentioned this issue Jun 14, 2019

test the topic changing over time with CSV format #2527

Closed

gojomo mentioned this issue Jul 29, 2020

Gensim Doc2Vec model Segmentation Faulting for Large Corpus #2894

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corpus: why not update self.length after iterating all #3

corpus: why not update self.length after iterating all #3

Dieterbe commented Feb 21, 2011

piskvorky commented Feb 21, 2011

Dieterbe commented Feb 21, 2011

piskvorky commented Feb 22, 2011

Dieterbe commented Feb 22, 2011

Dieterbe commented Feb 22, 2011

corpus: why not update self.length after iterating all #3

corpus: why not update self.length after iterating all #3

Comments

Dieterbe commented Feb 21, 2011

piskvorky commented Feb 21, 2011

Dieterbe commented Feb 21, 2011

piskvorky commented Feb 22, 2011

Dieterbe commented Feb 22, 2011

Dieterbe commented Feb 22, 2011