Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corpus: why not update self.length after iterating all #3

Closed
Dieterbe opened this issue Feb 21, 2011 · 5 comments
Closed

corpus: why not update self.length after iterating all #3

Dieterbe opened this issue Feb 21, 2011 · 5 comments

Comments

@Dieterbe
Copy link
Contributor

Hi,
why not do in every corpus, something like:

def __iter__(self):
     (...)
     length = 0   
     for lineNo, line in enumerate(...):
          (....)
          length += 1
          yield doc
     self.length = length

this reduces the chance of needing to run the highly expensive iteration for the sole sake of returning the length, in the len function.

@piskvorky
Copy link
Owner

Usually len() is needed earlier than iter(), so caching the length in iter wouldn't help.

But I'll add length caching to IndexedCorpus (see our Google groups discussion), so it doesn't matter anyway :) Killing two flies at once...

@Dieterbe
Copy link
Contributor Author

Usually len() is needed earlier than iter(), so caching the length in iter wouldn't help.

not in my case :)

But I'll add length caching to IndexedCorpus (see our Google groups discussion), so it doesn't matter anyway :)

it does. your codebase explicitly supports "the old way" of just having the streaming corpus without an index.
AFAICT, in the case where the user does not need corpus[123456]-style document retrieval (only streaming) and where the user iterates corpus first, calls len() afterwards, there are two options for fast len():
A) tell user to use an index (for the sole purpose of speeding up len())
B) add the code I suggested

I think A is quite expensive (building and storing the index structure but only using it for len()), so I would do B. But of course, it's your decision.

@piskvorky
Copy link
Owner

Ok. I still think determining your input data length belongs conceptually elsewhere (i.e., not in gensim at all), but on the other hand, it's just 3 lines of code and i finally want to see how the pulls work on github :) Can you please initiate a pull request?

EDIT: (to develop branch)

@Dieterbe
Copy link
Contributor Author

#4
there you go.
I went over all the corpus classes and found 2 of them that benefit from this tweak. So it's 6 lines ;)

@Dieterbe
Copy link
Contributor Author

note that github automatically generates an issue on a pull request.
in this case that's issue 4:
#4

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants