-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phrases and Phraser allow a generator corpus #1099
Phrases and Phraser allow a generator corpus #1099
Conversation
Allow Phrases and Phraser models to take a generator function/expression as input to the transformation method. Previously, only indexable iterables could be used which is problematic for large corpora. Add additional tests to test_phrases using a generator as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the error that the code produces when given a generator? The call to _apply
used here is pretty standard in Gensim. Interested to know what is particular about Phrases.
bigram2_seen = False | ||
|
||
for s in self.bigram[gen_sentences()]: | ||
if not bigram1_seen and 'response_time' in s: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would be lost if just assert 'response_time' in self.bigram[gen_sentences()]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a cop-out, but I just copied the format of the existing test :P
To stream-line that section you could do:
assert len(set(['response_time', 'graph_minors']).intersection(set(it.chain.from_iterable(self.bigram[gen_sentences()])))) == 2
However, the current formulation in both my test and testBigramConsturction short-circuits so that it doesn't have to go through the entire input. This would allow longer test corpora in the future.
In either case, I didn't want to question or rethink the work of whoever designed the tests originally, I just wanted to make sure my changes were non-breaking. Do you want me to change the tests?
Would it be easier to modify |
I like the goal and thank you for including tests! But I'm concerned this seems a bit of a convoluted approach. It's got twisty conditionals, a catchall exception handler, and is named-like a predicate-function ( Maybe it has to be, to continue to offer Phrase-ification via In the interest of explicitness/simplicity, what if If a def __getitem__(self, sentence_or_sentences):
sent_iter = iter(sentence_or_sentences)
try:
peek = sent_iter.next()
sentence_or_sentences = it.chain([peek], sent_iter)
except StopIteration:
return []
if isinstance(peek, string_types):
return phrasify_one(sentence_or_sentences)
else:
return phrasify_many(sentence_or_sentences) |
@tmylk I don't think there's any problem with @gojomo I think you're right, it's convoluted, I was more concerned with changing as little of the existing code as possible than with making the new approach clear. I think your proposed Also, should the docstings for Last question, should I merge my branch back in to my fork's develop branch? |
@ELind77 This PR won't be merged as it is. Waiting for an improvement with |
My suggestion would be to only have one And yes, we could offer a similar function (called Questions around separating model creation vs. model use (model static but more efficient in memory/CPU) seem to crop up a lot. We did something similar with word2vec, where we introduced |
After going back and looking at the code again I don't think it makes sense to change the public API. I think it's better and more consistent to keep the API exactly as it is. I think the main issue @gojomo pointed out is the complexity of A simple modification to @gojomo's code with some documentation inspired by def _is_single(obj):
"""
Check whether `obj` is a single document or an entire corpus.
Returns (is_single, new) 2-tuple, where `new` yields the same
sequence as `obj`.
`obj` is a single document if it is an iterable of strings. It
is a corpus if it is an iterable of documents.
"""
obj_iter = iter(obj)
try:
peek = obj_iter.next()
obj_iter = it.chain([peek], obj_iter)
except StopIteration:
# An empty object is a single document
return True, obj
if isinstance(peek, string_types):
# It's a document, return
return True, obj_iter
else:
return False, obj_iter What do you guys think? I think the naming could use some tweaking, but it is exactly the same problem and solution that |
@ELind77 Looks concise. Would you like to update the PR with that version? Adding a test for |
The _is_single function in phrases.py is now simpler and has the same contract as the is_corpus function in utils.
@tmylk I've pushed the changes to Also, what did I break in the compatibility tests? I gather I did something that py 3.x doesn't like but I'm not sure what. |
I took a look at the error from travis and it looks like one of my tests is failing for python 3.x but I'm not sure why. Something different in the way 3.x handles generator expressions perhaps? |
And an explicit test that no ouput and no exception is received for an empty input |
Ok, changes made and test added. I also removed the TODO comments in |
@ELind77 Thanks for the PR! |
Allow Phrases and Phraser models to take a generator
function/expression as input to the transformation method. Previously,
only indexable iterables could be used which is problematic for large
corpora.
Add additional tests to test_phrases using a generator as input.
I was trying to build a Phrases model with a corpus bigger than 1GB. My corpora are always generators and usually that's fine with gensim but it didn't work this time so I fixed it. The fix is basically just adding a function in phrases.py to do some more checking as to the type of the input. It also accommodates a generator of generators in case each document/token is undergoing some kind of transformation as it's pulled through.
This PR does still need a small modification to the docstings in Phraser and Phrases in order to indicate that the input type can be a generator and not just a list but my rst isn't all that great. Maybe if someone reviews this I could get a pointer on that?