Phrases and Phraser allow a generator corpus #1099

ELind77 · 2017-01-20T00:51:16Z

Allow Phrases and Phraser models to take a generator
function/expression as input to the transformation method. Previously,
only indexable iterables could be used which is problematic for large
corpora.

Add additional tests to test_phrases using a generator as input.

I was trying to build a Phrases model with a corpus bigger than 1GB. My corpora are always generators and usually that's fine with gensim but it didn't work this time so I fixed it. The fix is basically just adding a function in phrases.py to do some more checking as to the type of the input. It also accommodates a generator of generators in case each document/token is undergoing some kind of transformation as it's pulled through.

This PR does still need a small modification to the docstings in Phraser and Phrases in order to indicate that the input type can be a generator and not just a list but my rst isn't all that great. Maybe if someone reviews this I could get a pointer on that?

Allow Phrases and Phraser models to take a generator function/expression as input to the transformation method. Previously, only indexable iterables could be used which is problematic for large corpora. Add additional tests to test_phrases using a generator as input.

tmylk

What is the error that the code produces when given a generator? The call to _apply used here is pretty standard in Gensim. Interested to know what is particular about Phrases.

tmylk · 2017-01-20T01:05:46Z

gensim/test/test_phrases.py

+        bigram2_seen = False
+
+        for s in self.bigram[gen_sentences()]:
+            if not bigram1_seen and 'response_time' in s:


what would be lost if just assert 'response_time' in self.bigram[gen_sentences()]

This is a cop-out, but I just copied the format of the existing test :P

To stream-line that section you could do:

assert len(set(['response_time', 'graph_minors']).intersection(set(it.chain.from_iterable(self.bigram[gen_sentences()])))) == 2

However, the current formulation in both my test and testBigramConsturction short-circuits so that it doesn't have to go through the entire input. This would allow longer test corpora in the future.

In either case, I didn't want to question or rethink the work of whoever designed the tests originally, I just wanted to make sure my changes were non-breaking. Do you want me to change the tests?

tmylk · 2017-01-20T01:48:54Z

Would it be easier to modify _apply so that other methods can benefit from this as well?

gojomo · 2017-01-20T03:08:54Z

I like the goal and thank you for including tests! But I'm concerned this seems a bit of a convoluted approach. It's got twisty conditionals, a catchall exception handler, and is named-like a predicate-function (is_…) but returns a tuple.

Maybe it has to be, to continue to offer Phrase-ification via __getitem__() index-lookup, for multiply-shaped input types. The existing is_single test to achieve that is already kind of clunky, and the idea of treating phrase-ification as if it were a []-keyed-lookup has always been non-intuitive to me, though I know some other parts of the gensim API work similarly.

In the interest of explicitness/simplicity, what if phrasify_one(sentence) and phrasify_many(sentences) were explicit parts of the API, so the user of the code indicates what they expect done, rather than leaving it to a bunch of type-checking?

If a __getitem__() accept-and-adapt-to-whatever entry point remains. I believe that peeking at the first item from the argument (which should always be iterable of some sort) should be enough. If it's a string, treat it as a single-text, and if it's anything else (generator or not), consider it an iterable-of-iterables. For example:

def __getitem__(self, sentence_or_sentences):
    sent_iter = iter(sentence_or_sentences)
    try:
        peek = sent_iter.next()
        sentence_or_sentences = it.chain([peek], sent_iter)
    except StopIteration:
        return []
    if isinstance(peek, string_types):
        return phrasify_one(sentence_or_sentences)
    else:
        return phrasify_many(sentence_or_sentences)

ELind77 · 2017-01-20T16:41:54Z

@tmylk I don't think there's any problem with _apply. The issue is here, when Phrases/Phraser checks if the input is a single document or a corpus.

@gojomo I think you're right, it's convoluted, I was more concerned with changing as little of the existing code as possible than with making the new approach clear. I think your proposed peek with phrasiphy_one and phraseiphy_many is the better approach. Is there any argument for tying to unify Phrases and Phraser? The phrasiphy_one code would be sooo close but not quite close enough to actually be easy to do well.

Also, should the docstings for __getitem__ be modified to indicate more versatile input types? E.g. here: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/phrases.py#L231

Last question, should I merge my branch back in to my fork's develop branch?

tmylk · 2017-01-20T20:50:51Z

@ELind77 This PR won't be merged as it is. Waiting for an improvement with phrasify_one/many .

piskvorky · 2017-01-21T03:05:06Z

My suggestion would be to only have one transform() function, which accepts a stream of multiple items (here, sentences). Calling it on one sentence is just transform([one_sentence]), no need to clutter the API.

And yes, we could offer a similar function (called transform to keep in sync with sklearn terminology) in other modules too, in addition to the existing [ ] notation.

Questions around separating model creation vs. model use (model static but more efficient in memory/CPU) seem to crop up a lot. We did something similar with word2vec, where we introduced KeyedVectors for the same reason. Something for @tmylk to think about -- do we need a new API to make this concept clean and consistent across gensim?

ELind77 · 2017-01-21T23:02:13Z

After going back and looking at the code again I don't think it makes sense to change the public API. I think it's better and more consistent to keep the API exactly as it is.

I think the main issue @gojomo pointed out is the complexity of _is_single. In order to improve that I took a look at how TfidfModel does it and was pleasantly surprised when I discovered that utils.is_corpus behaves in exactly the same way as my _is_single function. I think we can resolve the problems pointed out above by simplifying _is_single (I can make it public too if that's preferable) and giving it a contract and documentation like is_corpus has in the utilities.

A simple modification to @gojomo's code with some documentation inspired by is_corpus give this:

def _is_single(obj):
    """
    Check whether `obj` is a single document or an entire corpus.
    Returns (is_single, new) 2-tuple, where `new` yields the same
    sequence as `obj`.

    `obj` is a single document if it is an iterable of strings.  It
    is a corpus if it is an iterable of documents.
    """
    obj_iter = iter(obj)
    try:
        peek = obj_iter.next()
        obj_iter = it.chain([peek], obj_iter)
    except StopIteration:
        # An empty object is a single document
        return True, obj
    if isinstance(peek, string_types):
        # It's a document, return
        return True, obj_iter
    else:
        return False, obj_iter

What do you guys think? I think the naming could use some tweaking, but it is exactly the same problem and solution that is_corpus solves for raw inputs instead of transformed corpora.

tmylk · 2017-01-24T19:41:56Z

@ELind77 Looks concise. Would you like to update the PR with that version?
The change return order to is_single, sentence is also good idea to match is_corpus.

Adding a test for self.bigram[[]] would be good too

The _is_single function in phrases.py is now simpler and has the same contract as the is_corpus function in utils.

ELind77 · 2017-01-24T21:01:35Z

@tmylk I've pushed the changes to _is_single. What were the changes you wanted me to make to the tests? My original PR adds two tests that repeat some of the existing test functions with a generator input instead of lists.

Also, what did I break in the compatibility tests? I gather I did something that py 3.x doesn't like but I'm not sure what.

ELind77 · 2017-01-24T21:07:53Z

I took a look at the error from travis and it looks like one of my tests is failing for python 3.x but I'm not sure why. Something different in the way 3.x handles generator expressions perhaps?

tmylk · 2017-01-24T22:10:17Z

next is different between Python 2 and 3

from six import next
next(g)

And an explicit test that no ouput and no exception is received for an empty input [] is needed: self.bigram[[]]

…es models

ELind77 · 2017-01-26T20:58:10Z

Ok, changes made and test added.

I also removed the TODO comments in phrases.py concerning the docstrings as I think the existing module documentation already implies that the phrase models are consistent with iterators.

tmylk · 2017-01-27T07:54:44Z

@ELind77 Thanks for the PR!

tmylk reviewed Jan 20, 2017

View reviewed changes

Update and simplify _is_single predicate

6d0d5cf

The _is_single function in phrases.py is now simpler and has the same contract as the is_corpus function in utils.

Fixes use of next in phrases, and adds tests for empty input to phras…

7f5da22

…es models

tmylk merged commit 34759be into piskvorky:develop Jan 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phrases and Phraser allow a generator corpus #1099

Phrases and Phraser allow a generator corpus #1099

ELind77 commented Jan 20, 2017

tmylk left a comment

tmylk Jan 20, 2017

ELind77 Jan 20, 2017

tmylk commented Jan 20, 2017

gojomo commented Jan 20, 2017 •

edited

Loading

ELind77 commented Jan 20, 2017

tmylk commented Jan 20, 2017

piskvorky commented Jan 21, 2017 •

edited

Loading

ELind77 commented Jan 21, 2017

tmylk commented Jan 24, 2017

ELind77 commented Jan 24, 2017 •

edited

Loading

ELind77 commented Jan 24, 2017

tmylk commented Jan 24, 2017

ELind77 commented Jan 26, 2017

tmylk commented Jan 27, 2017

Phrases and Phraser allow a generator corpus #1099

Phrases and Phraser allow a generator corpus #1099

Conversation

ELind77 commented Jan 20, 2017

tmylk left a comment

Choose a reason for hiding this comment

tmylk Jan 20, 2017

Choose a reason for hiding this comment

ELind77 Jan 20, 2017

Choose a reason for hiding this comment

tmylk commented Jan 20, 2017

gojomo commented Jan 20, 2017 • edited Loading

ELind77 commented Jan 20, 2017

tmylk commented Jan 20, 2017

piskvorky commented Jan 21, 2017 • edited Loading

ELind77 commented Jan 21, 2017

tmylk commented Jan 24, 2017

ELind77 commented Jan 24, 2017 • edited Loading

ELind77 commented Jan 24, 2017

tmylk commented Jan 24, 2017

ELind77 commented Jan 26, 2017

tmylk commented Jan 27, 2017

gojomo commented Jan 20, 2017 •

edited

Loading

piskvorky commented Jan 21, 2017 •

edited

Loading

ELind77 commented Jan 24, 2017 •

edited

Loading