Lemmatizer exceptions: `lemma_strings` get sorted anyway #1387

mdcclv · 2017-10-05T00:51:20Z

Lemmatization exceptions have been working inconsistently, even when they are added directly in corpora/en/wordnet/dict/verb.exc

The minimal test case of coping at #389 reveals that, at
https://github.com/explosion/spaCy/blob/master/spacy/lemmatizer.py#L94, the lemmatizer's list of potential forms (where the thing in lemmatizer.exceptions is item [0]) is cast to a set (and so loses ordering); then in https://github.com/explosion/spaCy/blob/master/spacy/morphology.pyx#L149, that set gets sorted. So lemmatizer exceptions only work if they also come first alphabetically!

I've implemented the fix for this, but I need this issue in order to submit the PR!

One question, though, for @honnibal: this can be fixed either as I did it locally -- return the whole list, with exceptions and then anything that comes back from the lemmatizer's rules -- or skip the rules altogether if we used an exception. I think it's more useful downstream if we keep all lemma candidates, even if we're not using them in the default pipeline. But it also seems only destructive to do sorted(set()) on them!

The text was updated successfully, but these errors were encountered:

honnibal · 2017-10-05T01:27:45Z

Thanks for bringing this up. I do think something is off in the lemmatization logic here.

However, if we want the exceptions list to have priority, shouldn't we just return if an exception is found?

Once we've generated multiple analyses in lemmatize(), I think it's actually reasonable to use lexical order as a decision criterion. I don't see a more principled way to do it, and at least lexical order is predictable?

mdcclv · 2017-10-05T01:45:09Z

I do think it's fine, and saves useless rule-application, if we just use the exception if it's present. The exception list also seems to have tuples as values.

I haven't looked at the ways the analyzer generate candidate lemmas, but I can imagine plenty of implementations where there's no principle to apply for choosing (I guess some rulebases could make the last rule to be applied the 'best' one). In the case of coping, hating and others, the real problem is that Wordnet has verbs entered for 'hat' and 'cop' which will always win.

Maybe an alternative is to prune the Wordnet-sourced exceptions for cases like cop and hat

Oops, sorry for hitting the close button

…ion#1387

…explosion#1387

azar923 · 2018-01-28T16:48:28Z

Hi @honnibal
Is it possible to add this fix to 1.x versions of spacy ?
We are experiencing some problems with lemmatization using spacy 1.9, like:
"ring" is lemmatized as "r"
"need" is lemmatized as "ne"
"rating" is lemmatized as "rat"
but, unfortunately, cannot allow migration to 2.x due to current runtime measurements.

Would be great if this fix could be merged to 1.9.
Thanks.

mdcclv · 2018-01-29T16:47:49Z

@azar923 this fix was indeed merged to 1.9! At first glance, I have a slight suspicion that the errors you're seeing are related to a different part of the lemmatization logic, but I don't have the time to confirm that right now.

azar923 · 2018-01-31T15:19:35Z

Hi @mdcclv
Thank you for responding. In 1.9 code I see the following lemmatize function in lemmatizer.py:
`def lemmatize(string, index, exceptions, rules):

string = string.lower()
forms = []
# TODO: Is this correct? See discussion in Issue #435.
#if string in index:
#    forms.append(string)
forms.extend(exceptions.get(string, []))
oov_forms = []
for old, new in rules:
    if string.endswith(old):
        form = string[:len(string) - len(old)] + new
        if not form:
            pass
        elif form in index or not form.isalpha():
            forms.append(form)
        else:
            oov_forms.append(form)
if not forms:
    forms.extend(oov_forms)
if not forms:
    forms.append(string)
return set(forms)`

I believe that you changed it by adding condition on forms in your fix ?
Please correct me if I misunderstood something.

mdcclv · 2018-02-02T16:52:47Z

Aha, I'm sorry — I forgot the which release had my PR. There has been a 1.10 release: https://github.com/explosion/spaCy/releases/tag/v1.10.0 which is the one that has this lemmatization fix, and a number of other improvements that applied to both 1.x and 2.x spaCy. I highly recommend installing 1.10 — it's the version I'm using for my projects, too!

azar923 · 2018-02-02T20:09:21Z

@mdcclv thank you for clarification. I would certainly follow your advice and switch to 1.10, however, there is a bug in this version, that prevents me doing this. Since I rely on document serialization in some part of the system, this is a blocker for me #1925. In 1.9 I am not experiencing such a problem.

lock · 2018-05-08T00:55:19Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

mdcclv closed this as completed Oct 5, 2017

mdcclv reopened this Oct 5, 2017

mdcclv added a commit to mdcclv/spaCy that referenced this issue Oct 5, 2017

Regression test for lemmatizer exceptions -- demonstrate issue explos…

e81a608

…ion#1387

mdcclv added a commit to mdcclv/spaCy that referenced this issue Oct 5, 2017

Lemmatizer honors exceptions: Fix explosion#1387

ffb50d2

mdcclv added a commit to mdcclv/spaCy that referenced this issue Oct 5, 2017

Unit test for lemmatizer exceptions -- copied from regression test for …

b0d2718

…explosion#1387

mdcclv mentioned this issue Oct 5, 2017

Lemmatizer obey exceptions #1389

Merged

8 tasks

honnibal closed this as completed in #1389 Oct 5, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lemmatizer exceptions: `lemma_strings` get sorted anyway #1387

Lemmatizer exceptions: `lemma_strings` get sorted anyway #1387

mdcclv commented Oct 5, 2017

honnibal commented Oct 5, 2017

mdcclv commented Oct 5, 2017 •

edited

Loading

azar923 commented Jan 28, 2018

mdcclv commented Jan 29, 2018

azar923 commented Jan 31, 2018

mdcclv commented Feb 2, 2018

azar923 commented Feb 2, 2018

lock bot commented May 8, 2018

Lemmatizer exceptions: lemma_strings get sorted anyway #1387

Lemmatizer exceptions: lemma_strings get sorted anyway #1387

Comments

mdcclv commented Oct 5, 2017

honnibal commented Oct 5, 2017

mdcclv commented Oct 5, 2017 • edited Loading

azar923 commented Jan 28, 2018

mdcclv commented Jan 29, 2018

azar923 commented Jan 31, 2018

mdcclv commented Feb 2, 2018

azar923 commented Feb 2, 2018

lock bot commented May 8, 2018

Lemmatizer exceptions: `lemma_strings` get sorted anyway #1387

Lemmatizer exceptions: `lemma_strings` get sorted anyway #1387

mdcclv commented Oct 5, 2017 •

edited

Loading