Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lemmatizer exceptions: lemma_strings get sorted anyway #1387

Closed
mdcclv opened this issue Oct 5, 2017 · 8 comments · Fixed by #1389
Closed

Lemmatizer exceptions: lemma_strings get sorted anyway #1387

mdcclv opened this issue Oct 5, 2017 · 8 comments · Fixed by #1389

Comments

@mdcclv
Copy link
Contributor

mdcclv commented Oct 5, 2017

Lemmatization exceptions have been working inconsistently, even when they are added directly in corpora/en/wordnet/dict/verb.exc

The minimal test case of coping at #389 reveals that, at
https://github.com/explosion/spaCy/blob/master/spacy/lemmatizer.py#L94, the lemmatizer's list of potential forms (where the thing in lemmatizer.exceptions is item [0]) is cast to a set (and so loses ordering); then in https://github.com/explosion/spaCy/blob/master/spacy/morphology.pyx#L149, that set gets sorted. So lemmatizer exceptions only work if they also come first alphabetically!

I've implemented the fix for this, but I need this issue in order to submit the PR!

One question, though, for @honnibal: this can be fixed either as I did it locally -- return the whole list, with exceptions and then anything that comes back from the lemmatizer's rules -- or skip the rules altogether if we used an exception. I think it's more useful downstream if we keep all lemma candidates, even if we're not using them in the default pipeline. But it also seems only destructive to do sorted(set()) on them!

@honnibal
Copy link
Member

honnibal commented Oct 5, 2017

Thanks for bringing this up. I do think something is off in the lemmatization logic here.

However, if we want the exceptions list to have priority, shouldn't we just return if an exception is found?

Once we've generated multiple analyses in lemmatize(), I think it's actually reasonable to use lexical order as a decision criterion. I don't see a more principled way to do it, and at least lexical order is predictable?

@mdcclv
Copy link
Contributor Author

mdcclv commented Oct 5, 2017

I do think it's fine, and saves useless rule-application, if we just use the exception if it's present. The exception list also seems to have tuples as values.

I haven't looked at the ways the analyzer generate candidate lemmas, but I can imagine plenty of implementations where there's no principle to apply for choosing (I guess some rulebases could make the last rule to be applied the 'best' one). In the case of coping, hating and others, the real problem is that Wordnet has verbs entered for 'hat' and 'cop' which will always win.

Maybe an alternative is to prune the Wordnet-sourced exceptions for cases like cop and hat

Oops, sorry for hitting the close button

@azar923
Copy link

azar923 commented Jan 28, 2018

Hi @honnibal
Is it possible to add this fix to 1.x versions of spacy ?
We are experiencing some problems with lemmatization using spacy 1.9, like:
"ring" is lemmatized as "r"
"need" is lemmatized as "ne"
"rating" is lemmatized as "rat"
but, unfortunately, cannot allow migration to 2.x due to current runtime measurements.

Would be great if this fix could be merged to 1.9.
Thanks.

@mdcclv
Copy link
Contributor Author

mdcclv commented Jan 29, 2018

@azar923 this fix was indeed merged to 1.9! At first glance, I have a slight suspicion that the errors you're seeing are related to a different part of the lemmatization logic, but I don't have the time to confirm that right now.

@azar923
Copy link

azar923 commented Jan 31, 2018

Hi @mdcclv
Thank you for responding. In 1.9 code I see the following lemmatize function in lemmatizer.py:
`def lemmatize(string, index, exceptions, rules):

string = string.lower()
forms = []
# TODO: Is this correct? See discussion in Issue #435.
#if string in index:
#    forms.append(string)
forms.extend(exceptions.get(string, []))
oov_forms = []
for old, new in rules:
    if string.endswith(old):
        form = string[:len(string) - len(old)] + new
        if not form:
            pass
        elif form in index or not form.isalpha():
            forms.append(form)
        else:
            oov_forms.append(form)
if not forms:
    forms.extend(oov_forms)
if not forms:
    forms.append(string)
return set(forms)`

I believe that you changed it by adding condition on forms in your fix ?
Please correct me if I misunderstood something.

@mdcclv
Copy link
Contributor Author

mdcclv commented Feb 2, 2018

Aha, I'm sorry — I forgot the which release had my PR. There has been a 1.10 release: https://github.com/explosion/spaCy/releases/tag/v1.10.0 which is the one that has this lemmatization fix, and a number of other improvements that applied to both 1.x and 2.x spaCy. I highly recommend installing 1.10 — it's the version I'm using for my projects, too!

@azar923
Copy link

azar923 commented Feb 2, 2018

@mdcclv thank you for clarification. I would certainly follow your advice and switch to 1.10, however, there is a bug in this version, that prevents me doing this. Since I rely on document serialization in some part of the system, this is a blocker for me #1925. In 1.9 I am not experiencing such a problem.

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants