-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lemmatizer exceptions: lemma_strings
get sorted anyway
#1387
Comments
Thanks for bringing this up. I do think something is off in the lemmatization logic here. However, if we want the exceptions list to have priority, shouldn't we just return if an exception is found? Once we've generated multiple analyses in |
I do think it's fine, and saves useless rule-application, if we just use the exception if it's present. The exception list also seems to have tuples as values. I haven't looked at the ways the analyzer generate candidate lemmas, but I can imagine plenty of implementations where there's no principle to apply for choosing (I guess some rulebases could make the last rule to be applied the 'best' one). In the case of Maybe an alternative is to prune the Wordnet-sourced exceptions for cases like Oops, sorry for hitting the |
Hi @honnibal Would be great if this fix could be merged to 1.9. |
@azar923 this fix was indeed merged to 1.9! At first glance, I have a slight suspicion that the errors you're seeing are related to a different part of the lemmatization logic, but I don't have the time to confirm that right now. |
Hi @mdcclv
I believe that you changed it by adding condition on forms in your fix ? |
Aha, I'm sorry — I forgot the which release had my PR. There has been a 1.10 release: https://github.com/explosion/spaCy/releases/tag/v1.10.0 which is the one that has this lemmatization fix, and a number of other improvements that applied to both 1.x and 2.x spaCy. I highly recommend installing 1.10 — it's the version I'm using for my projects, too! |
@mdcclv thank you for clarification. I would certainly follow your advice and switch to 1.10, however, there is a bug in this version, that prevents me doing this. Since I rely on document serialization in some part of the system, this is a blocker for me #1925. In 1.9 I am not experiencing such a problem. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Lemmatization exceptions have been working inconsistently, even when they are added directly in
corpora/en/wordnet/dict/verb.exc
The minimal test case of
coping
at #389 reveals that, athttps://github.com/explosion/spaCy/blob/master/spacy/lemmatizer.py#L94, the lemmatizer's list of potential forms (where the thing in lemmatizer.exceptions is item [0]) is cast to a
set
(and so loses ordering); then in https://github.com/explosion/spaCy/blob/master/spacy/morphology.pyx#L149, thatset
gets sorted. So lemmatizer exceptions only work if they also come first alphabetically!I've implemented the fix for this, but I need this issue in order to submit the PR!
One question, though, for @honnibal: this can be fixed either as I did it locally -- return the whole list, with exceptions and then anything that comes back from the lemmatizer's
rules
-- or skip the rules altogether if we used an exception. I think it's more useful downstream if we keep all lemma candidates, even if we're not using them in the default pipeline. But it also seems only destructive to dosorted(set())
on them!The text was updated successfully, but these errors were encountered: