Words that match more than one lemma #94

juanjoDiaz · 2023-05-26T14:43:11Z

Hi,

I noticed a problem with the behaviour for the word 'schulen' in German (as a noun vs as a verb) when using all capitals:

>>> simplemma.lemmatize("Schulen", "de")
'Schule'  # plural form lemmatized as noun "Schule"
>>> simplemma.lemmatize("schulen", "de")
'schulen'  # infinitive form lemmatized as verb "schulen"

The all-caps version appears to result in a match with the lowercase :

>>> simplemma.lemmatize("SCHULEN", "de")
'schulen'

Of course, it is impossible to know which version to prefer in the case.

I noticed that some of the words in the dictionaries contain more than one lemma.
for example

>>> simplemma.lemmatize("Sie", "de")
'Sie|sie'

Why are there words with multiple lemmas in the dictionaries?
Should we consider adding this on simplemma side? I mean changing the strategies so they can have multiple matches and return them somehow.

Wdyt?

adbar · 2023-05-26T16:14:09Z

Tough one, this is an absolute borderline case since multiple matches are usually not present in lists and they may be annotated differently.

Concerning the "noun vs. verb" issue this is indeed one of the main limitations of simplemma, it does not operate with syntactic information.

1over137 · 2024-03-23T15:40:17Z

Personally, I think the best way to handle this is to actually capture all examples (Schulen -> [Schule, schulen]) present in the corpus, arrange them by frequency, and make a new API (let's call it simplemma.lemmatize_all), which returns a list rather than a single word. This doesn't require drastically complicating the architecture but would still be useful for many situations.

adbar · 2024-03-25T11:43:22Z

The approach you suggest would probably give better results but memory is already a concern for the available dictionaries. One way or the other there is always a tradeoff between precision, memory and processing time.

juanjoDiaz · 2024-03-25T21:43:59Z

Regarding the issue of having multiple words separated by |, it only happens in German and only in a few words:

('Sie', 'Sie|sie')
('Sich', 'er|es|sie')
('er|es|sie', 'er|es|sie')
('sich', 'er|es|sie')

So, I think that we should just modify the training script to correct these.
Unfortunately, that script is not public yet (requested in #102) so I can't do a PR.

Regarding the proposal of having a list of potential lemma instead of a single lemma, it was just what I proposed in the original issue. I guess that it's a matter of having both options so the user can control memory.
This can be easily done with the strategies framework that I did.
Once again, if you could publish here how dictionaries are trained, I'm happy to give it a go and give you some numbers and a proposal in the form of a PR.

1over137 · 2024-03-25T22:53:31Z

It would be great to publish the dictionary creation scripts. In particular, I feel that it would be nice to augment the existing data by passing some corpus through an LLM which should be at least decent at the job. My users (https://github.com/FreeLanguageTools/vocabsieve) often complain that the lemmatizer coverage is poor for some languages, severely hindering usability. I know this isn't very elegant and may increase disk space use, but sometimes practicality is more important. A good first step in this direction would be to produce better eval datasets, though. I think the current eval datasets are too small to be representative, as they contain quite few unique lemmas. For this primarily dictionary-based lemmatization, I don't think there is a real need to separate training and validation sets, because you are primarily just memorizing stuff anyways, not generalizing.

adbar added the question Further information is requested label Jun 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Words that match more than one lemma #94

Words that match more than one lemma #94

juanjoDiaz commented May 26, 2023

adbar commented May 26, 2023

1over137 commented Mar 23, 2024

adbar commented Mar 25, 2024

juanjoDiaz commented Mar 25, 2024

1over137 commented Mar 25, 2024 •

edited

Loading

Words that match more than one lemma #94

Words that match more than one lemma #94

Comments

juanjoDiaz commented May 26, 2023

adbar commented May 26, 2023

1over137 commented Mar 23, 2024

adbar commented Mar 25, 2024

juanjoDiaz commented Mar 25, 2024

1over137 commented Mar 25, 2024 • edited Loading

1over137 commented Mar 25, 2024 •

edited

Loading