Skip to content

Commit

Permalink
prepare version 0.7.0
Browse files Browse the repository at this point in the history
  • Loading branch information
adbar committed Jun 16, 2022
1 parent 85d90a8 commit 6205180
Show file tree
Hide file tree
Showing 5 changed files with 112 additions and 37 deletions.
8 changes: 8 additions & 0 deletions HISTORY.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,14 @@
History
=======

0.7.0
-----

* **breaking change**: language data pre-loading now occurs internally, language codes are now directly provided in ``lemmatize()`` call, e.g. ``simplemma.lemmatize("test", lang="en")``
* faster lemmatization, result cache
* sentence-aware ``text_lemmatizer()``
* optional iterators for tokenization and lemmatization


0.6.0
-----
Expand Down
64 changes: 38 additions & 26 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,39 +58,34 @@ Simplemma is used by selecting a language of interest and then applying the data
>>> import simplemma
# get a word
myword = 'masks'
# decide which language data to load
>>> langdata = simplemma.load_data('en')
# apply it on a word form
>>> simplemma.lemmatize(myword, langdata)
# decide which language to use and apply it on a word form
>>> simplemma.lemmatize(myword, lang='en')
'mask'
# grab a list of tokens
>>> mytokens = ['Hier', 'sind', 'Vaccines']
>>> langdata = simplemma.load_data('de')
>>> for token in mytokens:
>>> simplemma.lemmatize(token, langdata)
>>> simplemma.lemmatize(token, lang='de')
'hier'
'sein'
'Vaccines'
# list comprehensions can be faster
>>> [simplemma.lemmatize(t, langdata) for t in mytokens]
>>> [simplemma.lemmatize(t, lang='de') for t in mytokens]
['hier', 'sein', 'Vaccines']
Chaining several languages can improve coverage:
Chaining several languages can improve coverage, they are used in sequence:


.. code-block:: python
>>> langdata = simplemma.load_data('de', 'en')
>>> simplemma.lemmatize('Vaccines', langdata)
>>> from simplemma import lemmatize
>>> lemmatize('Vaccines', lang=('de', 'en'))
'vaccine'
>>> langdata = simplemma.load_data('it')
>>> simplemma.lemmatize('spaghettis', langdata)
>>> lemmatize('spaghettis', lang='it')
'spaghettis'
>>> langdata = simplemma.load_data('it', 'fr')
>>> simplemma.lemmatize('spaghettis', langdata)
>>> lemmatize('spaghettis', lang=('it', 'fr'))
'spaghetti'
>>> simplemma.lemmatize('spaghetti', langdata)
>>> lemmatize('spaghetti', lang=('it', 'fr'))
'spaghetto'
Expand All @@ -99,16 +94,23 @@ There are cases in which a greedier decomposition and lemmatization algorithm is
.. code-block:: python
# same example as before, comes to this result in one step
>>> simplemma.lemmatize('spaghettis', mydata, greedy=True)
>>> simplemma.lemmatize('spaghettis', lang=('it', 'fr'), greedy=True)
'spaghetto'
# a German case
>>> langdata = simplemma.load_data('de')
>>> simplemma.lemmatize('angekündigten', langdata)
>>> simplemma.lemmatize('angekündigten', lang='de')
'ankündigen' # infinitive verb
>>> simplemma.lemmatize('angekündigten', langdata, greedy=False)
>>> simplemma.lemmatize('angekündigten', lang='de', greedy=False)
'angekündigt' # past participle
Additional functions:

.. code-block:: python
# same example as before, comes to this result in one step
>>> simplemma.is_known('spaghetti', lang='it')
Tokenization
~~~~~~~~~~~~

Expand All @@ -119,17 +121,20 @@ A simple tokenization function is included for convenience:
>>> from simplemma import simple_tokenizer
>>> simple_tokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')
['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.']
# use iterator instead
>>> simple_tokenizer('Lorem ipsum dolor sit amet', iterate=True)
The function ``text_lemmatizer()`` chains tokenization and lemmatization. It can take ``greedy`` (affecting lemmatization) and ``silent`` (affecting errors and logging) as arguments:
The functions ``text_lemmatizer()`` and ``lemma_iterator()`` chain tokenization and lemmatization. They can take ``greedy`` (affecting lemmatization) and ``silent`` (affecting errors and logging) as arguments:

.. code-block:: python
>>> from simplemma import text_lemmatizer
>>> langdata = simplemma.load_data('pt')
>>> text_lemmatizer('Sou o intervalo entre o que desejo ser e os outros me fizeram.', langdata)
>>> text_lemmatizer('Sou o intervalo entre o que desejo ser e os outros me fizeram.', lang='pt')
# caveat: desejo is also a noun, should be desejar here
['ser', 'o', 'intervalo', 'entre', 'o', 'que', 'desejo', 'ser', 'e', 'o', 'outro', 'me', 'fazer', '.']
# same principle, returns an iterator and not a list
>>> from simplemma import lemma_iterator
Caveats
Expand All @@ -138,13 +143,11 @@ Caveats
.. code-block:: python
# don't expect too much though
>>> langdata = simplemma.load_data('it')
# this diminutive form isn't in the model data
>>> simplemma.lemmatize('spaghettini', langdata)
>>> simplemma.lemmatize('spaghettini', lang='it')
'spaghettini' # should read 'spaghettino'
# the algorithm cannot choose between valid alternatives yet
>>> langdata = simplemma.load_data('es')
>>> simplemma.lemmatize('son', langdata)
>>> simplemma.lemmatize('son', lang='es')
'son' # valid common name, but what about the verb form?
Expand Down Expand Up @@ -216,6 +219,15 @@ The scores are calculated on `Universal Dependencies <https://universaldependenc
This library is particularly relevant as regards the lemmatization of less frequent words. Its performance in this case is only incidentally captured by the benchmark above.


Speed
-----

Measured on an old laptop to give a lower bound:

- Tokenization: > 1 million tokens/sec
- Lemmatization: > 250,000 words/sec


Roadmap
-------

Expand Down
2 changes: 1 addition & 1 deletion simplemma/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,5 @@
__version__ = '0.7.0'


from .simplemma import lemmatize, text_lemmatizer, is_known
from .simplemma import lemmatize, lemma_iterator, text_lemmatizer, is_known
from .tokenizer import simple_tokenizer
2 changes: 1 addition & 1 deletion tests/test_simplemma.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import pytest

import simplemma
from simplemma import lemmatize
from simplemma import lemmatize, lemma_iterator, simple_tokenizer, text_lemmatizer


TEST_DIR = os.path.abspath(os.path.dirname(__file__))
Expand Down
73 changes: 64 additions & 9 deletions tests/udscore.py
Original file line number Diff line number Diff line change
@@ -1,46 +1,101 @@

import time

from collections import Counter

from conllu import parse_incr
from simplemma import load_data, lemmatize
from simplemma import lemmatize


data_files = [
('bg', 'tests/UD/bg-btb-all.conllu'),
# ('cs', 'tests/UD/cs-pdt-all.conllu'),
('da', 'tests/UD/da-ddt-all.conllu'),
('de', 'tests/UD/de-gsd-all.conllu'),
('el', 'tests/UD/el-gdt-all.conllu'),
('en', 'tests/UD/en-gum-all.conllu'),
('es', 'tests/UD/es-gsd-all.conllu'),
('et', 'tests/UD/et-edt-all.conllu'),
('fi', 'tests/UD/fi-tdt-all.conllu'),
('fr', 'tests/UD/fr-gsd-all.conllu'),
('ga', 'tests/UD/ga-idt-all.conllu'),
('hu', 'tests/UD/hu-szeged-all.conllu'),
('hy', 'tests/UD/hy-armtdp-all.conllu'),
('id', 'tests/UD/id-csui-all.conllu'),
('it', 'tests/UD/it-isdt-all.conllu'),
('la', 'tests/UD/la-proiel-all.conllu'),
('lt', 'tests/UD/lt-alksnis-all.conllu'),
('lv', 'tests/UD/lv-lvtb-all.conllu'),
('nb', 'tests/UD/nb-bokmaal-all.conllu'),
('nl', 'tests/UD/nl-alpino-all.conllu'),
('pl', 'tests/UD/pl-pdb-all.conllu'),
('pt', 'tests/UD/pt-gsd-all.conllu'),
('ru', 'tests/UD/ru-gsd-all.conllu'),
('sk', 'tests/UD/sk-snk-all.conllu'),
]
('tr', 'tests/UD/tr-boun-all.conllu'),
]

# doesn't work: right-to-left?
#data_files = [
# ('he', 'tests/UD/he-htb-all.conllu'),
# ('hi', 'tests/UD/hi-hdtb-all.conllu'),
# ('ur', 'tests/UD/ur-udtb-all.conllu'),
#]

#data_files = [
# ('de', 'tests/UD/de-gsd-all.conllu'),
#]


for filedata in data_files:
total, greedy, nongreedy, zero = 0, 0, 0, 0
total, nonprototal, greedy, nongreedy, zero, zerononpro, nonpro, nongreedynonpro = 0, 0, 0, 0, 0, 0, 0, 0
errors, flag = [], False
langdata = load_data(filedata[0])
language = filedata[0]
data_file = open(filedata[1], 'r', encoding='utf-8')
start = time.time()
print('==', filedata, '==')
for tokenlist in parse_incr(data_file):
for token in tokenlist:
if token['lemma'] == '_':
if token['lemma'] == '_': # or token['upos'] in ('PUNCT', 'SYM')
# flag = True
continue
greedy_candidate = lemmatize(token['form'], langdata, greedy=True)
candidate = lemmatize(token['form'], langdata, greedy=False)

if token['id'] == 1:
initial = True
else:
initial = False

greedy_candidate = lemmatize(token['form'], lang=language, greedy=True, initial=initial)
candidate = lemmatize(token['form'], lang=language, greedy=False, initial=initial)

if token['upos'] in ('ADJ', 'NOUN'):
nonprototal += 1
if token['form'] == token['lemma']:
zerononpro += 1
if greedy_candidate == token['lemma']:
nonpro += 1
if candidate == token['lemma']:
nongreedynonpro += 1
#if len(token['lemma']) < 3:
# print(token['form'], token['lemma'], greedy_candidate)
#else:
# errors.append((token['form'], token['lemma'], candidate))
total += 1
if token['form'] == token['lemma']:
zero += 1
if greedy_candidate == token['lemma']:
greedy += 1
else:
errors.append((token['form'], token['lemma'], candidate))
if candidate == token['lemma']:
nongreedy += 1
else:
errors.append((token['form'], token['lemma'], candidate))
print('exec time:\t %.3f' % (time.time() - start))
print('greedy:\t\t %.3f' % (greedy/total))
print('non-greedy:\t %.3f' % (nongreedy/total))
print('baseline:\t %.3f' % (zero/total))
print('-PRO greedy:\t\t %.3f' % (nonpro/nonprototal))
print('-PRO non-greedy:\t %.3f' % (nongreedynonpro/nonprototal))
print('-PRO baseline:\t\t %.3f' % (zerononpro/nonprototal))
mycounter = Counter(errors)
print(mycounter.most_common(20))

0 comments on commit 6205180

Please sign in to comment.