prepare version 0.7.0

adbar · Jun 16, 2022 · 6205180 · 6205180
1 parent 85d90a8
commit 6205180
Show file tree

Hide file tree

Showing 5 changed files with 112 additions and 37 deletions.
diff --git a/HISTORY.rst b/HISTORY.rst
@@ -2,6 +2,14 @@
 History
 =======
 
+0.7.0
+-----
+
+* **breaking change**: language data pre-loading now occurs internally, language codes are now directly provided in ``lemmatize()`` call, e.g. ``simplemma.lemmatize("test", lang="en")``
+* faster lemmatization, result cache
+* sentence-aware ``text_lemmatizer()``
+* optional iterators for tokenization and lemmatization
+
 
 0.6.0
 -----

diff --git a/README.rst b/README.rst
@@ -58,39 +58,34 @@ Simplemma is used by selecting a language of interest and then applying the data
     >>> import simplemma
     # get a word
     myword = 'masks'
-    # decide which language data to load
-    >>> langdata = simplemma.load_data('en')
-    # apply it on a word form
-    >>> simplemma.lemmatize(myword, langdata)
+    # decide which language to use and apply it on a word form
+    >>> simplemma.lemmatize(myword, lang='en')
     'mask'
     # grab a list of tokens
     >>> mytokens = ['Hier', 'sind', 'Vaccines']
-    >>> langdata = simplemma.load_data('de')
     >>> for token in mytokens:
-    >>>     simplemma.lemmatize(token, langdata)
+    >>>     simplemma.lemmatize(token, lang='de')
     'hier'
     'sein'
     'Vaccines'
     # list comprehensions can be faster
-    >>> [simplemma.lemmatize(t, langdata) for t in mytokens]
+    >>> [simplemma.lemmatize(t, lang='de') for t in mytokens]
     ['hier', 'sein', 'Vaccines']
 
 
-Chaining several languages can improve coverage:
+Chaining several languages can improve coverage, they are used in sequence:
 
 
 .. code-block:: python
 
-    >>> langdata = simplemma.load_data('de', 'en')
-    >>> simplemma.lemmatize('Vaccines', langdata)
+    >>> from simplemma import lemmatize
+    >>> lemmatize('Vaccines', lang=('de', 'en'))
     'vaccine'
-    >>> langdata = simplemma.load_data('it')
-    >>> simplemma.lemmatize('spaghettis', langdata)
+    >>> lemmatize('spaghettis', lang='it')
     'spaghettis'
-    >>> langdata = simplemma.load_data('it', 'fr')
-    >>> simplemma.lemmatize('spaghettis', langdata)
+    >>> lemmatize('spaghettis', lang=('it', 'fr'))
     'spaghetti'
-    >>> simplemma.lemmatize('spaghetti', langdata)
+    >>> lemmatize('spaghetti', lang=('it', 'fr'))
     'spaghetto'
 
 
@@ -99,16 +94,23 @@ There are cases in which a greedier decomposition and lemmatization algorithm is
 .. code-block:: python
 
     # same example as before, comes to this result in one step
-    >>> simplemma.lemmatize('spaghettis', mydata, greedy=True)
+    >>> simplemma.lemmatize('spaghettis', lang=('it', 'fr'), greedy=True)
     'spaghetto'
     # a German case
-    >>> langdata = simplemma.load_data('de')
-    >>> simplemma.lemmatize('angekündigten', langdata)
+    >>> simplemma.lemmatize('angekündigten', lang='de')
     'ankündigen' # infinitive verb
-    >>> simplemma.lemmatize('angekündigten', langdata, greedy=False)
+    >>> simplemma.lemmatize('angekündigten', lang='de', greedy=False)
     'angekündigt' # past participle
 
 
+Additional functions:
+
+.. code-block:: python
+
+    # same example as before, comes to this result in one step
+    >>> simplemma.is_known('spaghetti', lang='it')
+
+
 Tokenization
 ~~~~~~~~~~~~
 
@@ -119,17 +121,20 @@ A simple tokenization function is included for convenience:
     >>> from simplemma import simple_tokenizer
     >>> simple_tokenizer('Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.')
     ['Lorem', 'ipsum', 'dolor', 'sit', 'amet', ',', 'consectetur', 'adipiscing', 'elit', ',', 'sed', 'do', 'eiusmod', 'tempor', 'incididunt', 'ut', 'labore', 'et', 'dolore', 'magna', 'aliqua', '.']
+    # use iterator instead
+    >>> simple_tokenizer('Lorem ipsum dolor sit amet', iterate=True)
 
 
-The function ``text_lemmatizer()`` chains tokenization and lemmatization. It can take ``greedy`` (affecting lemmatization) and ``silent`` (affecting errors and logging) as arguments:
+The functions ``text_lemmatizer()`` and ``lemma_iterator()`` chain tokenization and lemmatization. They can take ``greedy`` (affecting lemmatization) and ``silent`` (affecting errors and logging) as arguments:
 
 .. code-block:: python
 
     >>> from simplemma import text_lemmatizer
-    >>> langdata = simplemma.load_data('pt')
-    >>> text_lemmatizer('Sou o intervalo entre o que desejo ser e os outros me fizeram.', langdata)
+    >>> text_lemmatizer('Sou o intervalo entre o que desejo ser e os outros me fizeram.', lang='pt')
     # caveat: desejo is also a noun, should be desejar here
     ['ser', 'o', 'intervalo', 'entre', 'o', 'que', 'desejo', 'ser', 'e', 'o', 'outro', 'me', 'fazer', '.']
+    # same principle, returns an iterator and not a list
+    >>> from simplemma import lemma_iterator
 
 
 Caveats
@@ -138,13 +143,11 @@ Caveats
 .. code-block:: python
 
     # don't expect too much though
-    >>> langdata = simplemma.load_data('it')
     # this diminutive form isn't in the model data
-    >>> simplemma.lemmatize('spaghettini', langdata)
+    >>> simplemma.lemmatize('spaghettini', lang='it')
     'spaghettini' # should read 'spaghettino'
     # the algorithm cannot choose between valid alternatives yet
-    >>> langdata = simplemma.load_data('es')
-    >>> simplemma.lemmatize('son', langdata)
+    >>> simplemma.lemmatize('son', lang='es')
     'son' # valid common name, but what about the verb form?
 
 
@@ -216,6 +219,15 @@ The scores are calculated on `Universal Dependencies <https://universaldependenc
 This library is particularly relevant as regards the lemmatization of less frequent words. Its performance in this case is only incidentally captured by the benchmark above.
 
 
+Speed
+-----
+
+Measured on an old laptop to give a lower bound:
+
+- Tokenization: > 1 million tokens/sec
+- Lemmatization: > 250,000 words/sec
+
+
 Roadmap
 -------
 

diff --git a/simplemma/__init__.py b/simplemma/__init__.py
@@ -7,5 +7,5 @@
 __version__ = '0.7.0'
 
 
-from .simplemma import lemmatize, text_lemmatizer, is_known
+from .simplemma import lemmatize, lemma_iterator, text_lemmatizer, is_known
 from .tokenizer import simple_tokenizer
diff --git a/tests/test_simplemma.py b/tests/test_simplemma.py
@@ -4,7 +4,7 @@
 import pytest
 
 import simplemma
-from simplemma import lemmatize
+from simplemma import lemmatize, lemma_iterator, simple_tokenizer, text_lemmatizer
 
 
 TEST_DIR = os.path.abspath(os.path.dirname(__file__))

diff --git a/tests/udscore.py b/tests/udscore.py
@@ -1,46 +1,101 @@
 
+import time
 
 from collections import Counter
 
 from conllu import parse_incr
-from simplemma import load_data, lemmatize
+from simplemma import lemmatize
 
 
 data_files = [
+              ('bg', 'tests/UD/bg-btb-all.conllu'),
+#              ('cs', 'tests/UD/cs-pdt-all.conllu'),
+              ('da', 'tests/UD/da-ddt-all.conllu'),
               ('de', 'tests/UD/de-gsd-all.conllu'),
+              ('el', 'tests/UD/el-gdt-all.conllu'),
               ('en', 'tests/UD/en-gum-all.conllu'),
               ('es', 'tests/UD/es-gsd-all.conllu'),
+              ('et', 'tests/UD/et-edt-all.conllu'),
               ('fi', 'tests/UD/fi-tdt-all.conllu'),
               ('fr', 'tests/UD/fr-gsd-all.conllu'),
+              ('ga', 'tests/UD/ga-idt-all.conllu'),
+              ('hu', 'tests/UD/hu-szeged-all.conllu'),
+              ('hy', 'tests/UD/hy-armtdp-all.conllu'),
+              ('id', 'tests/UD/id-csui-all.conllu'),
+              ('it', 'tests/UD/it-isdt-all.conllu'),
+              ('la', 'tests/UD/la-proiel-all.conllu'),
+              ('lt', 'tests/UD/lt-alksnis-all.conllu'),
+              ('lv', 'tests/UD/lv-lvtb-all.conllu'),
+              ('nb', 'tests/UD/nb-bokmaal-all.conllu'),
+              ('nl', 'tests/UD/nl-alpino-all.conllu'),
+              ('pl', 'tests/UD/pl-pdb-all.conllu'),
+              ('pt', 'tests/UD/pt-gsd-all.conllu'),
+              ('ru', 'tests/UD/ru-gsd-all.conllu'),
               ('sk', 'tests/UD/sk-snk-all.conllu'),
-             ]
+              ('tr', 'tests/UD/tr-boun-all.conllu'),
+]
+
+# doesn't work: right-to-left?
+#data_files = [
+#              ('he', 'tests/UD/he-htb-all.conllu'),
+#              ('hi', 'tests/UD/hi-hdtb-all.conllu'),
+#              ('ur', 'tests/UD/ur-udtb-all.conllu'),
+#]
+
+#data_files = [
+#              ('de', 'tests/UD/de-gsd-all.conllu'),
+#]
 
 
 for filedata in data_files:
-    total, greedy, nongreedy, zero = 0, 0, 0, 0
+    total, nonprototal, greedy, nongreedy, zero, zerononpro, nonpro, nongreedynonpro = 0, 0, 0, 0, 0, 0, 0, 0
     errors, flag = [], False
-    langdata = load_data(filedata[0])
+    language = filedata[0]
     data_file = open(filedata[1], 'r', encoding='utf-8')
+    start = time.time()
     print('==', filedata, '==')
     for tokenlist in parse_incr(data_file):
         for token in tokenlist:
-            if token['lemma'] == '_':
+            if token['lemma'] == '_':  #  or token['upos'] in ('PUNCT', 'SYM')
             #    flag = True
                 continue
-            greedy_candidate = lemmatize(token['form'], langdata, greedy=True)
-            candidate = lemmatize(token['form'], langdata, greedy=False)
+
+            if token['id'] == 1:
+                initial = True
+            else:
+                initial = False
+
+            greedy_candidate = lemmatize(token['form'], lang=language, greedy=True, initial=initial)
+            candidate = lemmatize(token['form'], lang=language, greedy=False, initial=initial)
+
+            if token['upos'] in ('ADJ', 'NOUN'):
+                nonprototal += 1
+                if token['form'] == token['lemma']:
+                    zerononpro += 1
+                if greedy_candidate == token['lemma']:
+                    nonpro += 1
+                if candidate == token['lemma']:
+                    nongreedynonpro += 1
+                    #if len(token['lemma']) < 3:
+                    #    print(token['form'], token['lemma'], greedy_candidate)
+                #else:
+                #    errors.append((token['form'], token['lemma'], candidate))
             total += 1
             if token['form'] == token['lemma']:
                 zero += 1
             if greedy_candidate == token['lemma']:
                 greedy += 1
-            else:
-                errors.append((token['form'], token['lemma'], candidate))
             if candidate == token['lemma']:
                 nongreedy += 1
+            else:
+                errors.append((token['form'], token['lemma'], candidate))
+    print('exec time:\t %.3f' % (time.time() - start))
     print('greedy:\t\t %.3f' % (greedy/total))
     print('non-greedy:\t %.3f' % (nongreedy/total))
     print('baseline:\t %.3f' % (zero/total))
+    print('-PRO greedy:\t\t %.3f' % (nonpro/nonprototal))
+    print('-PRO non-greedy:\t %.3f' % (nongreedynonpro/nonprototal))
+    print('-PRO baseline:\t\t %.3f' % (zerononpro/nonprototal))
     mycounter = Counter(errors)
     print(mycounter.most_common(20))