-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a dictionary factory backed by MARISA-tries #133
Conversation
using the `TrieDictionaryFactory` for the first time for a language and | ||
will take a few seconds and use as much memory as loading the Python | ||
dicts for the language requires. For further invocations the trie | ||
dictionaries get cached on disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be helpful to explain where the cache is located on disk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried not to document every little detail in the README to avoid blowing it up too much. IMO there should be a separate API-documentation for Simplemma to cover stuff like that. However, if it's desired, please let me know and I'll happily add it.
Awesome work @Dunedan ! |
Thanks for sharing your insights and for contributing this functionality! Everything looks good at first sight but we need to update the installation process on Github actions. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #133 +/- ##
==========================================
+ Coverage 97.15% 97.25% +0.09%
==========================================
Files 33 34 +1
Lines 563 619 +56
==========================================
+ Hits 547 602 +55
- Misses 16 17 +1 ☔ View full report in Codecov by Sentry. |
@Dunedan Your PR only works on Windows, there seems to be an issue with the There are also lines not covered by the tests, it should be easy to add further tests. |
I guess you're talking about the tests, which only succeed on the Windows runners so far. At first glance that looks like a list ordering issue affecting just the test assertions. Should be easy to fix. I'll probably do so tomorrow. If you're not talking about the tests, please let me know and provide a bit more details about what functionality doesn't work and how that manifests. I'd be surprised about that though, as I did develop and test the code with Linux and didn't encounter any problems there.
Sure, no big deal, I can add tests for that as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've dropped a bunch of comments.
I like the idea of using tries but I'm not fully convinced of the implementation.
I think that we should have a TrieDictionaryFactory
which:
- has a method to convert all (or the selected) pickled dictionaries into pickled tries and save them on disk
- has a method to load a dictionary which loads the dicto from disk if there or from the original dictionary as needed
- doesn't do all that hashing logic because dictionaries are not likely to change unless the library is updated
- reconsider if the current bytestring usage is right
logger = logging.getLogger(__name__) | ||
|
||
|
||
class TrieWrapDict(MutableMapping): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we simply modify the DictionaryFactory
protocol to return a Mapping
instead of Dict
instead of having this wrapper?
Why MutableMapping
instead of Mapping
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the constraints I gave myself was to implement this functionality without requiring changes to the existing code of Simplemma. That's why I didn't modify expected return types of the DictionaryFactory
for example. Doing so would certainly simplify things, however that's something I didn't want to decide on my own.
It's a MutableMapping
right now, because a dict
, what's used when using the DefaultDictionaryFactory
, is as well. No further reason beyond that.
def __init__( | ||
self, | ||
cache_max_size: int = 8, | ||
use_disk_cache: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_disk_cache
is not needed.
If the disk_cache_dir
is None
, then you don't use disk caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now if disk_cache_dir
is None
, a subdirectory in the users platform-specific cache directory is used to store the cache. So use_disk_cache
is used right now to distinguish between disabling the disk cache and using the default cache location.
from typing import ByteString, Dict, List, Optional, cast | ||
|
||
from marisa_trie import BytesTrie, HUGE_CACHE # type: ignore[import-not-found] | ||
from platformdirs import user_cache_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
platformdirs
is also a external dependency. So, why it doesn't need the type ignore like marisa_trie
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I'm not entirely sure, my guess is that mypy can't get any type information, because marisa_trie isn't Python code, but a C-extension and doesn't provide type information.
|
||
def __init__( | ||
self, | ||
cache_max_size: int = 8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are doing double caching: in-memory and in-disk.
These params make unclear what it is about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree. I just kept that parameter from DefaultDictionaryFactory
to be able to use TrieDictionaryFactory
as a drop-in replacement for it.
hasher.update(chunk) | ||
return hasher.hexdigest() | ||
|
||
def _cache_is_current(self, lang: str) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the scenario in which the in-disk trie won't match the in-memory one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is for checking whether the tries cached on disk still match the pickled bytestring dictionaries. They don't match anymore whenever a new version of Simplemma gets released which includes updated pickled bytestring dictionaries, resulting in the need to regenerate the cached tries.
self._trie = trie | ||
|
||
def __getitem__(self, item): | ||
return self._trie[item.decode()][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having to do this decoding after just encoding feels a bit wasteful, every time makes me think if we did a right change changing this to bytestring...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I somewhat agree, it wasn't much better before, as BytesTrie
expects the keys to be strings and the values to be bytestrings. So while now the keys have to be encoded/decoded, before the values had to be.
@@ -0,0 +1,234 @@ | |||
from collections.abc import ItemsView, KeysView |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we somehow reuse generic DictionaryFactory
tests for all the implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For common functionality that would be helpful and I considered implementing dictionary agnostic tests, but concluded that this would be out-of-scope for this PR.
That happens right now when loading the dictionary. Having it as a separate method IMO wouldn't provide much benefit, but wouldn't be a big change either. for lang in SUPPORTED_LANGUAGES:
TrieDictionaryFactory().get_dictionary(lang)
That's currently done when loading a dictionary: dictionary = TrieDictionaryFactory().get_dictionary("en")
How would you handle regenerating the cached tries after dictionary updates then? |
I just pushed another commit fixing the syntax errors in the tests for Python <3.9. The failure to run the tests with Python 3.13.0b1 is caused by Aside from those failed Github actions the only remaining one is with Python 3.10 on the MacOS runner and seems to be unrelated to my changes, as it's failing because of a missing Codecov token. |
Python 3.6 and 3.7 are now deprecated and the token error with 3.10 on MacOS is a Codecov problem, nothing serious. Thanks for your work @Dunedan ! @juanjoDiaz Can we merge the PR or do you have other questions? |
I'm all in for having trie-backed dictionaries. However, I'm still not convinced of the implementation. |
Thanks for that alternative proposal. 👍 As I'd prefer to keep the discussion in a single place, I'm going to comment on your PR here:
That's what I'd prefer as well, but as that requires changes to the
The way you implemented that has several disadvantages from my point of view:
All of that is why I didn't use a temp directory, but a user-specific cache directory instead.
In the end that's balancing between minimizing the number of times the trie-based dictionaries have to be regenerated and implementation complexity. I (obviously) opted for minimizing the number of regenerations, as I expect Simplemma not to update dictionaries with every release and regenerating them is pretty costly in terms of CPU and memory usage. If Simplemma is going to update dictionaries more frequently than I anticipate, then the hashing wouldn't make much sense of course.
I like that. 👍 |
I guess it would now be best to integrate some of the changes discussed into this PR? We could first start with the changes you both agree on. Besides, as far as I'm concerned we can make changes in the |
We agree on modifying the I see 3 aspects that we need to align:
How to cache on disk
The more I think about this, the more I think that we should simple store the trie dictionaries in I think that we need to think of how Simplemma is used. That's also why I propose to have a method that allow to pre-generate all tries in disk which can be a simple for loop documented. Is hashing needed to notice if dictionaries changed?
I think that Simplemma never updates the dictionaries because it doesn't have a fully automated mechanism to do so. So, for me the question is: are we going to release sooo many versions of simplemma that regenerating the tries with every trie would become a problem? I don' t think so. And I have the same remark as before. How will simplemma be used? Should this be part of
|
If you don't mind, I could also change the
That path might not even be writable. What's the reason for your resistance to just to use the users cache directory as default?
We can simply go on without the hashing for now. Hashing can easily be added later on, if the dictionaries change more frequently than anticipated.
I believe you're limiting yourself too much with this stance on the use of dependencies. While I'm a fan of as little as necessary dependencies, sometimes adding a third-party library can massively improve the utility of a package. The use of
While I'd prefer to have it as part of Simplemma, as it would be easier to use and maintain, a separate package would be fine for me as well. That's something you guys have to decide on how you want to handle that. |
tests/strategies/dictionaries/test_trie_dictionary_factory.py
Dismissed
Show dismissed
Hide dismissed
tests/strategies/dictionaries/test_trie_dictionary_factory.py
Dismissed
Show dismissed
Hide dismissed
I would also prefer to integrate it into simplemma, as long as the installation remains optional the package is still usable without dependencies. As for the rest of the questions I'll keep on following the discussion as I have no fixed opinion on this. Maybe keep hashing on the side for now for the sake of simplicity. |
Alright, we agreed on:
The only thing left is to agree on the disk caching. |
All of that is already implemented in this PR. 🙂
IMO the most simple and sensible solution is to use the users cache directory as default and get its location using |
I would use https://docs.python.org/3/library/tempfile.html instead of |
So this is the same approach you implemented in your PR and I above already explained the downsides this approach has. To sum it up: You prefer to have a default configuration which keeps all generated tries in memory and regenerates them on every reboot (or potentially more often), just to avoid having to use an additional, tiny, well maintained, pure-Python dependency. That approach negates the lower memory benefit of using tries and slows down the users applications as the tries have to be regenerated regularly. For long-running applications using Simplemma which get started on boot, not caching tries would even be better than this approach, as it'd mean lower memory utilization and the tries wouldn't have to be regenerated more often. In my opinion that's not a user friendly default and I still don't understand the insistence on not using |
Since this PR entails substantial work I'd be in favor of moving on with the integration. We can always amend it later to improve on the issues discussed here. @Dunedan The tests do not pass for Python 3.8, could you please have a look at it? |
Sorry, I missed that. This should be fixed now. If you're fine with the code as-is, I suggest I rebase & squash the commits in this branch to have a single commit for all the changes and promote the pull request from "Draft" to "Ready for review" afterwards. |
Using Maybe you are using simplemma very differently. |
On Linux
I don't know how you install Python packages, but usually the user doesn't need to know about the dependencies of a package. In this case the user has to just follow the instructions for using tries and install
As explained already, there are more than just this single difference. Regarding race conditions: There is one possible race condition right now: If two processes start in parallel, one doesn't find a cached trie, starts generating it and is still in the process of writing it to disk when the second process tries to reads the unfinished file from the cache, then the second process will retrieve an exception, because the not-yet-finished-file isn't valid. I expect that case to be very rare though and fixing it would be as easy as writing the cached trie to a temporary file and moving it to its final position after writing. |
@Dunedan Thanks for fixing the PR, I believe you can prepare the PR so that it can be merged. (My understanding is that merging the PR is not the end of the discussion and that work on this optional function can continue beyond this point.) |
This adds an additional dictionary factory backed by MARISA-tries. This dictionary factory on average offers 20x lower memory usage and 100x faster initialization time, in exchange for reduced lemmatization and language detection performance. The first time loading a dictionary with the `TrieDictionaryFactory` requires more memory and will take a few seconds, as the trie-backed dictionary has to be generated on-the-fly from the pickled dict-based dictionary first.
This changes the format of the dictionary returned by `DictionaryFactory().get_dictionary()` from `Dict[ByteString, ByteString]` to `Mapping[str, str] to accommodate alternative dictionary factory implementations better and to ease the dictionary handling again. This keeps the storage of pickled dictionaries with byte strings though, as they're smaller than when using strings.
Thanks! |
This adds an additional dictionary factory backed by MARISA-tries. This dictionary factory on average offers 20x lower memory usage and 100x faster initialization time, in exchange for reduced lemmatization and language detection performance.
The first time loading a dictionary with the
TrieDictionaryFactory
requires more memory and will take a few seconds, as the trie-backed dictionary has to be generated on-the-fly from the pickled dict-based dictionary first.