-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Xtransformer to backend #798
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #798 +/- ##
==========================================
- Coverage 99.65% 97.21% -2.44%
==========================================
Files 91 95 +4
Lines 6886 7210 +324
==========================================
+ Hits 6862 7009 +147
- Misses 24 201 +177 ☔ View full report in Codecov by Sentry. |
tests/test_backend.py
Outdated
@@ -95,6 +95,16 @@ def test_get_backend_yake_not_installed(): | |||
assert "YAKE not available" in str(excinfo.value) | |||
|
|||
|
|||
@pytest.mark.skipif( | |||
importlib.util.find_spec("pecos") is not None, | |||
reason="test requires that YAKE is NOT installed", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PECOS, not YAKE, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes. Thanks for catching it.
Thanks a lot for this new PR @Lakshmi-bashyam ! It really helps to have a clean starting point based on the current code. We've now tested this briefly. We used the PLC (YKL) classification task, because it seemed simpler than predicting YSO subjects and the current classification quality (mainly using Omikuji Parabel and Bonsai) are not that good, so it seems likely that a new algorithm could achieve better results. (And it did!) I set this up in the University of Helsinki HPC environment. We got access to an A100 GPU (which is way overkill for this...) so it was possible to train and evaluate models in a reasonable time. Here are some notes, comments and observations: Default BERT model missingTraining a model without setting Documentation and adviceThere was some advice and a suggested config in this comment from Moritz. I think we would need something like this to guide users (including us at NLF!) on how to use the backend and what configuration settings to use. Eventually this could be a wiki page for the backend like the others we have already, but for now just a comment in this PR would be helpful for testing. Here is the config I currently use for the YKL classification task in Finnish:
Using the Finnish BERT model improved results a bit compared to the multilingual BERT model. It's a little slower and takes slightly more VRAM (7GB instead of 6GB in this task), probably because it's not a DistilBERT model. This configuration achieves a Precision@1 score of 0.59 on the Finnish YKL classification task, which is slightly higher than what we get with Parabel and Bonsai (0.56-0.57). If you have any insight in how to choose appropriate configuration settings based on e.g. the training data size, vocabulary size, task type, available hardware etc. then that would be very valuable to include in the documentation. Pecos has tons of hyperparameters! Example questions that I wonder about:
Pecos FutureWarningI saw this warning a lot:
However, I think this is a problem in Pecos and probably not something we can easily fix ourselves. Maybe it will be fixed in a later release of Pecos. (I used libpecos 1.25 which is currently the most recent release on PyPI) Not working under Python 3.11I first tried Python 3.11, but it seemed that there was no Unit tests not run under CIThe current tests seem to do a lot of mocking to avoid actually training models. This is probably sensible since actually training a model could require lots of resources. However, the end result is that test coverage is quite low, with less than 10% of lines covered. Looking more closely, it seems like most of the tests aren't currently executed at all under GitHub Actions CI. I suspect this is because this is an optional dependency and it's not installed at all under the CI environment, so the tests will be skipped. Fixing this in the CI config ( Code style and QA issuesThere are some complaints from QA tools about the current code. These should be easy to fix. Not super urgent, but they should be fixed before we can consider merging this. (If some things are hard to fix we can reconsider them case by case)
Dependency on PyTorchInstalling this optional dependency brings in a lot of dependencies, including PyTorch and CUDA. The virtualenv in my case (using Also, the NN ensemble backend is implemented using TensorFlow. It seems a bit wasteful to depend on both TensorFlow and PyTorch. Do you think it would make sense to try to reimplement the NN ensemble in PyTorch? This way we could at least drop the dependency on TensorFlow. Again, thanks a lot for this and apologies for the long silence and the long comments! We can of course do some of the remaining work to get this integrated and merged on our side, because this seems like a very useful addition to the Annif backends. Even if you don't have any time to work on the code, just providing some advice on the configuration side would help a lot! For example, example configurations you've used at ZBW would be nice to see. |
Quality Gate failedFailed conditions |
I build a Dockerimage from this branch, and its size is 7.21 GB, which is quite much bigger than the size of Annif 1.1 image, which is 2.07 GB. Not all users and use cases probably won't need Xtransformer, or other optional dependencies, so we could build different variants of the image and push them to quay.io (just by setting different buildargs in GitHub Actions build step and tagging the images appropriately). But that can be done in separate PR; I'll create an issue for this now. |
Hello, A good starting point might be the hyperparameters used in the original paper. They can be found here. Different settings were used for different datasets. We also observed that the choice of Transformer model can have an impact on the results. In the original paper and in our experiments, Roberta model performed well. We used xml-roberta-base. It is a multilingual model which was trained on 100 languages.
We found that tuning the hyperparameters associated with the Partitioned Label Tree (known as Indexer in XR-Transformer) and the hyperparameters of the OVA classifiers (known as Ranker in XR-Transformer) led to notable improvements in our results. In particular:
As far as I can tell, some of these are not currently integrated in the PR here.
The maximum length of the transformer model limits this. For instance, for BERT this is 512. The authors noted that there was no significant performance increase when using 512, and we observed the same thing.
This also depends on how big of a batch fits the memory requirements of GPUs/CPUs that is used. Generally, starting out with a value like 32 or 64 works well, increasing it (if possible) to see if this leads to improvements. I also found this forum exchange where it's stated that:
I have attached the hyperparamter configuration file that we currently use. Even though we don't use Annif in our experiments, I hope this can still provide some helpful insights. params.txt I am happy to answer any questions and contribute to the Wiki if needed! |
This PR adds xtransformer as an optional dependency, incorporating minor changes and updating the backend implementation to align with the latest Annif version, building on the previous xtransformer PR #540