Add Xtransformer to backend #798

Lakshmi-bashyam · 2024-09-16T15:14:34Z

This PR adds xtransformer as an optional dependency, incorporating minor changes and updating the backend implementation to align with the latest Annif version, building on the previous xtransformer PR #540

annif/backend/xtransformer.py

annif/util.py

codecov · 2024-09-17T06:21:27Z

Codecov Report

Attention: Patch coverage is 36.10108% with 177 lines in your changes missing coverage. Please review.

Project coverage is 97.21%. Comparing base (125565e) to head (4c33a31).
Report is 49 commits behind head on main.

Files with missing lines	Patch %	Lines
annif/backend/xtransformer.py	7.36%	88 Missing ⚠️
tests/test_backend_xtransformer.py	9.27%	88 Missing ⚠️
annif/backend/__init__.py	83.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #798      +/-   ##
==========================================
- Coverage   99.65%   97.21%   -2.44%     
==========================================
  Files          91       95       +4     
  Lines        6886     7210     +324     
==========================================
+ Hits         6862     7009     +147     
- Misses         24      201     +177

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

annif/backend/xtransformer.py

osma · 2024-09-25T10:33:50Z

tests/test_backend.py

@@ -95,6 +95,16 @@ def test_get_backend_yake_not_installed():
    assert "YAKE not available" in str(excinfo.value)


+@pytest.mark.skipif(
+    importlib.util.find_spec("pecos") is not None,
+    reason="test requires that YAKE is NOT installed",


PECOS, not YAKE, right?

Oops, yes. Thanks for catching it.

osma · 2024-09-25T10:57:19Z

Thanks a lot for this new PR @Lakshmi-bashyam ! It really helps to have a clean starting point based on the current code.

We've now tested this briefly. We used the PLC (YKL) classification task, because it seemed simpler than predicting YSO subjects and the current classification quality (mainly using Omikuji Parabel and Bonsai) are not that good, so it seems likely that a new algorithm could achieve better results. (And it did!)

I set this up in the University of Helsinki HPC environment. We got access to an A100 GPU (which is way overkill for this...) so it was possible to train and evaluate models in a reasonable time.

Here are some notes, comments and observations:

Default BERT model missing

Training a model without setting model_shortcut didn't work for me. Apparently the model distilbert-base-multilingual-uncased cannot be found on HuggingFace Hub (maybe it has been deleted?). I set model_shortcut="distilbert-base-multilingual-cased" and it started working. (Later I changed to another BERT model, see below)

Documentation and advice

There was some advice and a suggested config in this comment from Moritz. I think we would need something like this to guide users (including us at NLF!) on how to use the backend and what configuration settings to use. Eventually this could be a wiki page for the backend like the others we have already, but for now just a comment in this PR would be helpful for testing.

Here is the config I currently use for the YKL classification task in Finnish:

[ykl-xtransformer-fi]
name="YKL XTransformer Finnish"
language="fi"
backend="xtransformer"
analyzer="simplemma(fi)"
vocab="ykl"
batch_size=16
truncate_length=256
learning_rate=0.0001
num_train_epochs=3
max_leaf_size=18000
model_shortcut="TurkuNLP/bert-base-finnish-cased-v1"

Using the Finnish BERT model improved results a bit compared to the multilingual BERT model. It's a little slower and takes slightly more VRAM (7GB instead of 6GB in this task), probably because it's not a DistilBERT model.

This configuration achieves a Precision@1 score of 0.59 on the Finnish YKL classification task, which is slightly higher than what we get with Parabel and Bonsai (0.56-0.57).

If you have any insight in how to choose appropriate configuration settings based on e.g. the training data size, vocabulary size, task type, available hardware etc. then that would be very valuable to include in the documentation. Pecos has tons of hyperparameters!

Example questions that I wonder about:

Does the analyzer setting affect what the BERT model sees? I don't think so?
How to select the number of epochs? (so far I've tried 1, 2 and 3 and got the best results with 3 epochs)
How to set truncate_length and what is the maximum value? Can I increase it from 256 if my documents are longer than this?
How to set max_leaf_size?
How to set batch_size?
Are there other important settings/hyperparameters that could be tuned for better results?

Pecos FutureWarning

I saw this warning a lot:

/home/xxx/.cache/pypoetry/virtualenvs/annif-fDHejL2r-py3.10/lib/python3.10/site-packages/pecos/xmc/xtransformer/matcher.py:411: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.

However, I think this is a problem in Pecos and probably not something we can easily fix ourselves. Maybe it will be fixed in a later release of Pecos. (I used libpecos 1.25 which is currently the most recent release on PyPI)

Not working under Python 3.11

I first tried Python 3.11, but it seemed that there was no libpecos wheel for this Python version available on PyPI (and it couldn't be built automatically for some reason). So I switched to Python 3.10 for my tests. Again, this is really a problem with libpecos and not with the backend itself.

Unit tests not run under CI

The current tests seem to do a lot of mocking to avoid actually training models. This is probably sensible since actually training a model could require lots of resources. However, the end result is that test coverage is quite low, with less than 10% of lines covered.

Looking more closely, it seems like most of the tests aren't currently executed at all under GitHub Actions CI. I suspect this is because this is an optional dependency and it's not installed at all under the CI environment, so the tests will be skipped. Fixing this in the CI config (.github/workflows/cicd.html) should at least substantially improve the test coverage.

Code style and QA issues

There are some complaints from QA tools about the current code. These should be easy to fix. Not super urgent, but they should be fixed before we can consider merging this. (If some things are hard to fix we can reconsider them case by case)

Lint with Black fails in the CI run. The code doesn't follow Black style. Easy to fix by running black
SonarCloud complains about a few variable names and return types
github-advanced-security complains about imports (see previous comment above)

Dependency on PyTorch

Installing this optional dependency brings in a lot of dependencies, including PyTorch and CUDA. The virtualenv in my case (using poetry install --all-extras) is 5.7GB, while another one for the main branch (without pecos) is 2.6GB, an increase of over 3GB. I wonder if there is any way to reduce this? Especially if we want to include this in the Docker images, the huge size could become a problem.

Also, the NN ensemble backend is implemented using TensorFlow. It seems a bit wasteful to depend on both TensorFlow and PyTorch. Do you think it would make sense to try to reimplement the NN ensemble in PyTorch? This way we could at least drop the dependency on TensorFlow.

Again, thanks a lot for this and apologies for the long silence and the long comments! We can of course do some of the remaining work to get this integrated and merged on our side, because this seems like a very useful addition to the Annif backends. Even if you don't have any time to work on the code, just providing some advice on the configuration side would help a lot! For example, example configurations you've used at ZBW would be nice to see.

sonarcloud · 2024-09-25T10:58:22Z

Quality Gate failed

Failed conditions
11.5% Duplication on New Code (required ≤ 3%)

See analysis details on SonarCloud

juhoinkinen · 2024-09-26T07:13:08Z

Especially if we want to include this in the Docker images, the huge size could become a problem.

I build a Dockerimage from this branch, and its size is 7.21 GB, which is quite much bigger than the size of Annif 1.1 image, which is 2.07 GB.

Not all users and use cases probably won't need Xtransformer, or other optional dependencies, so we could build different variants of the image and push them to quay.io (just by setting different buildargs in GitHub Actions build step and tagging the images appropriately). But that can be done in separate PR; I'll create an issue for this now.

katjakon · 2024-10-01T12:02:56Z

Hello,
Thank you for your work on this PR!
At the German National Library, we are also experimenting with XR-Transformer. We would be glad to contribute, especially with regards to documentation and training advice.

A good starting point might be the hyperparameters used in the original paper. They can be found here. Different settings were used for different datasets.

We also observed that the choice of Transformer model can have an impact on the results. In the original paper and in our experiments, Roberta model performed well. We used xml-roberta-base. It is a multilingual model which was trained on 100 languages.

Are there other important settings/hyperparameters that could be tuned for better results?

We found that tuning the hyperparameters associated with the Partitioned Label Tree (known as Indexer in XR-Transformer) and the hyperparameters of the OVA classifiers (known as Ranker in XR-Transformer) led to notable improvements in our results. In particular:

nr_splits (& min_codes): Number of child nodes. This hyperparameter can be compared to cluster_k in Omikuji. For us, bigger values like 256 led to better results.
max_leaf_size: We observed that bigger values perform better. We currently use 400.
Cp & Cn are the costs for wrongly classified labels used in the OVA classifiers. Cp is the cost for wrongly classified positive labels, Cn is the cost for negative labels. Using different penalities for positive and negative labels is especially helpful when labels are imbalanced, which is probably the case for OVA classifiers. These hyperparameters had a huge influence on our results. Further reading
threshold: A regularisation method. Model weights in the OVA classifiers that fall below the threshold are set to zero. Choosing a high value here will reduce model size, but might lead to a model that is underfitting. Choosing a very low value might lead to overfitting. We achieve good performance with 0.015.

As far as I can tell, some of these are not currently integrated in the PR here.

How to set truncate_length and what is the maximum value? Can I increase it from 256 if my documents are longer than this?

The maximum length of the transformer model limits this. For instance, for BERT this is 512. The authors noted that there was no significant performance increase when using 512, and we observed the same thing.

How to set batch_size?

This also depends on how big of a batch fits the memory requirements of GPUs/CPUs that is used. Generally, starting out with a value like 32 or 64 works well, increasing it (if possible) to see if this leads to improvements. I also found this forum exchange where it's stated that:

Batch size is a slider on the learning process.
Small values give a learning process that converges quickly at the cost of noise in the training process.
Large values give a learning process that converges slowly with accurate estimates of the error gradient.

I have attached the hyperparamter configuration file that we currently use. Even though we don't use Annif in our experiments, I hope this can still provide some helpful insights. params.txt

I am happy to answer any questions and contribute to the Wiki if needed!

mo-fu and others added 17 commits August 26, 2022 13:47

Add parameter merging to utils

fb13401

Allow atomic save to handle directories.

e249715

Add XTransformer backend.

5cc207b

Remove redundant import in fasttext

5a18d98

Use parsed parameter in suggest batch_size.

6129965

Use provided parameters in xtransformer training.

02ff772

Fix import for Xtransformer

3d06ebe

Split atomic_save in folder and directory variant.

8555bab

Disable gpu use for xtransformer suggest.

c11ba38

Update pecos dependency.

4a82ea2

Adapt xtransformer backend to new vocab model.

367e493

Merge branch 'master' of github.com:mo-fu/Annif into mo-fu-master

aa96ebc

Working transformer backend

efbb05c

Working transformer backend

3731f47

Resolve conflicts

6187e91

xtrans test fixed, stwfsa import fixed

3e02a72

Change default to smaller model

7379061

github-advanced-security bot found potential problems Sep 17, 2024

View reviewed changes

Fix linting errors

2078a65

github-advanced-security bot found potential problems Sep 19, 2024

View reviewed changes

annif/backend/xtransformer.py Fixed Show fixed Hide fixed

osma reviewed Sep 25, 2024

View reviewed changes

Lakshmi-bashyam added 2 commits September 25, 2024 12:52

code formatting changes

f1b9c78

security bot fix

5e41dce

typo fix

4c33a31

juhoinkinen mentioned this pull request Sep 26, 2024

Docker image variants #804

Open

juhoinkinen mentioned this pull request Oct 3, 2024

Are PECOS models serializable? amzn/pecos#99

Closed

annakasprzik mentioned this pull request Oct 7, 2024

Add XTransformer backend #716

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Xtransformer to backend #798

Add Xtransformer to backend #798

Lakshmi-bashyam commented Sep 16, 2024

codecov bot commented Sep 17, 2024 •

edited

Loading

osma Sep 25, 2024

Lakshmi-bashyam Sep 25, 2024

osma commented Sep 25, 2024

sonarcloud bot commented Sep 25, 2024

juhoinkinen commented Sep 26, 2024

katjakon commented Oct 1, 2024

Add Xtransformer to backend #798

Are you sure you want to change the base?

Add Xtransformer to backend #798

Conversation

Lakshmi-bashyam commented Sep 16, 2024

codecov bot commented Sep 17, 2024 • edited Loading

Codecov Report

osma Sep 25, 2024

Choose a reason for hiding this comment

Lakshmi-bashyam Sep 25, 2024

Choose a reason for hiding this comment

osma commented Sep 25, 2024

Default BERT model missing

Documentation and advice

Pecos FutureWarning

Not working under Python 3.11

Unit tests not run under CI

Code style and QA issues

Dependency on PyTorch

sonarcloud bot commented Sep 25, 2024

Quality Gate failed

juhoinkinen commented Sep 26, 2024

katjakon commented Oct 1, 2024

codecov bot commented Sep 17, 2024 •

edited

Loading