Skip to content

Commit

Permalink
Squashed commit with changes since version 3.2
Browse files Browse the repository at this point in the history
Fix resume_from_checkpoint (adapter-hub#514)

add initialization of variable so invalid checkpoints throw a understandable error

Fix LoRA & (IA)³ implementation for Bart & MBart (adapter-hub#518)

Fixes a critical issue in the LoRA & (IA)³ implementation of Bart & MBart, where LoRA & (IA)³ weights were not added to the intermediate and output linear layers of the model's decoder blocks.

I.e., adapter configs having intermediate_lora=True or output_lora=True are added incorrectly to (M)Bart models. For LoRA, this does not affect the default config, for (IA)³ it does (intermediate_lora=True).

To ensure correct addition of weights in the future, get_adapter() tests are updated to count the number of modules added per adapter.

Fix python3.7 Compatibility (adapter-hub#510)

Compatibility with python3.8+ and pytorch1.12.1+

Restore compatibility in GPT-2 LoRALinear bias init (adapter-hub#525)

Fix compacter init weights (adapter-hub#516)

Update doc chapter "Getting Started" (adapter-hub#527)

Update version to 3.2.1

Fix Notebook01 Dataset column_rename (adapter-hub#543)

Update doc chapter "Adapter Methods" (adapter-hub#535)

Do not stale issues labeled as bugs (adapter-hub#550)
  • Loading branch information
hSterz committed Aug 5, 2023
1 parent 0d95a52 commit 970c5d1
Show file tree
Hide file tree
Showing 44 changed files with 590 additions and 383 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/stale.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ jobs:
issues: write

steps:
- uses: actions/stale@v6
- uses: actions/stale@v8
with:
repo-token: "${{ secrets.BOT_TOKEN }}"
exempt-issue-labels: 'do-not-stale,enhancement'
exempt-issue-labels: 'do-not-stale,enhancement,bug'
stale-issue-message: 'This issue has been automatically marked as stale because it has been without activity for 90 days. This issue will be closed in 14 days unless you comment or remove the stale label.'
close-issue-message: 'This issue was closed because it was stale for 14 days without any activity.'
days-before-issue-stale: 90
Expand Down
18 changes: 9 additions & 9 deletions .github/workflows/tests_torch.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@ jobs:
check_code_quality:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: 3.8
- uses: actions/cache@v2
- uses: actions/cache@v3
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('setup.py') }}
Expand All @@ -45,11 +45,11 @@ jobs:
timeout-minutes: 60
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: 3.8
- uses: actions/cache@v2
- uses: actions/cache@v3
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('setup.py') }}
Expand All @@ -67,11 +67,11 @@ jobs:
timeout-minutes: 60
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: 3.8
- uses: actions/cache@v2
- uses: actions/cache@v3
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('setup.py') }}
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Thus, most files in this repository are direct copies from the HuggingFace Trans

## Installation

`adapter-transformers` currently supports **Python 3.7+** and **PyTorch 1.3.1+**.
`adapter-transformers` currently supports **Python 3.8+** and **PyTorch 1.12.1+**.
After [installing PyTorch](https://pytorch.org/get-started/locally/), you can install `adapter-transformers` from PyPI ...

```
Expand Down
2 changes: 1 addition & 1 deletion adapter_docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
docs_versions = [
"adapters1.1.1",
"adapters2.3.0",
"adapters3.2.0",
"adapters3.2.1",
]


Expand Down
6 changes: 3 additions & 3 deletions adapter_docs/installation.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Installation

Our *adapter-transformers* package is a drop-in replacement for Huggingface's *transformers* library.
It currently supports Python 3.7+ and PyTorch 1.3.1+. You will have to [install PyTorch](https://pytorch.org/get-started/locally/) first.
It currently supports Python 3.8+ and PyTorch 1.12.1+. You will have to [install PyTorch](https://pytorch.org/get-started/locally/) first.

```{eval-rst}
.. important::
``adapter-transformers`` is a direct fork of ``transformers``.
This means our package includes all the awesome features of HuggingFace's original package plus the adapter implementation.
As both packages share the same namespace, they ideally should not installed in the same environment.
This means our package includes all the awesome features of HuggingFace's original package, plus the adapter implementation.
As both packages share the same namespace, they ideally should not be installed in the same environment.
```

## Using pip
Expand Down
12 changes: 6 additions & 6 deletions adapter_docs/method_combinations.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

_Configuration class_: [`ConfigUnion`](transformers.ConfigUnion)

While different efficient fine-tuning methods and configurations have often been proposed as standalone, it might be beneficial to combine them for joint training.
To make this process easier, adapter-transformers provides the possibility to group multiple configuration instances together using the `ConfigUnion` class.
While different efficient fine-tuning methods and configurations have often been proposed as standalone, combining them for joint training might be beneficial.
To make this process easier, `adapter-transformers` provides the possibility to group multiple configuration instances using the [`ConfigUnion`](transformers.ConfigUnion) class.

For example, this could be used to define different reduction factors for the adapter modules placed after the multi-head attention and the feed-forward blocks:

Expand All @@ -22,8 +22,8 @@ model.add_adapter("union_adapter", config=config)
_Configuration class_: [`MAMConfig`](transformers.MAMConfig)

[He et al. (2021)](https://arxiv.org/pdf/2110.04366.pdf) study various variants and combinations of efficient fine-tuning methods.
Among others, they propose _Mix-and-Match Adapters_ as a combination of Prefix Tuning and parallel bottleneck adapters.
This configuration is supported by adapter-transformers out-of-the-box:
They propose _Mix-and-Match Adapters_ as a combination of Prefix Tuning and parallel bottleneck adapters.
This configuration is supported by `adapter-transformers` out-of-the-box:

```python
from transformers.adapters import MAMConfig
Expand Down Expand Up @@ -68,7 +68,7 @@ Concretely, for each adapted module $m$, UniPELT adds a trainable gating value $

$$\mathcal{G}_m \leftarrow \sigma(W_{\mathcal{G}_m} \cdot x)$$

These gating values are then used to scale the output activations of the injected adapter modules, e.g. for a LoRA layer:
These gating values are then used to scale the output activations of the injected adapter modules, e.g., for a LoRA layer:

$$
h \leftarrow W_0 x + \mathcal{G}_{LoRA} B A x
Expand All @@ -77,7 +77,7 @@ $$
In the configuration classes of `adapter-transformers`, these gating mechanisms can be activated via `use_gating=True`.
The full UniPELT setup can be instantiated using `UniPELTConfig`[^unipelt]:

[^unipelt]: Note that the implementation of UniPELT in `adapter-transformers` follows the implementation in the original code, which is slighlty different from the description in the paper. See [here](https://github.com/morningmoni/UniPELT/issues/1) for more.
[^unipelt]: Note that the implementation of UniPELT in `adapter-transformers` follows the implementation in the original code, which is slightly different from the description in the paper. See [here](https://github.com/morningmoni/UniPELT/issues/1) for more.

```python
from transformers.adapters import UniPELTConfig
Expand Down
50 changes: 25 additions & 25 deletions adapter_docs/methods.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Adapter Methods

On this page, we present all adapter methods currently integrated into the `adapter-transformers` library.
A tabulary overview of adapter methods is provided [here](overview.html#table-of-adapter-methods)
A tabular overview of adapter methods is provided [here](overview.html#table-of-adapter-methods).
Additionally, options to combine multiple adapter methods in a single setup are presented [on the next page](method_combinations.md).

## Bottleneck Adapters
Expand All @@ -15,7 +15,7 @@ $$
h \leftarrow W_{up} \cdot f(W_{down} \cdot h) + r
$$

Depending on the concrete adapter configuration, these layers can be introduced at different locations within a Transformer block. Further, residual connections, layer norms, activation functions and bottleneck sizes etc. can be configured.
Depending on the concrete adapter configuration, these layers can be introduced at different locations within a Transformer block. Further, residual connections, layer norms, activation functions and bottleneck sizes ,etc., can be configured.

The most important configuration hyperparameter to be highlighted here is the bottleneck dimension $d_{bottleneck}$.
In adapter-transformers, this bottleneck dimension is specified indirectly via the `reduction_factor` attribute of a configuration.
Expand All @@ -25,7 +25,7 @@ $$
\text{reduction_factor} = \frac{d_{hidden}}{d_{bottleneck}}
$$

A visualization of further configuration options related to the adapter structure is given in the figure below. For more details, refer to the documentation of [`AdapterConfig`](transformers.AdapterConfig).
A visualization of further configuration options related to the adapter structure is given in the figure below. For more details, we refer to the documentation of [`AdapterConfig`](transformers.AdapterConfig).


```{eval-rst}
Expand All @@ -37,11 +37,11 @@ A visualization of further configuration options related to the adapter structur
Visualization of possible adapter configurations with corresponding dictionary keys.
```

adapter-transformers comes with pre-defined configurations for some bottleneck adapter architectures proposed in literature:
`adapter-transformers` comes with pre-defined configurations for some bottleneck adapter architectures proposed in literature:

- [`HoulsbyConfig`](transformers.HoulsbyConfig) as proposed by [Houlsby et al. (2019)](https://arxiv.org/pdf/1902.00751.pdf) places adapter layers after both the multi-head attention and feed-forward block in each Transformer layer.
- [`PfeifferConfig`](transformers.PfeifferConfig) as proposed by [Pfeiffer et al. (2020)](https://arxiv.org/pdf/2005.00052.pdf) places an adapter layer only after the feed-forward block in each Transformer layer.
- [`ParallelConfig`](transformers.ParallelConfig) as proposed by [He et al. (2021)](https://arxiv.org/pdf/2110.04366.pdf) places adapter layers in parallel to the original Transformer layers.
- [`HoulsbyConfig`](transformers.HoulsbyConfig), as proposed by [Houlsby et al. (2019)](https://arxiv.org/pdf/1902.00751.pdf), places adapter layers after both the multi-head attention and feed-forward block in each Transformer layer.
- [`PfeifferConfig`](transformers.PfeifferConfig), as proposed by [Pfeiffer et al. (2020)](https://arxiv.org/pdf/2005.00052.pdf), places an adapter layer only after the feed-forward block in each Transformer layer.
- [`ParallelConfig`](transformers.ParallelConfig), as proposed by [He et al. (2021)](https://arxiv.org/pdf/2110.04366.pdf), places adapter layers in parallel to the original Transformer layers.

_Example_:
```python
Expand All @@ -68,7 +68,7 @@ To perform zero-shot cross-lingual transfer, one language adapter can simply be

In terms of architecture, language adapters are largely similar to regular bottleneck adapters, except for an additional _invertible adapter_ layer after the LM embedding layer.
Embedding outputs are passed through this invertible adapter in the forward direction before entering the first Transformer layer and in the inverse direction after leaving the last Transformer layer.
Invertible adapter architectures are further detailed in [Pfeiffer et al. (2020)](https://arxiv.org/pdf/2005.00052.pdf) and can be configured via the `inv_adapter` attribute of the `AdapterConfig` class.
Invertible adapter architectures are further detailed in [Pfeiffer et al. (2020)](https://arxiv.org/pdf/2005.00052.pdf) and can be configured via the `inv_adapter` attribute of the [`AdapterConfig`](transformers.AdapterConfig) class.

_Example_:
```python
Expand Down Expand Up @@ -101,13 +101,13 @@ _Configuration class_: [`PrefixTuningConfig`](transformers.PrefixTuningConfig)
```

Prefix Tuning ([Li and Liang, 2021](https://aclanthology.org/2021.acl-long.353.pdf)) introduces new parameters in the multi-head attention blocks in each Transformer layer.
More, specifically, it prepends trainable prefix vectors $P^K$ and $P^V$ to the keys and values of the attention head input, each of a configurable prefix length $l$ (`prefix_length` attribute):
More specifically, it prepends trainable prefix vectors $P^K$ and $P^V$ to the keys and values of the attention head input, each of a configurable prefix length $l$ (`prefix_length` attribute):

$$
head_i = \text{Attention}(Q W_i^Q, [P_i^K, K W_i^K], [P_i^V, V W_i^V])
$$

Following the original authors, the prefix vectors in $P^K$ and $P^V$ are note optimized directly, but reparameterized via a bottleneck MLP.
Following the original authors, the prefix vectors in $P^K$ and $P^V$ are not optimized directly but reparameterized via a bottleneck MLP.
This behavior is controlled via the `flat` attribute of the configuration.
Using `PrefixTuningConfig(flat=True)` will create prefix tuning vectors that are optimized without reparameterization.

Expand All @@ -119,7 +119,7 @@ config = PrefixTuningConfig(flat=False, prefix_length=30)
model.add_adapter("prefix_tuning", config=config)
```

As reparameterization using the bottleneck MLP is not necessary for performing inference on an already trained Prefix Tuning module, adapter-transformers includes a function to "eject" a reparameterized Prefix Tuning into a flat one:
As reparameterization using the bottleneck MLP is not necessary for performing inference on an already trained Prefix Tuning module, `adapter-transformers` includes a function to "eject" a reparameterized Prefix Tuning into a flat one:
```python
model.eject_prefix_tuning("prefix_tuning")
```
Expand Down Expand Up @@ -150,9 +150,9 @@ for a PHM layer by specifying `use_phm=True` in the config.
The PHM layer has the following additional properties: `phm_dim`, `shared_phm_rule`, `factorized_phm_rule`, `learn_phm`,
`factorized_phm_W`, `shared_W_phm`, `phm_c_init`, `phm_init_range`, `hypercomplex_nonlinearity`

For more information check out the [`AdapterConfig`](transformers.AdapterConfig) class.
For more information, check out the [`AdapterConfig`](transformers.AdapterConfig) class.

To add a Compacter to your model you can use the predefined configs:
To add a Compacter to your model, you can use the predefined configs:
```python
from transformers.adapters import CompacterConfig

Expand All @@ -177,15 +177,15 @@ _Configuration class_: [`LoRAConfig`](transformers.LoRAConfig)

Low-Rank Adaptation (LoRA) is an efficient fine-tuning technique proposed by [Hu et al. (2021)](https://arxiv.org/pdf/2106.09685.pdf).
LoRA injects trainable low-rank decomposition matrices into the layers of a pre-trained model.
For any model layer expressed as a matrix multiplication of the form $h = W_0 x$, it therefore performs a reparameterization, such that:
For any model layer expressed as a matrix multiplication of the form $h = W_0 x$, it performs a reparameterization such that:

$$
h = W_0 x + \frac{\alpha}{r} B A x
$$

Here, $A \in \mathbb{R}^{r\times k}$ and $B \in \mathbb{R}^{d\times r}$ are the decomposition matrices and $r$, the low-dimensional rank of the decomposition, is the most important hyperparameter.

While, in principle, this reparameterization can be applied to any weights matrix in a model, the original paper only adapts the attention weights of the Transformer self-attention sub-layer with LoRA.
While, in principle, this reparameterization can be applied to any weight matrix in a model, the original paper only adapts the attention weights of the Transformer self-attention sub-layer with LoRA.
`adapter-transformers` additionally allows injecting LoRA into the dense feed-forward layers in the intermediate and output components of a Transformer block.
You can configure the locations where LoRA weights should be injected using the attributes in the [`LoRAConfig`](transformers.LoRAConfig) class.

Expand All @@ -200,14 +200,14 @@ model.add_adapter("lora_adapter", config=config)
In the design of LoRA, Hu et al. (2021) also pay special attention to keeping the inference latency overhead compared to full fine-tuning at a minimum.
To accomplish this, the LoRA reparameterization can be merged with the original pre-trained weights of a model for inference.
Thus, the adapted weights are directly used in every forward pass without passing activations through an additional module.
In `adapter-transformers`, this can be realized using the built-in `merge_adapter()` method:
In `adapter-transformers`, this can be realized using the built-in [`merge_adapter()`](transformers.ModelAdaptersMixin.merge_adapter) method:
```python
model.merge_adapter("lora_adapter")
```

To continue training on this LoRA adapter or to deactivate it entirely, the merged weights first have to be reset again:
```python
model.reset_adapter("lora_adapter")
model.reset_adapter()
```

_Papers:_
Expand All @@ -227,7 +227,7 @@ _Configuration class_: [`IA3Config`](transformers.IA3Config)
```

_Infused Adapter by Inhibiting and Amplifying Inner Activations ((IA)^3)_ is an efficient fine-tuning method proposed within the _T-Few_ fine-tuning approach by [Liu et al. (2022)](https://arxiv.org/pdf/2205.05638.pdf).
(IA)^3 introduces trainable vectors $l_W$ into different components of a Transformer model which perform element-wise rescaling of inner model activations.
(IA)^3 introduces trainable vectors $l_W$ into different components of a Transformer model, which perform element-wise rescaling of inner model activations.
For any model layer expressed as a matrix multiplication of the form $h = W x$, it therefore performs an element-wise multiplication with $l_W$, such that:

$$
Expand All @@ -244,16 +244,16 @@ config = IA3Config()
model.add_adapter("ia3_adapter", config=config)
```

The implementation of (IA)^3, as well as the `IA3Config` class, are derived from the implementation of [LoRA](#lora), with a few main modifications.
First, (IA)^3 uses multiplicative composition of weights instead of additive composition as in LoRA.
The implementation of (IA)^3, as well as the [`IA3Config`](transformers.IA3Config) class, are derived from the implementation of [LoRA](#lora), with a few main modifications.
First, (IA)^3 uses multiplicative composition of weights instead of additive composition, as in LoRA.
Second, the added weights are not further decomposed into low-rank matrices.
Both of these modifications are controlled via the `composition_mode` configuration attribute by setting `composition_mode="scale"`.
These modifications are controlled via the `composition_mode` configuration attribute by setting `composition_mode="scale"`.
Additionally, as the added weights are already of rank 1, `r=1` is set.

Beyond that, both methods share the same configuration attributes that allow you to specify in which Transformer components rescaling vectors will be injected.
Following the original implementation, `IA3Config` adds rescaling vectors to the self-attention weights (`selfattn_lora=True`) and the final feed-forward layer (`output_lora=True`).
Beyond that, both methods share the same configuration attributes that allow you to specify which Transformer components rescaling vectors will be injected.
Following the original implementation, [`IA3Config`](transformers.IA3Config) adds rescaling vectors to the self-attention weights (`selfattn_lora=True`) and the final feed-forward layer (`output_lora=True`).
Further, you can modify which matrices of the attention mechanism to rescale by leveraging the `attn_matrices` attribute.
By default, (IA)^3 injects weights into the key ('k') and value ('v') matrices, but not in the query ('q') matrix.
By default, (IA)^3 injects weights into the key ('k') and value ('v') matrices but not in the query ('q') matrix.

Finally, similar to LoRA, (IA)^3 also allows merging the injected parameters with the original weight matrices of the Transformer model.
E.g.:
Expand All @@ -262,7 +262,7 @@ E.g.:
model.merge_adapter("ia3_adapter")

# Reset merged weights
model.reset_adapter("ia3_adapter")
model.reset_adapter()
```

_Papers:_
Expand Down
Loading

0 comments on commit 970c5d1

Please sign in to comment.