From 8ea031ea9e287fa86ac453bac05c84b46ae1c528 Mon Sep 17 00:00:00 2001
From: Timo Imhof <timo.imhof.uni@gmail.com>
Date: Thu, 13 Apr 2023 18:08:01 +0200
Subject: [PATCH 1/5] fixed typos, fixed broken named anchor links, updated
 broken code fragments

---
 adapter_docs/method_combinations.md | 12 ++++-----
 adapter_docs/methods.md             | 42 ++++++++++++++---------------
 adapter_docs/overview.md            | 37 ++++++++++++++-----------
 3 files changed, 48 insertions(+), 43 deletions(-)

diff --git a/adapter_docs/method_combinations.md b/adapter_docs/method_combinations.md
index 33bf83edcb..56265cc7ec 100644
--- a/adapter_docs/method_combinations.md
+++ b/adapter_docs/method_combinations.md
@@ -2,8 +2,8 @@
 
 _Configuration class_: [`ConfigUnion`](transformers.ConfigUnion)
 
-While different efficient fine-tuning methods and configurations have often been proposed as standalone, it might be beneficial to combine them for joint training.
-To make this process easier, adapter-transformers provides the possibility to group multiple configuration instances together using the `ConfigUnion` class.
+While different efficient fine-tuning methods and configurations have often been proposed as standalone, combining them for joint training might be beneficial. 
+To make this process easier, `adapter-transformers` provides the possibility to group multiple configuration instances using the `ConfigUnion` class.
 
 For example, this could be used to define different reduction factors for the adapter modules placed after the multi-head attention and the feed-forward blocks:
 
@@ -22,8 +22,8 @@ model.add_adapter("union_adapter", config=config)
 _Configuration class_: [`MAMConfig`](transformers.MAMConfig)
 
 [He et al. (2021)](https://arxiv.org/pdf/2110.04366.pdf) study various variants and combinations of efficient fine-tuning methods.
-Among others, they propose _Mix-and-Match Adapters_ as a combination of Prefix Tuning and parallel bottleneck adapters.
-This configuration is supported by adapter-transformers out-of-the-box:
+They propose _Mix-and-Match Adapters_ as a combination of Prefix Tuning and parallel bottleneck adapters.
+This configuration is supported by `adapter-transformers` out-of-the-box:
 
 ```python
 from transformers.adapters import MAMConfig
@@ -68,7 +68,7 @@ Concretely, for each adapted module $m$, UniPELT adds a trainable gating value $
 
 $$\mathcal{G}_m \leftarrow \sigma(W_{\mathcal{G}_m} \cdot x)$$
 
-These gating values are then used to scale the output activations of the injected adapter modules, e.g. for a LoRA layer:
+These gating values are then used to scale the output activations of the injected adapter modules, e.g., for a LoRA layer:
 
 $$
 h \leftarrow W_0 x + \mathcal{G}_{LoRA} B A x
@@ -77,7 +77,7 @@ $$
 In the configuration classes of `adapter-transformers`, these gating mechanisms can be activated via `use_gating=True`.
 The full UniPELT setup can be instantiated using `UniPELTConfig`[^unipelt]:
 
-[^unipelt]: Note that the implementation of UniPELT in `adapter-transformers` follows the implementation in the original code, which is slighlty different from the description in the paper. See [here](https://github.com/morningmoni/UniPELT/issues/1) for more.
+[^unipelt]: Note that the implementation of UniPELT in `adapter-transformers` follows the implementation in the original code, which is slightlty different from the description in the paper. See [here](https://github.com/morningmoni/UniPELT/issues/1) for more.
 
 ```python
 from transformers.adapters import UniPELTConfig
diff --git a/adapter_docs/methods.md b/adapter_docs/methods.md
index 666e498862..0e40cad1c7 100644
--- a/adapter_docs/methods.md
+++ b/adapter_docs/methods.md
@@ -1,7 +1,7 @@
 # Adapter Methods
 
 On this page, we present all adapter methods currently integrated into the `adapter-transformers` library.
-A tabulary overview of adapter methods is provided [here](overview.html#table-of-adapter-methods)
+A tabular overview of adapter methods is provided [here](overview.html#table-of-adapter-methods). 
 Additionally, options to combine multiple adapter methods in a single setup are presented [on the next page](method_combinations.md).
 
 ## Bottleneck Adapters
@@ -15,7 +15,7 @@ $$
 h \leftarrow W_{up} \cdot f(W_{down} \cdot h) + r
 $$
 
-Depending on the concrete adapter configuration, these layers can be introduced at different locations within a Transformer block. Further, residual connections, layer norms, activation functions and bottleneck sizes etc. can be configured.
+Depending on the concrete adapter configuration, these layers can be introduced at different locations within a Transformer block. Further, residual connections, layer norms, activation functions and bottleneck sizes ,etc., can be configured.
 
 The most important configuration hyperparameter to be highlighted here is the bottleneck dimension $d_{bottleneck}$.
 In adapter-transformers, this bottleneck dimension is specified indirectly via the `reduction_factor` attribute of a configuration.
@@ -25,7 +25,7 @@ $$
 \text{reduction_factor} = \frac{d_{hidden}}{d_{bottleneck}}
 $$
 
-A visualization of further configuration options related to the adapter structure is given in the figure below. For more details, refer to the documentation of [`AdapterConfig`](transformers.AdapterConfig).
+A visualization of further configuration options related to the adapter structure is given in the figure below. For more details, we refer to the documentation of [`AdapterConfig`](transformers.AdapterConfig).
 
 
 ```{eval-rst}
@@ -37,11 +37,11 @@ A visualization of further configuration options related to the adapter structur
     Visualization of possible adapter configurations with corresponding dictionary keys.
 ```
 
-adapter-transformers comes with pre-defined configurations for some bottleneck adapter architectures proposed in literature:
+`adapter-transformers` comes with pre-defined configurations for some bottleneck adapter architectures proposed in literature:
 
-- [`HoulsbyConfig`](transformers.HoulsbyConfig) as proposed by [Houlsby et al. (2019)](https://arxiv.org/pdf/1902.00751.pdf) places adapter layers after both the multi-head attention and feed-forward block in each Transformer layer.
-- [`PfeifferConfig`](transformers.PfeifferConfig) as proposed by [Pfeiffer et al. (2020)](https://arxiv.org/pdf/2005.00052.pdf) places an adapter layer only after the feed-forward block in each Transformer layer.
-- [`ParallelConfig`](transformers.ParallelConfig) as proposed by [He et al. (2021)](https://arxiv.org/pdf/2110.04366.pdf) places adapter layers in parallel to the original Transformer layers.
+- [`HoulsbyConfig`](transformers.HoulsbyConfig), as proposed by [Houlsby et al. (2019)](https://arxiv.org/pdf/1902.00751.pdf), places adapter layers after both the multi-head attention and feed-forward block in each Transformer layer.
+- [`PfeifferConfig`](transformers.PfeifferConfig), as proposed by [Pfeiffer et al. (2020)](https://arxiv.org/pdf/2005.00052.pdf), places an adapter layer only after the feed-forward block in each Transformer layer.
+- [`ParallelConfig`](transformers.ParallelConfig), as proposed by [He et al. (2021)](https://arxiv.org/pdf/2110.04366.pdf), places adapter layers in parallel to the original Transformer layers.
 
 _Example_:
 ```python
@@ -101,13 +101,13 @@ _Configuration class_: [`PrefixTuningConfig`](transformers.PrefixTuningConfig)
 ```
 
 Prefix Tuning ([Li and Liang, 2021](https://aclanthology.org/2021.acl-long.353.pdf)) introduces new parameters in the multi-head attention blocks in each Transformer layer.
-More, specifically, it prepends trainable prefix vectors $P^K$ and $P^V$ to the keys and values of the attention head input, each of a configurable prefix length $l$ (`prefix_length` attribute):
+More specifically, it prepends trainable prefix vectors $P^K$ and $P^V$ to the keys and values of the attention head input, each of a configurable prefix length $l$ (`prefix_length` attribute):
 
 $$
 head_i = \text{Attention}(Q W_i^Q, [P_i^K, K W_i^K], [P_i^V, V W_i^V])
 $$
 
-Following the original authors, the prefix vectors in $P^K$ and $P^V$ are note optimized directly, but reparameterized via a bottleneck MLP.
+Following the original authors, the prefix vectors in $P^K$ and $P^V$ are not optimized directly but reparameterized via a bottleneck MLP.
 This behavior is controlled via the `flat` attribute of the configuration.
 Using `PrefixTuningConfig(flat=True)` will create prefix tuning vectors that are optimized without reparameterization.
 
@@ -119,7 +119,7 @@ config = PrefixTuningConfig(flat=False, prefix_length=30)
 model.add_adapter("prefix_tuning", config=config)
 ```
 
-As reparameterization using the bottleneck MLP is not necessary for performing inference on an already trained Prefix Tuning module, adapter-transformers includes a function to "eject" a reparameterized Prefix Tuning into a flat one:
+As reparameterization using the bottleneck MLP is not necessary for performing inference on an already trained Prefix Tuning module, `adapter-transformers` includes a function to "eject" a reparameterized Prefix Tuning into a flat one:
 ```python
 model.eject_prefix_tuning("prefix_tuning")
 ```
@@ -150,9 +150,9 @@ for a PHM layer by specifying `use_phm=True` in the config.
 The PHM layer has the following additional properties: `phm_dim`, `shared_phm_rule`, `factorized_phm_rule`, `learn_phm`, 
 `factorized_phm_W`, `shared_W_phm`, `phm_c_init`, `phm_init_range`, `hypercomplex_nonlinearity`
 
-For more information check out the [`AdapterConfig`](transformers.AdapterConfig) class.
+For more information, check out the [`AdapterConfig`](transformers.AdapterConfig) class.
 
-To add a Compacter to your model you can use the predefined configs:
+To add a Compacter to your model, you can use the predefined configs:
 ```python
 from transformers.adapters import CompacterConfig
 
@@ -177,7 +177,7 @@ _Configuration class_: [`LoRAConfig`](transformers.LoRAConfig)
 
 Low-Rank Adaptation (LoRA) is an efficient fine-tuning technique proposed by [Hu et al. (2021)](https://arxiv.org/pdf/2106.09685.pdf).
 LoRA injects trainable low-rank decomposition matrices into the layers of a pre-trained model.
-For any model layer expressed as a matrix multiplication of the form $h = W_0 x$, it therefore performs a reparameterization, such that:
+For any model layer expressed as a matrix multiplication of the form $h = W_0 x$, it performs a reparameterization such that:
 
 $$
 h = W_0 x + \frac{\alpha}{r} B A x
@@ -185,7 +185,7 @@ $$
 
 Here, $A \in \mathbb{R}^{r\times k}$ and $B \in \mathbb{R}^{d\times r}$ are the decomposition matrices and $r$, the low-dimensional rank of the decomposition, is the most important hyperparameter.
 
-While, in principle, this reparameterization can be applied to any weights matrix in a model, the original paper only adapts the attention weights of the Transformer self-attention sub-layer with LoRA.
+While, in principle, this reparameterization can be applied to any weight matrix in a model, the original paper only adapts the attention weights of the Transformer self-attention sub-layer with LoRA.
 `adapter-transformers` additionally allows injecting LoRA into the dense feed-forward layers in the intermediate and output components of a Transformer block.
 You can configure the locations where LoRA weights should be injected using the attributes in the [`LoRAConfig`](transformers.LoRAConfig) class.
 
@@ -207,7 +207,7 @@ model.merge_adapter("lora_adapter")
 
 To continue training on this LoRA adapter or to deactivate it entirely, the merged weights first have to be reset again:
 ```python
-model.reset_adapter("lora_adapter")
+model.reset_adapter()
 ```
 
 _Papers:_
@@ -227,7 +227,7 @@ _Configuration class_: [`IA3Config`](transformers.IA3Config)
 ```
 
 _Infused Adapter by Inhibiting and Amplifying Inner Activations ((IA)^3)_ is an efficient fine-tuning method proposed within the _T-Few_ fine-tuning approach by [Liu et al. (2022)](https://arxiv.org/pdf/2205.05638.pdf).
-(IA)^3 introduces trainable vectors $l_W$ into different components of a Transformer model which perform element-wise rescaling of inner model activations.
+(IA)^3 introduces trainable vectors $l_W$ into different components of a Transformer model, which perform element-wise rescaling of inner model activations.
 For any model layer expressed as a matrix multiplication of the form $h = W x$, it therefore performs an element-wise multiplication with $l_W$, such that:
 
 $$
@@ -245,15 +245,15 @@ model.add_adapter("ia3_adapter", config=config)
 ```
 
 The implementation of (IA)^3, as well as the `IA3Config` class, are derived from the implementation of [LoRA](#lora), with a few main modifications.
-First, (IA)^3 uses multiplicative composition of weights instead of additive composition as in LoRA.
+First, (IA)^3 uses multiplicative compositions of weights instead of additive compositions, as in LoRA.
 Second, the added weights are not further decomposed into low-rank matrices.
-Both of these modifications are controlled via the `composition_mode` configuration attribute by setting `composition_mode="scale"`.
+These modifications are controlled via the `composition_mode` configuration attribute by setting `composition_mode="scale"`.
 Additionally, as the added weights are already of rank 1, `r=1` is set.
 
-Beyond that, both methods share the same configuration attributes that allow you to specify in which Transformer components rescaling vectors will be injected.
+Beyond that, both methods share the same configuration attributes that allow you to specify which Transformer components rescaling vectors will be injected.
 Following the original implementation, `IA3Config` adds rescaling vectors to the self-attention weights (`selfattn_lora=True`) and the final feed-forward layer (`output_lora=True`).
 Further, you can modify which matrices of the attention mechanism to rescale by leveraging the `attn_matrices` attribute.
-By default, (IA)^3 injects weights into the key ('k') and value ('v') matrices, but not in the query ('q') matrix.
+By default, (IA)^3 injects weights into the key ('k') and value ('v') matrices but not in the query ('q') matrix.
 
 Finally, similar to LoRA, (IA)^3 also allows merging the injected parameters with the original weight matrices of the Transformer model.
 E.g.:
@@ -262,7 +262,7 @@ E.g.:
 model.merge_adapter("ia3_adapter")
 
 # Reset merged weights
-model.reset_adapter("ia3_adapter")
+model.reset_adapter()
 ```
 
 _Papers:_
diff --git a/adapter_docs/overview.md b/adapter_docs/overview.md
index 3cd35da6ed..7ce14c6a18 100644
--- a/adapter_docs/overview.md
+++ b/adapter_docs/overview.md
@@ -1,17 +1,17 @@
 # Overview and Configuration
 
 Large pre-trained Transformer-based language models (LMs) have become the foundation of NLP in recent years.
-While the most prevalent method of using these LMs for transfer learning involves costly *full fine-tuning* of all model parameters, a series of *efficient* and *lightweight* alternatives have been established in recent time.
-Instead of updating all parameters of the pre-trained LM towards a downstream target task, these methods commonly introduce a small amount of new parameters and only update these while keeping the pre-trained model weights fixed.
+While the most prevalent method of using these LMs for transfer learning involves costly *full fine-tuning* of all model parameters, a series of *efficient* and *lightweight* alternatives have recently been established.
+Instead of updating all parameters of the pre-trained LM towards a downstream target task, these methods commonly introduce a small number of new parameters and only update these while keeping the pre-trained model weights fixed.
 
 ```{admonition} Why use Efficient Fine-Tuning?
-Efficient fine-tuning methods offer multiple benefits over full fine-tuning of LMs:
+Efficient fine-tuning methods offer multiple benefits over the full fine-tuning of LMs:
 
-- They are **parameter-efficient**, i.e. they only update a very small subset (often under 1%) of a model's parameters.
-- They often are **modular**, i.e. the updated parameters can be extracted and shared independently of the base model parameters.
-- They are easy to share and easy to deploy due to their **small file sizes**, e.g. having only ~3MB per task instead of ~440MB for sharing a full model.
-- They **speed up training**, i.e. efficient fine-tuning often needs less time for training compared fully fine-tuning LMs.
-- They are **composable**, e.g. multiple adapters trained on different tasks can be stacked, fused or mixed to leverage their combined knowledge.
+- They are **parameter-efficient**, i.e., they only update a tiny subset (often under 1%) of a model's parameters.
+- They often are **modular**, i.e., the updated parameters can be extracted and shared independently of the base model parameters.
+- They are easy to share and deploy due to their **small file sizes**, e.g., having only ~3MB per task instead of ~440MB for sharing a full model.
+- They **speed up training**, i.e., efficient fine-tuning often requires less training time than fully fine-tuning LMs.
+- They are **composable**, e.g., multiple adapters trained on different tasks can be stacked, fused, or mixed to leverage their combined knowledge.
 - They often provide **on-par performance** with full fine-tuning.
 ```
 
@@ -30,17 +30,18 @@ While these adapters have laid the foundation of the adapter-transformers librar
 .. important::
     In literature, different terms are used to refer to efficient fine-tuning methods.
     The term "adapter" is usually only applied to bottleneck adapter modules.
-    However, most efficient fine-tuning methods follow the same general idea of inserting a small set of new parameters and by this "adapting" the pre-trained LM to a new task.
+    However, most efficient fine-tuning methods follow the same general idea of inserting a small set of new parameters and, by this, "adapting" the pre-trained LM to a new task.
     In adapter-transformers, the term "adapter" thus may refer to any efficient fine-tuning method if not specified otherwise.
 ```
 
 In the remaining sections, we will present how adapter methods can be configured in `adapter-transformers`.
-The next two pages will then present the methodological details of all currently supported adapter methods.
+The following two pages will offer the methodological details of all currently supported adapter methods.
 
 ## Table of Adapter Methods
 
 The following table gives an overview of all adapter methods supported by `adapter-transformers`.
 Identifiers and configuration classes are explained in more detail in the [next section](#configuration).
+TODO: update links
 
 | Identifier | Configuration class | More information
 | --- | --- | --- |
@@ -48,14 +49,14 @@ Identifiers and configuration classes are explained in more detail in the [next
 | `houlsby` | `HoulsbyConfig()` | [Bottleneck Adapters](methods.html#bottleneck-adapters) |
 | `parallel` | `ParallelConfig()` | [Bottleneck Adapters](methods.html#bottleneck-adapters) |
 | `scaled_parallel` | `ParallelConfig(scaling="learned")` | [Bottleneck Adapters](methods.html#bottleneck-adapters) |
-| `pfeiffer+inv` | `PfeifferInvConfig()` | [Invertible Adapters](methods.html#language-adapters---invertible-adapters) |
-| `houlsby+inv` | `HoulsbyInvConfig()` | [Invertible Adapters](methods.html#language-adapters---invertible-adapters) |
+| `pfeiffer+inv` | `PfeifferInvConfig()` | [Invertible Adapters](methods.html#language-adapters-invertible-adapters) |
+| `houlsby+inv` | `HoulsbyInvConfig()` | [Invertible Adapters](methods.html#language-adapters-invertible-adapters) |
 | `compacter` | `CompacterConfig()` | [Compacter](methods.html#compacter) |
 | `compacter++` | `CompacterPlusPlusConfig()` | [Compacter](methods.html#compacter) |
 | `prefix_tuning` | `PrefixTuningConfig()` | [Prefix Tuning](methods.html#prefix-tuning) |
 | `prefix_tuning_flat` | `PrefixTuningConfig(flat=True)` | [Prefix Tuning](methods.html#prefix-tuning) |
 | `lora` | `LoRAConfig()` | [LoRA](methods.html#lora) |
-| `ia3` | `IA3Config()` | [IA³](methods.html#ia3) |
+| `ia3` | `IA3Config()` | [IA³](methods.html#ia-3) |
 | `mam` | `MAMConfig()` | [Mix-and-Match Adapters](method_combinations.html#mix-and-match-adapters) |
 | `unipelt` | `UniPELTConfig()` | [UniPELT](method_combinations.html#unipelt) |
 
@@ -83,11 +84,15 @@ Here, `<identifier>` refers to one of the identifiers listed in [the table above
 In square brackets after the identifier, you can set specific configuration attributes from the respective configuration class, e.g. `parallel[reduction_factor=2]`.
 If all attributes remain at their default values, this can be omitted.
 
-Finally, it is also possible to specify a [method combination](method_combinations.md) as a configuration string by joining multiple configuration strings with `|`.
-E.g., `prefix_tuning[bottleneck_size=800]|parallel` is identical to the following configuration class instance:
+Finally, it is also possible to specify a [method combination](method_combinations.md) as a configuration string by joining multiple configuration strings with `|`, e.g.:
+```python
+config = "prefix_tuning[bottleneck_size=800]|parallel"
+```
+
+is identical to the following `ConfigUnion`:
 
 ```python
-ConfigUnion(
+config = ConfigUnion(
     PrefixTuningConfig(bottleneck_size=800),
     ParallelConfig(),
 )

From 16ee765eb2a57f7ff94d77c536b5ea047b8dac9d Mon Sep 17 00:00:00 2001
From: TimoImhof <timo.imhof.uni@gmail.com>
Date: Fri, 14 Apr 2023 11:45:54 +0200
Subject: [PATCH 2/5] create more links to the documentation of classes and
 methods

---
 adapter_docs/method_combinations.md | 2 +-
 adapter_docs/methods.md             | 8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/adapter_docs/method_combinations.md b/adapter_docs/method_combinations.md
index 56265cc7ec..f5902498f8 100644
--- a/adapter_docs/method_combinations.md
+++ b/adapter_docs/method_combinations.md
@@ -3,7 +3,7 @@
 _Configuration class_: [`ConfigUnion`](transformers.ConfigUnion)
 
 While different efficient fine-tuning methods and configurations have often been proposed as standalone, combining them for joint training might be beneficial. 
-To make this process easier, `adapter-transformers` provides the possibility to group multiple configuration instances using the `ConfigUnion` class.
+To make this process easier, `adapter-transformers` provides the possibility to group multiple configuration instances using the [`ConfigUnion`](transformers.ConfigUnion) class.
 
 For example, this could be used to define different reduction factors for the adapter modules placed after the multi-head attention and the feed-forward blocks:
 
diff --git a/adapter_docs/methods.md b/adapter_docs/methods.md
index 0e40cad1c7..62790cd5b7 100644
--- a/adapter_docs/methods.md
+++ b/adapter_docs/methods.md
@@ -68,7 +68,7 @@ To perform zero-shot cross-lingual transfer, one language adapter can simply be
 
 In terms of architecture, language adapters are largely similar to regular bottleneck adapters, except for an additional _invertible adapter_ layer after the LM embedding layer.
 Embedding outputs are passed through this invertible adapter in the forward direction before entering the first Transformer layer and in the inverse direction after leaving the last Transformer layer.
-Invertible adapter architectures are further detailed in [Pfeiffer et al. (2020)](https://arxiv.org/pdf/2005.00052.pdf) and can be configured via the `inv_adapter` attribute of the `AdapterConfig` class.
+Invertible adapter architectures are further detailed in [Pfeiffer et al. (2020)](https://arxiv.org/pdf/2005.00052.pdf) and can be configured via the `inv_adapter` attribute of the [`AdapterConfig`](transformers.AdapterConfig) class.
 
 _Example_:
 ```python
@@ -200,7 +200,7 @@ model.add_adapter("lora_adapter", config=config)
 In the design of LoRA, Hu et al. (2021) also pay special attention to keeping the inference latency overhead compared to full fine-tuning at a minimum.
 To accomplish this, the LoRA reparameterization can be merged with the original pre-trained weights of a model for inference.
 Thus, the adapted weights are directly used in every forward pass without passing activations through an additional module.
-In `adapter-transformers`, this can be realized using the built-in `merge_adapter()` method:
+In `adapter-transformers`, this can be realized using the built-in [`merge_adapter()`](transformers.ModelAdaptersMixin.merge_adapter) method:
 ```python
 model.merge_adapter("lora_adapter")
 ```
@@ -244,14 +244,14 @@ config = IA3Config()
 model.add_adapter("ia3_adapter", config=config)
 ```
 
-The implementation of (IA)^3, as well as the `IA3Config` class, are derived from the implementation of [LoRA](#lora), with a few main modifications.
+The implementation of (IA)^3, as well as the [`IA3Config`](transformers.IA3Config) class, are derived from the implementation of [LoRA](#lora), with a few main modifications.
 First, (IA)^3 uses multiplicative compositions of weights instead of additive compositions, as in LoRA.
 Second, the added weights are not further decomposed into low-rank matrices.
 These modifications are controlled via the `composition_mode` configuration attribute by setting `composition_mode="scale"`.
 Additionally, as the added weights are already of rank 1, `r=1` is set.
 
 Beyond that, both methods share the same configuration attributes that allow you to specify which Transformer components rescaling vectors will be injected.
-Following the original implementation, `IA3Config` adds rescaling vectors to the self-attention weights (`selfattn_lora=True`) and the final feed-forward layer (`output_lora=True`).
+Following the original implementation, [`IA3Config`](transformers.IA3Config) adds rescaling vectors to the self-attention weights (`selfattn_lora=True`) and the final feed-forward layer (`output_lora=True`).
 Further, you can modify which matrices of the attention mechanism to rescale by leveraging the `attn_matrices` attribute.
 By default, (IA)^3 injects weights into the key ('k') and value ('v') matrices but not in the query ('q') matrix.
 

From cda7a9f4c9422d47689563f92b85f7cb8e758031 Mon Sep 17 00:00:00 2001
From: TimoImhof <62378375+TimoImhof@users.noreply.github.com>
Date: Fri, 5 May 2023 09:11:09 +0200
Subject: [PATCH 3/5] Update adapter_docs/method_combinations.md

Fix typo

Co-authored-by: calpt <calpt@mail.de>
---
 adapter_docs/method_combinations.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/adapter_docs/method_combinations.md b/adapter_docs/method_combinations.md
index f5902498f8..80ffe6f77e 100644
--- a/adapter_docs/method_combinations.md
+++ b/adapter_docs/method_combinations.md
@@ -77,7 +77,7 @@ $$
 In the configuration classes of `adapter-transformers`, these gating mechanisms can be activated via `use_gating=True`.
 The full UniPELT setup can be instantiated using `UniPELTConfig`[^unipelt]:
 
-[^unipelt]: Note that the implementation of UniPELT in `adapter-transformers` follows the implementation in the original code, which is slightlty different from the description in the paper. See [here](https://github.com/morningmoni/UniPELT/issues/1) for more.
+[^unipelt]: Note that the implementation of UniPELT in `adapter-transformers` follows the implementation in the original code, which is slightly different from the description in the paper. See [here](https://github.com/morningmoni/UniPELT/issues/1) for more.
 
 ```python
 from transformers.adapters import UniPELTConfig

From 2c35b59197b8916d38a3bfb69e82bbc1fdf3b2cb Mon Sep 17 00:00:00 2001
From: TimoImhof <62378375+TimoImhof@users.noreply.github.com>
Date: Fri, 5 May 2023 09:12:30 +0200
Subject: [PATCH 4/5] Update adapter_docs/methods.md

fix typo

Co-authored-by: calpt <calpt@mail.de>
---
 adapter_docs/methods.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/adapter_docs/methods.md b/adapter_docs/methods.md
index 62790cd5b7..b4ec411062 100644
--- a/adapter_docs/methods.md
+++ b/adapter_docs/methods.md
@@ -245,7 +245,7 @@ model.add_adapter("ia3_adapter", config=config)
 ```
 
 The implementation of (IA)^3, as well as the [`IA3Config`](transformers.IA3Config) class, are derived from the implementation of [LoRA](#lora), with a few main modifications.
-First, (IA)^3 uses multiplicative compositions of weights instead of additive compositions, as in LoRA.
+First, (IA)^3 uses multiplicative composition of weights instead of additive composition, as in LoRA.
 Second, the added weights are not further decomposed into low-rank matrices.
 These modifications are controlled via the `composition_mode` configuration attribute by setting `composition_mode="scale"`.
 Additionally, as the added weights are already of rank 1, `r=1` is set.

From 9b838c709d86d3b24ec6f01cf02204fd986fe9b2 Mon Sep 17 00:00:00 2001
From: Timo Imhof <timo.imhof.uni@gmail.com>
Date: Fri, 5 May 2023 09:17:00 +0200
Subject: [PATCH 5/5] remove fixed TODO

---
 adapter_docs/overview.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/adapter_docs/overview.md b/adapter_docs/overview.md
index 7ce14c6a18..4c1e9c87d4 100644
--- a/adapter_docs/overview.md
+++ b/adapter_docs/overview.md
@@ -41,7 +41,6 @@ The following two pages will offer the methodological details of all currently s
 
 The following table gives an overview of all adapter methods supported by `adapter-transformers`.
 Identifiers and configuration classes are explained in more detail in the [next section](#configuration).
-TODO: update links
 
 | Identifier | Configuration class | More information
 | --- | --- | --- |