microsoft · QuanluZhang · Jul 28, 2021 · Jun 29, 2021 · Jun 30, 2021 · Jun 30, 2021
diff --git a/docs/en_US/Compression/CompressionReference.rst b/docs/en_US/Compression/CompressionReference.rst
@@ -91,6 +91,8 @@ Pruners
 ..  autoclass:: nni.algorithms.compression.pytorch.pruning.lottery_ticket.LotteryTicketPruner
     :members:
 
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.transformer_pruner.TransformerHeadPruner
+    :members:
 
 Quantizers
 ^^^^^^^^^^

diff --git a/docs/en_US/Compression/Overview.rst b/docs/en_US/Compression/Overview.rst
@@ -35,7 +35,7 @@ The algorithms include pruning algorithms and quantization algorithms.
 Pruning Algorithms
 ^^^^^^^^^^^^^^^^^^
 
-Pruning algorithms compress the original network by removing redundant weights or channels of layers, which can reduce model complexity and mitigate the over-ﬁtting issue.
+Pruning algorithms compress the original network by removing redundant weights or channels of layers, which can reduce model complexity and mitigate the over-fitting issue.
 
 .. list-table::
    :header-rows: 1
@@ -73,6 +73,8 @@ Pruning algorithms compress the original network by removing redundant weights o
      - Automatic pruning by iteratively call SimulatedAnnealing Pruner and ADMM Pruner `Reference Paper <https://arxiv.org/abs/1907.03141>`__
    * - `AMC Pruner <../Compression/Pruner.rst#amc-pruner>`__
      - AMC: AutoML for Model Compression and Acceleration on Mobile Devices `Reference Paper <https://arxiv.org/pdf/1802.03494.pdf>`__
+   * - `Transformer Head Pruner <../Compression/Pruner.rst#transformer-head-pruner>`__
+     - Pruning attention heads from transformer models either in one shot or iteratively.
 
 
 You can refer to this `benchmark <../CommunitySharings/ModelCompressionComparison.rst>`__ for the performance of these pruners on some benchmark problems.

diff --git a/docs/en_US/Compression/Pruner.rst b/docs/en_US/Compression/Pruner.rst
@@ -28,6 +28,7 @@ We provide several pruning algorithms that support fine-grained weight pruning a
 **Others**
 
 * `Lottery Ticket Hypothesis <#lottery-ticket-hypothesis>`__
+* `Transformer Head Pruner <#transformer-head-pruner>`__
 
 Level Pruner
 ------------
@@ -722,3 +723,128 @@ User configuration for Sensitivity Pruner
 **PyTorch**
 
 ..  autoclass:: nni.algorithms.compression.pytorch.pruning.SensitivityPruner
+
+Transformer Head Pruner
+-----------------------
+
+Transformer Head Pruner is a tool designed for pruning attention heads from the models belonging to the `Transformer family <https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf>`__. The following image from `Efficient Transformers: A Survey <https://arxiv.org/pdf/2009.06732.pdf>`__ gives a good overview the general structure of the Transformer.
+
+.. image:: ../../img/transformer_structure.png
+   :target: ../../img/transformer_structure.png
+   :alt: 
+
+Typically, each attention layer in the Transformer models consists of four weights: three projection matrices for query, key, value, and an output projection matrix. The outputs of the former three matrices contains the projected results for all heads. Normally, the results are then reshaped so that each head performs that attention computation independently. The final results are concatenated back before fed into the output projection. Therefore, when an attention head is pruned, the same weights corresponding to that heads in the three projection matrices are pruned. Also, the weights in the output projection corresponding to the head's output are pruned. In our implementation, we calculate and apply masks to the four matrices together.
+
+The pruner implements the following algorithm:
+
+.. code-block:: bash
+
+    Repeat for each pruning iteration (1 for one-shot pruning):
+       1. Calculate importance scores for each head in each specified layer using a specific criterion
+       2. Sort heads locally or globally, and prune out some heads with lowest scores. The number of pruned heads is determined according to the sparsity specified in the config.
+       3. If the specified pruning iteration is larger than 1 (iterative pruning), finetune the model for a while before the next pruning iteration.
+
+Currently, the following head sorting criteria are supported:
+
+    * "l1_weight": rank heads by the L1-norm of weights of the query, key, and value projection matrices.
+    * "l2_weight": rank heads by the L2-norm of weights of the query, key, and value projection matrices.
+    * "l1_activation": rank heads by the L1-norm of their attention computation output.
+    * "l2_activation": rank heads by the L2-norm of their attention computation output.
+    * "taylorfo": rank heads by l1 norm of the output of attention computation * gradient for this output. Check more details in `this paper <https://arxiv.org/abs/1905.10650>`__ and `this one <https://arxiv.org/abs/1611.06440>`__.
+
+We support local sorting (i.e., sorting heads within a layer) and global sorting (sorting all heads together), and you can control by setting the ``global_sort`` parameter. Note that if ``global_sort=True`` is passed, all weights must have the same sparsity in the config list. However, this does not mean that each layer will be prune to the same sparsity as specified. This sparsity value will be interpreted as a global sparsity, and each layer is likely to have different sparsity after pruning by global sort.
+
+In our implementation, we support two ways to group the four weights in the same layer together. You can either pass a nested list containing the names of these modules (usage 1 below) to the pruner, or simply pass a dummy input and the pruner will run ``torch.jit.trace`` to group the weights (usage 2 below).
+
+However, if you would like to assign different sparsity to each layer, currently you could only use the first option, i.e., passing names of the weights to the pruner (usage 3 below). Also note that weights belong to the same layer must have the same sparsity.
+
+In addition to the following usage guide, we provide a more detailed example of pruning BERT for tasks from the GLUE benchmark. Please find it in this :githublink:`page <examples/model_compress/pruning/transformers>`.
+
+Usage
+^^^^^
+
+Usage 1: one-shot pruning, same sparsity for all the layers (PyTorch code)
+
+.. code-block:: python
+
+   from nni.algorithms.compression.pytorch.pruning import TransformerHeadPruner
+   kwargs = {'ranking_criterion': "l1_weight",
+             'global_sort': False,
+             'num_iterations': 1,
+             'epochs_per_iteration': 1,        # this is ignored when num_iterations = 1
+             'head_hidden_dim': 64,
+             'dummy_input': dummy_input,
+             'trainer': trainer,
+             'optimizer': optimizer
+             }
+    config_list = [{
+        'sparsity': 0.5,
+        'op_types': ["Linear"]
+    }]
+   pruner = TransformerHeadPruner(model, config_list, **kwargs)
+   pruner.compress()
+
+Usage 2: same effect as usage 1, the only change is passing names to the pruner instead of dummy input (PyTorch code)
+
+.. code-block:: python
+
+   from nni.algorithms.compression.pytorch.pruning import TransformerHeadPruner
+   attention_name_groups = list(zip(['encoder.layer.{}.attention.self.query'.format(i) for i in range(12)],
+                                    ['encoder.layer.{}.attention.self.key'.format(i) for i in range(12)],
+                                    ['encoder.layer.{}.attention.self.value'.format(i) for i in range(12)],
+                                    ['encoder.layer.{}.attention.output.dense'.format(i) for i in range(12)]))
+   kwargs = {'ranking_criterion': "l1_weight",
+             'global_sort': False,
+             'num_iterations': 1,
+             'epochs_per_iteration': 1,    # this is ignored when num_iterations = 1
+             'head_hidden_dim': 64,
+             'attention_name_groups': attention_name_groups,
+             'trainer': trainer,
+             'optimizer': optimizer
+             }
+    config_list = [{
+        'sparsity': 0.5,
+        'op_types': ["Linear"]
+    }]
+   pruner = TransformerHeadPruner(model, config_list, **kwargs)
+   pruner.compress()
+
+Usage 3: one-shot pruning, setting different sparsity for different layers (PyTorch code)
+
+.. code-block:: python
+
+   from nni.algorithms.compression.pytorch.pruning import TransformerHeadPruner
+   attention_name_groups = list(zip(['encoder.layer.{}.attention.self.query'.format(i) for i in range(12)],
+                                    ['encoder.layer.{}.attention.self.key'.format(i) for i in range(12)],
+                                    ['encoder.layer.{}.attention.self.value'.format(i) for i in range(12)],
+                                    ['encoder.layer.{}.attention.output.dense'.format(i) for i in range(12)]))
+   kwargs = {'ranking_criterion': "l1_weight",
+             'global_sort': False,
+             'num_iterations': 1,
+             'epochs_per_iteration': 1,    # this is ignored when num_iterations = 1
+             'head_hidden_dim': 64,
+             'attention_name_groups': attention_name_groups,     # can change to dummy_input here
+             'trainer': trainer,
+             'optimizer': optimizer
+             }
+    config_list = [{
+        'sparsity': 0.5,
+        'op_types': ["Linear"],
+        'op_names': [x for layer in attention_name_groups[:6] for x in layer]      # first six layers
+    },
+    {
+        'sparsity': 0.25,
+        'op_types': ["Linear"],
+        'op_names': [x for layer in attention_name_groups[:6] for x in layer]      # last six layers
+    }
+    ]
+   pruner = TransformerHeadPruner(model, config_list, **kwargs)
+   pruner.compress()
+
+
+User configuration for Transformer Head Pruner
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+**PyTorch**
+
+..  autoclass:: nni.algorithms.compression.pytorch.pruning.TransformerHeadPruner
diff --git a/docs/img/transformer_structure.png b/docs/img/transformer_structure.png
diff --git a/examples/model_compress/pruning/transformers/run.sh b/examples/model_compress/pruning/transformers/run.sh
@@ -0,0 +1,45 @@
+#!/bin/bash
+
+# Usage: ./run.sh gpu_id glue_task
+
+export CUDA_VISIBLE_DEVICES=$1
+TASK_NAME=$2                                  # "cola", "sst2", "mrpc", "stsb", "qqp", "mnli", "qnli", "rte", "wnli"
+PRETRAINED_MODEL="bert-base-uncased"          # "distilbert-base-uncased", "roberta-base", "bert-base-cased", ...
+
+# parameters for pruning
+# change USAGE to different numbers (1, 2, 3) to run examples with different configs
+USAGE=2                       
+SPARSITY=0.5
+RANKING_CRITERION=l1_weight                   # "l1_weight", "l2_weight", "l1_activation", "l2_activation", "taylorfo"
+NUM_ITERATIONS=1                              # 1 for one-shot pruning
+EPOCHS_PER_ITERATION=1
+
+# other training parameters, no need to change
+MAX_LENGTH=128
+BATCH_SIZE=32
+LR=2e-5
+N_EPOCHS=3
+
+time=$(date "+%Y%m%d%H%M%S")
+OUTDIR="models_${PRETRAINED_MODEL}_${TASK_NAME}_$time/"
+
+TASK_LIST=("cola" "sst2" "mrpc" "stsb" "qqp" "mnli" "qnli" "rte" "wnli")
+if [[ ${TASK_LIST[*]} =~ (^|[[:space:]])$TASK_NAME($|[[:space:]]) ]]; then
+    mkdir $OUTDIR
+    python transformer_pruning.py \
+	   --sparsity $SPARSITY \
+	   --ranking_criterion $RANKING_CRITERION \
+	   --num_iterations $NUM_ITERATIONS \
+	   --epochs_per_iteration $EPOCHS_PER_ITERATION \
+	   --speed_up \
+	   --model_name $PRETRAINED_MODEL \
+	   --task_name $TASK_NAME \
+	   --max_length $MAX_LENGTH \
+	   --batch_size $BATCH_SIZE \
+	   --learning_rate $LR \
+	   --num_train_epochs $N_EPOCHS \
+	   --output_dir $OUTDIR \
+	   2>&1 | tee "$OUTDIR/output.log"
+else
+    echo "Unsupported task $TASK_NAME."
+fi