For more general pipeline please follow compressors library and BERT distillation example. This project is not supported by me anymore.
This project is about BERT distillation.
The goal is to distillate any BERT based on any language with convenient high-level API, reproducibility and all new GPU's features.
- various losses
- distributed training
- fp16
- logging with tensorboard, wandb etc
- catalyst framework
Not so far ago Hugging Face team published paper about DistilBERT model. The idea is to transfer knowledge from big student model to smaller student model.
First of all we need a well trained teacher model.
Let's take 6 encoders instead of 12! We should initialize our small model's layers with teachers layers.
Instead of train our model for a long time on masked language model task we can add to our casual loss KL divergence and cosine loss between student and teacher as we know that a teacher is well trained.
As shown in the paper this method leads to small quality decreasing, reduce model size and speed up inference especially on mobile devices.
In catalyst framework there are two ways to run your experiment: Notebook API and Config API. If you wanna run a quick flexible experiment yo should use Notebook API, but if you want to make product-ready solution you should use Config API.
Let's briefly take a look on a Notebook API. First of all we should do all necessary imports:
from catalyst import dl
from catalyst.contrib.data.nlp import LanguageModelingDataset
from catalyst.contrib.nn.optimizers import RAdam
from catalyst.core import MetricAggregationCallback
import pandas as pd
import torch
from torch.utils.data import DataLoader
from transformers import (
AutoConfig,
AutoTokenizer,
BertForMaskedLM,
DistilBertForMaskedLM,
)
from transformers.data.data_collator import DataCollatorForLanguageModeling
from src.callbacks import (
CosineLossCallback,
KLDivLossCallback,
MaskedLanguageModelCallback,
MSELossCallback,
PerplexityMetricCallbackDistillation,
)
from src.data import MLMDataset
from src.runners import DistilMLMRunner
from src.models import DistilbertStudentModel, BertForMLM
Then we should load our training data, for example:
train_df = pd.read_csv("data/train.csv")
valid_df = pd.read_csv("data/valid.csv")
Next we should initialize our data loaders.
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # your teacher's model tokenizer
train_dataset = LanguageModelingDataset(train_df["text"], tokenizer)
valid_dataset = LanguageModelingDataset(valid_df["text"], tokenizer)
collate_fn = DataCollatorForLanguageModeling(tokenizer).collate_batch
train_dataloader = DataLoader(
train_dataset, collate_fn=collate_fn, batch_size=2
)
valid_dataloader = DataLoader(
valid_dataset, collate_fn=collate_fn, batch_size=2
)
loaders = {"train": train_dataloader, "valid": valid_dataloader}
The most important thing is to define our models.
teacher = BertForMLM("bert-base-uncased")
student = DistilbertStudentModel(
teacher_model_name="bert-base-uncased",
layers=[0, 2, 4, 7, 9, 11], # which layers will be transfer to student
)
model = torch.nn.ModuleDict({"teacher": teacher, "student": student})
The next thing is callbacks:
callbacks = {
"masked_lm_loss": MaskedLanguageModelCallback(), # standard MLM loss
"mse_loss": MSELossCallback(), # MSE loss between student and student distributions on masked positions
"cosine_loss": CosineLossCallback(), # cosine loss between hidden states
"kl_div_loss": KLDivLossCallback(), # KL divergence between student and student distributions on masked positions
"loss": MetricAggregationCallback(
prefix="loss",
mode="weighted_sum",
metrics={ # weights for final loss
"cosine_loss": 1.0,
"masked_lm_loss": 1.0,
"kl_div_loss": 1.0,
"mse_loss": 1.0,
},
),
"optimizer": dl.OptimizerCallback(), # optim.step() and loss.backward() is here
"perplexity": PerplexityMetricCallbackDistillation(), # perplexity metric
}
Finally, run an experiment!
runner = DistilMLMRunner()
optimizer = RAdam(model.parameters(), lr=5e-5)
runner.train(
model=model,
optimizer=optimizer,
loaders=loaders,
verbose=True,
num_epochs=10, # epochs number
callbacks=callbacks,
)
But what about more product-ready solution?
Here is a minimal example for config API. All yo need to do is to write your config.yml file.
model_params: # defining our models
_key_value: true
teacher:
model: BertForMLM
model_name: "bert-base-cased" # hugging face hub model name
student:
model: DistilbertStudentModel
teacher_model_name: "bert-base-cased"
args:
# where to look for __init__.py file
expdir: "src"
# store logs in this subfolder
baselogdir: "./logs/distilbert"
# common settings for all stages
stages:
# PyTorch loader params
data_params:
batch_size: 2
num_workers: 0
path_to_data: "./data"
train_filename: "train.csv"
valid_filename: "valid.csv"
text_field: "text"
model_name: "bert-base-uncased"
max_sequence_length: 300
shuffle: True
state_params:
main_metric: &reduced_metric loss
minimize_metric: True
# scheduler controls learning rate during training
scheduler_params:
scheduler: ReduceLROnPlateau
# callbacks serve to calculate loss and metric,
# update model weights, save checkpoint etc.
callbacks_params:
loss_aggregator:
callback: MetricAggregationCallback
mode: weighted_sum
metrics:
cosine_loss: 1.0
masked_lm_loss: 1.0
kl_div_loss: 1.0
mse_loss: 1.0
prefix: loss
cosine_loss:
callback: CosineLossCallback
prefix: cosine_loss
masked_lm_loss:
callback: MaskedLanguageModelCallback
prefix: masked_lm_loss
kl_div_loss:
callback: KLDivLossCallback
prefix: kl_div_loss
mse_loss:
callback: MSELossCallback
prefix: mse_loss
perplexity:
callback: PerplexityMetricCallbackDistillation
optimizer:
callback: OptimizerCallback
scheduler:
callback: SchedulerCallback
reduced_metric: *reduced_metric
# params specific for stage 1 called "train_val"
train_val:
state_params:
num_epochs: 1
optimizer_params:
optimizer: RAdam
lr: 0.00005
And then run it with catalyst:
catalyst-dl run -C config.yml --verbose
We can add distributed training and fp16:
catalyst-dl run -C config.yml --verbose --distributed --fp16
bin
- bash files for running pipelinesconfigs
- just place configs heredocker
- project Docker files for pure reproducibilityexamples
- examples of using this projectrequirements
- different project python requirements for docker, tests, CI, etcscripts
- data preprocessing scripts, utils, everything likepython scripts/.py
src
- model, experiment, etc - research
Firstly let's discuss feature you want to see in this project. You can use feature request issue template.
After that you can write your code following this simple steps:
- Clone repository
- run
pip install -r requirements/requirements-dev.txt -r requirements/requirements.txt
- write some code
- run
catalyst-make-codestyle
- run
catalyst-check-codestyle
- if exit code is not 0 refactor your code
- commit!