Lucas Georges Gabriel Charpentier and David Samuel
University of Oslo
Language Technology Group
Paper
HuggingFace 100M model
HuggingFace 10M model
This paper introduces a novel modification of the transformer architecture, tailored for the data-efficient pretraining of language models. This aspect is evaluated by participating in the BabyLM challenge, where our solution won both the STRICT and STRICT-SMALL tracks. Our approach allows each transformer layer to select which outputs of previous layers to pro- cess. The empirical results verify the potential of this simple modification and show that not all layers are equally as important.
This is the official repository for our BabyLM 2023 submission: ELC-BERT.
./train_elc_bert_*.py
: Scripts to train an ELC-BERT model (replace * with base, normalized, weighted_output, or zero)../preprocess/
: Scripts for processing the BabyLM 2023 datasets../tokenizers/
: Script for creating a tokenizer as well as where the tokenizers are saved../configs/
: Folder containing model configs../pre_training/
: Scripts for the dataset, optimizer and utilities of pretraining../models/
: Folder containing training models.
After having preprocessed your data and creating you tokenizer, you are ready to train your ELC BERT model. To this extend you can run:
python train_elc_bert_*.py \
--input_path="PATH_TO_CACHED_DATA" \
--config_file="PATH_TO_CONFIG_FILE" \
--output_dir="PATH_TO_OUTPUT_DIR" \
--vocab_path="PATH_TO_TOKENIZER_FILE" \
--checkpoint_path="PATH_TO_MODEL_CHECKPOINT" \ # (Optional, to continue training)
--optimizer="NAME_OF_OPTIMIZER" \ # Options: lamb, adamw
--scheduler="NAME_OF_SCHEDULER" \ # (Not implemented) Options: cosine
--seq_length=MAX_SEQUENCE_LENGTH \
--batch_size=TRAINING_BATCH_SIZE \
--learning_rate=MAX_TRAINING_LAEARNING_RATE \
--max_steps=NUMBER_OF_TRAINING_STEPS \
--long_after=FRACTION_AFTER_WHICH_TO_4x_SEQUENCE_LENGTH \
--warmup_proportion=FRACTION_OF_TRAINING_STEPS_FOR_WARMUP \
--seed=RANDOMIZATION_SEEd \
--log_freq=LOSS_LOGGING_FREQUENCY \ # For WANDB, unused
--mask_p=TOKEN_MASKING_PROBABILITY \
--short_p=PROBABILITY_OF_SHORTENING_SEQUENCE \
--weight_decay=FRACTION_OF_WEIGHT_DECAY \
--max_gradient=MAX_GRADIENT_BEFORE_CLIPPING \
--gradient_accumulation=NUMBER_GRADIENT_ACCUMULATION_STEPS \
--label_smoothing=CROSS_ENTROPY_LABEL_SMOOTHING \
--wandb_entity="WANDB_ENTITY_NAME" \
--wandb_name="WANDB_RUN_NAME" \
--wandb_project="WANDB_PROJECT_NAME"
A few things to note:
- In the dataset (look up
pre_training/dataset.py
) you can pass arandom_p
andkeep_p
representing the probability of masked tokens to replace by either a random token or the original token. In the code they are both set to 0.1 byt default, but this can be changed. - Our code assumes the usage of wandb but this can be removed. In general before calling wandb we do a check for
is_main_process()
(when runninng multiple GPUs/CPUs, it makes sure only one process (the main) executes the code) to make sure to not have multiple wandb runs for the same model. - We assume the usage of SLURM at the start of the code (to import wandb) (lines 31-32), if you do not use SLURM remove line 31 (and line 32 if you do not use wandb).
@inproceedings{georges-gabriel-charpentier-samuel-2023-layers,
title = "Not all layers are equally as important: Every Layer Counts {BERT}",
author = "Georges Gabriel Charpentier, Lucas and
Samuel, David",
editor = "Warstadt, Alex and
Mueller, Aaron and
Choshen, Leshem and
Wilcox, Ethan and
Zhuang, Chengxu and
Ciro, Juan and
Mosquera, Rafael and
Paranjabe, Bhargavi and
Williams, Adina and
Linzen, Tal and
Cotterell, Ryan",
booktitle = "Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.conll-babylm.20",
doi = "10.18653/v1/2023.conll-babylm.20",
pages = "238--252",
}