Mu-scaling: Loss Prediction via Maximal Update Parametrization

We show that Maximal Update Parametrization (Mup) itself provides a model sequence that fits a modified scaling law and enables accurate loss prediction.

Mu-scaling paper: https://arxiv.org/abs/2304.06875

This implementation is based on Huggingface and MuTransformers, with modifications to improve stability and support Deepspeed.

Quick Start

1. Environment Setting

You can use conda or other tools to manage your python environment. To make things easy, we recommend conda.

conda create -n mu_scaling python=3.8
conda activate mu_scaling
pip install -r requirements.txt

If you are in China, you can use pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple instead of pip install -r requirements.txt to improve installation speed.

2. Data Preparation

Preprocess datasets for causal language model following Huggingface instructions. We also provide an example of processed data in res/final_data/test.

3. Train GPT-2 with Mup

sh run_grid_search_pair_wise_mup.sh

4. Plot Loss Landscape

If Mup works correctly, loss basins for different widths should be aligned.

python visualize_lr_landscape.py

5. Fit Scaling Laws

Record the training loss with the same data on the same step, then run

python fit_scale_loss_prediction.py

6. Evaluation

If you would like to run on evaluation data, we suggest training all the models for more steps, and then

sh run_eval_ppl_loss_pred.sh

References

If this project helps you, please star and cite us, thanks!

@article{DBLP:journals/corr/abs-2304-06875,
  author       = {Yiqun Yao and Yequan Wang},
  title        = {Research without Re-search: Maximal Update Parametrization Yields Accurate Loss Prediction across Scales},
  journal      = {CoRR},
  volume       = {abs/2304.06875},
  year         = {2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
configs/gpt_2_L_6		configs/gpt_2_L_6
deepspeed_configs		deepspeed_configs
metric		metric
modeling		modeling
res/final_data/test		res/final_data/test
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
fit_scale_loss_prediction.py		fit_scale_loss_prediction.py
mup_trainer.py		mup_trainer.py
requirements.txt		requirements.txt
run_eval_ppl_loss_pred.sh		run_eval_ppl_loss_pred.sh
run_eval_ppl_mup.py		run_eval_ppl_mup.py
run_grid_search_pair_wise_mup.sh		run_grid_search_pair_wise_mup.sh
run_train_gpt_mup_from_scratch.py		run_train_gpt_mup_from_scratch.py
visualize_lr_landscape.py		visualize_lr_landscape.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mu-scaling: Loss Prediction via Maximal Update Parametrization

Quick Start

1. Environment Setting

2. Data Preparation

3. Train GPT-2 with Mup

4. Plot Loss Landscape

5. Fit Scaling Laws

6. Evaluation

References

About

Releases

Packages

Contributors 2

Languages

cofe-ai/Mu-scaling

Folders and files

Latest commit

History

Repository files navigation

Mu-scaling: Loss Prediction via Maximal Update Parametrization

Quick Start

1. Environment Setting

2. Data Preparation

3. Train GPT-2 with Mup

4. Plot Loss Landscape

5. Fit Scaling Laws

6. Evaluation

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages