| Paper | Demo | Leaderboard | Human Annotation Dataset |
In this package, you can use MT-bench questions and prompts to evaluate your models with LLM-as-a-judge. MT-bench is a set of challenging multi-turn open-ended questions for evaluating chat assistants. To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses.
- Install
- Review Pre-Generated Model Answers and Judgments
- MT-Bench
- Agreement Computation
- Release Plan
- Citation
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip install -e .
pip install openai anthropic ray
We provide pre-generated model answers and judgments for some models. You can view them at this demo.
To download the pre-generated data, use
python3 download_mt_bench_pregenerated.py
After downloading the data, you can view them locally by
python3 qa_browser.py --share
You can use this QA browser to view the answers generated by you later.
python gen_model_answer.py --model-path [MODEL-PATH] --model-id [MODEL-ID]
Arguments:
[MODEL-PATH]
is the path to the weights, which can be a local folder or a Hugging Face repo ID.[MODEL-ID]
is a name you give to the model.
e.g.,
python gen_model_answer.py --model-path lmsys/vicuna-7b-v1.3 --model-id vicuna-7b-v1.3
The answers will be saved to data/mt_bench/model_answer/[MODEL-ID].jsonl
.
To make sure FastChat loads the correct prompt template, see the supported models and how to add a new model here.
You can also specify --num-gpus-per-model
for model parallelism (needed for large 65B models) and --num-gpus-total
to parallelize answer generation with multiple GPUs.
There are several options to use GPT-4 as a judge, such as pairwise winrate and single-answer grading. In MT-bench, we recommond single-answer grading as the default mode. This mode asks GPT-4 to grade and give a score to model's answer directly without pairwise comparison. For each turn, GPT-4 will give a score on a scale of 10. We then compute the average score on all turns.
python gen_judgment.py --model-list [LIST-OF-MODEL-ID] --parallel [num-concurrent-api-call]
e.g.,
python gen_judgment.py --model-list vicuna-13b-v1.3 alpaca-13b llama-13b claude-v1 gpt-3.5-turbo gpt-4 --parallel 2
The judgments will be saved to data/mt_bench/model_judgment/gpt-4_single.jsonl
- Show the scores for selected models
python show_result.py --model-list vicuna-13b-v1.3 alpaca-13b llama-13b claude-v1 gpt-3.5-turbo gpt-4
- Show all scores
python show_result.py
Besides score-based single-answer grading, we also support two additional grading options based on win rates:
pariwise-baseline
: run pairwise comparison against a baseline model.pairwise-all
: run pairwise comparison between all model pairs on all questions.
- Generate GPT-4 judgments
python gen_judgment.py --mode pairwise-baseline --model-list vicuna-13b-v1.3 alpaca-13b llama-13b --parallel 2
The judgments will be saved to data/mt_bench/model_judgment/gpt-4_pair.jsonl
- Show results
python show_result.py --mode pairwise-baseline
Another option is to run pairwise comparisons on all possible pairs. This could be more expensive when #models increases, but it gives you a more comprehensive information.
python gen_judgment.py --mode pairwise-all --model-list [LIST-OF-MODEL-ID] --parallel [num-concurrent-api-call]
python show_result.py --mode pairwise-all
python gen_api_answer.py --model [MODEL-NAME]
to generate GPT-3.5/4 and Claude's answers.
We released 3.3K human annotations for model responses generated by 6 models in response to 80 MT-bench questions. The dataset is available at lmsys/mt_bench_human_judgments. You can use this data to compute the agreement between human and GPT-4.
wget https://huggingface.co/datasets/lmsys/mt_bench_human_judgments/resolve/main/human_judgments.json
wget https://huggingface.co/datasets/lmsys/mt_bench_human_judgments/resolve/main/gpt4_pair_judgments.json
python compute_agreement.py --judges gpt4-pair human --votefiles human_judgments.json gpt4_pair_judgments.json
Our current release contains:
- The MT-bench questions, prompts, pre-generated answers, and pre-generated judgments.
- The 3K expert-level human annotations.
The next release will include:
- 30K arena conversations with human votes
If you find the repository helpful for your study, please consider citing the following paper: "Judging LLM-as-a-judge with MT-Bench and Chatbot Arena":
@misc{zheng2023judging,
title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena},
author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica},
year={2023},
eprint={2306.05685},
archivePrefix={arXiv},
primaryClass={cs.CL}
}