MERA: Multimodal Evaluation for Russian-language Architectures
The LM-harness support for the MERA benchmark datasets.
This project provides a unified framework to test generative language models on the MERA benchmark and its evaluation tasks.
To install lm-eval
from the repository main branch, run the following command:
pip install -e .
To support loading GPTQ quantized models, install the package with the auto-gptq
extra:
pip install -e ".[auto-gptq]"
Below is a sample command to run the benchmark with ai-forever/rugpt3large_based_on_gpt2
(AutoModelForCausalLM
class compatible)
model from Huggingface Hub:
CUDA_VISIBLE_DEVICES=0 MERA_FOLDER="$PWD/mera_results/rugpt3large_760m_defaults" MERA_MODEL_STRING="pretrained=ai-forever/rugpt3large_based_on_gpt2,dtype=auto" bash run_mera.sh
Below is a sample command to run the benchmark with ai-forever/FRED-T5-large
(AutoModelForSeq2SeqLM
class compatible)
model from Huggingface Hub:
CUDA_VISIBLE_DEVICES=0 MERA_FOLDER="$PWD/mera_results/FRED-T5-large_defaults" MERA_MODEL_STRING="pretrained=ai-forever/FRED-T5-large,dtype=auto" bash run_mera_seq2seq.sh
Use CUDA_VISIBLE_DEVICES
to set cuda device visibility, MERA_FOLDER
for path to store outputs,
MERA_MODEL_STRING
to setup model_args
parameter of lm-evaluation-harness
's main.py
.
Use MERA_COMMON_SETUP
to change default parameters for model inferencing with main.py
(defaults are
--model hf-causal-experimental --device cuda --max_batch_size=64 --batch_size=auto --inference
).
See more on parameters in the next section.
Notice two different bash scripts: run_mera.sh
for models compatible with transformers' AutoModelForCausalLM
class,
and run_mera_seq2seq.sh
for models compatible with transformers' AutoModelForSeq2SeqLM
class.
Running specific benchmark tasks is available with the main.py
script.
Example:
CUDA_VISIBLE_DEVICES=3 python main.py --model hf-causal-experimental --model_args pretrained=mistralai/Mistral-7B-v0.1,dtype=auto,max_length=11500 \
--device cuda --output_base_path="$PWD/mera_results/Mistral-7B-v0.1_defaults" --batch_size=1 \
--inference --write_out --no_cache --tasks rummlu --num_fewshot=5 \
--output_path="$PWD/mera_results/Mistral-7B-v0.1_defaults/rummlu_result.json"
Use --tasks
to provide comma separated list of tasks to run (available options are: bps
, chegeka
, lcs
,
mathlogicqa
, multiq
, parus
, rcb
, rudetox
, ruethics
, ruhatespeech
, ruhhh
, ruhumaneval
, rummlu
,
rumodar
, rumultiar
, ruopenbookqa
, rutie
, ruworldtree
, rwsd
, simplear
, use
).
Avoiding this argument will run all tasks with same provided settings.
--num_fewshot
sets fewshot count. MERA supposes to run tasks with the following fewshot count:
--num_fewshot=0
(zeroshot) withmultiq
,parus
,rcb
,rumodar
,rwsd
,use
,rudetox
,ruethics
,ruhatespeech
,ruhhh
,rutie
, andruhumaneval
;--num_fewshot=2
withbps
andlcs
;--num_fewshot=4
withchegeka
;--num_fewshot=5
withmathlogicqa
,ruworldtree
,ruopenbookqa
,simplear
,rumultiar
, andrummlu
.
Use CUDA_VISIBLE_DEVICES
to set cuda device visibility (setting --device cuda:3
works inconsistently).
--model hf-causal-experimental
is used for models compatible with transformers' AutoModelForCausalLM
class
and hf-seq2seq
is used for models compatible with transformers' AutoModelForSeq2SeqLM
class.
--model_args
is for comma separated parameters of from_pretrained
method of autoclass. One should be aware of
hardware requirements to run big models and limit the maximum input length of models with the parameter max_length
to avoid out-of-memory errors during a run.
--batch_size=1
is set to use a batch size of 1 to maximize benchmark results reproducibility.
--batch_size=auto
may be set to determine a batch size automatically based on the evaluated tasks and inputs maximum value
to start to search down is set with --max_batch_size
. Bigger batches may speed up running the whole MERA benchmark,
but results may become irreproducible, so it is not the default suggestion.
--output_base_path
is a path to dir (will be created) to store data for submission preparation and logs.
--inference
is important to use this key always. It allows to run on datasets without proper replies provided
(score result 0 will be reported).
The --write_out
command turns on extra logging necessary for public submissions.
--no_cache
is used to turn off the caching of tokenized inputs and model files (datasets are not cached).
--output_path
is a path to an extra log file with parameters of run and results of the task. It is preferred to be inside
output_base_path
directory.
The bash script above runs the submission zip-packing routine. Below is a way to run packing manually.
For converting run
python scripts/log_to_submission.py
Cmd arguments:
--outputs_dir
— path to directory with outputs (MERA_FOLDER
from bash script above)--dst_dir
— directory for store submission zip--dataset_dir
— path tolm_eval/datasets/
--logs_public_submit
(--no-logs_public_submit
) — pack logs for public submission in separate file (true by default)