On the Advance of Making Language Models Better Reasoners

News

[August, 2022] Data release: GSM8K and StrategyQA, generated by code-davinci-002.

Dataset Details

Each subfolder in the /data folder corresponds to a reasoning benchmark. You can find more details about the benchmark from the README.md file of the subfolder.

Generally, these subfolders consist of train.jsonl and test.jsonl files. Each line of these files shares the same format:

{
    // context: the prompt sequence we provide to the language model.
    // {Qi}/{Ei}/{Ai} represents the question/chain-of-thought/answer of the i-th exemplar.
    // {Q} represents the question for inference.
    "context": "Question:\n{Q1}\n{E1}\n#### {A1}\n\n{...}Question:\n{Qk}\n{Ek}\n#### {Ak}\n\nQuestion:\n{Q}\nAnswer:\n",
  
  // samples: multiple output sequences sampled from the language model, given the prompt sequence as input
  "samples": [
      "{E}\n#### {A}\n\n",
      "{E'}\n#### {A'}\n\n",
      "{E''}\n#### {A'''}\n\n"
      ...
  ],

    // {E*}/{A*} represents the ground truth chain-of-thought/answer of {Q}.
    // if the dataset doesn't provide ground truth chain-of-thoughts, {E*} will be "No chain-of-thought provided.".
    "metadata": {
        "question": "{Q}",
        "ground_truth": "{E*}#### {A*}"
    }
}

Currently, all data we release in this repository are generated by the code-davinci-002 model provided by OpenAI.

Usage

Given the train.jsonl and test.jsonl files that are generated by large-scale pretrained language models, you can use code provided in the code folder to reproduce our results. Here we take the gsm8k dataset as an example.

Prerequisites

Install dependencies according to the environment properties of code/verifier_data_prepare.yaml and verifier_train.yaml.
Register a wandb account and get a wandb API key.
Create a new folder (denoted as {EXEC_DIR}) and initialize this folder as follows:

$ {EXEC_DIR}
.
├── train_dir
│   └── train.jsonl
├── test_dir
│   └── test.jsonl
├── train_preprocessed    // this is an empty folder
├── test_preprocessed     // this is an empty folder
└── exec                  // this is an empty folder

Data Pre-Processing

In the code/src folder, run these two commands:

python verifier_data_prepare.py
--generator_result_file {EXEC_DIR}/train_dir
--output_dir {EXEC_DIR}/train_preprocessed
--split train
--random_seed 233
--dataset_name GSM8K

python verifier_data_prepare.py
--generator_result_file {EXEC_DIR}/test_dir
--output_dir {EXEC_DIR}/test_preprocessed
--split dev
--random_seed 233
--dataset_name GSM8K

You can find the detailed parameter specifications in code/verifier_data_prepare.yaml.

Training and Evaluation

In the code/src folder, run these commands:

export WANDB_API_KEY={your_wandb_api_key_here}
export WANDB_PROJECT=deberta-verifier
export WANDB_RUN_ID=gsm8k-codedavinci002
export WANDB_TAGS=deberta_verifier
export NCCL_DEBUG=INFO

deepspeed --num_gpus=8 run_ner.py
--task_type NER
--dataset_name GSM8K
--train_data {EXEC_DIR}/train_preprocessed
--test_data {EXEC_DIR}/test_preprocessed
--output_dir {EXEC_DIR}/exec
--max_seq_length 512
--per_device_train_batch_size 8
--per_device_eval_batch_size 64
--lr_scheduler_type constant
--seed 233
--logging_steps 10
--overwrite_output_dir
--alpha 0.1
--deepspeed ds_config.json

You can find the detailed parameter specifications in code/verifier_train.yaml.

Logs

All the training/evaluation logs will be uploaded to your wandb account.

Key logged metrics include:

eval_weighted_voting_top1_accuracy@100: solve rate of DIVERSE (our approach);
eval_voting_top1_accuracy@100: solve rate of DIVERSE w/o verifier (i.e., each candidate is weighted equally);
eval_verifier_top1_accuracy@100: solve rate of DIVERSE w/o voting (i.e., selecting the candidate with highest verifier score).

Citation

If our work is useful for you, please consider citing our paper:

@article{li2022advance,
  title={On the Advance of Making Language Models Better Reasoners},
  author={Li, Yifei and Lin, Zeqi and Zhang, Shizhuo and Fu, Qiang and Chen, Bei and Lou, Jian-Guang and Chen, Weizhu},
  journal={arXiv preprint arXiv:2206.02336},
  year={2022}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

License

Please note that this repo is under MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

On the Advance of Making Language Models Better Reasoners

News

Dataset Details

Usage

Prerequisites

Data Pre-Processing

Training and Evaluation

Logs

Citation

Contributing

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

On the Advance of Making Language Models Better Reasoners

News

Dataset Details

Usage

Prerequisites

Data Pre-Processing

Training and Evaluation

Logs

Citation

Contributing

License