CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations

📃Paper 🏰Project Page 🏆Leaderboard ✨Findings

Construction of CHARM

Comparison of commonsense reasoning benchmarks

Benchmarks	CN-Lang	CSR	CN-specifics	Dual-Domain	Rea-Mem
Most benchmarks in davis2023benchmarks	✘	✔	✘	✘	✘
XNLI, XCOPA,XStoryCloze	✔	✔	✘	✘	✘
LogiQA, CLUE, CMMLU	✔	✘	✔	✘	✘
CORECODE	✔	✔	✘	✘	✘
CHARM (ours)	✔	✔	✔	✔	✔

"CN-Lang" indicates the benchmark is presented in Chinese language. "CSR" means the benchmark is designed to focus on CommonSense Reasoning. "CN-specific" indicates the benchmark includes elements that are unique to Chinese culture, language, regional characteristics, history, etc. "Dual-Domain" indicates the benchmark encompasses both Chinese-specific and global domain tasks, with questions presented in the similar style and format. "Rea-Mem" indicates the benchmark includes closely-interconnected reasoning and memorization tasks.

🚀 What's New

[2024.7.26] All inference and evaluation of CHARM are supported by Opencompass.🔥🔥🔥
[2024.6.06] Leaderboard updated! LLaMA-3, GPT-4o, Gemini-1.5, Yi1.5, Qwen1.5, etc. are evaluated.
[2024.5.24] CHARM has been open-sourced !!! 🔥🔥🔥
[2024.5.15] CHARM has been accepted to the main conference of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) !!! 🔥🔥🔥
[2024.3.21] Paper available on ArXiv.

🛠️ Inference and Evaluation on Opencompass

Below are the steps for quickly downloading CHARM and using OpenCompass for evaluation.

1. OpenCompass Environment Setup

Refer to the installation steps for OpenCompass.

2. Download CHARM

git clone https://github.com/opendatalab/CHARM ${path_to_CHARM_repo}

cd ${path_to_opencompass}
mkdir data
ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM

3. Run Inference and Evaluation

cd ${path_to_opencompass}

# modify config file `configs/eval_charm_rea.py`: uncomment or add models you want to evaluate
python run.py configs/eval_charm_rea.py -r --dump-eval-details

# modify config file `configs/eval_charm_mem.py`: uncomment or add models you want to evaluate
python run.py configs/eval_charm_mem.py -r --dump-eval-details

The inference and evaluation results would be in ${path_to_opencompass}/outputs, like this:

outputs
├── CHARM_mem
│   └── chat
│       └── 20240605_151442
│           ├── predictions
│           │   ├── internlm2-chat-1.8b-turbomind
│           │   ├── llama-3-8b-instruct-lmdeploy
│           │   └── qwen1.5-1.8b-chat-hf
│           ├── results
│           │   ├── internlm2-chat-1.8b-turbomind_judged-by--GPT-3.5-turbo-0125
│           │   ├── llama-3-8b-instruct-lmdeploy_judged-by--GPT-3.5-turbo-0125
│           │   └── qwen1.5-1.8b-chat-hf_judged-by--GPT-3.5-turbo-0125
│           └── summary
│               └── 20240605_205020 # MEMORY_SUMMARY_DIR
│                   ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Anachronisms_Judgment
│                   ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Movie_and_Music_Recommendation
│                   ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Sport_Understanding
│                   ├── judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Time_Understanding
│                   └── judged-by--GPT-3.5-turbo-0125.csv # MEMORY_SUMMARY_CSV
└── CHARM_rea
    └── chat
        └── 20240605_152359
            ├── predictions
            │   ├── internlm2-chat-1.8b-turbomind
            │   ├── llama-3-8b-instruct-lmdeploy
            │   └── qwen1.5-1.8b-chat-hf
            ├── results # REASON_RESULTS_DIR
            │   ├── internlm2-chat-1.8b-turbomind
            │   ├── llama-3-8b-instruct-lmdeploy
            │   └── qwen1.5-1.8b-chat-hf
            └── summary
                ├── summary_20240605_205328.csv # REASON_SUMMARY_CSV
                └── summary_20240605_205328.txt

4. Generate Analysis Results

cd ${path_to_CHARM_repo}

# generate Table5, Table6, Table9 and Table10 in https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/summarize_reasoning.py ${REASON_SUMMARY_CSV}

# generate Figure3 and Figure9 in https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/summarize_mem_rea.py ${REASON_SUMMARY_CSV} ${MEMORY_SUMMARY_CSV}

# generate Table7, Table12, Table13 and Figure11 in https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/analyze_mem_indep_rea.py data/CHARM ${REASON_RESULTS_DIR} ${MEMORY_SUMMARY_DIR} ${MEMORY_SUMMARY_CSV}

🖊️ Citation

@misc{sun2024benchmarking,
      title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations}, 
      author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
      year={2024},
      eprint={2403.14112},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

💳 License

This project is released under the Apache 2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data/CHARM		data/CHARM
docs		docs
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_ZH.md		README_ZH.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations

Construction of CHARM

Comparison of commonsense reasoning benchmarks

🚀 What's New

🛠️ Inference and Evaluation on Opencompass

1. OpenCompass Environment Setup

2. Download CHARM

3. Run Inference and Evaluation

4. Generate Analysis Results

🖊️ Citation

💳 License

About

Releases

Packages

Languages

License

opendatalab/CHARM

Folders and files

Latest commit

History

Repository files navigation

CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations

Construction of CHARM

Comparison of commonsense reasoning benchmarks

🚀 What's New

🛠️ Inference and Evaluation on Opencompass

1. OpenCompass Environment Setup

2. Download CHARM

3. Run Inference and Evaluation

4. Generate Analysis Results

🖊️ Citation

💳 License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages