JTruthfulQA is a Japanese version of TruthfulQA (Lin+, 2022). This dataset is not translated from original TruthfulQA but built from scratch.
The full set of benchmark questions and reference answers is available at data/JTruthfulQA.csv
. The benchmark questions are divided into three types: Fact, Knowledge, and Uncategorized.
The task is to answer the given questions. To make it easier to evaluate the answers that were generated by a large language model (LLM), the instruction to LLMs is to generate an answer to each question within 50 characters.
This table shows the performance of human performance and recent LLMs on each type of the questions. For human performance, we asked people to answer the questions in the two cases that allow or do not allow them to search the web about the questions. We set the temperature to 0 for "GPT-3.5-turbo" and "GPT-4" or 0.1 for the other LLMs to generate the answers.
Fact | Knowledge | Uncategorized | All | ft-GPT-3.5-Turbo | ft-waseda RoBERTa | BLEU | ROUGE1 | BERTScore | MC1 | MC2 | |
---|---|---|---|---|---|---|---|---|---|---|---|
Human (with search) | 0.741 | 0.762 | 0.647 | 0.750 | 0.629 | 0.753 | 6.99 | 0.28 | 0.14 | - | - |
Human (without search) | 0.753 | 0.579 | 0.588 | 0.654 | 0.586 | 0.702 | 5.30 | 0.25 | 0.11 | - | - |
GPT-3.5-turbo | 0.78 | 0.177 | 0.235 | 0.437 | 0.512 | 0.543 | 6.01 | 0.04 | -0.02 | - | - |
GPT-4 | 0.869 | 0.409 | 0.529 | 0.609 | 0.601 | 0.611 | -0.673 | 0.03 | -0.01 | - | - |
stabilityai/japanese-stablelm-instruct-alpha-7b | 0.212 | 0.271 | 0.235 | 0.245 | 0.207 | 0.232 | -7.26 | -0.05 | -0.09 | 0.129 | 0.130 |
elyza/ELYZA-japanese-Llama-2-7b-instruct | 0.564 | 0.146 | 0.176 | 0.326 | 0.290 | 0.421 | -8.65 | -0.06 | -0.10 | 0.126 | 0.129 |
matsuo-lab/weblab-10b-instruction-sft | 0.174 | 0.201 | 0.353 | 0.194 | 0.172 | 0.151 | -4.50 | -0.05 | -0.08 | 0.156 | 0.146 |
line-corporation/japanese-large-lm-3.6b-instruction-sft | 0.378 | 0.165 | 0.294 | 0.260 | 0.192 | 0.320 | -1.52 | -0.01 | -0.04 | 0.152 | 0.152 |
We provide a sample code "SampleGeneration.py", which uses "elyza/ELYZA-japanese-Llama-2-7b-instruct" to generate the answers. Rewrite the code for a model that you want to use.
Python3 script/RobertaEvaluation.py "input_file_name" "output_file_name_1"
Run "RobertaEvaluation.py" to evaluate the generated answers. You can get the result with "label". (1: correct, 0: incorrect)
"input_file_name" and "output_file_name_1" have to end with ".csv".
Python3 script/ResultAnalysys.py "output_file_name_1" "output_file_name_2"
You can see the analysys of the answers with "ResultAnalysys.py"
"output_file_name_2" has to end with ".json".
Each question has the original answer created by human. The dataset includes correct answers and wrong answers generated by four LLMs ("GPT-3.5-turbo", "stabilityai/japanese-stablelm-instruct-alpha-7b", "elyza/ELYZA-japanese-Llama-2-7b-instruct", "matsuo-lab/weblab-10b-instruction-sft"). The original answers are also added to the correct answers.
JTruthfulQA.csv
This dataset has 3,078 correct answers and 3,281 incorrect answers (6,359 answers in total) over 582 questions.
JTruthfulQA_without_gpt.csv
This dataset has 2,125 correct answers and 2,267 incorrect answers (4,392 answers in total) over 551 questions. The answers of GPT-3.5-turbo are excluded from this dataset.
@InProceedings{Kurihara_nlp2022,
author = "中村友亮 and 河原大輔",
title = "日本語TruthfulQAの構築",
booktitle = "言語処理学会第30回年次大会",
year = "2024",
url = "https://anlp.jp/proceedings/annual_meeting/2024/pdf_dir/P6-15.pdf",
pages = "1709--1714",
note= "in Japanese"
}
This dataset is distributed under CC BY-NC-SA 4.0.
This dataset was created in collaboration with SB Intuitions Corp. and Waseda University.