GPTScore: A Novel Evaluation Framework for Text Generation Models #811

ShellLM · 2024-04-22T18:51:45Z

GPTScore: A Novel Evaluation Framework for Text Generation Models

GPTScore: A Novel Evaluation Framework for Text Generation Models

GPTScore: Evaluate as You Desire

This is the Source Code of Paper: GPTScore: Evaluate as You Desire.

What is GPTScore?

GPTScore is a novel evaluation framework that utilizes the emergent abilities (e.g., zero-shot instruction) of Generative Pre-Trained models to Score generated texts.

GPTScore evaluation framework support:

Customizable. Customized instructions and demonstrations enable the evaluation of new aspects without labeled datasets;
Multifaceted. One evaluator performs multifaceted evaluations;
Training-free.

What PLMs does GPTScore support?

We explored 19 Pre-trained Language Models (PLMs) ranging in size from 80M (FLAN-T5-Small) to 175B (GPT3) to design GPTScore. The PLMs studied in this paper are listed as follows:

Model	Parameter	Evaluator Name	Model	Parameter	Evaluator Name
GPT3		OPT
text-ada-001	350M	gpt3_score	OPT350M	350M	opt350m_score
text-babbage-001	1.3B	gpt3_score	OPT-1.3B	1.3B	opt1_3B_score
text-curie-001	6.7B	gpt3_score	OPT-6.7B	6.7B	opt6_7B_score
text-davinci-001	175B	gpt3_score	OPT-13B	13B	opt13B_score
text-davinci-003	175B	gpt3_score	OPT-66B	66B	opt66B_score
FLAN-T5		GPT2
FT5-small	80M	flan_small_score	GPT2-M	355M	gpt2_medium_score
FT5-base	250M	flan_base_score	GPT2-L	774M	gpt2_large_score
FT5-L	770M	flan_large_score	GPT2-XL	1.5B	gpt2_xl_score
FT5-XL	3B	flan_xl_score	GPT-J-6B	6B	gptJ6B_score
FT5-XXL	11B	flan_xxl_score

Evaluator Name indicates the name of the evaluator corresponding to the Model name in the first column.

Usage

Use the GPT3-based model as the evaluator

Take the evaluation of GPT3-text-curie-001 model as an example.

Setting gpt3_score to True: the GPTScore evaluator uses a GPT3-based PLM.
Setting gpt3model to curie: the text-curie-001 model is utilized.
out_dir_name: set the folder for saving scoring results.
dataname: set the dataset name for evaluation (e.g., BAGEL).
aspect: set the aspect name to be evaluated (e.g., quality).

1. GPTScore with Instruction and Demonstration

Set both the use_demo and use_ist as True.

python score_d2t.py 
--dataname "BAGEL" 
--use_demo True 
--use_ist True 
--gpt3_score True 
--gpt3model "curie" 
--out_dir_name "gpt3Score_based"  
--aspect 'quality'

2. GPTScore with only Instruction

Set the use_ist to True and use_demo to False.

python score_d2t.py 
--dataname "BAGEL" 
--use_demo False 
--use_ist True 
--gpt3_score True 
--gpt3model "curie" 
--out_dir_name "gpt3Score_based"  
--aspect 'quality'

3. GPTScore without both Instruction and Demonstration

Set the use_ist to False and use_demo to False.

python score_d2t.py 
--dataname "BAGEL" 
--use_demo False 
--use_ist False 
--gpt3_score True 
--gpt3model "curie" 
--out_dir_name "gpt3Score_based"  
--aspect 'quality'

For more information, visit the GitHub repository.

Suggested labels

None

The text was updated successfully, but these errors were encountered:

ShellLM · 2024-04-22T18:51:47Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPTScore: A Novel Evaluation Framework for Text Generation Models #811

GPTScore: A Novel Evaluation Framework for Text Generation Models #811

ShellLM commented Apr 22, 2024

ShellLM commented Apr 22, 2024

GPTScore: A Novel Evaluation Framework for Text Generation Models #811

GPTScore: A Novel Evaluation Framework for Text Generation Models #811

Comments

ShellLM commented Apr 22, 2024

GPTScore: A Novel Evaluation Framework for Text Generation Models

What is GPTScore?

What PLMs does GPTScore support?

Usage

1. GPTScore with Instruction and Demonstration

2. GPTScore with only Instruction

3. GPTScore without both Instruction and Demonstration

Suggested labels

None

ShellLM commented Apr 22, 2024

Related content