Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTScore: A Novel Evaluation Framework for Text Generation Models #811

Open
1 task
ShellLM opened this issue Apr 22, 2024 · 1 comment
Open
1 task

GPTScore: A Novel Evaluation Framework for Text Generation Models #811

ShellLM opened this issue Apr 22, 2024 · 1 comment
Labels
code-generation code generation models and tools like copilot and aider llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets Papers Research papers

Comments

@ShellLM
Copy link
Collaborator

ShellLM commented Apr 22, 2024

GPTScore: A Novel Evaluation Framework for Text Generation Models

GPTScore: Evaluate as You Desire

This is the Source Code of Paper: GPTScore: Evaluate as You Desire.

What is GPTScore?

GPTScore is a novel evaluation framework that utilizes the emergent abilities (e.g., zero-shot instruction) of Generative Pre-Trained models to Score generated texts.

GPTScore evaluation framework support:

  • Customizable. Customized instructions and demonstrations enable the evaluation of new aspects without labeled datasets;
  • Multifaceted. One evaluator performs multifaceted evaluations;
  • Training-free.

What PLMs does GPTScore support?

We explored 19 Pre-trained Language Models (PLMs) ranging in size from 80M (FLAN-T5-Small) to 175B (GPT3) to design GPTScore. The PLMs studied in this paper are listed as follows:

Model Parameter Evaluator Name Model Parameter Evaluator Name
GPT3 OPT
text-ada-001 350M gpt3_score OPT350M 350M opt350m_score
text-babbage-001 1.3B gpt3_score OPT-1.3B 1.3B opt1_3B_score
text-curie-001 6.7B gpt3_score OPT-6.7B 6.7B opt6_7B_score
text-davinci-001 175B gpt3_score OPT-13B 13B opt13B_score
text-davinci-003 175B gpt3_score OPT-66B 66B opt66B_score
FLAN-T5 GPT2
FT5-small 80M flan_small_score GPT2-M 355M gpt2_medium_score
FT5-base 250M flan_base_score GPT2-L 774M gpt2_large_score
FT5-L 770M flan_large_score GPT2-XL 1.5B gpt2_xl_score
FT5-XL 3B flan_xl_score GPT-J-6B 6B gptJ6B_score
FT5-XXL 11B flan_xxl_score

Evaluator Name indicates the name of the evaluator corresponding to the Model name in the first column.

Usage

Use the GPT3-based model as the evaluator

Take the evaluation of GPT3-text-curie-001 model as an example.

  • Setting gpt3_score to True: the GPTScore evaluator uses a GPT3-based PLM.
  • Setting gpt3model to curie: the text-curie-001 model is utilized.
  • out_dir_name: set the folder for saving scoring results.
  • dataname: set the dataset name for evaluation (e.g., BAGEL).
  • aspect: set the aspect name to be evaluated (e.g., quality).

1. GPTScore with Instruction and Demonstration

Set both the use_demo and use_ist as True.

python score_d2t.py 
--dataname "BAGEL" 
--use_demo True 
--use_ist True 
--gpt3_score True 
--gpt3model "curie" 
--out_dir_name "gpt3Score_based"  
--aspect 'quality'

2. GPTScore with only Instruction

Set the use_ist to True and use_demo to False.

python score_d2t.py 
--dataname "BAGEL" 
--use_demo False 
--use_ist True 
--gpt3_score True 
--gpt3model "curie" 
--out_dir_name "gpt3Score_based"  
--aspect 'quality'

3. GPTScore without both Instruction and Demonstration

Set the use_ist to False and use_demo to False.

python score_d2t.py 
--dataname "BAGEL" 
--use_demo False 
--use_ist False 
--gpt3_score True 
--gpt3model "curie" 
--out_dir_name "gpt3Score_based"  
--aspect 'quality'

For more information, visit the GitHub repository.

Suggested labels

None

@ShellLM ShellLM added code-generation code generation models and tools like copilot and aider llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets Papers Research papers labels Apr 22, 2024
@ShellLM
Copy link
Collaborator Author

ShellLM commented Apr 22, 2024

Related content

#498 similarity score: 0.9
#499 similarity score: 0.89
#309 similarity score: 0.89
#383 similarity score: 0.89
#762 similarity score: 0.89
#456 similarity score: 0.89

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code-generation code generation models and tools like copilot and aider llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets Papers Research papers
Projects
None yet
Development

No branches or pull requests

1 participant