This is the Source Code of Paper: GPTScore: Evaluate as You Desire.
GPTScore is a novel evaluation framework that utilizes the emergent abilities (e.g., zero-shot instruction) of Generative Pre-Trained models to Score generated texts.
GPTScore evaluation framework support:
- Customizable. Customized instructions and demonstrations enable the evaluation of new aspects without labeled datasets;
- Multifaceted. One evaluator performs multifaceted evaluations;
- Training-free.
We explored 19 Pre-trained Language Models (PLMs) ranging in size from 80M (FLAN-T5-Small) to 175B (GPT3) to design GPTScore.
The PLMs studied in this paper are listed as follows:
| Model | Parameter | Evaluator Name | Model | Parameter | Evaluator Name |
|---|---|---|---|---|---|
| GPT3 | OPT | ||||
| text-ada-001 | 350M | gpt3_score | OPT350M | 350M | opt350m_score |
| text-babbage-001 | 1.3B | gpt3_score | OPT-1.3B | 1.3B | opt1_3B_score |
| text-curie-001 | 6.7B | gpt3_score | OPT-6.7B | 6.7B | opt6_7B_score |
| text-davinci-001 | 175B | gpt3_score | OPT-13B | 13B | opt13B_score |
| text-davinci-003 | 175B | gpt3_score | OPT-66B | 66B | opt66B_score |
| FLAN-T5 | GPT2 | ||||
| FT5-small | 80M | flan_small_score | GPT2-M | 355M | gpt2_medium_score |
| FT5-base | 250M | flan_base_score | GPT2-L | 774M | gpt2_large_score |
| FT5-L | 770M | flan_large_score | GPT2-XL | 1.5B | gpt2_xl_score |
| FT5-XL | 3B | flan_xl_score | GPT-J-6B | 6B | gptJ6B_score |
| FT5-XXL | 11B | flan_xxl_score |
- Evaluator Name indicates the name of the evaluator corresponding to the Model name in the first column.
Take the evaluation of GPT3-text-curie-001 model as an example.
- Setting
gpt3_scoretoTrue: the GPTScore evaluator uses a GPT3-based PLM. - Setting
gpt3modeltocurie: thetext-curie-001model is utilized. out_dir_name: set the folder for saving scoring results.dataname: set the dataset name for evaluation (e.g.,BAGEL).aspect: set the aspect name to be evaluated (e.g.,quality).
Set both the use_demo and use_ist as True.
python score_d2t.py
--dataname "BAGEL"
--use_demo True
--use_ist True
--gpt3_score True
--gpt3model "curie"
--out_dir_name "gpt3Score_based"
--aspect 'quality'
Set the use_ist to True and use_demo to False.
python score_d2t.py
--dataname "BAGEL"
--use_demo False
--use_ist True
--gpt3_score True
--gpt3model "curie"
--out_dir_name "gpt3Score_based"
--aspect 'quality'
Set the use_ist to False and use_demo to False.
python score_d2t.py
--dataname "BAGEL"
--use_demo False
--use_ist False
--gpt3_score True
--gpt3model "curie"
--out_dir_name "gpt3Score_based"
--aspect 'quality'
Here, we take the evaluation of OPT350M model as an example.
- Setting
opt350m_scoretoTrue: use the evaluator namedopt350m_score. out_dir_name: set the folder for saving scoring results.dataname: set the dataset name for evaluation (e.g.,BAGEL).aspect: set the aspect name to be evaluated (e.g.,quality).
Set both the use_demo and use_ist as True.
python score_d2t.py
--dataname "BAGEL"
--use_demo True
--use_ist True
--opt350m_score True
--out_dir_name "optScore_based"
--aspect 'quality'
Set the use_ist to True and use_demo to False.
python score_d2t.py
--dataname "BAGEL"
--use_demo False
--use_ist True
--opt350m_score True
--out_dir_name "optScore_based"
--aspect 'quality'
Set the use_ist to False and use_demo to False.
python score_d2t.py
--dataname "BAGEL"
--use_demo False
--use_ist False
--opt350m_score True
--out_dir_name "optScore_based"
--aspect 'quality'
@article{fu2023gptscore,
title={GPTScore: Evaluate as You Desire},
author={Fu, Jinlan and Ng, See-Kiong and Jiang, Zhengbao and Liu, Pengfei},
journal={arXiv preprint arXiv:2302.04166},
year={2023}
}
