179 lines
6.4 KiB
Markdown
179 lines
6.4 KiB
Markdown
# Evaluations
|
|
|
|
This directory contains end-to-end pipelines for AI-enhanced evaluation. We will introduce the evaluation pipeline and the data format in this document.
|
|
|
|
## Generate Answers
|
|
|
|
### ChatGPT (gpt-3.5-turbo)
|
|
|
|
Make sure you have setup the OpenAI API Key in your environment. Then run:
|
|
|
|
```bash
|
|
python qa_baseline_gpt35.py --question table/question.jsonl --output table/answer/answer_gpt35.jsonl
|
|
```
|
|
|
|
### Bard
|
|
|
|
Unfortunately, Bard has not release its public APIs till now. You may have to enter the anwsers manually. Or you could find a third-party project that interfaces with Bard.
|
|
|
|
### Vicuna and others
|
|
|
|
To generate answers with Vicuna or other models, specify path to the model checkpoint. Then run:
|
|
```bash
|
|
python model_qa.py --model-name /model/path --question-file tables/question.jsonl --answer-file table/answer/answer.jsonl
|
|
```
|
|
|
|
## Evaluate Answers Automatically
|
|
|
|
### Generete Reviews with GPT-4
|
|
|
|
PS: If you do not current have access to GPT-4 API, but you have access to GPT-4 chatbot, you can evaluate the answers manually, according to the instructions in the **Data Format** section. `table/review/*.jsonl` are some examples of reviews.
|
|
|
|
TODO: add instructions
|
|
|
|
## Visualize Results
|
|
|
|
You can generate the data for the webpage by running:
|
|
|
|
```bash
|
|
python eval/generate_webpage_data_from_table.py
|
|
```
|
|
|
|
Then you can serve a static website in `webpage` to see the results.
|
|
|
|
## Data Format
|
|
|
|
If you want to have a deeper understanding of our evaluation pipeline or want to contribute to the evaluation process, you need to learn the data format we used for evaluation.
|
|
|
|
Our evaluation data are encoded with [JSON Lines](https://jsonlines.org/).
|
|
|
|
### Random ID Generation
|
|
|
|
We use the `shortuuid` Python library for generating short random UUIDs.
|
|
|
|
```python
|
|
import shortuuid
|
|
shortuuid.uuid() -> str
|
|
```
|
|
|
|
### Models
|
|
|
|
`model.jsonl` contains model information we used for generating anwsers.
|
|
|
|
Each row contains a record of a model with the following field:
|
|
|
|
* `model_id` (str): A unique ID for a model. Models with different IDs is supposed to have different performance. This ID is generated by `{model_name}:{model_version}`.
|
|
* `model_name` (str): The name of a model. This is not unique, because a model could be trained and updated continuously, but it is still considered as the same model with different versions.
|
|
* `model_version` (str): The version of a model.
|
|
* `model_metadata` (Any): Any metadata of a model (descriptions etc). This is optional.
|
|
|
|
For example:
|
|
|
|
```json
|
|
{
|
|
"model_id": "vicuna-13b:v1",
|
|
"model_name": "vicuna-13b",
|
|
"model_version": "v1",
|
|
"model_metadata": "learning rate 1e-5, 3 epochs, 13b"
|
|
}
|
|
```
|
|
|
|
### Prompts
|
|
|
|
We store prompts in `prompt.jsonl`. Each row contains a record of a prompt with the following field:
|
|
|
|
* `prompt_id` (int): A unique integer ID for a prompt. Prompts with different IDs are supposed to have different purpose.
|
|
* `system_prompt` (str): The system prompt given to a model. This is the prompt that the model sees first.
|
|
* `prompt_template` (str): The prompt body. This is the user prompt that the model sees after the system prompt. It is a Python f-string template, so that we can fill in the inputs later.
|
|
* `defaults` (dict): A dictionary of default values for the prompt template. It can be empty.
|
|
* `description` (str): A description of the functionality of the prompt.
|
|
|
|
For example:
|
|
|
|
```json
|
|
{
|
|
"prompt_id": 1,
|
|
"system_prompt": "You are a helpful assistant.",
|
|
"prompt_template": "[Question]\n{question}\n\n[Assistant 1]\n{answer_1}\n\n[End of Assistant 1]\n\n[Assistant 2]\n{answer_2}\n\n[End of Assistant 2]\n\n[System]\n{prompt}\n\n",
|
|
"defaults": {"prompt": "Which assistant is more helpful?"},
|
|
"description": "Compare two assistants' answers to a question."
|
|
}
|
|
```
|
|
|
|
### Reviewers
|
|
|
|
`reviewer.jsonl` contains reviewer information we used for reviewing answers generated by different models. Each row contains a record of a reviewer with the following field:
|
|
|
|
* `reviewer_id` (str): A unique ID for a reviewer. Reviewers with different IDs is supposed to have different reviewing performance.
|
|
* `prompt_id` (str): The ID of the prompt given to the reviewer (e.g., an AI assistant). Different prompts could result in different reviewing performance.
|
|
* `metadata` (dict): Metadata of a reviewer about its configurations.
|
|
* `description` (str): A description of the reviewer.
|
|
|
|
For example:
|
|
|
|
```json
|
|
{
|
|
"reviewer_id": "gpt-4-0328-default",
|
|
"prompt_id": 1,
|
|
"temperature": 0.2,
|
|
"max_tokens": 8192,
|
|
"description": "GPT-4 for generic questions."
|
|
}
|
|
```
|
|
|
|
### Questions
|
|
|
|
`question.jsonl` contains questions we used for evaluation. Each row contains a record of a question with the following field:
|
|
|
|
* `question_id` (int): A unique integer for a question. Questions with different IDs is supposed to be different.
|
|
* `text` (str): The question text.
|
|
* `category` (str): The category of the question. Questions with the same category are supposed to be similar or originate from the same source.
|
|
|
|
### Answers
|
|
|
|
`answer/xxx.jsonl` contains answers generated by different models. Each row contains a record of an answer with the following field:
|
|
|
|
* `answer_id` (str): A unique UUID for an answer. Answers with different IDs is supposed to be different.
|
|
* `question_id` (int): The ID of the question the answer is generated for.
|
|
* `model_id` (str): The ID of the model the answer is generated by.
|
|
* `text` (str): The answer text.
|
|
* `metadata` (dict): Any metadata of the answer.
|
|
|
|
Example:
|
|
|
|
```json
|
|
{
|
|
"answer_id": "[short uuid]",
|
|
"question_id": 1,
|
|
"model_id": "vicuna-13b:v1",
|
|
"text": "Here are five tips...",
|
|
"metadata": {}
|
|
}
|
|
```
|
|
|
|
### Reviews
|
|
|
|
`review/xxx.jsonl` contains reviews given by reviewers, comparing peformance between a pair of models. Each row contains a record of a review with the following field:
|
|
|
|
* `review_id` (str): A unique UUID for a review. Reviews with different IDs is supposed to be different.
|
|
* `question_id` (int): The ID of the question the review is given for.
|
|
* `answer1_id` (str): The ID of the first answer.
|
|
* `answer2_id` (str): The ID of the second answer.
|
|
* `text` (str): The review text.
|
|
* `score` (list): A list of scores given by the reviewer. The first score is for the first answer, and the second score is for the second answer.
|
|
* `reviewer_id` (str): The ID of the reviewer.
|
|
* `metadata` (dict): Any metadata of the review.
|
|
|
|
```json
|
|
{
|
|
"review_id": "[short uuid]",
|
|
"question_id": 1,
|
|
"answer1_id": "[answer1_id]",
|
|
"answer2_id": "[answer2_id]",
|
|
"text": "Assistant 2 is better...",
|
|
"score": [9.0, 7.5],
|
|
"reviewer_id": "gpt-4-0328-default",
|
|
"metadata": {}
|
|
}
|
|
```
|