This repository uses Python >= 3.10
Be sure to run in a virtual python environment (e.g. conda, venv, mkvirtualenv, etc.)
-
In the root directory of this repo run
pip install -r requirements.txt
For running and evaluating the baseline, run :
python baseline.py -i "data/dev.jsonl" -o "predictions/baseline.pred.jsonl"
python evaluate.py -p "predictions/baseline.pred.jsonl" -g "data/dev.jsonl"
For running and evaluating our proposed GPT3 approach, make sure you set your OPENAI_API_KEY
in the environmental
variables. This will use the default values for training, i.e. text-davinci-002
model, data/dev.jsonl
as input
and predictions/gpt3.pred.jsonl
as output. Run :
python gpt3_baseline
python evaluate -p "predictions/gpt3.pred.jsonl" -g "data/dev.jsonl"
For the scaling experiment, you need to change the flag model
to the respective model. The options
include: ['text-davinci-002', 'text-curie-001', 'text-babbage-001', 'text-ada-001']
python gpt3_baseline -i "data/dev.jsonl" -o "predictions/gpt3-ada.pred.jsonl" --model "text-ada-001"
python evaluate -p "predictions/gpt3.pred.jsonl" -g "data/dev.jsonl"
- Make changes that the competition organisers suggest [priority]
- Pull the changes from their repo
- Check our performance on the updated train/val dataset
- Dataset statistics (nice to include in the paper)
- The number of answers per relation
- Count the number of 'None' per relation
- Logic integrity
- Run for all prompts.
- Report on performance difference.
- Submit current version to leadership board
- Look at failure cases
- Wrong formatting? :: We tried different formatting - no significant improvement.
- Improve recall via
- Reduce temperature and generate multiple samples (k=3?)
- Rephrase prompts? :: link to colab
- General improvements
- Can we use the logprob?
- Are we using other models?
Distributed under the MIT License.
See LICENSE
for more information.