This repository provides code for various baseline models reported in the EMNLP-2018 paper introducing the OpenBookQA dataset: Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
@inproceedings{OpenBookQA2018,
title={Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering},
author={Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal},
booktitle={EMNLP},
year={2018}
}
Please visit the OpenBookQA Leaderboard for the latest on this challenge!
-
Create the
obqa
environment using Anacondaconda create -n obqa python=3.6
-
Activate the environment
source activate obqa
-
Install the requirements in the environment:
Note: The script below installs Pytorch 0.4.0 for CUDA 8 only. If you are using a different CUDA version, please visit http://pytorch.org/ and install the relevant version.
bash scripts/install_requirements.sh
Download the OpenBookQA dataset and embeddings using the script below.
Note that this includes downloading glove.840B.300d.txt.gz
, a 2GB file
containing 300-dimensional GloVe word embeddings
trained on 840B tokens, which can take several minutes.
If you already have this file, you might consider altering the script.
bash scripts/download_and_prepare_data.sh
If you are interested in using the pre-trained models from the paper, you can download them using the command below.
Note: Some of the models that use ELMo are more than 700MB. If you do not plan to use them or have a slow internet connection, you might want to modify the download script and exclude them from downloading.
Note: The downloaded models are for best performing run on Dev.
bash scripts/download_trained_models.sh
If you use the script below, you might want to first look at
scripts/experiments/qa/run_experiment_openbookqa.sh
and set the
EXPERIMENTS_OUTPUT_DIR_BASE
environment variable to a directory
where you want to save the output of the experiments.
Default is _experiments
.
Note: If you want to use GPU for the experiments, make sure to change the
trainer.cuda_device
setting to the desired CUDA device id. Default is -1
(no GPU).
You can also use scripts/experiments/qa/run_experiment_openbookqa_gpu.sh
(automatically sets trainer.cuda_device
to CUDA device 0
)
instead of scripts/experiments/qa/run_experiment_openbookqa.sh
in the experiments commands below.
Table: Comparison between models with Glove (default) and ELMo. The comparison is mentioned in the text of the paper. The results displayed here are avg accuracy (equivalent to exam score) and the Std across 5 runs with different random seeds and the result for the best run on Dev.
Model | Dev (5 runs) | Test (5 runs) | Dev (Best run) | Test |
---|---|---|---|---|
Question-to-Choice (Question Match) | 54.6±1.2 | 50.2±0.9 | 56.8 | 49.8 |
Question-to-Choice + ELMo | 57.1±1.1 | 50.6±1.2 | 58.4 | 50.0 |
ESIM | 53.9±0.4 | 48.9±1.1 | 54.4 | 47.4 |
ESIM + ELMo | 55.5±0.6 | 50.7±0.7 | 56.4 | 49.6 |
Experiments with pre-trained GloVe embedding vectors:
Train a new model
config_file=training_config/qa/multi_choice/openbookqa/reader_mc_qa_question_to_choice.json
bash scripts/experiments/qa/run_experiment_openbookqa.sh ${config_file}
Evaluate on the pre-trained model
MODEL_ARCHIVE=data/trained_models/model_q2ch_best_run.tar.gz
EVALUATION_DATA_FILE=data/OpenBookQA-V1-Sep2018/Data/Main/test.jsonl
python obqa/run.py evaluate_predictions_qa_mc \
--archive_file ${MODEL_ARCHIVE} \
--evaluation_data_file ${EVALUATION_DATA_FILE} \
--output_file ${MODEL_ARCHIVE##*/}_pred_${EVALUATION_DATA_FILE##*/}
Experiments with ELMo contextual word representations:
Train a new model
config_file=training_config/qa/multi_choice/openbookqa/reader_mc_qa_question_to_choice_elmo.json
bash scripts/experiments/qa/run_experiment_openbookqa.sh ${config_file}
Evaluate on the pre-trained model
MODEL_ARCHIVE=data/trained_models/model_q2ch_elmo_best_run.tar.gz
EVALUATION_DATA_FILE=data/OpenBookQA-V1-Sep2018/Data/Main/test.jsonl
python obqa/run.py evaluate_predictions_qa_mc \
--archive_file ${MODEL_ARCHIVE} \
--evaluation_data_file ${EVALUATION_DATA_FILE} \
--output_file ${MODEL_ARCHIVE##*/}_pred_${EVALUATION_DATA_FILE##*/}
Experiments with Glove pre-trained embeddings vectors:
Train a new model
config_file=training_config/qa/multi_choice/openbookqa/reader_mc_qa_esim.json
bash scripts/experiments/qa/run_experiment_openbookqa.sh ${config_file}
Evaluate on the pre-trained model
MODEL_ARCHIVE=data/trained_models/model_esim_best_run.tar.gz
EVALUATION_DATA_FILE=data/OpenBookQA-V1-Sep2018/Data/Main/test.jsonl
python obqa/run.py evaluate_predictions_qa_mc \
--archive_file ${MODEL_ARCHIVE} \
--evaluation_data_file ${EVALUATION_DATA_FILE} \
--output_file ${MODEL_ARCHIVE##*/}_pred_${EVALUATION_DATA_FILE##*/}
Experiments with ELMo contextual representations:
Train a new model
config_file=training_config/qa/multi_choice/openbookqa/reader_mc_qa_esim_elmo.json
bash scripts/experiments/qa/run_experiment_openbookqa.sh ${config_file}
Evaluate on the pre-trained model
MODEL_ARCHIVE=data/trained_models/model_esim_elmo_best_run.tar.gz
EVALUATION_DATA_FILE=data/OpenBookQA-V1-Sep2018/Data/Main/test.jsonl
python obqa/run.py evaluate_predictions_qa_mc \
--archive_file ${MODEL_ARCHIVE} \
--evaluation_data_file ${EVALUATION_DATA_FILE} \
--output_file ${MODEL_ARCHIVE##*/}_pred_${EVALUATION_DATA_FILE##*/}
Rank OpenBook (Science) knowledge facts for the given question:
DATA_DIR_ROOT=data/
KNOWLEDGE_DIR_ROOT=data/knowledge
OPENBOOKQA_DIR=${DATA_DIR_ROOT}/OpenBookQA-V1-Sep2018
ranking_out_dir=${OPENBOOKQA_DIR}/Data/Main/ranked_knowledge/openbook
mkdir -p ${ranking_out_dir}
data_file=${OPENBOOKQA_DIR}/Data/Main/full.jsonl
know_file=${KNOWLEDGE_DIR_ROOT}/openbook.csv
PYTHONPATH=. python obqa/data/retrieval/knowledge/rank_knowledge_for_mc_qa.py \
-o ${ranking_out_dir} -i ${data_file} \
-k ${know_file} -n tfidf --max_facts_per_choice 100 \
--limit_items 0
DATA_DIR_ROOT=data/
KNOWLEDGE_DIR_ROOT=data/knowledge
OPENBOOKQA_DIR=${DATA_DIR_ROOT}/OpenBookQA-V1-Sep2018
ranking_out_dir=${OPENBOOKQA_DIR}/Data/Main/ranked_knowledge/cn5omcs
mkdir -p ${ranking_out_dir}
data_file=${OPENBOOKQA_DIR}/Data/Main/full.jsonl
know_file=${KNOWLEDGE_DIR_ROOT}/CN5/cn5_omcs.json
PYTHONPATH=. python obqa/data/retrieval/knowledge/rank_knowledge_for_mc_qa.py \
-o ${ranking_out_dir} -i ${data_file} \
-k ${know_file} -n tfidf --max_facts_per_choice 100 \
--limit_items 0
DATA_DIR_ROOT=data/
KNOWLEDGE_DIR_ROOT=data/knowledge
OPENBOOKQA_DIR=${DATA_DIR_ROOT}/OpenBookQA-V1-Sep2018
ranking_out_dir=${OPENBOOKQA_DIR}/Data/Main/ranked_knowledge/cn5wordnet
mkdir -p ${ranking_out_dir}
data_file=${OPENBOOKQA_DIR}/Data/Main/full.jsonl
know_file=${KNOWLEDGE_DIR_ROOT}/CN5/cn5_wordnet.json
PYTHONPATH=. python obqa/data/retrieval/knowledge/rank_knowledge_for_mc_qa.py \
-o ${ranking_out_dir} -i ${data_file} \
-k ${know_file} -n tfidf --max_facts_per_choice 100 \
--limit_items 0
Note: This is Oracle knowledge -- a hypothetical setting that assumes access to the gold science fact. The goal here is to allow research effort to focus on the sub-challenges of retrieving the missing commonsense knowledge, and reasoning with both facts in order to answer the question. A full model for OpenBookQA should, of course, not rely on such Oracle knowledge.
DATA_DIR_ROOT=data/
KNOWLEDGE_DIR_ROOT=data/knowledge
OPENBOOKQA_DIR=${DATA_DIR_ROOT}/OpenBookQA-V1-Sep2018
ranking_out_dir=${OPENBOOKQA_DIR}/Data/Main/ranked_knowledge/openbook_oracle
mkdir -p ${ranking_out_dir}
data_file=${OPENBOOKQA_DIR}/Data/Main/full.jsonl
know_file=${OPENBOOKQA_DIR}/Data/Additional/full_complete.jsonl
PYTHONPATH=. python obqa/data/retrieval/knowledge/rank_knowledge_for_mc_qa.py \
-o ${ranking_out_dir} -i ${data_file} \
-k ${know_file} -n tfidf --max_facts_per_choice 1 \
--limit_items 0 \
--knowledge_reader reader_gold_facts_arc_mc_qa_2 \
--dataset_reader reader_arc_qa_question_choice_facts
Various baselines that adapt and train the Knowledge-Enhanced Reader model from ACL-2018 for the OpenBookQA setting, using various sources of knowledge.
- Oracle Open Book fact + Conceptnet OMCS
(referred to as the
f + ConceptNet
Oracle setup in the paper)
config_file=training_config/qa/multi_choice/openbookqa/knowreader_v1_mc_qa_multi_source_oracle_openbook_plus_cn5omcs.json
bash scripts/experiments/qa/run_experiment_openbookqa.sh ${config_file}
- Oracle Open Book fact + WordNet
(referred to as the
f + WordNet
Oracle setup in the paper)
config_file=training_config/qa/multi_choice/openbookqa/knowreader_v1_mc_qa_multi_source_oracle_openbook_plus_cn5wordnet.json
bash scripts/experiments/qa/run_experiment_openbookqa.sh ${config_file}
Note: These experiments are not reported in the main paper! These are additional baseline models whose Dev and Test scores are listed below for reference.
Table: Additional (Non-Oracle) experiments with external knowledge. The results displayed here are avg accuracy (equivalent to exam score) and the Std across 5 runs with different random seeds and the result for the best run on Dev.
Model | Dev (5 runs) | Test (5 runs) | Dev (Best run) | Test |
---|---|---|---|---|
ConceptNet only (cn5omcs) | 54.0±0.6 | 51.1±2.1 | 54.4 | 52.2 |
Wordnet only (cn5wordnet) | 54.9±0.4 | 49.4±1.5 | 55.6 | 51.4 |
OpenBook + ConceptNet | 53.8±1.0 | 51.2±1.1 | 54.6 | 50.8 |
OpenBook + Wordnet | 53.3±0.7 | 50.6±0.6 | 54.2 | 51.2 |
Below are commands for training new models or evaluating on the pre-trained models from the EMNLP paper. Note that even if you just evaluate on pre-trained models, you still need to run the knowledge retrieval from 2.1. Retrieve external knowledge.
- Open Mind Common Sense part of ConceptNet only (cn5omcs)
Train a new model
config_file=training_config/qa/multi_choice/openbookqa/knowreader_v1_mc_qa_multi_source_cn5omcs.json
bash scripts/experiments/qa/run_experiment_openbookqa.sh ${config_file}
Evaluate on the pre-trained model
MODEL_ARCHIVE=data/trained_models/model_kn_conceptnet5_best_run.tar.gz
EVALUATION_DATA_FILE=data/OpenBookQA-V1-Sep2018/Data/Main/test.jsonl
python obqa/run.py evaluate_predictions_qa_mc_know_visualize \
--archive_file ${MODEL_ARCHIVE} \
--evaluation_data_file ${EVALUATION_DATA_FILE} \
--output_file ${MODEL_ARCHIVE##*/}_pred_${EVALUATION_DATA_FILE##*/}
- WordNet part of ConceptNet only (cn5wordnet)
Train a new model
config_file=training_config/qa/multi_choice/openbookqa/knowreader_v1_mc_qa_multi_source_cn5wordnet.json
bash scripts/experiments/qa/run_experiment_openbookqa.sh ${config_file}
Evaluate on the pre-trained model
MODEL_ARCHIVE=data/trained_models/model_kn_wordnet_best_run.tar.gz
EVALUATION_DATA_FILE=data/OpenBookQA-V1-Sep2018/Data/Main/test.jsonl
python obqa/run.py evaluate_predictions_qa_mc_know_visualize \
--archive_file ${MODEL_ARCHIVE} \
--evaluation_data_file ${EVALUATION_DATA_FILE} \
--output_file ${MODEL_ARCHIVE##*/}_pred_${EVALUATION_DATA_FILE##*/}
- Open Book + Open Mind Common Sense part of ConceptNet (Note: this is not the Oracle setup from the paper; instead, science facts from the Open Book are retrieved based on a TF-IDF similarity measure with the question and answer choices)
Train a new model
config_file=training_config/qa/multi_choice/openbookqa/knowreader_v1_mc_qa_multi_source_openbook_plus_cn5omcs.json
bash scripts/experiments/qa/run_experiment_openbookqa.sh ${config_file}
Evaluate on the pre-trained model
MODEL_ARCHIVE=data/trained_models/model_kn_conceptnet5_and_openbook_best_run.tar.gz
EVALUATION_DATA_FILE=data/OpenBookQA-V1-Sep2018/Data/Main/test.jsonl
python obqa/run.py evaluate_predictions_qa_mc_know_visualize \
--archive_file ${MODEL_ARCHIVE} \
--evaluation_data_file ${EVALUATION_DATA_FILE} \
--output_file ${MODEL_ARCHIVE##*/}_pred_${EVALUATION_DATA_FILE##*/}
- Open Book + WordNet part of ConceptNet (Note: Similar to above, this is not the Oracle setup from the paper)
Train a new model
config_file=training_config/qa/multi_choice/openbookqa/knowreader_v1_mc_qa_multi_source_openbook_plus_cn5wordnet.json
bash scripts/experiments/qa/run_experiment_openbookqa.sh ${config_file}
Evaluate on the pre-trained model
MODEL_ARCHIVE=data/trained_models/model_kn_wordnet_and_openbook_best_run.tar.gz
EVALUATION_DATA_FILE=data/OpenBookQA-V1-Sep2018/Data/Main/test.jsonl
python obqa/run.py evaluate_predictions_qa_mc_know_visualize \
--archive_file ${MODEL_ARCHIVE} \
--evaluation_data_file ${EVALUATION_DATA_FILE} \
--output_file ${MODEL_ARCHIVE##*/}_pred_${EVALUATION_DATA_FILE##*/}
If you are also interested in SciTail entailment task (Khot et. al 2017), here is a simple BiLSTM max-out model that attains an accuracy of 87% and 85% on the Dev and Test sets, resp. (without extensive hyper-parameter tuning).
bash scripts/download_and_prepare_data_scitail.sh
python obqa/run.py train \
-s _experiments/scitail_bilstm_maxout/ \
training_config/entailment/scitail/stacked_nn_aggregate_custom_bilstm_maxout_scitail.json
If you are also interested in the ARC Challenge, our Question-to-Choice BiLSTM max-out model obtains an accuracy of 33.9% on the Test set (without extensive hyper-parameter tuning).
bash scripts/download_and_prepare_data_arc.sh
python obqa/run.py train \
-s _experiments/qa_multi_question_to_choices/ \
training_config/qa/multi_choice/arc/reader_qa_multi_choice_max_att_ARC_Chellenge_full.json
If you have any questions or comments about the code, data, or models, please contact Todor Mihaylov, Ashish Sabharwal, Tushar Khot, or Peter Clark.