This repository contains the code for the 2017 paper "Social Bias in Elicited Natural Language Inferences" by Rachel Rudinger, Chandler May, and Benjamin Van Durme. Rachel Rudinger and Chandler May contributed to this code, which is released under the two-clause BSD license.
Install dependencies with:
pip install -r requirements.txt
Download and unzip the SNLI data:
wget http://nlp.stanford.edu/projects/snli/snli_1.0.zip
unzip snli_1.0.zip
Compute counts for unigrams and bigrams, across all inference types,
using 7 subprocesses (in addition to the main process), and filtering
out hypothesis words that occur in the premise. Read SNLI pairs from
snli_1.0/snli_1.0_train.jsonl
and write counts to
snli_stats/counts.pkl
:
python snli_cooccur.py \
between-prem-hypo \
--max-ngram 2 \
--num-proc 7 \
--filter-hypo-by-prem \
snli_1.0/snli_1.0_train.jsonl snli_stats/counts.pkl
Run python snli_cooccur.py --help
for more options.
Alternatively, compute counts for all parameter configurations, in a loop:
bash snli_cooccur_loop.bash
To change the default input and output directories, or change the
Python interpreter used to run snli_cooccur.py
, create a file named
snli_cooccur_loop_include.bash
with the following contents and
modify them as desired (and then run snli_cooccur_loop.bash
):
snli_dir=snli_1.0
output_dir=snli_stats
big_python=python
little_python=python
The little_python
and big_python
variables are the Python
commands used for unigram and unigram-and-bigram models, respectively.
The latter have higher memory requirements.
(Note the little_python
and big_python
variables can be set to job
submission scripts invoking a python interpreter to parallelize the
computation on a grid.)
Query top-five co-occurrence lists, ranked by PMI, filtering candidates
to unigrams (filtering out bigrams), and filtering out co-occurrence
candidates with count less than five. Run queries from the YAML
specification in top-y.yaml
, using counts from
snli_stats/counts.pkl
, and write output to snli_stats/pmi.txt
:
python snli_query.py \
-k 5 \
--filter-to-unigrams \
--top-y-score-func pmi \
--min-count 5 \
snli_stats/counts.pkl top-y top-y.yaml snli_stats/pmi.txt
Run python snli_query.py --help
for more options.
Alternatively, run queries for all parameter configurations, in a loop:
bash snli_query_loop.bash
To change the default input paths and output directory, or change the
Python interpreter used to run snli_query.py
or other settings,
create a file named snli_query_loop_include.bash
with the following
contents and modify them as desired (and then run
snli_query_loop.bash
):
min_count=5
output_dir=snli_stats
python=python
extra_args='-k 5 --filter-to-unigrams --top-y-score-func pmi'
query_type=top-y
query_path=top-y.yaml
output_ext=.txt
input_dir=snli_stats
input_paths=`find "$input_dir" -type f -name '*.pkl'`
In the definition of the likelihood ratio Λ(C') in the paper (last equation on the second page, or page 75 in the proceedings), the summations should be products. The code and results use the correct definition.