When you are installing this for the first time, run:
conda env create --file environment.yml
After that, and on every new session, run:
conda activate mscan2
During development, add dependencies that you care about to environment.yml
. Then run:
conda env update --file environment.yml --prune
The SCAN repo gives us the pre-split data, while the MCD splits give a JSON description of how to split. The following script converts the SCAN splits to the same format as MCD.
./scripts/create_split_files.sh
The following translates the English to various languages, then uses the split JSON descriptions to produces datasets for each (language, split) pair.
./scripts/preprocess_all_scan.sh
Set an environment variable with your Hugging Face token:
export HF_TOKEN="hf_..."
Then, to run an experiment, you can run:
python src/inference.py --model-name "bigscience/bloom" --train data/output/en/simple/train.txt --test data/output/en/simple/test.txt --output data/output/results/playground/results.json --context-size 2 --num-queries 1
Set an environment variable with your OpenAI token:
export OPENAI_API_KEY="sk-..."
Edit condor-run.sh
with whatever you want to do in the Condor task, then run:
condor_submit condor-task.cmd
To see the queue, and see whether your job is running:
condor_q
Start by generating all SLURM submission scripts
python src/generate_slurm.py
To run a single SLURM job, do e.g.:
sbatch scripts/generated/slurm/bloomz/run-bloomz-en-mcd1.slurm
Then check job status with:
squeue --me
# or, for more information:
squeue --me -o "%22S %.12i %.45j %.10T %.10M %.30R" --sort="M,j"
And check job output with e.g.:
# Replace job ID and task ID below
cat /mmfs1/gscratch/clmbr/amelie/projects/thesis_multiling_compos/data/output/results/bloomz/en/mcd1/<task_id>_<job_id>.out
./scripts/generated/slurm/bloomz/slurm-submit-all.sh