When you are installing this for the first time, run:
conda env create --file environment.yml
After that, and on every new session, run:
conda activate mscan2
During development, add dependencies that you care about to environment.yml
. Then run:
conda env update --file environment.yml --prune
The SCAN repo gives us the pre-split data, while the MCD splits give a JSON description of how to split. The following script converts the SCAN splits to the same format as MCD.
The following translates the English to various languages, then uses the split JSON descriptions to produces datasets for each (language, split) pair.
Set an environment variable with your Hugging Face token:
export HF_TOKEN="hf_..."
Then, to run an experiment, you can run:
python src/inference.py --model-name "bigscience/bloom" --train data/output/en/simple/train.txt --test data/output/en/simple/test.txt --output data/output/results/playground/results.json --context-size 2 --num-queries 1
Set an environment variable with your OpenAI token:
export OPENAI_API_KEY="sk-..."
Edit condor-run.sh
with whatever you want to do in the Condor task, then run:
condor_submit condor-task.cmd
To see the queue, and see whether your job is running:
Start by generating all SLURM submission scripts
python src/generate_slurm.py
To run a single SLURM job, do e.g.:
sbatch scripts/generated/slurm/bloomz/run-bloomz-en-mcd1.slurm
Then check job status with:
squeue --me
# or, for more information:
squeue --me -o "%22S %.12i %.45j %.10T %.10M %.30R" --sort="M,j"
And check job output with e.g.:
# Replace job ID and task ID below
cat /mmfs1/gscratch/clmbr/amelie/projects/thesis_multiling_compos/data/output/results/bloomz/en/mcd1/<task_id>_<job_id>.out