We provide pre-set config files for each experiment setting in configs
folder.
Make sure to configure the dataset paths in config files and run the commands from the main directory ViFi-CLIP/
.
Below we provide training instructions for ViFi-CLIP and its variants.
Instructions:
- Zero-shot setting
- Base-to-novel generalization setting
- Few-shot setting
- Fully-supervised setting
- Vanilla CLIP zero-shot evaluation
We train all models on Kinetics-400 with 32 frames for 10 epochs and then evaluate directly on downstream datasets (HMDB-51, UCF-101 and K-600). All zero-shot config files are present at configs/zero_shot/*.yaml
.
To train ViFi-CLIP model on Kinetics-400, run the following command.
python -m torch.distributed.launch --nproc_per_node=8 main.py \
-cfg configs/zero_shot/train/k400/16_16_vifi_clip.yaml --output /PATH/TO/OUTPUT
Note: If you want to finetune on either image or text encoder of CLIP, please set the variable USE
in the config file to "image" or "text" respectively.
For example, to train CLIP with only text fine-tuning, run the following command:
python -m torch.distributed.launch --nproc_per_node=8 main.py \
-cfg configs/zero_shot/train/k400/16_16_vifi_clip.yaml --output /PATH/TO/OUTPUT --opts TRAINER.ViFi_CLIP.USE "text"
Use the configs at configs/zero_shot/eval
to evaluate the trained models on downstream datasets.
For example, to evaluate ViFi-CLIP trained model on Kinetics-600 first split, run the following command:
python -m torch.distributed.launch --nproc_per_node=8 main.py \
-cfg configs/zero_shot/eval/k600/16_32_K600_ZS_split1.yaml --output /PATH/TO/OUTPUT --only_test \
--resume /PATH/TO/TRAINED/VIFI-CLIP-CKPT
The default training settings are configured in config files at configs/base2novel
. Below we provide instructions to train
and evaluate ViFi-CLIP model on Kinetics-400.
# seed=1
# trains and evaluates on base classes, this will save weights named "ckpt_epoch_10.pth"
python -m torch.distributed.launch --nproc_per_node=8 main.py \
-cfg configs/base2novel/finetuning_base2novel/k400/16_32_vifi_clip_s1.yaml --output /PATH/TO/OUTPUT
# evaluates on novel classes, use model weights which are trained on base classes
python -m torch.distributed.launch --nproc_per_node=8 main.py \
-cfg configs/base2novel/finetuning_base2novel/k400/16_32_vifi_clip_novel_eval.yaml --output /PATH/TO/OUTPUT --only_test \
--resume /PATH/TO/TRAINED/VIFI-CLIP-CKPT
To train using other variants of CLIP, modify the config parameter TRAINER.ViFi_CLIP.USE
accordingly.
Config files for VL prompting approach are provided at configs/base2novel/prompting_base2novel
. Use the pretrained ViFi-CLIP model to initialize training when using VL prompting approach.
Below we provide instructions to train VL prompting method in base-to-novel setting on HMDB-51.
# seed=1
# use pretrained model (on K-400) for training
# trains and evaluates on base classes,
# this will save weights named "ckpt_epoch_10.pth",
python -m torch.distributed.launch --nproc_per_node=8 main.py \
-cfg configs/base2novel/prompting_base2novel/hmdb/16_32_prompting_s1.yaml --output /PATH/TO/OUTPUT \
--resume /PATH/TO/TRAINED-K400/VIFI-CLIP-CKPT
# evaluates on novel classes, use model weights which are trained on base classes
python -m torch.distributed.launch --nproc_per_node=8 main.py \
-cfg configs/base2novel/prompting_base2novel/hmdb/16_32_prompting_novel_eval.yaml --output /PATH/TO/OUTPUT --only_test \
--resume /PATH/TO/TRAINED/VIFI-CLIP-CKPT
This trains only the vision and language prompts on the downstream task while rest of CLIP model is kept frozen.
Similarly, using the corresponding config files, models can be trained on other datasets including HMDB-51, UCF-101 and SSv2.
Use the config files at configs/few_shot
to train models in few-shot setting.
Below we provide instructions to train ViFi-CLIP on HMDB-51 in few-shot manner for K=2.
# K=2
python -m torch.distributed.launch --nproc_per_node=8 main.py \
-cfg configs/few_shot/finetuning_few_shot/hmdb51/16_32_vifi_clip_2_shot.yaml --output /PATH/TO/OUTPUT
To train using other variants of CLIP, modify the config parameter TRAINER.ViFi_CLIP.USE
accordingly.
We provide config files for VL prompting approach at configs/few_shot/prompting_few_shot
.
Use the pretrained ViFi-CLIP model to initialize training when using VL prompting approach.
Below we provide instructions to train VL prompting method on HMDB-51 in few-shot manner for K=2.
# K=2
python -m torch.distributed.launch --nproc_per_node=8 main.py \
-cfg configs/few_shot/prompting_few_shot/hmdb51/16_32_prompting_2_shot.yaml --output /PATH/TO/OUTPUT \
--resume /PATH/TO/TRAINED-K400/VIFI-CLIP-CKPT
This trains only the vision and language prompts on the downstream task in few-shot manner while rest of CLIP model is kept frozen.
For fully-supervised experiments, we provide config files at configs/fully_supervised/k400
.
For example, to train ViFi-CLIP (tunes both image and text encoder) on Kinetics-400, run the following command:
python -m torch.distributed.launch --nproc_per_node=8 \
main.py -cfg configs/fully_supervised/k400/16_16_vifi_clip.yaml --output /PATH/TO/OUTPUT
To train using other variants of CLIP, modify the config parameter TRAINER.ViFi_CLIP.USE
accordingly.
Here we provide evaluation instructions to evaluate Vanilla CLIP (without any video training). All experimental settings are supported.
All config files can be used directly. Just turn ON the ZS_EVAL
flag while evaluating and it will use vanilla CLIP for evaluation.
For example, to evaluate ZS vanilla CLIP on HMDB-51 split-1, run the following command:
python -m torch.distributed.launch --nproc_per_node=8 main.py \
-cfg configs/base2novel/prompting_base2novel/hmdb/16_32_prompting_novel_eval.yaml --output /PATH/TO/OUTPUT --only_test \
--resume "" --opts TRAINER.ViFi_CLIP.ZS_EVAL "True"
This will evaluate the vanilla ZS CLIP on the given dataset.