Skip to content

Latest commit

 

History

History
109 lines (88 loc) · 4.92 KB

README.md

File metadata and controls

109 lines (88 loc) · 4.92 KB

[NAACL-2024] Self-adaptive Sampling for Efficient Video Question Ansering on Image--Text Models

🔥 [14/03/2024] This paper has been accepeted to NAACL 2024 (Findings)!

Introduction

This repository contains the official implementation code of the paper "Self-adaptive Sampling for Efficient Video Question Answering". In this work we introduce and study two simple sampling strategies (MIF and MDF) for the tuning of Video Question Answering tasks on pretrained Visual Language Models (VLMs).

Specifically, we first systematically test the performance of MIF (Most Implied Frames) with varied backbone models as captioner and scorer. They work together to perform a "question-and-vision-aware" sampling. Then we draw inspiration from the results and analysis to further propose the more lightweight MDF (Most Dominant Frames), which takes one more step to discard the correlation of question and executs a "question-agnostic, vision-aware" sampling. This routine significantly boosts the efficiency while maintain or even gain higher performance on the tested datasets.

Once running completes, sampled frames will be saved in a hdf5 (.h5) file as a "dataset" for fast loading during training and test time. We test our methods on three models (CLIP, GIT and All-in-one) and 4 datasets (MSVD-QA, MSRVTT-QA, TGIF-Frame, NeXT-QA). The implementation on CLIP (including our refined structure CLIP-Dec which significantly enhances the performance on raw-CLIP) and GIT are in the folder clip_and_git, while the implementation on All-in-one are under the folder all_in_one.

Usage

1. Downloading Datasets

Please visit the corresponding repository and follow the instruction there to download the datasets.

The suggested path to store these datasets is "model/dataset/<dataset_name>"

2. Preprocessing

The code to do sampling for all three models is same, under the folder "clip_and_git/src/preprocessing".

  • To sample via MDF method, run the python script as follows:

    python extract_features.py --dataset=<dataset_name> --dataset_root=<root_path> --sampling_strategy='repr' --model_name=<vlm_model_name> ... (other hps)
    

    If your code prompts an out-of-memory exception, please using a smaller chunksize (default=512) to shrink the input size per computation.

  • To sample via MIF method, first run a uniform sampling with large K (e.g., 16 or 32) to obtain a sparse frame sequence

    python extract_features.py --sampling_strategy='uni' --K 16 ...
    

    Then run the python script to capture and start sampling

    python gen_sample.py --dataset=<dataset_name> --dataset_root=<root_path> --sampling_strategy='repr' --vlm_model=<vlm_model_name> --sim_model=<sim_model_name> --task='gen_cap'
    
    python gen_sample.py --dataset=<dataset_name> --dataset_root=<root_path> --sampling_strategy='repr' --vlm_model=<vlm_model_name> --sim_model=<sim_model_name> --task='gen_inds'
    

3. Training and Inference

For experiments on CLIP and GIT, please modify our provided reference scripts (in src/scripts). For all-in-one, please check its attached README file for more details.

Results (Partial)

The following results are prediction accuracy, which has been defined and customized for each dataset/model in our paper.

CLIP-Dec (3 Frame)

Sampling MSVD-QA MSRVTT-QA TGIF-Frame
noDec 27.7 30.3 42.8
Uniform 33.8 33.7 47.2
MDF 35.0 35.2 63.2
MIF 35.0 35.4 61.8

GIT-Base (6 Frame)

Sampling MSVD-QA MSRVTT-QA TGIF-Frame
Report 51.2 41.0 69.1
Uniform 52.2 41.1 67.5
MDF 55.3 42.0 69.9
MIF 54.5 42.3 69.6

AIO-Base (3 Frame)

Sampling MSVD-QA MSRVTT-QA TGIF-Frame
Report 46.5 42.9 64.2
Reprd. 46.1 42.7 64.0
MDF 46.9 43.8 66.2
MIF 46.7 44.0 65.9

AIO-Base+ on Next-QA (3 Frame)

Method Val Test
Base 48.4 48.1
MIF 49.7 49.5
MDF 50.2 49.8

BLIP2-T5XXL on Next-QA (3 Frame)

Method Val Test
Base 60.1 59.7
MIF 61.5 61.2
MDF 61.8 61.1

Citation

Please cite our paper if you find this project is related to your work

@misc{han2023sas,
      title={SAS Video-QA: Self-Adaptive Sampling for Efficient Video Question-Answering}, 
      author={Wei Han and Hui Chen and Min-Yen Kan and Soujanya Poria},
      year={2023},
      eprint={2307.04192},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Contact

If you have any enquiries about our code and paper, feel free to contact us at [email protected] or [email protected].