Skip to content

streamer-AP/CGNet

Repository files navigation

Weakly Supervised Video Individual Counting

Official PyTorch implementation of "Weakly Supervised Video Individual Counting" as presented at CVPR 2024.

📄 Read the Paper

Authors: Xinyan Liu, Guorong Li, Yuankai Qi, Ziheng Yan, Zhenjun Han, Anton van den Hengel, Ming-Hsuan Yang, Qingming Huang

Overview

The Video Individual Counting (VIC) task focuses on predicting the count of unique individuals in videos. Traditional methods, which rely on costly individual trajectory annotations, are impractical for large-scale applications. This work introduces a novel approach to VIC under a weakly supervised framework, utilizing less restrictive inflow and outflow annotations. We propose a baseline method employing weakly supervised contrastive learning for group-level matching, enhanced by a custom soft contrastive loss, facilitating the distinction between different crowd dynamics. We also contribute two augmented datasets, SenseCrowd and CroHD, and introduce a new dataset, UAVVIC, to foster further research in this area. Our method demonstrates superior performance compared to fully supervised counterparts, making a strong case for its practical applicability.

Inference Pipeline

Inference Pipeline

The CGNet architecture includes:

  • Frame-level Crowd Locator: Detects pedestrian coordinates.
  • Encoder: Generates unique representations for each detected individual.
  • Memory-based Individual Count Predictor (MCP): Estimates inflow counts and maintains a memory of individual templates.

The Weakly Supervised Representation Learning (WSRL) method utilizes both inflow and outflow labels to refine the encoder through a novel Group-Level Matching Loss (GML), integrating soft contrastive and hinge losses to optimize performance.

Demo

Our model processes video inputs to predict individual counts, operating over 3-second intervals.

GIF 1 GIF 2 GIF 3 GIF 4

Setup

Installation

Clone and set up the CGNet repository:

git clone https://github.com/streamer-AP/CGNet
cd CGNet
conda create -n CGNet python=3.10
conda activate CGNet
pip install -r requirements.txt

Data Preparation

  • CroHD : Download CroHD dataset from this link. Unzip HT21.zip and place HT21 into the folder (Root/dataset/).
  • SenseCrowd dataset: Download the dataset from Baidu disk or from the original dataset link.

Usage

  1. We provide a toy example for the GML loss to quickly understand GML loss, which can also be used in other tasks.
cd models/
python tri_sim_ot_b.py

You can see the similarity matrix converging process like this:

sim

  1. Inference.

    • Before inference, you need to get crowd localization result on a pre-trained crowd localization model. You can use FIDTM, STEERER or any other crowd localization model that output coordinates results.
    • We also provide a crowd localization results inferenced by FIDTM-HRNet-W48. You can download it from Baidu disk or Google drive. The data format follows:
    x y
    x y
    
    • Pretrained models can be downloaded from Baidu disk(pwd: jhux). or Google drive. Unzip it in the weight folder and run the following command.
     python inference.py
    • It will cost less than 2GB GPU memory. And using interval = 15(3s), the inference time will be less than 10 minutes for all datasets. When finished, it will generate a json file in the result folder. The data format follows:
    {
     "video_name": {
         "video_num": the predicted vic,
         "first_frame_num": the predicted count in the first frame,
         "cnt_list": [the count of inflow in each frame],
         "pos_lists": [the position of each individual in each frame],
         "frame_num": the total frame number,
         "inflow_lists": [the inflow of each individual in each frame],
     },
     ...
    }
    • For the SenseCrowd dataset, the repoduced results of this repo is shown in results dir. Run the following command to evaluate the results.
     python eval.py
    • For MAE and WRAE, it is slightly better than the paper. The matrics are as follows:
    Method MAE RMSE WRAE
    Paper 8.86 17.69 12.6
    Repo 8.64 18.70 11.76
  2. Training.

    • For training, you need to prepare the dataset and the crowd localization results. The data format follows:
    x y
    x y
    
    • The training script is as follows:
    python train.py
    • It also supports multi-GPU training. You can set the number of GPUs by using the following command:
    bash dist_train.sh 8

Citation

If you find this repository helpful, please cite our paper:

@inproceedings{liu2024weakly,
  title={Weakly Supervised Video Individual Counting},
  author={Liu, Xinyan and Li, Guorong and Qi, Yuankai and Yan, Ziheng and Han, Zhenjun and van den Hengel, Anton and Yang, Ming-Hsuan and Huang, Qingming},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={19228--19237},
  year={2024}
}

Acknowledgement

We thank the authors of FIDTM and DR.VIC for their excellent work.

About

Weakly supverised individual counting

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published