Official PyTorch implementation of "Weakly Supervised Video Individual Counting" as presented at CVPR 2024.
Authors: Xinyan Liu, Guorong Li, Yuankai Qi, Ziheng Yan, Zhenjun Han, Anton van den Hengel, Ming-Hsuan Yang, Qingming Huang
The Video Individual Counting (VIC) task focuses on predicting the count of unique individuals in videos. Traditional methods, which rely on costly individual trajectory annotations, are impractical for large-scale applications. This work introduces a novel approach to VIC under a weakly supervised framework, utilizing less restrictive inflow and outflow annotations. We propose a baseline method employing weakly supervised contrastive learning for group-level matching, enhanced by a custom soft contrastive loss, facilitating the distinction between different crowd dynamics. We also contribute two augmented datasets, SenseCrowd and CroHD, and introduce a new dataset, UAVVIC, to foster further research in this area. Our method demonstrates superior performance compared to fully supervised counterparts, making a strong case for its practical applicability.
The CGNet architecture includes:
- Frame-level Crowd Locator: Detects pedestrian coordinates.
- Encoder: Generates unique representations for each detected individual.
- Memory-based Individual Count Predictor (MCP): Estimates inflow counts and maintains a memory of individual templates.
The Weakly Supervised Representation Learning (WSRL) method utilizes both inflow and outflow labels to refine the encoder through a novel Group-Level Matching Loss (GML), integrating soft contrastive and hinge losses to optimize performance.
Our model processes video inputs to predict individual counts, operating over 3-second intervals.
Clone and set up the CGNet repository:
git clone https://github.com/streamer-AP/CGNet
cd CGNet
conda create -n CGNet python=3.10
conda activate CGNet
pip install -r requirements.txt
Data Preparation
- CroHD : Download CroHD dataset from this link. Unzip
HT21.zip
and placeHT21
into the folder (Root/dataset/
). - SenseCrowd dataset: Download the dataset from Baidu disk or from the original dataset link.
- We provide a toy example for the GML loss to quickly understand GML loss, which can also be used in other tasks.
cd models/
python tri_sim_ot_b.py
You can see the similarity matrix converging process like this:
-
Inference.
- Before inference, you need to get crowd localization result on a pre-trained crowd localization model. You can use FIDTM, STEERER or any other crowd localization model that output coordinates results.
- We also provide a crowd localization results inferenced by FIDTM-HRNet-W48. You can download it from Baidu disk or Google drive. The data format follows:
x y x y
- Pretrained models can be downloaded from Baidu disk(pwd: jhux). or Google drive. Unzip it in the weight folder and run the following command.
python inference.py
- It will cost less than 2GB GPU memory. And using interval = 15(3s), the inference time will be less than 10 minutes for all datasets. When finished, it will generate a json file in the result folder. The data format follows:
{ "video_name": { "video_num": the predicted vic, "first_frame_num": the predicted count in the first frame, "cnt_list": [the count of inflow in each frame], "pos_lists": [the position of each individual in each frame], "frame_num": the total frame number, "inflow_lists": [the inflow of each individual in each frame], }, ... }
- For the SenseCrowd dataset, the repoduced results of this repo is shown in results dir. Run the following command to evaluate the results.
python eval.py
- For MAE and WRAE, it is slightly better than the paper. The matrics are as follows:
Method MAE RMSE WRAE Paper 8.86 17.69 12.6 Repo 8.64 18.70 11.76 -
Training.
- For training, you need to prepare the dataset and the crowd localization results. The data format follows:
x y x y
- The training script is as follows:
python train.py
- It also supports multi-GPU training. You can set the number of GPUs by using the following command:
bash dist_train.sh 8
If you find this repository helpful, please cite our paper:
@inproceedings{liu2024weakly,
title={Weakly Supervised Video Individual Counting},
author={Liu, Xinyan and Li, Guorong and Qi, Yuankai and Yan, Ziheng and Han, Zhenjun and van den Hengel, Anton and Yang, Ming-Hsuan and Huang, Qingming},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={19228--19237},
year={2024}
}
We thank the authors of FIDTM and DR.VIC for their excellent work.