Guided Online Cluster Assignment for Self Supervised Video Representation Learning.
Official PyTorch Implementation of the ECCV 2022 Paper. Feel free to contact hcoskun-at-snap.com if you have questions.
We propose a principled way to combine two views. Specifically, we propose a novel clustering strategy where we use the initial cluster
assignment of each modality as prior to guide the final cluster assignment of the
other modality. This idea will enforce similar cluster structures for both modalities, and the formed clusters will be semantically abstract and robust to noisy
inputs coming from each individual modality.
You can find the implementation of this idea in sinkhorn_withprior
- Python 3.7
- PyTorch==1.4.0, torchvision 0.5.0
- Cuda 10.2
- Apex with cuda extension (see also: this issue)
- Download dataset
sh datasets/ds_prep/kinetics-400/download.sh
- Extract rar files
sh datasets/ds_prep/kinetics-400/extract.sh
- We use the TVL1 algorithm to compute optical-flow. We modified the MemDPC Code for efficient GPU utilization to compute optical flow.
- Run this script
python datasets/ds_prep/efficent_optical_flow_with_GPU.py
- If you have more than one-GPU to dedicate computing the optical flow, you can run this script for each GPU.
- Unfortunately, I couldn't find a way to batch-wise optical flow computation with open-CV. If you can manage it, please let me know.
- Run this script
-
Generate prototypes
python prots/prototypes.py
- It will save the prototypes to "prot/fls/" and the model will load from there. If you save another location please update "helper/opt_aug.py".
- Please be sure that use_precomp_prot set true otherwise model will use randomly generated prototypes.
- Trained prototypes should look like Figure 3 (on the right) in the paper.
-
Run pre-training:
sh scripts/pretrain_on_cluster.sh
- The above script is for multi-node slurm training, however, code can be used for single node training as well.
- Please setup your dataset location in ".sh" file or in "helper/opt_aug.py" file.
You can use the following script for evaluation. You need to be sure that "root_dir" argument is correctly set.
sh scripts/knn_on_cluster.sh
please update "root_dir" for your computed features. Model generates features during the evaluation stage. You can set where to save in the
We used code from Selavi, SWaV, VICC, and CoCLR
- We are still cleaning the code, because of this maybe you might see some un-used methods in the code, please ignore them.
- We experimented training with float-16, however, we observe significant drop in accuracy. We couldn't solve this problem. Ideally, accuracy shouldn't change that much. If anyone interested in this code, we can provide that also.
- Please be careful with the open-cv implementation for optical flow. We observe that there can be significant differences in computed optical flows.
- We extract optical flow with at the image-size of 256
- You should use the same parameters to extract optical flow for all the datasets.
- We did our best to follow common evaluation strategies however, there are differences in earlier works. We mostly follow: A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
. We saw most of the works follow it, however we observe following differences:
- For instance, CoCLR and VICC use different learning rate-scheduler during fine-tuning.
- We observe differences in fine-tuning duration.
- Selavi uses different features (extracted from the different layer with embedding size of 4096) for evaluation than others. In our all experiment, we use 2048. We did not see a significant difference with 4096.
- We also observe that number of projection layer changes significantly in earlier works.
- We observe also significant differences in optimizers and learning-rate schedulers during pre-training.
@inproceedings{goca,
title={GOCA: Guided Online Cluster Assignment for Self Supervised Video Representation Learning},
author={Coskun, Huseyin and Zareian, Alireza and Moore, Joshua L and Tombari, Federico and Wang, Chen},
booktitle={ECCV},
year={2022}
}