Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation
Jinfeng Liu, Lingtong Kong, Bo Li, Zerong Wang, Hong Gu and Jinwei Chen
vivo Mobile Communication Co., Ltd
ECCV 2024 [arxiv]
- Description
- Setup
- Preparing datasets
- VFI Pre-training
- Mono-ViFI Training
- Evaluation
- Prediction
- Mono-ViFI Weights
- Related Projects
This is the official PyTorch implementation for Mono-ViFI, which is built on the codebase of BDEdepth. If you find our work useful in your research, please consider citing our paper:
@misc{liu2024,
title={Mono-ViFI: A Unified Learning Framework for Self-supervised Single- and Multi-frame Monocular Depth Estimation},
author={Jinfeng Liu and Lingtong Kong and Bo Li and Zerong Wang and Hong Gu and Jinwei Chen},
year={2024},
eprint={2407.14126},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.14126},
}
Install the dependencies with:
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
pip install -r requirements.txt
For KITTI dataset, you can prepare them as done in Monodepth2. Note that we directly train with the raw png images and do not convert them to jpgs. You also need to generate the groundtruth depth maps before training since the code will evaluate after each epoch. For the raw KITTI groundtruth (eigen
eval split), run the following command. This will generate gt_depths.npz
file in the folder splits/kitti/eigen/
.
python export_gt_depth.py --data_path /home/datasets/kitti_raw_data --split eigen
For the improved KITTI groundtruth (eigen_benchmark
eval split), please directly download it in this link. And then move the downloaded file (gt_depths.npz
) to the folder splits/kitti/eigen_benchmark/
.
For Make3D dataset, you can download it from here.
For Cityscapes dataset, we follow the instructions in ManyDepth. First Download leftImg8bit_sequence_trainvaltest.zip
and camera_trainvaltest.zip
in its website, and unzip them into a folder /path/to/cityscapes/
. Then preprocess CityScapes dataset using the followimg command:
python prepare_cityscapes.py \
--img_height 512 \
--img_width 1024 \
--dataset_dir /path/to/cityscapes \
--dump_root /path/to/cityscapes_preprocessed \
--seq_length 3 \
--num_threads 8
Remember to modify --dataset_dir
and --dump_root
to your own path. The ground truth depth files are provided by ManyDepth in this link, which were converted from pixel disparities using intrinsics and the known baseline. Download it and unzip into splits/cityscapes/
Download the following 6 checkpoints related to VFI in this link:
- small IFRNet pretrained on Vimeo90K dataset :
IFRNet_S_Vimeo90K.pth
- large IFRNet pretrained on Vimeo90K dataset :
IFRNet_L_Vimeo90K.pth
- small IFRNet pretrained on KITTI dataset :
IFRNet_S_KITTI.pth
- large IFRNet pretrained on KITTI dataset :
IFRNet_L_KITTI.pth
- small IFRNet pretrained on Cityscapes dataset :
IFRNet_S_CS.pth
- large IFRNet pretrained on Cityscapes dataset :
IFRNet_L_CS.pth
To save time, you can skip VFI pre-training and directly use our provided checkpoints. Just create a folder Mono-ViFI/weights/
and move IFRNet_S_KITTI.pth, IFRNet_L_KITTI.pth, IFRNet_S_CS.pth, IFRNet_L_CS.pth
to this folder.
If you want to train VFI models by yourself, move IFRNet_L_Vimeo90K.pth, IFRNet_S_Vimeo90K.pth
to the folder Mono-ViFI/weights/
. We load Vimeo90K checkpoints to train on KITTI/Cityscapes. All the VFI training configs are in the folder configs/vfi/
. For example,
the command for training large IFRNet on KITTI is:
### Training large IFRNet on KITTI
# single-gpu
CUDA_VISIBLE_DEVICES=0 python train_vfi.py -c configs/vfi/IFRNet_L_KITTI.txt
# multi-gpu
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 train_vfi.py -c configs/vfi/IFRNet_L_KITTI.txt
Before training, move the 2 checkpoints downloaded from this link to the folder Mono-ViFI/weights/
:
- HRNet18 backbone pretrained on ImageNet :
HRNet_W18_C_cosinelr_cutmix_300epoch.pth.tar
- LiteMono backbone pretrained on ImageNet :
lite-mono-pretrain.pth
You can refer to config files for the training settings/parameters/paths. All training configs are in the folders:
- ResNet18 backbone :
configs/resnet18
- LiteMono backbone :
configs/litemono
- D-HRNet backbone :
configs/dhrnet
Remember to modify related paths to your own. Take ResNet18 as an example, the training commands are as follows.
Note: you can adjust batch_size
in the config files according to your maximum GPU memory.
### Training with ResNet18 backbone (KITTI, 640x192)
# single-gpu
CUDA_VISIBLE_DEVICES=0 python train.py -c configs/resnet18/ResNet18_KITTI_MR.txt
# multi-gpu
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py -c configs/resnet18/ResNet18_KITTI_MR.txt
### Training with ResNet18 backbone (KITTI, 1024x320)
# For 1024x320 resolution, we load 640x192 model and train for 5 epoches with 1e-5 learning rate.
CUDA_VISIBLE_DEVICES=0 python train.py -c configs/resnet18/ResNet18_KITTI_HR.txt
### Training with ResNet18 backbone (Cityscapes, 512x192)
CUDA_VISIBLE_DEVICES=0 python train.py -c configs/resnet18/ResNet18_CS.txt
### KITTI 640x192 model, ResNet18
CUDA_VISIBLE_DEVICES=0 python evaluate_depth.py \
--pretrained_path our_models/ResNet18_KITTI_MR.pth \
--backbone ResNet18 \
--batch_size 16 \
--width 640 \
--height 192 \
--kitti_path /data/juicefs_sharing_data/public_data/Datasets/KITTI/kitti_raw_data \
--make3d_path /data/juicefs_sharing_data/public_data/Datasets/make3d \
--cityscapes_path /data/juicefs_sharing_data/public_data/Datasets/cityscapes \
# --post_process
This script will evaluate on KITTI (both raw and improved GT), Make3D and Cityscapes together. If you don't want to evaluate on some of these datasets, for example KITTI, just do not specify the corresponding --kitti_path
flag. It will only evaluate on the datasets which you have specified a path flag.
If you want to evalute with post-processing, add the --post_process
flag (disabled by default).
### KITTI 640x192 model, ResNet18
CUDA_VISIBLE_DEVICES=0 python evaluate_depth_mf.py \
--pretrained_path our_models/ResNet18_KITTI_MR.pth \
--backbone ResNet18 \
--vfi_scale small \
--training_data kitti \
--batch_size 16 \
--width 640 \
--height 192 \
--kitti_path /data/juicefs_sharing_data/public_data/Datasets/KITTI/kitti_raw_data \
--cityscapes_path /data/juicefs_sharing_data/public_data/Datasets/cityscapes \
You can predict the disparity (inverse depth) for a single image with:
python test_simple.py --image_path folder/test_image.png --pretrained_path our_models/DHRNet_KITTI_MR.pth --backbone DHRNet --height 192 --width 640 --save_npy
The --image_path
flag can also be a directory containing several images. In this setting, the script will predict all the images (use --ext
to specify png or jpg) in the directory:
python test_simple.py --image_path folder --pretrained_path our_models/DHRNet_KITTI_MR.pth --backbone DHRNet --height 192 --width 640 --ext png --save_npy
python test_video.py --image_path folder --pretrained_path our_models/DHRNet_KITTI_MR.pth --backbone DHRNet --vfi_scale small --training_data kitti --height 192 --width 640 --ext png --save_npy
Here the --image_path
flag should be a directory containing several video frames. Note that these video frame files should be named in an ascending numerical order. For example, the first frame is named as 0000.png
, the second frame is named as 0001.png
, and etc. This command will also output a GIF file.
We provide our pretrained weights of depth models in this link, including 9 checkpoints:
- ResNet18 backbone trained on KITTI with 640x192 :
ResNet18_KITTI_MR.pth
- ResNet18 backbone trained on KITTI with 1024x320 :
ResNet18_KITTI_HR.pth
- ResNet18 backbone trained on Cityscapes with 512x192 :
ResNet18_CS.pth
- Lite-Mono backbone trained on KITTI with 640x192 :
LiteMono_KITTI_MR.pth
- Lite-Mono backbone trained on KITTI with 1024x320 :
LiteMono_KITTI_HR.pth
- Lite-Mono backbone trained on Cityscapes with 512x192 :
LiteMono_CS.pth
- D-HRNet backbone trained on KITTI with 640x192 :
DHRNet_KITTI_MR.pth
- D-HRNet backbone trained on KITTI with 1024x320 :
DHRNet_KITTI_HR.pth
- D-HRNet backbone trained on Cityscapes with 512x192 :
DHRNet_CS.pth
Note that they are newly trained checkpoints whose evaluation indexes are slightly different from those reported in the paper.
- Monodepth2 (ICCV 2019)
- ManyDepth (CVPR 2021)
- Lite-Mono (CVPR 2023)
- PlaneDepth (CVPR 2023)
- RA-Depth (ECCV 2022)
- BDEdepth (IEEE RA-L 2023)
- IFRNet (CVPR 2022, our employed VFI model)