Skip to content

tum-vision/DEVO

Repository files navigation

Deep Event Visual Odometry

Simon Klenk1,2*    Marvin Motzet1,2*    Lukas Koestler1,2    Daniel Cremers1,2

*equal contribution

1Technical University of Munich (TUM)    2Munich Center for Machine Learning (MCML)

International Conference on 3D Vision (3DV) 2024, Davos, CH

DEVO

Paper (arXiv) | Video | Poster | BibTeX

Abstract

Event cameras offer the exciting possibility of tracking the camera's pose during high-speed motion and in adverse lighting conditions. Despite this promise, existing event-based monocular visual odometry (VO) approaches demonstrate limited performance on recent benchmarks. To address this limitation, some methods resort to additional sensors such as IMUs, stereo event cameras, or frame-based cameras. Nonetheless, these additional sensors limit the application of event cameras in real-world devices since they increase cost and complicate system requirements. Moreover, relying on a frame-based camera makes the system susceptible to motion blur and HDR. To remove the dependency on additional sensors and to push the limits of using only a single event camera, we present Deep Event VO (DEVO), the first monocular event-only system with strong performance on a large number of real-world benchmarks. DEVO sparsely tracks selected event patches over time. A key component of DEVO is a novel deep patch selection mechanism tailored to event data. We significantly decrease the pose tracking error on seven real-world benchmarks by up to 97% compared to event-only methods and often surpass or are close to stereo or inertial methods.

Overview

During training, DEVO takes event voxel grids $\{\mathbf{E}_t\}_{t=1}^N$, inverse depths $\{\mathbf{d}_t\}_{t=1}^N$, and camera poses $\{\mathbf{T}_t\}_{t=1}^N$ of a sequence of size $N$ as input. DEVO estimates poses $\{\hat{\mathbf{T}}_t\}_{t=1}^N$ and depths $\{\hat{\mathbf{d}}_t\}_{t=1}^N$ of the sequence. Our novel patch selection network predicts a score map $\mathbf{S}_t$ to highlight optimal 2D coordinates $\mathbf{P}_t$ for optical flow and pose estimation. A recurrent update operator iteratively refines the sparse patch-based optical flow $\hat{\mathbf{f}}$ between event grids by predicting $\Delta\hat{\mathbf{f}}$ and updates poses and depths through a differentiable bundle adjustment (DBA) layer, weighted by $\omega$, for each revision. Ground truth optical flow $\mathbf{f}$ for supervision is computed using poses and depth maps. At inference, DEVO samples from a multinomial distribution based on the pooled score map $\mathbf{S}_t$.

Setup

The code was tested on Ubuntu 22.04 and CUDA Toolkit 11.x. We use Anaconda to manage our Python environment.

First, clone the repo

git clone https://github.com/tum-vision/DEVO.git --recursive
cd DEVO

Then, create and activate the Anaconda environment

conda env create -f environment.yml
conda activate devo

Next, install the DEVO package

# download and unzip Eigen source code
wget https://gitlab.com/libeigen/eigen/-/archive/3.4.0/eigen-3.4.0.zip
unzip eigen-3.4.0.zip -d thirdparty

# install DEVO
pip install .

Only for Training

The following steps are only needed if you intend to (re)train DEVO. Please note, the training data have the size of about 1.1TB (rbg: 300GB, evs: 370GB).

Otherwise, skip it and go to here.

First, download all RGB images and depth maps of TartanAir from the left camera (~500GB) to <TARTANPATH>

python thirdparty/tartanair_tools/download_training.py --output-dir <TARTANPATH> --rgb --depth --only-left

Next, generate event voxel grids using vid2e.

python scripts/convert_tartan.py --dirsfile <path to .txt file>

dirsfile expects a .txt file containing line-separated paths to dirs with .png images (to generate events for these images).

Only for Evalution

We provide a pretrained model for our simulated event data.

# download model (~40MB)
./download_model.sh

Data Preprocessing

We evaluate DEVO on seven real-world event-based datasets (FPV, VECtor, HKU, EDS, RPG, MVSEC, TUM-VIE). We provide scripts for data preprocessing (undist, ...).

Check scripts/pp_DATASETNAME.py for the way to preprocess the original datasets. This will create the necessary files for you, e.g. rectify_map.h5, calib_undist.json and t_offset_us.txt.

Training

Make sure you have run the following steps. Your dataset directory structure should look as follows

├── <TARTANPATH>
    ├── abandonedfactory
    ├── abandonedfactory_night
    ├── ...
    ├── westerndesert

To train DEVO with the default configuration, run

python train.py -c="config/DEVO_base.conf" --name=<your name>

The log files will be written to runs/<your name>. Please, check train.py for more options.

Evaluation

Make sure you have run the following steps (downloading pretrained model, data and preprocessing data).

python evals/eval_evs/eval_DATASETNAME_evs.py --datapath=<DATASETPATH> --weights="DEVO.pth" --stride=1 --trials=1 --expname=<your name>

The qualitative and quantitative results will be written to results/DATASETNAME/<your name>. Check eval_rpg_evs.py for more options.

News

  • Code and model are released.
  • Code for simulation is released.

Citation

If you find our work useful, please cite our paper:

@inproceedings{klenk2023devo,
  title     = {Deep Event Visual Odometry},
  author    = {Klenk, Simon and Motzet, Marvin and Koestler, Lukas and Cremers, Daniel},
  booktitle = {International Conference on 3D Vision, 3DV 2024, Davos, Switzerland,
               March 18-21, 2024},
  pages     = {739--749},
  publisher = {{IEEE}},
  year      = {2024},
}

Acknowledgments

We thank the authors of the following repositories for publicly releasing their work:

This work was supported by the ERC Advanced Grant SIMULACRON.