This is a pure Python project that allows users to navigate through the latent space of a pretrained RAVE model with gestures in real-time.
The gesture encoder is designed in such a way that its latent codes follow the prior of RAVE (4-dimensional Gaussian distribution). Each time, RAVE decodes the gesture embeddings. More information is provided in the training notebook.
- Clone the repository
- Install the required packages via
pip install -r requirements.txt
(tested with Python 3.10.12) - Download a pretrained gesture encoder and unzip it in the root directory of the project
- Download the MediaPipe HandLandmarker model and place it in the
models
directory - Move a pretrained RAVE model to the
models
directory (you can download some here or train your own custom model)
- Connect a webcam
- Run
python generate.py --rave_model [PATH TO RAVE MODEL]
Optional arguments:
--gesture_encoder
(the path to gesture encoder; change this to indicate your custom path or if you have trained a custom encoder)--num_channels
(the number of output audio channels; depends on a RAVE model; default=1)--num_blocks
(the number of streaming blocks; the smaller number corresponds to a smaller delay; default=4)--temperature
(variance multiplier for encoder; indicates the randomness of sampling; default=2.0; recommendations: works fine from 1 to 4)--cam_device
(the index of camera device; default=0)
- Antoine Caillon and IRCAM for RAVE
- Google for MediaPipe solutions
- Matthias Geier and other contributors for sounddevice
- Kapitanov et al. for HaGRID dataset