parallel_MPC.mp4
Evaluating MPC in parallel for thousands of instances of the MIT Humanoid. CusADi efficiently parallelizes functions for applications such as reinforcement learning, parallel simulation, and sensitivity analysis.
Overview and more videos: https://www.youtube.com/watch?v=NxeujmgcEL4
ArXiv: https://arxiv.org/abs/2408.09662
If you use this work, please use the following citation:
@misc{jeon2024cusadigpuparallelizationframework,
title={CusADi: A GPU Parallelization Framework for Symbolic Expressions and Optimal Control},
author={Se Hwan Jeon and Seungwoo Hong and Ho Jae Lee and Charles Khazoom and Sangbae Kim},
year={2024},
eprint={2408.09662},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2408.09662},
}
Details of our work can be found in the paper:
Table of Contents
cusadi
is a framework for parallelizing arbitrary symbolic functions on the GPU.
It extends the graph expression structure of casadi
by generating and compiling CUDA
kernels for efficient evaluation of thousands of function instances.
casadi
is particularly well-suited for this because of it is able to exploit sparsity when building its expression graphs, ensuring efficient computation.
The CasADi expression graph is evaluated by iterating through instructions and performing each operation on scalar values. CusADi can exploit this same structure and evaluate the function in parallel with the GPU by vectorizing each operation to act on tensors of data instead of scalars.
We demonstrate using cusadi
for several robotics applications, including parallelizing MPC, augmenting RL with parallelized dynamics algorithms, creating parallel simulations, and running parameters sweeps.
Benchmarks show that cusadi
offers significant speedups, especially for reinforcement learning applications where significant overhead is incurred from CPU-GPU data transfer.
(Left): Speedups from CusADi relative to serial CPU evaluation of five functions, each increasing in complexity.
(Right): Speedups from CusADi relative to serial CPU evaluation, including overhead time from CPU-GPU data transfer.
cusadi
was built on the following. Other versions may work, but are untested.
- Ubuntu 20.04
- Python 3.8+
casadi
3.6.5: https://web.casadi.org/get/CUDA
Toolkit: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64- Version 12.3 was used, but 12.X should work
- Add the installation to the
PATH
variable to be able to compile and findCUDA
libraries. Make sure to replacecuda-12.3
with the installed version:export PATH="/usr/local/cuda-12.3/bin:$PATH" export LD_LIBRARY_PATH="/usr/local/cuda-12.3/lib64:$LD_LIBRARY_PATH"
- (Optional, for running benchmarks) Eigen
- Clone this repository (standalone, or into a larger project)
git clone https://github.com/se-hwan/cusadi
- (Optional) Setup a virtual environment for required Python dependencies
python -m venv .cusadi_venv # Python virtual environment source .cusadi_venv/bin/activate
- Install
cusadi
. From the root of the cloned repository, run:pip install -e .
- Compile the test function for parallelization.
python run_codegen.py --fn=test
- Evaluate the parallelized function for accuracy. The error should be ~1e-10 or smaller. If successful, then
cusadi
is ready for use!python run_cusadi_function_test.py --fn=test
-
Define some symbolic
casadi
function for parallelization. This function could be the dynamics of a system, a value iteration update, a controller, etc., but do not need to be limited to optimal control applications. There are many examples and tutorials available online:- https://web.casadi.org/docs/
- https://folk.ntnu.no/vladimim/#/6
- https://www.syscop.de/files/2022ws/numopt/ex1.pdf
- In our case, we'll parallelize the dynamics of a pendulum as a trivial example.
casadi
is available in C++, MATLAB, and Python, but we'll do this example in MATLAB% Add casadi to MATLAB path. Do this for wherever the casadi folder is downloaded from https://web.casadi.org/get/ addpath(genpath('[CASADI_FOLDER_LOCATION]')); import casadi.* % Symbolic expressions x_pend = casadi.SX.sym('x_pend', 2, 1); % pendulum state g = casadi.SX.sym('g', 1, 1); % pendulum parameters, gravity and length l = casadi.SX.sym('l', 1, 1); dt = casadi.SX.sym('dt', 1, 1); % simulation timestep f_pend = [x_pend(2); -g*sin(x_pend(1))/l]; % pendulum dynamics J_pend = jacobian(f_pend, x_pend); % Jacobian of pendulum dynamics w.r.t the state omega_next = x_pend(2) - (g*sin(x_pend(1))/l)*dt % Semi-implicit Euler integration of dynamics theta_next = x_pend(1) + omega_next*dt x_next_pend = [theta_next; omega_next]; % Export and save as casadi functions % [casadi_expr] = casadi.Function('[fn_name]', {[input1, input2, ...]}, {[output1, output2, ...]}) fn_dynamics = casadi.Function('fn_dynamics', {x_pend, g, l}, {f_pend}); fn_jacobian = casadi.Function('fn_jacobian', {x_pend, g, l}, {J_pend}); fn_sim_step = casadi.Function('fn_sim_step', {x_pend, g, l, dt}, {x_next_pend}); fn_dynamics.save('fn_dynamics.casadi') fn_sim_step.save('fn_sim_step.casadi') fn_jacobian.save('fn_jacobian.casadi')
-
Move the saved functions to
src/casadi_functions
of thecusadi
directory. -
Compile the functions for parallelization. From the root directory of
cusadi
:python run_codegen.py --fn=fn_dynamics python run_codegen.py --fn=fn_sim_step python run_codegen.py --fn=fn_jacobian
-
Evaluate the parallelized functions with
cusadi
in PyTorchimport torch from cusadi import * from casadi import * BATCH_SIZE = 10000 x0 = torch.rand((BATCH_SIZE, 2), device='cuda', dtype=torch.double) # Random initial states g = 9.81 * torch.ones((BATCH_SIZE, 1), device='cuda', dtype=torch.double) # Gravity for each env. l = torch.rand((BATCH_SIZE, 1), device='cuda', dtype=torch.double) # Random lengths for each env. dt = torch.linspace(0.001, 0.1, BATCH_SIZE, device='cuda', dtype=torch.double) # Varying timestep for each env. fn_casadi_sim_step = casadi.Function.load(os.path.join(CUSADI_FUNCTION_DIR, "fn_sim_step.casadi")) fn_cusadi_sim_step = CusadiFunction(fn_casadi_sim_step, BATCH_SIZE) fn_cusadi_sim_step.evaluate(x0, g, l, dt) # Evaluate fn. with CUDA kernel x_next = fn_cusadi_sim_step.outputs_sparse[0] # Access results.
-
With this example, by putting the
evaluate()
call in a for loop, a parallel simulator can be created. You can quickly sweep the effect of the parameters (l
andg
) or timestep (dt
) on the system as well.
- Release 1.0
- Support for
JAX
- Interface with
cuBLAS
,cuSPARSE
- Explore CPU parallelism opportunities
- Streamline exporting, saving, and compilation flow
- IsaacGym/Orbit/IsaacLab examples
- Public examples for parallelized MPC, other optimal controllers (coming soon!)
Distributed under the MIT License. See LICENSE.txt
for more information.
Please feel free to reach out with any questions!
Se Hwan Jeon - [email protected]