-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Methodology for ROS 2 Hardware Acceleration #20
Comments
The following proposes a methodology for ROS 2 Hardware Acceleration and demonstrates it with a practical use case studying the computational graph of a simple perception pipeline. Case study:
|
Methodology for ROS 2 Hardware Acceleration | case study: accelerating ROS 2 perception |
---|---|
(this methodology aligns with REP-2008's Pull Request proposal)
Methodology for ROS 2 Hardware Acceleration
A.
Trace computational graph
About tracing and benchmarking
Benchmarking is the act of running a computer program to assess its relative performance, whereas tracing is a technique used to understand what goes on in a running system. In the context of hardware acceleration in robotics, it's fundamental to be able to assess both. Tracing helps determine which pieces of a Node are consuming more compute cycles or generating indeterminism, and are thereby good candidates for hardware acceleration. Benchmarking instead helps investigate the relative performance of an acceleration kernel versus its CPU scalar computing baseline. Similarly, benchmarking also helps comparing acceleration kernels across hardware acceleration technology solutions (e.g. Kria KV260 vs Jetson Nano) and across kernel implementations (within the same hardware acceleration technology solution).
The first step is to instrument and trace the ROS 2 computational with LTTng probes. Reusing past work and probes allows us to easily get a grasp of the dataflow interactions within rmw
, rcl
and rclcpp
ROS 2 layers. But to trace appropriately the complete computational graph, besides these tracepoints, we also need to instrument our userland code. Particularly, as depicted for the publication path in the figure below, we need to add instrumentation to the image_pipeline
package and more specifically, to the ROS Components that we're using.
ROS 2 Layer | Trace point | Desired transition |
---|---|---|
userland |
||
image_proc_rectify_init |
CPU-FPGA | |
image_proc_rectify_fini |
FPGA-CPU | |
image_proc_rectify_cb_fini |
||
image_proc_resize_cb_init |
||
image_proc_resize_init |
CPU-FPGA | |
image_proc_resize_fini |
FPGA-CPU | |
image_proc_resize_cb_fini |
||
rclcpp |
||
callback_start |
||
callback_end |
||
rclcpp_publish |
||
rcl |
||
rcl_publish |
||
rmw |
||
rmw_publish |
This is illustrated in the Table above and implemented at ros-perception/image_pipeline#717, including the instrumentation of ResizeNode
and RectifyNode
ROS 2 Components. Further instrumentation could be added to these Components if necessary, obtaining more granularity in the tracing efforts.
Below, we depict the results obtained after instrumenting the complete ROS 2 computational graph being studied. A closer inspection shows in grey that the ROS 2 message-passing system across abstraction layers takes a considerable portion of the CPU. In comparison, in light red, taking only a small portion of each Node's execution time, we depict the computations that interact with the data flowing across nodes. Both the core logic of each one of the Nodes (rectify and resize operations) as well as the ROS 2 message-passing plumbing will be subject to acceleration.
B.
Benchmark CPU baseline
After tracing the graph and obtaining a good understanding of the dataflow, we can proceed to produce a CPU baseline benchmark while running in the Xilinx Kria® KV260 Vision AI Starter Kit quad-core Processing System (the CPU).
C.
Hardware acceleration
The third step in the methodology for ROS 2 hardware acceleration is to introduce custom compute architectures by using specialized hardware (FPGAs or GPUs). This is done in two steps: first, creating acceleration kernels for individual ROS 2 Nodes and Component and second, accelerate the computational graph by tracing and optimize dataflow interactions. The whole process can take various iterations until results are satisfactory.
Accelerate ROS 2 Nodes and Components
We first accelerate the computations at each one of the graph nodes. /rectify_node_fpga
and /resize/resize_node_fpga
Components of the use case above are accelerated using Xilinx's HLS, XRT and OpenCL targeting the Kria KV260. The changes in the ROS 2 Components of image_pipeline
to leverage hardware acceleration in the FPGA are available in rectify_fpga
and resize_fpga
respectively. Each one of the ROS 2 Components has an associated acceleration kernel that leverages the Vitis Vision Library, a computer vision library optimized for Xilinx silicon solutions and based on OpenCV APIs. Source code of the acceleration kernels is available here. It's relevant to note how the code implementation of these accelerated Components and its kernels co-exists well with the rest of the ROS meta-package. Thanks to the work of the WG, building accelerators is abstracted away from the roboticists and takes no significant additional effort than the usual build of image_pipeline
.
Figure above depicts the results obtained after benchmarking these accelerated Components using the trace points. We observe an average 6.22% speedup in the total computation time of the perception pipeline after offloading perception tasks to the FPGA.
Accel. Mean | Accel. RMS | Mean | RMS | |
---|---|---|---|---|
CPU baseline | 24.36 ms (0.00 %) |
24.50 ms (0.00 %) |
91.48 ms (0.00 %) |
92.05 ms (0.00 %) |
FPGA @ 250 MHz | 24.46 ms (:small_red_triangle_down: 0.41 %) |
24.66 ms (:small_red_triangle_down: 0.63 %) |
85.80 ms (6.22 %) |
87.87 ms (4.54 %) |
Accelerate Graph
As illustrated before through tracing, inter-Node exchanges using the ROS 2 message-passing system across its abstraction layers outweights other operations by far, regardless of the compute substrate. This confirms the CPU-centric approach in ROS, and hints about one important opportunity where hardware acceleration can hasten ROS 2 computational graphs. By optimizing inter-Node dataflows, ROS 2 intra-process and inter-process communications can be made more time efficient, leading to faster resolution of the graph computations and ultimately, to faster robots. This step is thereby focused on optimizing the dataflow within the computational graph and across ROS 2 Nodes and Components. Figures below depict two attempts to accelerate the graph dataflow.
integrated approach | streamlining approach |
---|---|
The first one integrates both ROS Components into a new one. The benefit of doing so is two-fold: first, we avoid the ROS 2 message-passing system between RectifyNode
and ResizeNode
Components. Second, we avoid the compute cycles wasted while memory mapping back and forth data between the host CPU and the FPGA, achieving an overall faster acceleration which totals in an average 26.96% speedup while benchmarking the graph for 60 seconds.
Accel. Mean | Accel. RMS | Mean | RMS | |
---|---|---|---|---|
CPU baseline | 24.36 ms (0.00 %) |
24.50 ms (0.00 %) |
91.48 ms (0.00 %) |
92.05 ms (0.00 %) |
FPGA, integrated @ 250 MHz | 23.90 ms (1.88 %) |
24.05 ms (1.84 %) |
66.82 ms (26.96 %) |
67.82 ms (26.32 %) |
The second attempt results from using the accelerated Components RectifyNodeFPGAStreamlined
and ResizeNodeFPGAStreamlined
. These ROS Components are redesigned to leverage hardware acceleration, however, besides offloading perception tasks to the FPGA, each leverages an AXI4-Stream interface to create an intra-FPGA ROS 2 communication queue which is then used to pass data across nodes through the FPGA. This allows to avoid completely the ROS 2 message-passing system and optimizes dataflow achieving a 24.42% total speedup resulting from averaging the measurements collected while benchmarking the graph for 60 seconds.
Accel. Mean | Accel. RMS | Mean | RMS | |
---|---|---|---|---|
CPU baseline | 24.36 ms (0.00 %) |
24.50 ms (0.00 %) |
91.48 ms (0.00 %) |
92.05 ms (0.00 %) |
FPGA, streams (resize) @ 250 MHz | 19.14 ms (21.42 %) |
19.28 ms (21.33 %) |
69.15 ms (24.42 %) |
70.18 ms (23.75 %) |
D.
Benchmark acceleration
The last step in the methodology for ROS 2 hardware acceleration is to continuously benchmark the acceleration results after creating custom compute architectures and against the CPU baseline. Figures above presents results obtained iteratively while building custom hardware interfaces for the Xilinx Kria KV260 FPGA SoC.
Accel. Mean | Accel. RMS | Mean | RMS | |
---|---|---|---|---|
CPU baseline | 24.36 ms (0.00 %) |
24.50 ms (0.00 %) |
91.48 ms (0.00 %) |
92.05 ms (0.00 %) |
FPGA @ 250 MHz | 24.46 ms (:small_red_triangle_down: 0.41 %) |
24.66 ms (:small_red_triangle_down: 0.63 %) |
85.80 ms (6.22 %) |
87.87 ms (4.54 %) |
FPGA, integrated @ 250 MHz | 23.90 ms (1.88 %) |
24.05 ms (1.84 %) |
66.82 ms (26.96 %) |
67.82 ms (26.32 %) |
FPGA, streams (resize) @ 250 MHz | 19.14 ms (21.42 %) |
19.28 ms (21.33 %) |
69.15 ms (24.42 %) |
70.18 ms (23.75 %) |
Discussion
The previous analysis shows for a simple perception robotics task how by leveraging the ROS 2 Hardware Acceleration open architecture and following the proposed methodology, we are able to use hardware acceleration easily, without changing the development flow, and while obtaining faster ROS 2 responses. We demonstrated how:
- pure perception FPGA offloading leads to a 6.22% speedup for our application,
- we also showed how re-architecting and integrating the ROS Components into a single FPGA-accelerated and optimized Component led to a 26.96% speedup. This comes at the cost of having to re-architect the ROS computational graph, merging Components as most appropriate, while breaking the ROS modularity and granularity assumptions conveyed in the default perception stack. To avoid doing so and lower the entry barrier for roboticists, finally,
- we design two new Components which offload perception tasks to the FPGA and leverage an AXI4-Stream interface to create an intra-FPGA ROS 2 Node communication queue. Using this queue, our new ROS Components deliver faster dataflows and achieve an inter-Node performance speedup of 24.42%. We believe that using this intra-FPGA ROS 2 Node communication queue, the acceleration speedup can also be exploited in subsequent Nodes of the computational graph dataflow, leading to an exponential acceleration gain. Best of all, our intra-FPGA ROS 2 Node communication queue aligns well with modern ROS 2 composition capabilities and allows ROS 2 Components and Nodes to exploit this communication pattern for inter- and intra-process ROS 2 communications.
@christophebedard and @iluetkeb, you guys might be interested on this and I'd love to hear or read your thoughts about it (a formal complete paper is coming out soon delivering additional details). Specially, on the methodology. @SteveMacenski, connecting this with ros-navigation/navigation2#2788 discussion, is the discourse aligned with what you'd expect (thought as a blueprint that would need to be transposed to costmap updates, planners and controllers, as discussed)? |
@vmayoral very interesting work of accelerating ROS 2 perception. Look forward to the formal complete paper. |
Happy to facilitate you an early draft if you provide me with a personal contact.
I'd definitely be interested in exploring this. Could you elaborate what new capabilities do you have in mind specifically? |
I'll of course check out the full paper once it's available, but this looks good! The figures are a bit confusing (stacked bar chart implies time durations, but the colours are linked to time events which have no duration), but I do understand the comparisons of course. I'm wondering what the next optimization/acceleration step is after this, since this is an easy-ish first step. Does the existing tracing instrumentation have enough information to allow you to dig deeper or try to accelerate other parts? |
Fair enough, there's definitely ground for impromevents in the plots. I built them with the following interpretation in mind: each color represents the time duration up until the specific time event, counting from the previous event. That way, I was able to identify bottlenecks.
From my experience there're three avenues to explore for better tracing capabilities:
|
The advantage of the streamlining approach I see is that, you are no longer constrained to custom constructed nodes. Developers can pick and choose and construct their own graph with individual nodes and still leverage graph level optimization. The only caveat being that it is incumbent upon the ROS developer to choose the right set of nodes for the specific accelerator. However, I am wondering what happens if there are multiple nodes subscribing to the topics published by the intermediate nodes, not just other FPGA nodes. That would require a copy back to the CPU memory. At the moment I can't think of that affecting the performance of the graph itself, but it is of interest to see how it affects the CPU power consumption and utilization vs non-accelerated graph. |
There's instrumentation in This is all you need for normal pub/sub over the network. I'm working on an analysis + a paper to extend/improve what I did with ROS 1; it's progressing well and I should be able to share both in about 2 months.
You should look into babeltrace2 if you haven't already: https://babeltrace.org/. You could use its C API to convert other traces to CTF traces to be able to easily read them with |
Agreed, and moreover it's currently not-that-simple for software engineers to engage with the streamlining approach since it requires some hardware skills. The methodology described above aims to shed some light into how to systematically help ROS developers identify which Nodes/Components should be considered for acceleration. At the end of the day, a roboticist spends a significant amount of time to optimize/put-together a computational graph optimizing things (in a functional and non-functional manner) to solve the given task, so it's not that far from the tree.
This is a great open question and something we're currently pondering. We have a few ideas that need some prototyping time. Shortly, we believe we can build FPGA constructs that duplicate the dataflow on each kernel through additional intra-FPGA queues, as many times as needed to serve additional publishers/subscribers. An extra queue could be also allocated to account for host (CPU) dynamic data requests as well (e.g. new intra-network endpoints). In principle this sounds feasible, but it needs to be prototyped to evaluate how many extra resources it requires in the Programmable Logic. I have the feeling that adding "by default" this feature to all kernels is not going to scale due to resource limitations (which we always have in embedded). Happy to chat more about this over a call if it's of interest to you. |
It sounds like adding tracing for benchmarks, do hardware acceleration, and then show that that acceleration helps via the benchmarks. Yeah, that makes sense, but the second step within Nav2 requires more discussion. We need to chat about what kinds of acceleration we want to have added and how they're added to be cross platform -- assuming (almost certainly) that there are areas that Nav2 would benefit from acceleration from. The burning question I have is around what are the target platform(s) that have the FGPA/GPU/etc to use as a basis. I don't want to make any features that strictly require a specific compute architecture. The point of ROS to me is that we have a set of tools available for every practical platform. That doesn't mean that some platforms can't be better supported than others, but I would not like to support, for instance, 1 GPU manufacturer's ecosystem only and have features that essentially require that new capability only available on that GPU vendor. I don't want vendor lock-in. |
This is not a ROS only issue. It is an issue for all open source graph acceleration efforts. Leveraging graph level optimization without a vendor specific feature is the right way to approach it rather than have custom implementations. This is something that needs more attention from all industry participants. In terms of cross platform standards to write portable code, OpenCL seems appropriate for writing individual kernels to be offloaded to GPU, FPGA etc. without vendor lock-in. SYCL could be a great alternative if only Nvidia were officially supporting it. |
OK, totally agreed. I just wanted to make the point since Victor asked my thoughts. It's certainly not a Nav2 specific (or robotics specific) request 😄 |
@SteveMacenski that's addressed by our vendor-agnostic architecture for hardware acceleration. See REP-2008 PR for more details. In a nutshell, this should provide an abstraction layer for accelerators so that you, as a package maintainer, can remain agnostic to the underlying accelerator hardware solution. Responsibility of building the right kernels is up to the silicon vendors that provide ROS support. Expect support for the most popular platforms for starters, including Xilinx's Kria and Nvidia's Jetson boards.
We are totally on the same page. This was widely discusses in here, and I think you'd like the paper that's coming out. |
Well said 👍.
@saratpoluri have a look at the discussion at https://discourse.ros.org/t/rep-2008-rfc-ros-2-hardware-acceleration-architecture-and-conventions/22026 and let me know your thoughts. I'd be interested to hear them and discuss things. |
Fulfilled second item and disclosed results at https://news.accelerationrobotics.com/hardware-accelerated-ros2-pipelines/. |
Paper released https://arxiv.org/pdf/2205.03929.pdf! |
The text was updated successfully, but these errors were encountered: