-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(nebula): optimize AT128 decoder performance #37
Conversation
nebula_ros/include/nebula_ros/hesai/hesai_decoder_ros_wrapper.hpp
Outdated
Show resolved
Hide resolved
c00658e
to
5f1fc29
Compare
nebula_decoders/src/nebula_decoders_hesai/decoders/pandar_at_decoder.cpp
Outdated
Show resolved
Hide resolved
Get rid of temporary decoding pointcloud, decode packets directly into scan_pc_ and overflow_pc_. Further, get rid of unused copy of the incoming packets. This results in less copy operations and around a 23% speedup of ReceiveScanMsgCallback() with no subscribers on the published pointclouds. Additionally, scan_pc_ and overflow_pc_ are now std::swap'd instead of de-/reallocated upon scan completion. This eliminates heap assignments. The duplicate sin/cos lookup tables for elevation/azimuth have been combined and have impoved cache performance slightly. Signed-off-by: Maximilian Schmeller <[email protected]>
This commit serves to ease performance validation of Hesai decoders by adding logging of ReceiveScanMsgCallback() callback duration and adding measurement runner and plotting scripts. `profiling_runner.bash` is configurable via commandline to use an arbitrary set of CPU cores, CPU frequency, sensor model, rosbag, number of repetitions, etc. The `plot_times.py` tool takes scenario names as console arguments. Usage of those tools is now documented in `README.md`. Signed-off-by: Maximilian Schmeller <[email protected]>
To support on-the-fly switching to dual return mode, the scan_pc_ and overflow_pc_ buffers have to reserve that capacity beforehand. This commit allocates the maximum number of points (2 * 1200 * 128) on construction of the decoder. Moved performance measurement explanation to the end of README.md.
9b07a9f
to
f20d7f4
Compare
nebula_decoders/include/nebula_decoders/nebula_decoders_hesai/decoders/pandar_at_decoder.hpp
Outdated
Show resolved
Hide resolved
Codecov ReportPatch coverage:
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. Additional details and impacted files@@ Coverage Diff @@
## main #37 +/- ##
==========================================
+ Coverage 13.38% 21.55% +8.17%
==========================================
Files 111 48 -63
Lines 10991 6313 -4678
Branches 1727 1664 -63
==========================================
- Hits 1471 1361 -110
+ Misses 8339 3800 -4539
+ Partials 1181 1152 -29
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
@mojomex can you please attend to the spelling errors on the spell-check job? |
@amc-nu Most of the warnings were the names of Bash commands, added them to the dictionary. |
PR Type
Related Links
--
Description
This PR improves performance of the Hesai AT128 decoder by approx. 23% without changing the logical output.
To achieve this, three changes have been made:
scan_pc_
andoverflow_pc_
instead of temporary buffers and re-using the buffers (~ 11%point improvement)Points were decoded in
convert(block_id)
to a temporary PointCloud, the contents of which were then copied toscan_pc_
oroverflow_pc_
respectively. This caused avoidable latency. Now, depending on whether the current block being decoded is in a new frame/scan or not,convert(block_id, pcl)
decodes those points directly into the right buffer.Further,
overflow_pc_
was reallocated on each full scan, causing avoidable heap allocations.Because each incoming
PandarScan
message always contains a full scan (albeit not necessarily aligned to our phase setting),scan_decoder->get_pointcloud()
will be called exactly once inConvertScanToPointCloud
. Thus, it is guaranteed that one ofscan_pc_
andoverflow_pc_
can always be used for decoding, and the other as the output buffer.Once a scan is complete, the buffers are swapped and the
overflow_pc_
isclear
ed (not reallocated).No changes have been made to the base class, the new function is an overloaded version of the existing one (which has been changed to throw a "not implemented" error).
All measurements have been made on an Intel Core i7-12700H with
taskset
to core 2Measurements were repeated 3 times and ran for 37s each (~370 iterations).
Review Procedure
To confirm logical equality with the previous implementation, a rosbag of recorded AT128 packets can be used.
To confirm performance improvement, I provide the scripts and logging that yielded the above charts:
The test runner locks/unlocks core frequencies and
taskset
s nebula onto the specified core(s). Ideally, the taskset cores are isolated (cf. Autoware performance measurement guide)Remarks
Pre-Review Checklist for the PR Author
PR Author should check the checkboxes below when creating the PR.
Checklist for the PR Reviewer
Reviewers should check the checkboxes below before approval.
Post-Review Checklist for the PR Author
PR Author should check the checkboxes below before merging.
CI Checks