Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(hesai_hw_interface): add error handling #131

Merged
merged 12 commits into from
May 21, 2024

Conversation

mojomex
Copy link
Collaborator

@mojomex mojomex commented Mar 28, 2024

PR Type

  • Improvement
  • Bug Fix

Related Links

This PR will not fix this issue but at least throw an exception with a detailed error message instead of a segfault, making future similar bugs much easier to spot:

Description

This PR introduces error checking and handling in hesai_hw_interface.cpp. The following features have been added:

  • check PTC command error codes and pass them to the calling function instead of the payload in case of error
  • throw std::runtime_errors in case an error occurs
  • perform size checks before parsing TCP responses
  • emit errors on too-large payload size
  • remove all fire-and-forget type requests, ALWAYS check the error code

The following other things have changed as a result of the fix:

  • structs in hesai_cmd_response.hpp are now parsed via memcpy, and use boost::endian buffers in the sizes specified by the datasheet(s)
  • a utility class expected modeled after C++23 std::expected was added to make error handling code more compact
  • added FIXMEs in places that are wrong but not high priority (do not cause crashes etc.)

Behavior for TCP commands is now as follows:

  • if any error occurs, a std::runtime_error is thrown with a detailed list of error messages
  • for some structs that are wrongly parsed in some cases (e.g. HesaiPtpConfig):
    • a std::runtime_error is thrown if the payload is too short to memcpy into the struct
    • an error is PrintErrored if the payload is too large. This is done because the first few fields of the struct are typically correct, even if the format of later fields differs

Review Procedure

Test TCP with real sensors (at least AT128, OT128 and an older sensor).

Remarks

We should refactor TCP communication in a similar way that was done with mixins in #104. This would make the sensor definition much more readable: one could read the sensor definition alongside the TCP documentation, without boilerplate code. This would provide much better maintainability, and, for new sensors, forces the developer to check every TCP command for changes when implementing.

Pre-Review Checklist for the PR Author

PR Author should check the checkboxes below when creating the PR.

  • Assign PR to reviewer

Checklist for the PR Reviewer

Reviewers should check the checkboxes below before approval.

  • Commits are properly organized and messages are according to the guideline
  • (Optional) Unit tests have been written for new behavior
  • PR title describes the changes

Post-Review Checklist for the PR Author

PR Author should check the checkboxes below before merging.

  • All open points are addressed and tracked via issues or tickets

CI Checks

  • Build and test for PR: Required to pass before the merge.

* check PTC command error codes, throw exception if necessary
* perform size checks before parsing responses
* emit errors on too-large payload size
@mojomex
Copy link
Collaborator Author

mojomex commented Mar 28, 2024

(I still have to test this on real sensors)

@mojomex mojomex self-assigned this Mar 28, 2024
@mojomex mojomex marked this pull request as ready for review March 28, 2024 11:37
@mojomex mojomex requested review from amc-nu and drwnz March 28, 2024 11:37
@amc-nu
Copy link
Contributor

amc-nu commented Apr 2, 2024

Test Result Pandar128E4X

@mojomex @drwnz

The driver works with an actual sensor. Point cloud looks normal

Issues:

  • I see some output marked as error:

    • [hesai_hw_monitor_hesai]: HesaiInventory from Sensor has unknown format. Will parse anyway.
    • sn: XXXX, date_of_manufacture: XXXX-XX-XX, mac: ec:9f:XX:XX:XX:XX, sw_ver: 1.3.19, hw_ver: , control_fw_ver: 1.3.10, sensor_fw_ver: 1.3.13, angle_offset: 8169, model: *, motor_type: , num_of_lines: �, 4f 54 31 32 38 2d 42 30 31 00 00
      (SN and MAC hidden on purpose)
  • The diagnostics page reports 1800V input, might need some check.

[hesai_hw_interface_ros_wrapper_node-2] Starting UDP server on: SensorModel: Pandar128_E4X_OT, FrameID: hesai, HostIP: 255.255.255.255, SensorIP: 192.168.1.201, DataPort: 940, ReturnMode: Dual, Frequency: 0, MTU: 5dc, Use sensor time: 0, GnssPort: 941, ScanPhase:0, RotationSpeed:258, FOV(Start):0, FOV(End):168, DualReturnDistanceThreshold:0.1, PtpProfile:IEEE_1588v2, PtpDomain:0, PtpTransportType:UDP/IP, PtpSwitchType:TSN
[hesai_hw_monitor_ros_wrapper_node-3] 192.168.1.201:9347
[hesai_hw_monitor_ros_wrapper_node-3] [INFO 1712038423.862968919] [TcpSocket::open]: connected (open() at /home/ne0/nebula_ws/src/transport_drivers/tcp_driver/src/tcp_socket.cpp:531)
[hesai_hw_monitor_ros_wrapper_node-3] [ERROR 1712038423.863656262] [hesai_hw_monitor_hesai]: HesaiInventory from Sensor has unknown format. Will parse anyway. (PrintError() at /home/ne0/nebula_ws/src/nebula/nebula_hw_interfaces/src/nebula_hesai_hw_interfaces/hesai_hw_interface.cpp:1331)
[hesai_hw_monitor_ros_wrapper_node-3] HesaiInventory
Parameter Screenshot
Voltage image
Temperature image
RPM image
No PTP image
PTP image

@amc-nu
Copy link
Contributor

amc-nu commented Apr 2, 2024

Test PandarXT32

The driver works with an actual sensor. Pointcloud looks normal.

  • Input current/power 0.0
[hesai_hw_interface_ros_wrapper_node-2] Starting UDP server on: SensorModel: PandarXT32, FrameID: hesai, HostIP: 255.255.255.255, SensorIP: 192.168.1.201, DataPort: 940, ReturnMode: Dual, Frequency: 0, MTU: 5dc, Use sensor time: 0, GnssPort: 941, ScanPhase:0, RotationSpeed:258, FOV(Start):0, FOV(End):168, DualReturnDistanceThreshold:0.1, PtpProfile:IEEE_1588v2, PtpDomain:0, PtpTransportType:UDP/IP, PtpSwitchType:TSN
[hesai_hw_monitor_ros_wrapper_node-3] 192.168.1.201:9347
[hesai_hw_monitor_ros_wrapper_node-3] [INFO 1712039923.485050914] [TcpSocket::open]: connected (open() at /home/ne0/nebula_ws/src/transport_drivers/tcp_driver/src/tcp_socket.cpp:531)
[hesai_hw_monitor_ros_wrapper_node-3] HesaiInventory
[hesai_hw_monitor_ros_wrapper_node-3] sn: XT3XXXXXXXX, date_of_manufacture: XXXX-XX-XX, mac: ec:9f:XX:XX:XX:XX, sw_ver: 0.1.25, hw_ver: 1.0.0, control_fw_ver: 1.1.13, sensor_fw_ver:   1.2.25, angle_offset: 1048, model: , motor_type: , num_of_lines:  , 00 00 00 00 00 00 00 00 00 00 00
Parameter Screenshot
No PTP image
PTP image
RPM image
Temperature image
Voltage image

@mojomex
Copy link
Collaborator Author

mojomex commented Apr 8, 2024

@amc-nu

I see some output marked as error:
[hesai_hw_monitor_hesai]: HesaiInventory from Sensor has unknown format. Will parse anyway.
sn: XXXX, date_of_manufacture: XXXX-XX-XX, mac: ec:9f:XX:XX:XX:XX, sw_ver: 1.3.19, hw_ver: , control_fw_ver: 1.3.10, sensor_fw_ver: 1.3.13, angle_offset: 8169, model: *, motor_type: , num_of_lines: �, 4f 54 31 32 38 2d 42 30 31 00 00
(SN and MAC hidden on purpose)

The actual parsing is the same in this PR as it was before. However, OT128 (and to some extent AT128, possibly others) have slightly different struct definitions from e.g. XT32. As it was not an issue until now, I opted to print an error that we are parsing the wrong data for some fields, rather than doing a proper refactoring at this point.

What has changed however, is that payload size is now checked before parsing into the structs is attempted.

@mojomex
Copy link
Collaborator Author

mojomex commented Apr 8, 2024

@amc-nu

The diagnostics page reports 1800V input, might need some check.

OT128's TCP API doc lists the input_voltage and input_current fields in a different order than the HesaiLidarMonitor struct at hesai_cmd_response.hpp:480. I think this issue already existed, as I did not change that struct. I do not have access to all TCP API docs of Hesai sensors so I cannot confirm at this point if it is in fact only different for OT128.

I have added FIXMEs to all structs where I found the API docs I do have (OT128 and AT128) to differ from the implementation.

Do you think it is best to add an exception if (sensor_model == OT128) to the parser?

@mojomex
Copy link
Collaborator Author

mojomex commented Apr 8, 2024

sn: XXXX, date_of_manufacture: XXXX-XX-XX, mac: ec:9f:XX:XX:XX:XX, sw_ver: 1.3.19, hw_ver: , control_fw_ver: 1.3.10, sensor_fw_ver: 1.3.13, angle_offset: 8169, model: *, motor_type: , num_of_lines: �, 4f 54 31 32 38 2d 42 30 31 00 00

Indeed, the output had some other errors as well: I forgot to reset the output modifiers (std::hex etc.) after printing MAC addresses etc. and printed some uint8 numbers as characters instead of numbers. Fixed in e5d0a9c.

Output now:

sn: XXXX, date_of_manufacture: 2023-11-23, mac: ec:xx:xx:xx:xx, sw_ver: 1.4.01, hw_ver: , control_fw_ver: 1.4.1, sensor_fw_ver: 1.4.1, angle_offset: 19817, model: 42, motor_type: 1, num_of_lines: 128, 4f 54 31 32 38 2d 42 30 31 00 00

@mojomex mojomex force-pushed the hesai-hw-interface-error-handling branch from e5d0a9c to decf1ac Compare April 8, 2024 07:32
@mojomex mojomex force-pushed the hesai-hw-interface-error-handling branch from 19e8101 to 4f022cd Compare April 18, 2024 07:50
@drwnz drwnz changed the base branch from main to develop May 17, 2024 04:01
Copy link
Collaborator

@drwnz drwnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@mojomex mojomex merged commit b8b25a6 into tier4:develop May 21, 2024
6 checks passed
@mojomex mojomex deleted the hesai-hw-interface-error-handling branch May 21, 2024 07:52
@mojomex mojomex mentioned this pull request Jul 18, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants