Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with using gl-based processers #9334

Closed
dgrnbrg opened this issue Jul 4, 2021 · 13 comments
Closed

Issue with using gl-based processers #9334

dgrnbrg opened this issue Jul 4, 2021 · 13 comments

Comments

@dgrnbrg
Copy link

dgrnbrg commented Jul 4, 2021


Required Info
Camera Model D435i
Firmware Version 5.12.12.100
Operating System & Version Rasbpian 10 (buster)
Kernel Version (Linux Only) 5.10.17-v71+
Platform Raspberry Pi
SDK Version 2.45.0
Language C++
Segment Robot

Issue Description

Hi, I'm using realsense-ros to use the D435i on a raspberry pi for SLAM, object detection, and mapping. I've been running into major performance issues with realsense-ros (see IntelRealSense/realsense-ros#1929 for the debugging so far), which I've determined is actually due to rs2::align being too slow on the ARMv7 architecture. I tried changing realsense-ros to use rs2::gl::align instead to improve performance; however, the OpenGL-based processing block is failing to initialize correctly. I'm looking for help in understanding what the issue is, or even how to debug it. At this point, I believe I've got the system to a point where I can see in a debugger where the glsl-based aligner segfaults (and I've included the backtrace below). I'll now describe what I've done so far:

First, I rebuilt librealsense2 with -DBUILD_GLSL_EXTENSIONS=true, and then in ros-realsense, I added #include <librealsense2-gl/rs_processing_gl.hpp> to realsense_node_factory.h, and changed base_realsense_node.cpp to use rs2::gl::align. I didn't see a performance increase, but I determined that this is probably due to the "backup" node structure that automatically falls back to the CPU version in rs-gl.cpp in librealsense, so I hacked that out with this patch:

--- a/src/gl/rs-gl.cpp
+++ b/src/gl/rs-gl.cpp
@@ -135,11 +135,12 @@ rs2_processing_block* rs2_gl_create_align(int api_version, rs2_stream to, rs2_er
 {
     verify_version_compatibility(api_version);
     auto block = std::make_shared<librealsense::gl::align_gl>(to);
-    auto backup = std::make_shared<librealsense::align>(to);
-    auto dual = std::make_shared<librealsense::gl::dual_processing_block>();
-    dual->add(block);
-    dual->add(backup);
-    return new rs2_processing_block { dual };
+    //auto backup = std::make_shared<librealsense::align>(to);
+    //auto dual = std::make_shared<librealsense::gl::dual_processing_block>();
+    //dual->add(block);
+    //dual->add(backup);
+    //return new rs2_processing_block { dual };
+    return new rs2_processing_block { block };
 }

I validated that rs-gl could run, which required coaxing b/c the RPi4 has OpenGL ES, which wasn't passing the OpenGL version checking. I addressed this by exporting MESA_GL_VERSION_OVERRIDE=3.0 and MESA_GLSL_VERSION_OVERRIDE=130, and I saw rs-gl run and display an image.

Next, I convinced roslaunch realsense2_camera rs_camera.launch align_depth:=true to start by adding those environment variables, as well as export LD_PRELOAD=/usr/local/lib/librealsense2-gl.so.2.45.

At this point, I was seeing an unexpected crash at startup in the nodelet manager, so I added launch-prefix="xterm -e gdb --args" to the crashing nodelet manager in order to get a backtrace of the crash site. I'm pretty sure that the RPi4's GPU is capable of running glsl-based aligner, but I need help understanding why the rs2::options::set_option/rs2::pointcloud::map_to is getting called by the align code.

#0  0xb6dec2b8 in rs2::options::set_option(rs2_option, float) const (this=0x0, value=30, option=RS2_OPTION_STREAM_FILTER)
    at /home/pi/librealsense/src/gl/../../include/librealsense2/hpp/rs_options.hpp:101
#1  0xb6dec2b8 in rs2::pointcloud::map_to(rs2::frame) (this=this@entry=0x0, mapped=...)
    at /home/pi/librealsense/src/gl/../../include/librealsense2/hpp/rs_processing.hpp:454
#2  0xb6de95b4 in librealsense::gl::align_gl::align_z_to_other(rs2::video_frame&, rs2::video_frame const&, rs2::video_stream_profile const&, float)
    (this=this@entry=0xa221f52c, aligned=..., depth=..., other_profile=..., z_scale=<optimized out>)
    at /home/pi/librealsense/src/gl/../../include/librealsense2/hpp/rs_frame.hpp:403
#3  0xb4f4a138 in librealsense::align::align_frames(rs2::video_frame&, rs2::video_frame const&, rs2::video_frame const&)
    (this=this@entry=0xa221f52c, aligned=..., from=..., to=...) at /home/pi/librealsense/src/proc/align.cpp:246
#4  0xb4f4c6ac in librealsense::align::process_frame(rs2::frame_source const&, rs2::frame const&) (this=0xa221f52c, source=..., f=...)
    at /home/pi/librealsense/src/proc/align.cpp:279
#5  0xb4f63b84 in librealsense::generic_processing_block::<lambda(rs2::frame, const rs2::frame_source&)>::operator() (__closure=0xa22ad9e4, source=..., f=...)
    at /home/pi/librealsense/src/proc/synthetic-stream.cpp:75
#6  0xb4f63b84 in rs2::frame_processor_callback<librealsense::generic_processing_block::generic_processing_block(char const*)::<lambda(rs2::frame, const rs2::frame_source&)> >::on_frame(rs2_frame *, rs2_source *) (this=0xa22ad9e0, f=<optimized out>, source=<optimized out>)
    at /home/pi/librealsense/build/../include/librealsense2/hpp/rs_processing.hpp:128
#7  0xb4f603f8 in librealsense::processing_block::invoke(librealsense::frame_holder) (this=0xa221f52c, f=...) at /home/pi/librealsense/src/proc/synthetic-stream.h:41
#8  0xb50c335c in rs2_process_frame(rs2_processing_block*, rs2_frame*, rs2_error**) (block=<optimized out>, frame=<optimized out>, error=error@entry=0x95bbd02c)
    at /home/pi/librealsense/src/core/streaming.h:139
#9  0xb6debbbc in rs2::processing_block::invoke(rs2::frame) const (f=..., this=0xa22b70d4)
    at /home/pi/librealsense/src/gl/../../include/librealsense2/hpp/rs_processing.hpp:303
#10 0xb6debbbc in rs2::filter::process(rs2::frame) const (this=0xa22b70d4, frame=...)
    at /home/pi/librealsense/src/gl/../../include/librealsense2/hpp/rs_processing.hpp:354
#11 0xaf5a6d10 in realsense2_camera::BaseRealSenseNode::frame_callback(rs2::frame) () at /home/pi/catkin_ws/devel/lib//librealsense2_camera.so
#12 0xaf5a8620 in rs2::frame_callback<realsense2_camera::BaseRealSenseNode::setupDevice()::{lambda(rs2::frame)#1}>::on_frame(rs2_frame*) ()
    at /home/pi/catkin_ws/devel/lib//librealsense2_camera.so
#13 0xb5111db0 in librealsense::frame_source::invoke_callback(librealsense::frame_holder) const (this=0xa2203c64, frame=...) at /home/pi/librealsense/src/source.cpp:125
#14 0xb4f5f7b8 in librealsense::synthetic_source::frame_ready(librealsense::frame_holder) (this=<optimized out>, result=...)
    at /home/pi/librealsense/src/core/streaming.h:147
#15 0xb4f6982c in librealsense::syncer_process_unit::<lambda(librealsense::frame_holder, librealsense::synthetic_source_interface*)>::operator()
    (__closure=0xa2223e64, __closure=0xa2223e64, frame=..., source=0xb4f6982c <librealsense::internal_frame_processor_callback<librealsense::syncer_process_unit::syncer_process_unit(std::initializer_list<std::shared_ptr<librealsense::bool_option> >, bool)::<lambda(librealsense::frame_holder, librealsense::synthetic_source_interface*)> >::on_frame(rs2_frame *, rs2_source *)+1172>) at /home/pi/librealsense/src/core/streaming.h:147
#16 0xb4f6982c in librealsense::internal_frame_processor_callback<librealsense::syncer_process_unit::syncer_process_unit(std::initializer_list<std::shared_ptr<librealsense::bool_option> >, bool)::<lambda(librealsense::frame_holder, librealsense::synthetic_source_interface*)> >::on_frame(rs2_frame *, rs2_source *)
    (this=0xa2223e60, f=<optimized out>, source=<optimized out>) at /home/pi/librealsense/src/core/processing.h:67
#17 0xb4f603f8 in librealsense::processing_block::invoke(librealsense::frame_holder) (this=0xa2203c18, f=...) at /home/pi/librealsense/src/proc/synthetic-stream.h:41
#18 0xb50c335c in rs2_process_frame(rs2_processing_block*, rs2_frame*, rs2_error**) (block=<optimized out>, frame=<optimized out>, error=0x95bbd61c)
    at /home/pi/librealsense/src/core/streaming.h:139
#19 0xaf5b568c in std::_Function_handler<void (rs2::frame), realsense2_camera::PipelineSyncer>::_M_invoke(std::_Any_data const&, rs2::frame&&) ()
    at /home/pi/catkin_ws/devel/lib//librealsense2_camera.so
#20 0xaf5b16c8 in rs2::frame_callback<std::function<void (rs2::frame)> >::on_frame(rs2_frame*) () at /home/pi/catkin_ws/devel/lib//librealsense2_camera.so
#21 0xb50f4d4c in librealsense::synthetic_sensor::<lambda(librealsense::frame_holder)>::operator() (__closure=0xa235ec34, f=...)
    at /usr/include/c++/8/bits/shared_ptr_base.h:1018
#22 0xb50f4d4c in librealsense::internal_frame_callback<librealsense::synthetic_sensor::start(librealsense::frame_callback_ptr)::<lambda(librealsense::frame_holder)> >::on_frame(rs2_frame *) (this=0xa235ec30, fref=<optimized out>) at /home/pi/librealsense/src/types.h:969
#23 0xb5111db0 in librealsense::frame_source::invoke_callback(librealsense::frame_holder) const (this=0x99a397b0, frame=...) at /home/pi/librealsense/src/source.cpp:125
#24 0xb4f5f7b8 in librealsense::synthetic_source::frame_ready(librealsense::frame_holder) (this=<optimized out>, result=...)
    at /home/pi/librealsense/src/core/streaming.h:147
#25 0xb50c15f8 in rs2_synthetic_frame_ready(rs2_source*, rs2_frame*, rs2_error**) (source=<optimized out>, frame=<optimized out>,
    frame@entry=0x99ab41a0, error=error@entry=0x95bbd890) at /home/pi/librealsense/src/core/streaming.h:147
#26 0xb4f63e34 in rs2::frame_source::frame_ready(rs2::frame) const (this=0x95bbd858, result=...)
    at /home/pi/librealsense/build/../include/librealsense2/hpp/rs_frame.hpp:590
@MartyG-RealSense
Copy link
Collaborator

Hi @dgrnbrg The GLSL system does not provide a noticable performance advantage when used on low-end computing devices. This is described by the documentation in the link below that advises on the pros and cons of GLSL and when to use it.

#3654

It is unlikely to therefore be worthwhile to continue to debug it on your Raspberry Pi.

It can be less processing-intensive to use the instruction RS2_PROJECT_COLOR_PIXEL_TO_DEPTH_PIXEL to convert a single pixel in the color frame to a depth pixel, instead of aligning the entire image.

#2948 (comment)

@dgrnbrg
Copy link
Author

dgrnbrg commented Jul 5, 2021

Hi Marty, thank you for that advice about glsl on low end devices.

For the individual pixel projection, would that be useful if I need the entire aligned pair of images? Or is that specifically useful when I may only need specific depths? The software package I’m feeding this to downstream requires aligned depth and color images.

Alternatively, a few other ideas/questions:

  • can I configure the d435i to downscale the color image in hardware if I don’t require the full resolution?
  • Can I use a scaling approach instead of projection, and if maybe, do you references on the downsides?
  • Do you think there’s an opportunity to implement basic neon enhancements to get more performance?
  • Is there a way to enable parallelism with openmp for the align loop on the CPU?

I know this is a lot of questions, but I am highly motivated to get the realsense usable on a raspberry pi for real time localization, using a nearby computer via wireless for the heavier computations. I just want to get the pi to be able to stream sufficient info to accomplish that :)

@MartyG-RealSense
Copy link
Collaborator

MartyG-RealSense commented Jul 5, 2021

If you are aiming to transfer the camera data to another computer to perform the heavy computation, Intel's open-source networking system may be an appropriate solution, as it is based around using a Pi 4 as the remote computing device that the camera is attached to and a more powerful computer such as a laptop as the central host machine.

https://dev.intelrealsense.com/docs/open-source-ethernet-networking-for-intel-realsense-depth-cameras

The paper that describes the networking system in the above link is based on ethernet cabling but the paper states that it could be used with a wi-fi connection.

Bear in mind though that there will be some limitations in supported resolution / FPS modes over a networking connection compared to accessing a camera directly.

https://dev.intelrealsense.com/docs/open-source-ethernet-networking-for-intel-realsense-depth-cameras#section-3-6-software-limitations


In the recent RealSense SDK version 2.48.0 an example program for using GLSL with data from a networked remote camera was also described in the release notes for that version.

#8884

@dgrnbrg
Copy link
Author

dgrnbrg commented Jul 5, 2021

Using the networking approach would be really good, except that I am also using a couple cores on the Raspberry Pi to do some local processing as well, for when the Wifi is spotty. Can I use the networking system and simultaneously use the camera frames locally, or is it an either/or situation?

@MartyG-RealSense
Copy link
Collaborator

A chart in the paper shows the CPU utilization across all cores for a selection of different configurations. It estimates around 50% utilization if running at VGA resolution (640x480) at 30 FPS. whilst using a low resolution would reduce utilization.

https://dev.intelrealsense.com/docs/open-source-ethernet-networking-for-intel-realsense-depth-cameras#section-5-2-power-consumption

@dgrnbrg
Copy link
Author

dgrnbrg commented Jul 5, 2021

Thank you, that seems like a better choice to try VGA.

I'd also like to know whether I can use rs-server and also connect locally? I need to get the IMU stream locally, and I also need to do some local processing on camera frames.

@MartyG-RealSense
Copy link
Collaborator

MartyG-RealSense commented Jul 5, 2021

I'm not certain on this question. My expectation with a non-networked application would normally be that if a particular stream type is accessed by a process ('claimed') then another process could not access the same stream. This is in accordance with the rules of the SDK's multi streaming model described in the link below.

https://github.com/IntelRealSense/librealsense/blob/master/doc/rs400_support.md#multi-streaming-model

A practical example of these principles is that if you enable the depth stream in the RealSense Viewer then another application that was launched afterwards and requested the depth stream could not access it. The reverse is also true - if the Viewer was launched secondly then it could not access the same depth stream if another application was already currently using it.

But if you had two cameras then if a stream is accessed on one camera then depth would still be available on the other camera, because it is a specific stream on a specific camera that is claimed when that stream is enabled.

@dgrnbrg
Copy link
Author

dgrnbrg commented Jul 5, 2021

In that case, I don't think that I'll be able to use the rs-server approach.

Given that, what about these following approaches to improve the performance:

  • Can I use a scaling approach instead of projection, and if maybe, do you references on the downsides?
  • Do you think it's not too hard to implement basic neon enhancements to get more performance?
  • Is there a way to enable parallelism with openmp for the align loop on the CPU?

If none of those are straightforward, do you know of another small SBC that there's documented success with running RTabMap and object detection/tracking (for instance, the Up2, Jetson Nano, or a particular NUC)? My other concern, if I change platforms, is being able to purchase it (thanks, chippageddon), and being able to fit it onto the robot chassis (i.e. very small).

@MartyG-RealSense
Copy link
Collaborator

MartyG-RealSense commented Jul 6, 2021

If you need the entire image to be aligned then using alignment instead of converting a single pixel with RS2_PROJECT_COLOR_PIXEL_TO_DEPTH_PIXEL would be the appropriate approach. Alignment is a processing-intensive operation, so unless you have access to graphics acceleration (other than GLSL) then it may be difficult to avoid experiencing slowdown on your Pi when aligning.

If you were able to replace the Raspberry Pi with an Nvidia Jetson Nano board then you would have access to the librealsense SDK's support of CUDA acceleration of alignment, pointclouds and color conversion due to the Nvidia graphics GPU chip on Jetson boards (CUDA is an Nvidia-only feature). A Nano board is affordable in price and small in size at 70 x 45 mm, and Jetson boards are especially suited to vision computing and AI applications.

In regard to scaling down depth resolution, depth scene complexity can be reduced with post-processing by using a Decimation Filter. Post-processing takes place on the computing hardware instead of in the camera hardware, so there can be a CPU % usage cost to doing so.

https://dev.intelrealsense.com/docs/post-processing-filters#section-decimation-filter

I do not have knowledge about achieving enhancement using the Neon* architecture of the Pi's Arm CPU. The OpenVINO Toolkit vision computing platform, which is compatible with Raspberry Pi, has optimizations for Neon though.

https://medium.com/sclable/intel-openvino-with-opencv-f5ad03363a38

A guide about RealSense installation on Raspberry Pi at the link below affirms that the CMake build flag -DBUILD_WITH_OPENMP=ON can be used on Pi 4 to enable usage of multiple cores in librealsense.

https://github.com/NobuoTsukamoto/realsense_examples/blob/master/doc/installation_raspberry_pi_64.md#download-source-and-build

The official SDK notes about the OpenMP flag state: "When enabled, YUY to RGB conversion and Depth-Color spatial alignment will take advantage of multiple-cores using OpenMP. This can reduce latency at expense of greater CPU utilization".

https://dev.intelrealsense.com/docs/build-configuration

@dgrnbrg
Copy link
Author

dgrnbrg commented Jul 6, 2021

@MartyG-RealSense, thank you for this wealth of information.

One last question before I buy some more hardware and modify some builds--do you have any references to the framerates folks have achieved with realsense-ros on the nano? I'm also looking at the LattePanda Alpha, which seems leagues faster and may be a more surefire choice.

@MartyG-RealSense
Copy link
Collaborator

You should be able to achieve 30 FPS on a Nano, as demonstrated in the tutorial article in the link below.

https://rahulvishwakarma.wordpress.com/2020/01/25/visualize-rgbd-using-realsense-d435i-rosmelodic-on-jetson-nano-dev-kit/

image

I say 'should' because of the number of factors in ROS that can affect performance.

@MartyG-RealSense
Copy link
Collaborator

Hi @dgrnbrg Do you require further assistance with this case, please? Thanks!

@MartyG-RealSense
Copy link
Collaborator

Case closed due to no further comments received.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants