Implement RTMDet to Perception Pipeline #7235

StepTurtle · 2024-06-03T08:42:41Z

StepTurtle · 2024-06-03T09:13:45Z

Here is the results of pre-trained models shared in this link from mmdetection.

Results:

I used PyTorch models to get these results (models with .pth extension)

Model	Score Threshold	NMS Threshold	Detection Time Per Image	Video Link
RTMDet-Ins-s	0.3	0.3	~20 ms	Video Link
RTMDet-Ins-x	0.3	0.3	~33 ms	Video Link

For now, I tested the pre-trained models shared by mmdetection using mmdetection tools. Also they provide a couple of tools to convert .pth models to .onnx and .engine models. I check both of the converted model results and they are looking same and I think we can say their tools clearly convert the models.

Right now I am trying to handle how can use TensorRT models in cpp with TensorRT libraries.

StepTurtle · 2024-06-05T12:20:14Z

I deploy TensorRT engine to Python and I get some consistent results.

Model	Score Threshold	NMS Threshold	Detection Time Per Image	Video Link
RTMDet-Ins-s (TensorRT)	0.3	NO NMS	~20 ms	Video Link
RTMDet-Ins-x (TensorRT)	0.3	NO NMS	~36 ms	Video Link

Warning

In some parts of the video, you may see incorrect class names. For example, you might see both truck and car class names assigned to a vehicle. This is because I didn't run NMS when I deployed it in Python. I plan to fix this when I deploy it in C++.

So, my next plan is doing same things in C++.

Also, the detection times looks a little bit more. I did not understand the reason right now but I am working on it.

StepTurtle · 2024-06-11T14:00:14Z

I am sharing the results from TensorRT deployment to C++, but currently, there are some issues with the results.

I cannot see exactly the same results as with Python deployment. When I check the bounding box and score results, everything appears to be the same as in Python. However, when I check the labels, the results are very different. I am currently trying to resolve this issue.

Model	Score Threshold	NMS Threshold	Detection Time Per Image	Video Link
RTMDet-Ins-s (TensorRT)	0.3	NO NMS	~8 ms	Video Link
RTMDet-Ins-x (TensorRT)	0.3	NO NMS	~22 ms	Video Link

You can find the scripts I used for deployment at these links:

StepTurtle · 2024-06-11T15:36:46Z

Fatih suggest to:

check YOLOX detection time and compare with RTMDet.
test with datasets and compare scores, bounding boxes and so on with YOLOX.

StepTurtle · 2024-06-25T12:02:54Z

I just started to deploy RTMDet to ROS 2. Here is the repository: https://github.com/leo-drive/tensorrt_rtmdet/

Right now I have similar results with my previous works.
Unlike my previous work, it converts ONNX model to TensorRT model on first run and you do not need provide TensorRT model to this package, ONNX is enough.

Model	Score Threshold	NMS Threshold	Video Link
RTMDet-Ins-s	0.3	NO NMS	Video Link
RTMDet-Ins-x	0.3	NO NMS	Video Link

I plan to complete the porting to ROS2 by performing the following steps:

RTMDet uses custom TensorRT plugin and It is loading with dlopen() right now, find better way.
For now, pre-processing jobs works on CPU (with OpenCV), I should use CUDA to preprocessing.
Also post-processing runs on CPU. I will try to convert it CUDA.
There is no NMS right now, solve it
There are a lot of hard coded part right now, remove them.
There are three precision option to convert ONNX model to TensorRT model (fp16, fp32 and int8). int8 is not working.

StepTurtle · 2024-07-08T15:35:08Z

✅ CUDA preprocessing is OK

Preprocess Time Comparison per Image:

Device	Average Time (ms)	Standard Deviation (ms)
CPU	`6.87286 ms`	`1.58603 ms`
CUDA	`1.61101 ms`	`0.470295 ms`

✅ NMS is added

When I checked the model, I saw that the existing NMS algorithm within the model was already working. Despite NMS working, the reason we saw overlapping boxes was because the detected objects belonged to different classes. To prevent this, I added a simple NMS algorithm that selects the one with the higher score for overlapping detections.

Video Link	https://youtu.be/nBAshLqQ1-k

NMS Time per Image

	Average Time (ms)	Standard Deviation (ms)
NMS	`0.00172242 ms`	`0.000780426 ms`

✅ Batch sizes which greater than 1 is OK

It wasn't used before, but now it can be used.
Need documentation.

Video Link	https://youtu.be/hUfK5M4S7Qo

✅ Hard Coded Parts, Launch and Config files

I removed hard codded parts and create example launch and config files.
Need multi-camera launch file

🟨 Plugin Loading

Autoware has the same logic that I used to load the TensorRT plugin.

for (const auto & plugin_path : plugin_paths) {
    int32_t flags{RTLD_LAZY};
    void * handle = dlopen(plugin_path.c_str(), flags); // Plugin path is '/path/to/plugin.so'
    if (!handle) {
      logger_.log(nvinfer1::ILogger::Severity::kERROR, "Could not load plugin library");
    }
}

After compilation, a file with the extension '.so' is created. This file stored in build and it should be parameter of dlopen() function.

Is there any information about can we handle this in Cmake. If we cannot, how can I provide the path to the file located inside the 'build' folder?

I was able to load the plugin using the file paths below:

./build/tensorrt_rtmdet/libtensorrt_rtmdet_plugin.so (relative path from workspace)
/home/user/projects/workspace/build/tensorrt_rtmdet/libtensorrt_rtmdet_plugin.so (absolute path)

🟨 int8 Precision Option

There are three precision option (fp16, fp32 and int8) and one of them was not work. In the current situation, it is working, but the result is not entirely correct. Watch video to see problem:

Video Link	https://youtu.be/3YlY3a9Xnpk

🟨 Post process and Visualization

In current implementation visualization and postprocess parts works on CPU. I can't figure out how I can do these on the GPU.

	Average Time (ms)	Standard Deviation (ms)
Preprocess	`1.61101 ms`	`0.470295 ms`
Inference	`1.71506 ms`	`0.757723 ms`
Postprocess	`13.4203 ms`	`0.628055 ms`
Visualization	`23.4338 ms`	`6.98011 ms`

Following table shows the total time for preprocess, inference and postprocess processes. It don't contain visualization

	Average Time (ms)	Standard Deviation (ms)
Total	`15.791 ms`	`1.58901 ms`

The part which fill the message and publish the result not implement yet. I don't think it will be waste a lot of time.

❓Message Type for Outputs

YOLOX semantic segmentation uses following message types:

For objects: tier4_perception_msgs::msg::DetectedObjectsWithFeature
For semantic mask: sensor_msgs::msg::Image

There is no message defination for instance segmentation, so my plan is creating a new message type which combines the current detection message and new instance segmentation information.

Should I create under autoware_msgs or tier4_autoware_msgs

📝 An RTX 3090 GPU was used for the time calculations and benchmarking

StepTurtle · 2024-07-12T15:52:33Z

When I ran 8 separate RTMDet nodes, I obtained the results in the table below, numbers represents the process times per image:

	Average Time (ms)	Standard Deviation (ms)	Min (ms)	Max (ms)
node-0	`43.5821`	`17.0037`	`16.2328`	`102.521`
node-1	`42.7286`	`15.8802`	`22.3733`	`98.7796`
node-2	`41.4004`	`15.0862`	`15.0169`	`87.0995`
node-3	`42.3738`	`15.7069`	`21.1178`	`105.505`
node-4	`36.4766`	`13.2308`	`20.3997`	`81.181`
node-5	`42.2531`	`16.1687`	`16.265`	`91.1761`
node-6	`35.9258`	`13.5001`	`19.7352`	`76.6217`
node-7	`36.4776`	`15.4815`	`21.2165`	`80.9634`

Computer Specifications

Device	Model
GPU	GeForce RTX 3090 (24 gB VRAM)
GPU	AMD® Ryzen 7 2700x eight-core processor × 16
Memory	32 gB

StepTurtle · 2024-07-30T06:56:49Z

Following tables shows the process times for RTMDet and YOLOX model with single and multiple camera configurations

Computer Specifications

Device	Model
GPU	GeForce RTX 3090 (24 gB VRAM)
GPU	AMD® Ryzen 7 2700x eight-core processor × 16
Memory	32 gB

RTMDet

Single Camera

	Average Time (ms)	Standard Deviation (ms)
Preprocess	`0.488408 ms`	`0.364109 ms`
Inference	`2.93087 ms`	`0.757723 ms`
Postprocess	`12.0955 ms`	`0.912801 ms`
Visualization	`24.4686 ms`	`7.0474 ms`
Fill Mssage	`8.88453 ms`	`2.69375 ms`
Total	`26.1742 ms`	`5.0135 ms`

Multiple Camera

	Average Time (ms)	Standard Deviation (ms)	Min (ms)	Max (ms)
`node-0`	72.119	27.7871	23.6683	170.952
`node-1`	74.4599	28.5641	36.4258	189.597
`node-2`	86.9789	25.7673	36.9122	191.112
`node-3`	68.1571	25.7322	32.5606	202.151
`node-4`	76.4576	24.6437	38.4349	147.174
`node-5`	86.094	26.974	32.8472	184.179
`node-6`	79.927	28.3585	39.9874	179.886
`node-7`	78.0745	29.8809	23.6683	162.812

YOLOX

Single Camera

	Average Time (ms)	Standard Deviation (ms)
Preprocess	`0.645836 ms`	`1.06275 ms`
Inference	`1.35646 ms`	`0.860102 ms`
Postprocess	`4.16757 ms`	`3.2164 ms`
Visualization	`6.27042 ms`	`1.73397 ms`
Fill Mssage	`2.73657 ms`	`0.390568 ms`
Total	`10.7695 ms`	`1.61752 ms`

Multiple Camera

	Average Time (ms)	Standard Deviation (ms)	Min (ms)	Max (ms)
`node-0`	65.3484	14.5957	33.0228	132.028
`node-1`	68.2993	16.6715	40.3608	133.458
`node-2`	69.3783	18.46	33.4685	123.697
`node-3`	66.2178	17.1871	38.1942	130.991
`node-4`	66.7007	16.8195	36.598	116.047
`node-5`	67.7668	17.9907	18.2643	160.636
`node-6`	67.4877	16.112	36.5241	147.61
`node-7`	68.5824	17.7861	22.1659	126.883

StepTurtle · 2024-07-30T10:14:17Z

Discussion on new message type for instance segmentation results: https://github.com/orgs/autowarefoundation/discussions/5047

StepTurtle · 2024-08-13T07:39:46Z

Last updates: https://www.youtube.com/watch?v=N8qrGAxzSJM

StepTurtle · 2024-10-01T14:47:34Z

Some options to use RTMDet results on camera-lidar pipeline:

1) Only run RTMDet and use roi outputs with current pipeline. Since the bounding box results from RTMDet is same with YOLOX, you can directly use RTMDet with current camera-lidar pipeline.

2) Fuse instance segmentation mask with clusters from euclidean clustering and assign clusters to label.

The outputs of euclidean clustering are only point clouds, and it is UNKNOWN which objects they belong to. roi_cluster_fusion assigns labels to the clusters using bounding boxes. mask_cluster_fusion uses instance segmentation masks instead of bounding boxes.

It may perform better with objects that are close to each other and overlapping. I haven't tested it yet.

code: https://github.com/StepTurtle/autoware.universe/tree/feat/mask_cluster_fusion
video: https://youtu.be/nq7WJUAzpXE

3) Fuse the point cloud with instance segmentation masks and take the points within the mask as object.

code: https://github.com/StepTurtle/autoware.universe/tree/feat/mask_pointcloud_fusion
video: https://youtu.be/MvUnQ120IWE

4) Fuse the point cloud with the instance segmentation mask. Filter out the points that do not correspond to objects within the mask and create a filtered point cloud.

Couldn't find a useful place where we can use the filtered point cloud.

PR: #8167
video: https://www.youtube.com/watch?v=N8qrGAxzSJM

kminoda · 2024-10-11T07:30:14Z

@StepTurtle Hi, first of all, thank you for your contribution to Autoware 🙏

However, I am not sure whether it would be better to merge this into Autoware at this moment (sorry for bringing this up after all of the reviews 🙏 )

I have several questions to ask:

What is the specific use case that you want to solve with this instance segmentation? Can't it be solved with semantic segmentation e.g. YOLOX in current Autoware?
What do you think about making RTMDet (as well as trt_nms_batch) as a separate repository for now, and add that repository to, say, pilot-auto.leodrive/autoware.repos? That way, you do not need to go through all the reviewing process, worry about breaking Autoware, care about some dependency version change (e.g. TensorRT), etc. We can merge to Autoware once it is clear that this is beneficial to AWF community.

Let me know your thoughts.

xmfcx · 2024-10-11T11:53:27Z

It also has competitive speed.

Why project boxes when you can project instance segmentation masks?

❌ Poor bounding box to point cloud performance

only_allow_inside_cluster = false

final-yolox.mp4

ROI cluster fusion code has very complicated rules:

https://github.com/autowarefoundation/autoware.universe/blob/main/perception/autoware_image_projection_based_fusion/src/roi_cluster_fusion/node.cpp

❌ Reduced performance with `only_allow_inside_cluster`

only_allow_inside_cluster = true

Probably to solve this issue, you've added only_allow_inside_cluster parameter recently. But even that, lead to mis-labeling of wrong objects. And even miss the pedestrians in front. (Green is prediction)

2024-10-11_14-45-41.mp4

✅ Project instance segmentation masks

final-rtmdet.mp4

Projection code is much more simpler and works all the time.

No need for weird workarounds.

Why not semantic segmentation?

Can't it be solved with semantic segmentation e.g. YOLOX in current Autoware?

Why would we use the outdated semantic segmentation technology when we could achieve instance segmentation with similar performance?

https://blog.roboflow.com/difference-semantic-segmentation-instance-segmentation/

You cannot achieve this level of granularity with semantic segmentation.

xmfcx · 2024-10-11T11:57:16Z

@kminoda

That way, you do not need to go through all the reviewing process, worry about breaking Autoware, care about some dependency version change (e.g. TensorRT), etc. We can merge to Autoware once it is clear that this is beneficial to AWF community.

I think this is a very backwards way of thinking. The universe is supposed to be for community contributions.

A lot of effort is already been put into this integration. And the author was very cooperative during the process.

The benefits are very clear and I don't see the reason for refusing to accept a feature like this.

It doesn't even affect the existing repositories. I am very confused about your proposal.

armaganarsln · 2024-10-11T12:45:54Z

@StepTurtle Hi, first of all, thank you for your contribution to Autoware 🙏

However, I am not sure whether it would be better to merge this into Autoware at this moment (sorry for bringing this up after all of the reviews 🙏 )

I have several questions to ask:

What is the specific use case that you want to solve with this instance segmentation? Can't it be solved with semantic segmentation e.g. YOLOX in current Autoware?

What do you think about making RTMDet (as well as trt_nms_batch) as a separate repository for now, and add that repository to, say, pilot-auto.leodrive/autoware.repos? That way, you do not need to go through all the reviewing process, worry about breaking Autoware, care about some dependency version change (e.g. TensorRT), etc. We can merge to Autoware once it is clear that this is beneficial to AWF community.

Let me know your thoughts.

This is too late to suggest. All work has been done and it has to be merged to the main now. I am sorry for not giving any other options but these comments had to be made earlier.

kminoda · 2024-10-15T00:00:40Z

@xmfcx @armaganarsln @StepTurtle
Hi, thank you for the comment.

First, I would like to apologize if I have been disrespectful in any way. My primary concern was to understand "specifically which use cases posed challenges." However, now it make sense to me with the Fatih's comment here. Let me provide this feedback to our team too 🙏
Regarding my second point, I initially suggested managing it in a separate repository to potentially increase development speed, especially if specific use cases were not yet fully defined. I thought this approach might benefit Leo Drive as well. However, if there is a value in using this solution, let me withdraw my previous suggestion.

armaganarsln · 2024-10-15T08:51:15Z

@kminoda san, thank you and I am sorry for misunderstanding you on your approach. I thought you don't want it to be used or be in the main. Anyway there is still a lot work to be done for it to be useful so let's focus on that right now and see if there will be an improvement in the end in the overall false detections.
Thank you for your support.

StepTurtle · 2024-10-18T11:19:19Z

Hey @kminoda , could you update me on your latest decision? Are the reviews ongoing, or are there other concerns you're currently considering? It would be helpful to know the latest status so I can continue working.

StepTurtle added the component:perception Advanced sensor data processing and environment understanding. (auto-assigned) label Jun 3, 2024

StepTurtle self-assigned this Jun 3, 2024

xmfcx changed the title ~~Implament RTMDet to Perception Pipeline~~ Implement RTMDet to Perception Pipeline Jun 11, 2024

This was referenced Jul 23, 2024

feat(autoware_tensorrt_rtmdet): add tensorrt rtmdet model #8165

Open

feat(image_projection_based_fusion): add instance segmentation fusion #8167

Draft

StepTurtle mentioned this issue Aug 25, 2024

feat(SegmentationMask): add instance segmentation message autowarefoundation/autoware_internal_msgs#23

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement RTMDet to Perception Pipeline #7235

Implement RTMDet to Perception Pipeline #7235

StepTurtle commented Jun 3, 2024 •

edited

Loading

StepTurtle commented Jun 3, 2024 •

edited

Loading

StepTurtle commented Jun 5, 2024 •

edited

Loading

StepTurtle commented Jun 11, 2024 •

edited

Loading

StepTurtle commented Jun 11, 2024

StepTurtle commented Jun 25, 2024 •

edited

Loading

StepTurtle commented Jul 8, 2024 •

edited

Loading

StepTurtle commented Jul 12, 2024

StepTurtle commented Jul 30, 2024 •

edited

Loading

StepTurtle commented Jul 30, 2024

StepTurtle commented Aug 13, 2024

StepTurtle commented Oct 1, 2024 •

edited

Loading

kminoda commented Oct 11, 2024 •

edited

Loading

xmfcx commented Oct 11, 2024

xmfcx commented Oct 11, 2024 •

edited

Loading

armaganarsln commented Oct 11, 2024

kminoda commented Oct 15, 2024

armaganarsln commented Oct 15, 2024

StepTurtle commented Oct 18, 2024

Implement RTMDet to Perception Pipeline #7235

Implement RTMDet to Perception Pipeline #7235

Comments

StepTurtle commented Jun 3, 2024 • edited Loading

Checklist

Description

Purpose

Possible approaches

Definition of done

StepTurtle commented Jun 3, 2024 • edited Loading

StepTurtle commented Jun 5, 2024 • edited Loading

StepTurtle commented Jun 11, 2024 • edited Loading

StepTurtle commented Jun 11, 2024

StepTurtle commented Jun 25, 2024 • edited Loading

StepTurtle commented Jul 8, 2024 • edited Loading

StepTurtle commented Jul 12, 2024

StepTurtle commented Jul 30, 2024 • edited Loading

RTMDet

YOLOX

StepTurtle commented Jul 30, 2024

StepTurtle commented Aug 13, 2024

StepTurtle commented Oct 1, 2024 • edited Loading

kminoda commented Oct 11, 2024 • edited Loading

xmfcx commented Oct 11, 2024

❌ Poor bounding box to point cloud performance

❌ Reduced performance with only_allow_inside_cluster

✅ Project instance segmentation masks

Why not semantic segmentation?

xmfcx commented Oct 11, 2024 • edited Loading

armaganarsln commented Oct 11, 2024

kminoda commented Oct 15, 2024

armaganarsln commented Oct 15, 2024

StepTurtle commented Oct 18, 2024

StepTurtle commented Jun 3, 2024 •

edited

Loading

StepTurtle commented Jun 3, 2024 •

edited

Loading

StepTurtle commented Jun 5, 2024 •

edited

Loading

StepTurtle commented Jun 11, 2024 •

edited

Loading

StepTurtle commented Jun 25, 2024 •

edited

Loading

StepTurtle commented Jul 8, 2024 •

edited

Loading

StepTurtle commented Jul 30, 2024 •

edited

Loading

StepTurtle commented Oct 1, 2024 •

edited

Loading

kminoda commented Oct 11, 2024 •

edited

Loading

❌ Reduced performance with `only_allow_inside_cluster`

xmfcx commented Oct 11, 2024 •

edited

Loading