Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement RTMDet to Perception Pipeline #7235

Open
8 of 10 tasks
StepTurtle opened this issue Jun 3, 2024 · 18 comments
Open
8 of 10 tasks

Implement RTMDet to Perception Pipeline #7235

StepTurtle opened this issue Jun 3, 2024 · 18 comments
Assignees
Labels
component:perception Advanced sensor data processing and environment understanding. (auto-assigned)

Comments

@StepTurtle
Copy link
Contributor

StepTurtle commented Jun 3, 2024

Checklist

  • I've read the contribution guidelines.
  • I've searched other issues and no duplicate issues were found.
  • I've agreed with the maintainers that I can plan this task.

Description

We plan to add the RTMDet model in addition to the existing YOLOX model in Autoware Universe. While the YOLOX model is successful in the bounding box task, its instance segmentation layer is weak. We aim to improve instance segmentation results by adding RTMDet model.

208044554-1e8de6b5-48d8-44e4-a7b5-75076c7ebb71

208070055-7233a3d8-955f-486a-82da-b714b3c3bbd6

Purpose

Our goal is to enhance the lidar image fusion pipeline by adding the RTMDet model to Autoware for image segmentation.

Possible approaches

We can convert pre-trained PyTorch models to ONNX and TensorRT formats, and we can create a ROS 2 package to handle the TensorRT models in Autoware Universe perception to implement them in Autoware.

Definition of done

  • Use pre-trained PyTorch models to detection
  • Convert pre-trained PyTorch models to ONNX and TensorRT format
  • Deploy TensorRT model with Python
  • Deploy TensorRT model with C++
  • Create a ROS 2 package in Autoware Universe which use ONNX models
  • Compare RTMDet results with YOLOX segmentation results
  • Decide how can we use the RTMDet segmentation results with image lidar fusion pipeline
@StepTurtle StepTurtle added the component:perception Advanced sensor data processing and environment understanding. (auto-assigned) label Jun 3, 2024
@StepTurtle StepTurtle self-assigned this Jun 3, 2024
@StepTurtle
Copy link
Contributor Author

StepTurtle commented Jun 3, 2024

Here is the results of pre-trained models shared in this link from mmdetection.

Results:

  • I used PyTorch models to get these results (models with .pth extension)
Model Score Threshold NMS Threshold Detection Time Per Image Video Link
RTMDet-Ins-s 0.3 0.3 ~20 ms Video Link
RTMDet-Ins-x 0.3 0.3 ~33 ms Video Link

For now, I tested the pre-trained models shared by mmdetection using mmdetection tools. Also they provide a couple of tools to convert .pth models to .onnx and .engine models. I check both of the converted model results and they are looking same and I think we can say their tools clearly convert the models.

Right now I am trying to handle how can use TensorRT models in cpp with TensorRT libraries.

@StepTurtle
Copy link
Contributor Author

StepTurtle commented Jun 5, 2024

I deploy TensorRT engine to Python and I get some consistent results.

Model Score Threshold NMS Threshold Detection Time Per Image Video Link
RTMDet-Ins-s (TensorRT) 0.3 NO NMS ~20 ms Video Link
RTMDet-Ins-x (TensorRT) 0.3 NO NMS ~36 ms Video Link

Warning

In some parts of the video, you may see incorrect class names. For example, you might see both truck and car class names assigned to a vehicle. This is because I didn't run NMS when I deployed it in Python. I plan to fix this when I deploy it in C++.


So, my next plan is doing same things in C++.

Also, the detection times looks a little bit more. I did not understand the reason right now but I am working on it.

@xmfcx xmfcx changed the title Implament RTMDet to Perception Pipeline Implement RTMDet to Perception Pipeline Jun 11, 2024
@StepTurtle
Copy link
Contributor Author

StepTurtle commented Jun 11, 2024

I am sharing the results from TensorRT deployment to C++, but currently, there are some issues with the results.

I cannot see exactly the same results as with Python deployment. When I check the bounding box and score results, everything appears to be the same as in Python. However, when I check the labels, the results are very different. I am currently trying to resolve this issue.

Model Score Threshold NMS Threshold Detection Time Per Image Video Link
RTMDet-Ins-s (TensorRT) 0.3 NO NMS ~8 ms Video Link
RTMDet-Ins-x (TensorRT) 0.3 NO NMS ~22 ms Video Link

You can find the scripts I used for deployment at these links:

@StepTurtle
Copy link
Contributor Author

Fatih suggest to:

  • check YOLOX detection time and compare with RTMDet.
  • test with datasets and compare scores, bounding boxes and so on with YOLOX.

@StepTurtle
Copy link
Contributor Author

StepTurtle commented Jun 25, 2024

I just started to deploy RTMDet to ROS 2. Here is the repository: https://github.com/leo-drive/tensorrt_rtmdet/

  • Right now I have similar results with my previous works.
  • Unlike my previous work, it converts ONNX model to TensorRT model on first run and you do not need provide TensorRT model to this package, ONNX is enough.
Model Score Threshold NMS Threshold Video Link
RTMDet-Ins-s 0.3 NO NMS Video Link
RTMDet-Ins-x 0.3 NO NMS Video Link

I plan to complete the porting to ROS2 by performing the following steps:

  • RTMDet uses custom TensorRT plugin and It is loading with dlopen() right now, find better way.
  • For now, pre-processing jobs works on CPU (with OpenCV), I should use CUDA to preprocessing.
  • Also post-processing runs on CPU. I will try to convert it CUDA.
  • There is no NMS right now, solve it
  • There are a lot of hard coded part right now, remove them.
  • There are three precision option to convert ONNX model to TensorRT model (fp16, fp32 and int8). int8 is not working.

@StepTurtle
Copy link
Contributor Author

StepTurtle commented Jul 8, 2024

CUDA preprocessing is OK

Preprocess Time Comparison per Image:

Device Average Time (ms) Standard Deviation (ms)
CPU 6.87286 ms 1.58603 ms
CUDA 1.61101 ms 0.470295 ms

NMS is added

When I checked the model, I saw that the existing NMS algorithm within the model was already working. Despite NMS working, the reason we saw overlapping boxes was because the detected objects belonged to different classes. To prevent this, I added a simple NMS algorithm that selects the one with the higher score for overlapping detections.

Video Link https://youtu.be/nBAshLqQ1-k

NMS Time per Image

Average Time (ms) Standard Deviation (ms)
NMS 0.00172242 ms 0.000780426 ms

Batch sizes which greater than 1 is OK

  • It wasn't used before, but now it can be used.
  • Need documentation.
Video Link https://youtu.be/hUfK5M4S7Qo

Hard Coded Parts, Launch and Config files

  • I removed hard codded parts and create example launch and config files.
  • Need multi-camera launch file

🟨 Plugin Loading

Autoware has the same logic that I used to load the TensorRT plugin.

for (const auto & plugin_path : plugin_paths) {
    int32_t flags{RTLD_LAZY};
    void * handle = dlopen(plugin_path.c_str(), flags); // Plugin path is '/path/to/plugin.so'
    if (!handle) {
      logger_.log(nvinfer1::ILogger::Severity::kERROR, "Could not load plugin library");
    }
}

After compilation, a file with the extension '.so' is created. This file stored in build and it should be parameter of dlopen() function.

Is there any information about can we handle this in Cmake. If we cannot, how can I provide the path to the file located inside the 'build' folder?

I was able to load the plugin using the file paths below:

  • ./build/tensorrt_rtmdet/libtensorrt_rtmdet_plugin.so (relative path from workspace)
  • /home/user/projects/workspace/build/tensorrt_rtmdet/libtensorrt_rtmdet_plugin.so (absolute path)

🟨 int8 Precision Option

There are three precision option (fp16, fp32 and int8) and one of them was not work. In the current situation, it is working, but the result is not entirely correct. Watch video to see problem:

Video Link https://youtu.be/3YlY3a9Xnpk

🟨 Post process and Visualization

In current implementation visualization and postprocess parts works on CPU. I can't figure out how I can do these on the GPU.

Average Time (ms) Standard Deviation (ms)
Preprocess 1.61101 ms 0.470295 ms
Inference 1.71506 ms 0.757723 ms
Postprocess 13.4203 ms 0.628055 ms
Visualization 23.4338 ms 6.98011 ms

Following table shows the total time for preprocess, inference and postprocess processes. It don't contain visualization

Average Time (ms) Standard Deviation (ms)
Total 15.791 ms 1.58901 ms
  • The part which fill the message and publish the result not implement yet. I don't think it will be waste a lot of time.

Message Type for Outputs

YOLOX semantic segmentation uses following message types:

  • For objects: tier4_perception_msgs::msg::DetectedObjectsWithFeature
  • For semantic mask: sensor_msgs::msg::Image

There is no message defination for instance segmentation, so my plan is creating a new message type which combines the current detection message and new instance segmentation information.

Should I create under autoware_msgs or tier4_autoware_msgs


📝 An RTX 3090 GPU was used for the time calculations and benchmarking

@StepTurtle
Copy link
Contributor Author

When I ran 8 separate RTMDet nodes, I obtained the results in the table below, numbers represents the process times per image:

Average Time (ms) Standard Deviation (ms) Min (ms) Max (ms)
node-0 43.5821 17.0037 16.2328 102.521
node-1 42.7286 15.8802 22.3733 98.7796
node-2 41.4004 15.0862 15.0169 87.0995
node-3 42.3738 15.7069 21.1178 105.505
node-4 36.4766 13.2308 20.3997 81.181
node-5 42.2531 16.1687 16.265 91.1761
node-6 35.9258 13.5001 19.7352 76.6217
node-7 36.4776 15.4815 21.2165 80.9634

Computer Specifications

Device Model
GPU GeForce RTX 3090 (24 gB VRAM)
GPU AMD® Ryzen 7 2700x eight-core processor × 16
Memory 32 gB

@StepTurtle
Copy link
Contributor Author

StepTurtle commented Jul 30, 2024

Following tables shows the process times for RTMDet and YOLOX model with single and multiple camera configurations

Computer Specifications

Device Model
GPU GeForce RTX 3090 (24 gB VRAM)
GPU AMD® Ryzen 7 2700x eight-core processor × 16
Memory 32 gB

RTMDet

Single Camera

Average Time (ms) Standard Deviation (ms)
Preprocess 0.488408 ms 0.364109 ms
Inference 2.93087 ms 0.757723 ms
Postprocess 12.0955 ms 0.912801 ms
Visualization 24.4686 ms 7.0474 ms
Fill Mssage 8.88453 ms 2.69375 ms
Total 26.1742 ms 5.0135 ms

Multiple Camera

Average Time (ms) Standard Deviation (ms) Min (ms) Max (ms)
node-0 72.119 27.7871 23.6683 170.952
node-1 74.4599 28.5641 36.4258 189.597
node-2 86.9789 25.7673 36.9122 191.112
node-3 68.1571 25.7322 32.5606 202.151
node-4 76.4576 24.6437 38.4349 147.174
node-5 86.094 26.974 32.8472 184.179
node-6 79.927 28.3585 39.9874 179.886
node-7 78.0745 29.8809 23.6683 162.812

YOLOX

Single Camera

Average Time (ms) Standard Deviation (ms)
Preprocess 0.645836 ms 1.06275 ms
Inference 1.35646 ms 0.860102 ms
Postprocess 4.16757 ms 3.2164 ms
Visualization 6.27042 ms 1.73397 ms
Fill Mssage 2.73657 ms 0.390568 ms
Total 10.7695 ms 1.61752 ms

Multiple Camera

Average Time (ms) Standard Deviation (ms) Min (ms) Max (ms)
node-0 65.3484 14.5957 33.0228 132.028
node-1 68.2993 16.6715 40.3608 133.458
node-2 69.3783 18.46 33.4685 123.697
node-3 66.2178 17.1871 38.1942 130.991
node-4 66.7007 16.8195 36.598 116.047
node-5 67.7668 17.9907 18.2643 160.636
node-6 67.4877 16.112 36.5241 147.61
node-7 68.5824 17.7861 22.1659 126.883

@StepTurtle
Copy link
Contributor Author

Discussion on new message type for instance segmentation results: https://github.com/orgs/autowarefoundation/discussions/5047

@StepTurtle
Copy link
Contributor Author

Last updates: https://www.youtube.com/watch?v=N8qrGAxzSJM

@StepTurtle
Copy link
Contributor Author

StepTurtle commented Oct 1, 2024

Some options to use RTMDet results on camera-lidar pipeline:

1) Only run RTMDet and use roi outputs with current pipeline. Since the bounding box results from RTMDet is same with YOLOX, you can directly use RTMDet with current camera-lidar pipeline.

2) Fuse instance segmentation mask with clusters from euclidean clustering and assign clusters to label.

The outputs of euclidean clustering are only point clouds, and it is UNKNOWN which objects they belong to. roi_cluster_fusion assigns labels to the clusters using bounding boxes. mask_cluster_fusion uses instance segmentation masks instead of bounding boxes.

It may perform better with objects that are close to each other and overlapping. I haven't tested it yet.

code: https://github.com/StepTurtle/autoware.universe/tree/feat/mask_cluster_fusion
video: https://youtu.be/nq7WJUAzpXE

3) Fuse the point cloud with instance segmentation masks and take the points within the mask as object.

code: https://github.com/StepTurtle/autoware.universe/tree/feat/mask_pointcloud_fusion
video: https://youtu.be/MvUnQ120IWE

4) Fuse the point cloud with the instance segmentation mask. Filter out the points that do not correspond to objects within the mask and create a filtered point cloud.

Couldn't find a useful place where we can use the filtered point cloud.

PR: #8167
video: https://www.youtube.com/watch?v=N8qrGAxzSJM

@kminoda
Copy link
Contributor

kminoda commented Oct 11, 2024

@StepTurtle Hi, first of all, thank you for your contribution to Autoware 🙏

However, I am not sure whether it would be better to merge this into Autoware at this moment (sorry for bringing this up after all of the reviews 🙏 )

I have several questions to ask:

  1. What is the specific use case that you want to solve with this instance segmentation? Can't it be solved with semantic segmentation e.g. YOLOX in current Autoware?
  2. What do you think about making RTMDet (as well as trt_nms_batch) as a separate repository for now, and add that repository to, say, pilot-auto.leodrive/autoware.repos? That way, you do not need to go through all the reviewing process, worry about breaking Autoware, care about some dependency version change (e.g. TensorRT), etc. We can merge to Autoware once it is clear that this is beneficial to AWF community.

Let me know your thoughts.

@xmfcx
Copy link
Contributor

xmfcx commented Oct 11, 2024

It also has competitive speed.

Why project boxes when you can project instance segmentation masks?

❌ Poor bounding box to point cloud performance

only_allow_inside_cluster = false

final-yolox.mp4

ROI cluster fusion code has very complicated rules:

❌ Reduced performance with only_allow_inside_cluster

only_allow_inside_cluster = true

Probably to solve this issue, you've added only_allow_inside_cluster parameter recently. But even that, lead to mis-labeling of wrong objects. And even miss the pedestrians in front. (Green is prediction)

2024-10-11_14-45-41.mp4

✅ Project instance segmentation masks

final-rtmdet.mp4

Projection code is much more simpler and works all the time.

No need for weird workarounds.

Why not semantic segmentation?

Can't it be solved with semantic segmentation e.g. YOLOX in current Autoware?

Why would we use the outdated semantic segmentation technology when we could achieve instance segmentation with similar performance?

image

https://blog.roboflow.com/difference-semantic-segmentation-instance-segmentation/

image

You cannot achieve this level of granularity with semantic segmentation.

@xmfcx
Copy link
Contributor

xmfcx commented Oct 11, 2024

@kminoda

That way, you do not need to go through all the reviewing process, worry about breaking Autoware, care about some dependency version change (e.g. TensorRT), etc. We can merge to Autoware once it is clear that this is beneficial to AWF community.

I think this is a very backwards way of thinking. The universe is supposed to be for community contributions.

A lot of effort is already been put into this integration. And the author was very cooperative during the process.

The benefits are very clear and I don't see the reason for refusing to accept a feature like this.

It doesn't even affect the existing repositories. I am very confused about your proposal.

@armaganarsln
Copy link

@StepTurtle Hi, first of all, thank you for your contribution to Autoware 🙏

However, I am not sure whether it would be better to merge this into Autoware at this moment (sorry for bringing this up after all of the reviews 🙏 )

I have several questions to ask:

  1. What is the specific use case that you want to solve with this instance segmentation? Can't it be solved with semantic segmentation e.g. YOLOX in current Autoware?
  2. What do you think about making RTMDet (as well as trt_nms_batch) as a separate repository for now, and add that repository to, say, pilot-auto.leodrive/autoware.repos? That way, you do not need to go through all the reviewing process, worry about breaking Autoware, care about some dependency version change (e.g. TensorRT), etc. We can merge to Autoware once it is clear that this is beneficial to AWF community.

Let me know your thoughts.

This is too late to suggest. All work has been done and it has to be merged to the main now. I am sorry for not giving any other options but these comments had to be made earlier.

@kminoda
Copy link
Contributor

kminoda commented Oct 15, 2024

@xmfcx @armaganarsln @StepTurtle
Hi, thank you for the comment.

First, I would like to apologize if I have been disrespectful in any way. My primary concern was to understand "specifically which use cases posed challenges." However, now it make sense to me with the Fatih's comment here. Let me provide this feedback to our team too 🙏
Regarding my second point, I initially suggested managing it in a separate repository to potentially increase development speed, especially if specific use cases were not yet fully defined. I thought this approach might benefit Leo Drive as well. However, if there is a value in using this solution, let me withdraw my previous suggestion.

@armaganarsln
Copy link

@kminoda san, thank you and I am sorry for misunderstanding you on your approach. I thought you don't want it to be used or be in the main. Anyway there is still a lot work to be done for it to be useful so let's focus on that right now and see if there will be an improvement in the end in the overall false detections.
Thank you for your support.

@StepTurtle
Copy link
Contributor Author

Hey @kminoda , could you update me on your latest decision? Are the reviews ongoing, or are there other concerns you're currently considering? It would be helpful to know the latest status so I can continue working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:perception Advanced sensor data processing and environment understanding. (auto-assigned)
Projects
Status: In Progress
Development

No branches or pull requests

4 participants