Skip to content

[Performance] CUDAExecutionProvider without RoiAlign (opset 16 version) #21990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
YuriGao opened this issue Sep 5, 2024 · 4 comments
Closed
Labels
ep:CUDA issues related to the CUDA execution provider performance issues related to performance regressions stale issues that have not been addressed in a while; categorized by a bot

Comments

@YuriGao
Copy link

YuriGao commented Sep 5, 2024

Describe the issue

i'm using cascade mask rcnn model in detectron2. when export onnx, it has RoiAlign (opset 16 version) in model file.
when running on onnxruntime (Cuda EP), it's too slow since RoiAlign running on CPU EP.
Could anyone provider RoiAlign (opset 16 version) on Cuda EP?

To reproduce

1、Exporting Cascade Mask RCNN in detectron2;
2、Running model in Onnxruntime Cuda EP;

Urgency

No response

Platform

Windows

OS Version

Win10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.8 and CUDA 12.2

Model File

No response

Is this a quantized model?

No

@YuriGao YuriGao added the performance issues related to performance regressions label Sep 5, 2024
@sophies927 sophies927 added the ep:CUDA issues related to the CUDA execution provider label Sep 5, 2024
@YuriGao
Copy link
Author

YuriGao commented Sep 6, 2024

For running fast on Cuda EP, i have to use RoiAlign (Opset 10 version) and insert Sub Op before RoiAlign's rois input. Should notice that the Sub value is corresponding with RoiAlign's spatial_scale attrs. The Sub value should be 0.5 / RoiAlign["spatial_scale "].
It will be good for everyone if someone could upgrade the current RoiAlign Cuda EP implement.

Copy link
Contributor

github-actions bot commented Oct 6, 2024

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

@github-actions github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Oct 6, 2024
@davidgill97
Copy link

For running fast on Cuda EP, i have to use RoiAlign (Opset 10 version) and insert Sub Op before RoiAlign's rois input. Should notice that the Sub value is corresponding with RoiAlign's spatial_scale attrs. The Sub value should be 0.5 / RoiAlign["spatial_scale "]. It will be good for everyone if someone could upgrade the current RoiAlign Cuda EP implement.

I'm also experiencing same issue. Could you explain why you added sub op before rois, and how much it improved latency?

@YuriGao
Copy link
Author

YuriGao commented Apr 15, 2025

For running fast on Cuda EP, i have to use RoiAlign (Opset 10 version) and insert Sub Op before RoiAlign's rois input. Should notice that the Sub value is corresponding with RoiAlign's spatial_scale attrs. The Sub value should be 0.5 / RoiAlign["spatial_scale "]. It will be good for everyone if someone could upgrade the current RoiAlign Cuda EP implement.

I'm also experiencing same issue. Could you explain why you added sub op before rois, and how much it improved latency?

As what i said, RoiAlign has two versions (Opset 10 and Opset 16) which takes defferent input. RoiAlign Opset 10 have CudaEP and Opset 16 don't. If you want to running fast on CudaEP, Opset 10 version is only choice. In specific inputs, RoiAlign Opset 10 don't equals Opset 16, but Opset 10 version RoiAlign plus Sub Op equals Opset 16 version RoiAlign.
On my situation, cost of one time reduce more than 50%. Such a improvement may relate with onnxruntime dispatch strategy.

@YuriGao YuriGao closed this as completed Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider performance issues related to performance regressions stale issues that have not been addressed in a while; categorized by a bot
Projects
None yet
Development

No branches or pull requests

3 participants