[Performance] CUDAExecutionProvider without RoiAlign (opset 16 version) #21990

YuriGao · 2024-09-05T02:04:05Z

Describe the issue

i'm using cascade mask rcnn model in detectron2. when export onnx, it has RoiAlign (opset 16 version) in model file.
when running on onnxruntime (Cuda EP), it's too slow since RoiAlign running on CPU EP.
Could anyone provider RoiAlign (opset 16 version) on Cuda EP?

To reproduce

1、Exporting Cascade Mask RCNN in detectron2;
2、Running model in Onnxruntime Cuda EP;

Urgency

No response

Platform

Windows

OS Version

Win10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.8 and CUDA 12.2

Model File

No response

Is this a quantized model?

No

YuriGao · 2024-09-06T06:35:44Z

For running fast on Cuda EP, i have to use RoiAlign (Opset 10 version) and insert Sub Op before RoiAlign's rois input. Should notice that the Sub value is corresponding with RoiAlign's spatial_scale attrs. The Sub value should be 0.5 / RoiAlign["spatial_scale "].
It will be good for everyone if someone could upgrade the current RoiAlign Cuda EP implement.

github-actions · 2024-10-06T15:01:04Z

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

davidgill97 · 2025-04-10T14:43:27Z

For running fast on Cuda EP, i have to use RoiAlign (Opset 10 version) and insert Sub Op before RoiAlign's rois input. Should notice that the Sub value is corresponding with RoiAlign's spatial_scale attrs. The Sub value should be 0.5 / RoiAlign["spatial_scale "]. It will be good for everyone if someone could upgrade the current RoiAlign Cuda EP implement.

I'm also experiencing same issue. Could you explain why you added sub op before rois, and how much it improved latency?

YuriGao · 2025-04-15T03:53:33Z

For running fast on Cuda EP, i have to use RoiAlign (Opset 10 version) and insert Sub Op before RoiAlign's rois input. Should notice that the Sub value is corresponding with RoiAlign's spatial_scale attrs. The Sub value should be 0.5 / RoiAlign["spatial_scale "]. It will be good for everyone if someone could upgrade the current RoiAlign Cuda EP implement.

I'm also experiencing same issue. Could you explain why you added sub op before rois, and how much it improved latency?

As what i said, RoiAlign has two versions (Opset 10 and Opset 16) which takes defferent input. RoiAlign Opset 10 have CudaEP and Opset 16 don't. If you want to running fast on CudaEP, Opset 10 version is only choice. In specific inputs, RoiAlign Opset 10 don't equals Opset 16, but Opset 10 version RoiAlign plus Sub Op equals Opset 16 version RoiAlign.
On my situation, cost of one time reduce more than 50%. Such a improvement may relate with onnxruntime dispatch strategy.

YuriGao added the performance issues related to performance regressions label Sep 5, 2024

sophies927 added the ep:CUDA issues related to the CUDA execution provider label Sep 5, 2024

github-actions bot added the stale issues that have not been addressed in a while; categorized by a bot label Oct 6, 2024

YuriGao closed this as completed Apr 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] CUDAExecutionProvider without RoiAlign (opset 16 version) #21990

[Performance] CUDAExecutionProvider without RoiAlign (opset 16 version) #21990

YuriGao commented Sep 5, 2024

YuriGao commented Sep 6, 2024

github-actions bot commented Oct 6, 2024

davidgill97 commented Apr 10, 2025

YuriGao commented Apr 15, 2025

[Performance] CUDAExecutionProvider without RoiAlign (opset 16 version) #21990

[Performance] CUDAExecutionProvider without RoiAlign (opset 16 version) #21990

Comments

YuriGao commented Sep 5, 2024

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

YuriGao commented Sep 6, 2024

github-actions bot commented Oct 6, 2024

davidgill97 commented Apr 10, 2025

YuriGao commented Apr 15, 2025