[WebGPU] Optimize GEMM with vec4 #24478

xiaofeihan1 · 2025-04-21T02:02:35Z

Description

In this PR, we use vec4 to optimize GEMM when colums of A and B can be divided by 4, or use previous shader.
I will add u32/vec2 implementation in the future, and we will only keep one shader at that time.

Perf comparison

I run customized model only include GEMM(M = N = K = 1024) with nodejs on M2/M3 Max. Roughly 20% increase.

	!transA&&!transB	transA	transB	transA&&transB
M2	9.36->7.41	9.45->7.54	11.21->8.19	9.66->8.37
M3 max	8.07->6.99	7.54->6.53	8.42->5.89	5.47->5.29

fs-eire · 2025-04-22T00:00:38Z

Is there a way to reuse the implementation of MatMul? My understanding is that there are some kind of duplication between GEMM and MatMul, and it would be great if we can reuse the shared code

xiaofeihan1 · 2025-04-22T07:26:52Z

Is there a way to reuse the implementation of MatMul? My understanding is that there are some kind of duplication between GEMM and MatMul, and it would be great if we can reuse the shared code

Thanks for the callout. That's what I'm gonna do next.
For current PR, I want to push forward to support vec4 for GEMM. I will take the refactor work in future PRs because it also require some effort to consider(e.g.There are also some differences between gemm and matmul, e.g. the latter supports batch size, the former supports transpose, etc). WDYT?

fs-eire · 2025-04-22T18:44:03Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

azure-pipelines · 2025-04-22T18:44:24Z

Azure Pipelines successfully started running 5 pipeline(s).

onnxruntime/core/providers/webgpu/math/gemm_vec4.cc

onnxruntime/core/providers/webgpu/math/gemm.h

onnxruntime/core/providers/webgpu/math/gemm_vec4.cc

fs-eire · 2025-04-24T21:10:02Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

azure-pipelines · 2025-04-24T21:10:24Z

Azure Pipelines successfully started running 5 pipeline(s).

onnxruntime/core/providers/webgpu/math/gemm_vec4.cc

…ize_vec4

qjia7

LGTM, thanks.

fs-eire · 2025-04-28T23:12:43Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,ONNX Runtime Web CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline

fs-eire · 2025-04-28T23:12:44Z

/azp run Linux QNN CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline,Linux DNNL CI Pipeline,Linux MIGraphX CI Pipeline,Linux ROCm CI Pipeline

azure-pipelines · 2025-04-28T23:13:04Z

Azure Pipelines successfully started running 5 pipeline(s).

azure-pipelines · 2025-04-28T23:13:11Z

Azure Pipelines successfully started running 6 pipeline(s), but failed to run 1 pipeline(s).

xiaofeihan1 added 5 commits April 21, 2025 10:00

implement vec4

Loading
Loading status checks…

fff087e

fix compile error

Loading
Loading status checks…

2a07abf

delete extra param

Loading
Loading status checks…

8b76007

check vec4 for C

Loading
Loading status checks…

90d371a

cache key

Loading
Loading status checks…

2177b8a

extract functions

Loading
Loading status checks…

4d22064

qjia7 reviewed Apr 23, 2025

View reviewed changes

onnxruntime/core/providers/webgpu/math/gemm_vec4.cc Outdated Show resolved Hide resolved

onnxruntime/core/providers/webgpu/math/gemm_vec4.cc Outdated Show resolved Hide resolved

onnxruntime/core/providers/webgpu/math/gemm_vec4.cc Outdated Show resolved Hide resolved

resolve comments

Loading
Loading status checks…

8a3b67f

xiaofeihan1 requested a review from qjia7 April 24, 2025 03:00

xiaofeihan1 commented Apr 24, 2025

View reviewed changes

onnxruntime/core/providers/webgpu/math/gemm.h Outdated Show resolved Hide resolved

qjia7 reviewed Apr 24, 2025

View reviewed changes

onnxruntime/core/providers/webgpu/math/gemm_vec4.cc Outdated Show resolved Hide resolved

qjia7 reviewed Apr 24, 2025

View reviewed changes

onnxruntime/core/providers/webgpu/math/gemm_vec4.cc Outdated Show resolved Hide resolved

xiaofeihan1 added 2 commits April 24, 2025 18:06

output is vec1 for some cases

Loading
Loading status checks…

f8d710c

remove unnecessry variable

Loading
Loading status checks…

0735137

update comments

Loading
Loading status checks…

8f8a202

xiaofeihan1 requested a review from qjia7 April 25, 2025 08:48

qjia7 reviewed Apr 28, 2025

View reviewed changes

onnxruntime/core/providers/webgpu/math/gemm_vec4.cc Outdated Show resolved Hide resolved

xiaofeihan1 added 3 commits April 28, 2025 11:18

cache hint

Loading
Loading status checks…

c87e643

Merge remote-tracking branch 'origin/main' into xiaofeihan/gemm_optim…

Loading
Loading status checks…

cf99d51

…ize_vec4

remove unneccesary hint

Loading
Loading status checks…

70e138b

qjia7 approved these changes Apr 28, 2025

View reviewed changes

qjia7 requested a review from fs-eire April 28, 2025 08:27

fs-eire approved these changes Apr 28, 2025

View reviewed changes

fs-eire merged commit 81fc3f1 into microsoft:main Apr 29, 2025
71 of 82 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WebGPU] Optimize GEMM with vec4 #24478

[WebGPU] Optimize GEMM with vec4 #24478

xiaofeihan1 commented Apr 21, 2025 •

edited

Loading

fs-eire commented Apr 22, 2025

xiaofeihan1 commented Apr 22, 2025 •

edited

Loading

fs-eire commented Apr 22, 2025

azure-pipelines bot commented Apr 22, 2025

fs-eire commented Apr 24, 2025

azure-pipelines bot commented Apr 24, 2025

qjia7 left a comment

fs-eire commented Apr 28, 2025

fs-eire commented Apr 28, 2025

azure-pipelines bot commented Apr 28, 2025

azure-pipelines bot commented Apr 28, 2025

[WebGPU] Optimize GEMM with vec4 #24478

[WebGPU] Optimize GEMM with vec4 #24478

Conversation

xiaofeihan1 commented Apr 21, 2025 • edited Loading

Description

Perf comparison

fs-eire commented Apr 22, 2025

xiaofeihan1 commented Apr 22, 2025 • edited Loading

fs-eire commented Apr 22, 2025

azure-pipelines bot commented Apr 22, 2025

fs-eire commented Apr 24, 2025

azure-pipelines bot commented Apr 24, 2025

qjia7 left a comment

Choose a reason for hiding this comment

fs-eire commented Apr 28, 2025

fs-eire commented Apr 28, 2025

azure-pipelines bot commented Apr 28, 2025

azure-pipelines bot commented Apr 28, 2025

xiaofeihan1 commented Apr 21, 2025 •

edited

Loading

xiaofeihan1 commented Apr 22, 2025 •

edited

Loading