Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit 8de1639

Browse files
authoredApr 15, 2025··
[webgpu] Enable DP4A MatMul generation path for Qualcomm (#24408)
With this PR, the generation speed for phi4 improves 2x on Qualcomm Adreno X1 GPU (11.1 tps -> 23.2 tps for simple inputs).
1 parent 1f14dac commit 8de1639

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed
 

‎onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -684,7 +684,9 @@ Status MatMulNBits::ComputeInternal(onnxruntime::webgpu::ComputeContext& context
684684
}
685685

686686
// On FP32 only GPUs, integer math is faster than FP32 therefore always use DP4A independent of length of M.
687-
if ((M >= kMinMForTileOptimization || y->DataType() == DataTypeImpl::GetType<float>()) && CanApplyDP4AMatrixMatMulNBits(context, accuracy_level_, block_size, batch_count, N, K, components_a, has_zero_points)) {
687+
if ((M >= kMinMForTileOptimization || y->DataType() == DataTypeImpl::GetType<float>() ||
688+
context.AdapterInfo().vendor == std::string_view{"qualcomm"}) &&
689+
CanApplyDP4AMatrixMatMulNBits(context, accuracy_level_, block_size, batch_count, N, K, components_a, has_zero_points)) {
688690
return ApplyDP4AMatrixMatMulNBits(a, b, scales, M, N, K, block_size, kMinMForTileOptimization, context, y);
689691
}
690692

0 commit comments

Comments
 (0)
Please sign in to comment.