Improve Shape Inference for GQA #24143

peishenyan · 2025-03-24T09:06:08Z

Description

For GroupQueryAttention op, if the input total_sequence_length is a constant, we can infer the shape of output present_key/present_value (batch_size, kv_num_heads, present_sequence_length, head_size).

onnxruntime/onnxruntime/contrib_ops/cpu/bert/group_query_attention_helper.h

Line 185 in 5ed900e

    
           int present_sequence_length = std::max(total_sequence_length, past_sequence_length);

We know that from CPU EP, present_sequence_length = max(past_sequence_length, total_sequence_length), and batch_size, kv_num_heads, head_size are the same as past_key/past_value.

This inference is very important for WebNN EP, because WebNN only supports GQA for present_sequence_length == past_sequence_length and requires static shape for graph compilation.

Motivation and Context

onnxruntime/core/graph/contrib_ops/bert_defs.cc

tianleiwu · 2025-03-25T15:12:45Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline,ONNX Runtime Web CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline

tianleiwu · 2025-03-25T15:12:46Z

/azp run Linux QNN CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline,Linux DNNL CI Pipeline,Linux MIGraphX CI Pipeline,Linux ROCm CI Pipeline

azure-pipelines · 2025-03-25T15:13:23Z

Azure Pipelines successfully started running 8 pipeline(s).

azure-pipelines · 2025-03-25T15:13:34Z

Azure Pipelines successfully started running 10 pipeline(s).

onnxruntime/core/graph/contrib_ops/bert_defs.cc

### Description  For GroupQueryAttention op, if the input total_sequence_length is a constant, we can infer the shape of output present_key/present_value `(batch_size, kv_num_heads, present_sequence_length, head_size)`. https://github.com/microsoft/onnxruntime/blob/5ed900e9712ce2f02e40c15b945d18453d1960d8/onnxruntime/contrib_ops/cpu/bert/group_query_attention_helper.h#L185 We know that from CPU EP, `present_sequence_length = max(past_sequence_length, total_sequence_length)`, and `batch_size, kv_num_heads, head_size` are the same as past_key/past_value. This inference is very important for WebNN EP, because WebNN only supports GQA for `present_sequence_length == past_sequence_length` and requires static shape for graph compilation. ### Motivation and Context

improve shape inference for GQA

Loading
Loading status checks…

1f2c8a5

tianleiwu reviewed Mar 24, 2025

View reviewed changes

onnxruntime/core/graph/contrib_ops/bert_defs.cc Show resolved Hide resolved

address comments

Loading
Loading status checks…

4c20a35

tianleiwu reviewed Mar 25, 2025

View reviewed changes

onnxruntime/core/graph/contrib_ops/bert_defs.cc Show resolved Hide resolved

tianleiwu approved these changes Mar 25, 2025

View reviewed changes

aciddelgado reviewed Mar 25, 2025

View reviewed changes

onnxruntime/core/graph/contrib_ops/bert_defs.cc Show resolved Hide resolved

tianleiwu merged commit c756e0a into microsoft:main Mar 28, 2025
95 of 101 checks passed

peishenyan mentioned this pull request Mar 31, 2025

[WebNN EP] Support GroupQueryAttention(GQA) #23416

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Shape Inference for GQA #24143

Improve Shape Inference for GQA #24143

peishenyan commented Mar 24, 2025

tianleiwu commented Mar 25, 2025

tianleiwu commented Mar 25, 2025

azure-pipelines bot commented Mar 25, 2025

azure-pipelines bot commented Mar 25, 2025

Improve Shape Inference for GQA #24143

Improve Shape Inference for GQA #24143

Conversation

peishenyan commented Mar 24, 2025

Description

Motivation and Context

tianleiwu commented Mar 25, 2025

tianleiwu commented Mar 25, 2025

azure-pipelines bot commented Mar 25, 2025

azure-pipelines bot commented Mar 25, 2025