You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We propose adding a fusion transformer to reconstruct Group Query Attention (GQA) nodes that get decomposed during WebNN EP graph processing.
Describe scenario use case
In #23416, the WebNN EP breaks down one GQA node into smaller primitive ops to meet WebNN API constraints which will prevents optimal hardware utilization. Here is the decomposed WebNN subgraph:
Introducing a fusion transformer would reduce the operator count and enable hardware-accelerated attention kernels for WebNN produced models.
Describe the feature request
We propose adding a fusion transformer to reconstruct Group Query Attention (GQA) nodes that get decomposed during WebNN EP graph processing.
Describe scenario use case
In #23416, the WebNN EP breaks down one GQA node into smaller primitive ops to meet WebNN API constraints which will prevents optimal hardware utilization. Here is the decomposed WebNN subgraph:
Introducing a fusion transformer would reduce the operator count and enable hardware-accelerated attention kernels for WebNN produced models.
/cc @Honry @huningxin
The text was updated successfully, but these errors were encountered: