Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads

Rahimian, Ali K.; Govind, Manish K.; Maity, Subhajit; Reilly, Dominick; Kümmerle, Christian; Das, Srijan; Dutta, Aritra

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.19391 (cs)

[Submitted on 27 Jun 2024 (v1), last revised 12 Feb 2026 (this version, v4)]

Title:Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads

Authors:Ali K. Rahimian, Manish K. Govind, Subhajit Maity, Dominick Reilly, Christian Kümmerle, Srijan Das, Aritra Dutta

View PDF HTML (experimental)

Abstract:Vision Transformers and their variants have achieved remarkable success in diverse visual perception tasks. Despite their effectiveness, they suffer from two significant limitations. First, the quadratic computational complexity of multi-head self-attention (MHSA), which restricts scalability to large token counts, and second, a high dependency on large-scale training data to attain competitive performance. In this paper, to address these challenges, we propose a novel sparse self-attention mechanism named Fibottention. Fibottention employs structured sparsity patterns derived from the Wythoff array, enabling an $\mathcal{O}(N \log N)$ computational complexity in self-attention. By design, its sparsity patterns vary across attention heads, which provably reduces redundant pairwise interactions while ensuring sufficient and diverse coverage. This leads to an \emph{inception-like functional diversity} in the attention heads, and promotes more informative and disentangled representations. We integrate Fibottention into standard Transformer architectures and conduct extensive experiments across multiple domains, including image classification, video understanding, and robot learning. Results demonstrate that models equipped with Fibottention either significantly outperform or achieve on-par performance with their dense MHSA counterparts, while leveraging only $2\%$ of all pairwise interactions across self-attention heads in typical settings, $2-6\%$ of the pairwise interactions in self-attention heads, resulting in substantial computational savings. Moreover, when compared to existing sparse attention mechanisms, Fibottention consistently achieves superior results on a FLOP-equivalency basis. Finally, we provide an in-depth analysis of the enhanced feature diversity resulting from our attention design and discuss its implications for efficient representation learning.

Comments:	The complete implementation, including source code and evaluation scripts, is publicly available at: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2406.19391 [cs.CV]
	(or arXiv:2406.19391v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.19391

Submission history

From: Ali Khaleghi Rahimian [view email]
[v1] Thu, 27 Jun 2024 17:59:40 UTC (1,756 KB)
[v2] Tue, 17 Dec 2024 05:37:37 UTC (2,396 KB)
[v3] Fri, 20 Dec 2024 02:12:06 UTC (2,396 KB)
[v4] Thu, 12 Feb 2026 22:21:03 UTC (1,811 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators