EP-ViT: Leveraging Encoding Priors to Eliminate Spatio-Temporal Redundancy in Vision Transformer

Qiu, Yawen; Wang, Qinyu; Li, Zhiheng; Liu, Qiyuan

doi:10.3233/FAIA250790

Abstract

Vision Transformer (ViT) has shown strong performance in video analysis tasks. However, conventional frame-by-frame processing results in substantial redundant computation due to repeated processing of spatiotemporally similar content across frames. Existing methods often utilize extra modules to detect spatiotemporal redundant regions for computational reduction. However, the detection process incurs additional computation and latency. To address this, we propose the Encoding Prior Vision Transformer (EP-ViT), which leverages prior information in video encoding to identify redundant regions without incurring additional computational cost. Furthermore, we introduce a redundancy-aware attention mechanism that reuses tokens with unchanged features across frames. Experimental results demonstrate that EP-ViT reduces the Transformer’s computational cost by 58.91% without compromising accuracy, surpassing state-of-the-art methods.

This website uses cookies

This website uses cookies