As a guest user you are not logged in or recognized by your IP address. You have
access to the Front Matter, Abstracts, Author Index, Subject Index and the full
text of Open Access publications.
Vision Transformer (ViT) has shown strong performance in video analysis tasks. However, conventional frame-by-frame processing results in substantial redundant computation due to repeated processing of spatiotemporally similar content across frames. Existing methods often utilize extra modules to detect spatiotemporal redundant regions for computational reduction. However, the detection process incurs additional computation and latency. To address this, we propose the Encoding Prior Vision Transformer (EP-ViT), which leverages prior information in video encoding to identify redundant regions without incurring additional computational cost. Furthermore, we introduce a redundancy-aware attention mechanism that reuses tokens with unchanged features across frames. Experimental results demonstrate that EP-ViT reduces the Transformer’s computational cost by 58.91% without compromising accuracy, surpassing state-of-the-art methods.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.