1 Introduction

Video Action Recognition (VAR) is an important computer vision task required in the real world, ranging from deployment in security cameras to healthcare monitoring [1]. To date, there has been great interest in the design of efficient VAR algorithms. Motivated by their state-of-the-art performance in image classification, object detection, image segmentation, etc. [2], several previous works have proposed to use Convolutional Neural Networks (CNNs) for VAR [3, 4]. Unlike using CNN for image domain tasks, CNN-powered VAR needs to properly exploit the extratemporal dimension, since there are very small changes between adjacent frames of a video (see Fig. 1). To that end, researchers have accommodated this temporal dimension, leading to several state-of-the-art networks such as C3D [3], R(2+1)D [4], Slow [5], TSM [6], etc. However, despite achieving state-of-the-art performance, the deployment of these networks on hardware is a far cry, as these proposed networks are resource-hungry, take weeks to train, and barely process the streams in a real-time way on smaller, resource-constrained hardware (such as Nvidia Jetson TX2 and AGX Xavier GPUs). Hence, the current achievements of these networks remain beyond the scope of real-life deployment. Consequently, there is an urgent need to work towards making such networks useful and amenable to resource-constrained devices, and, in turn, enabling small devices such as autonomous drones to navigate and perform crucial tasks using local resources.

Fig. 1
Fig. 1
Full size image

Four sets of 6 adjacent frames of four videos taken from UCF-101 dataset [7]. From the top: fencing, archery, skateboarding, and frisbee catch.

The CNNs that are usually used in video-based action recognition are called 3D-CNNs, as they contain kernels that are essentially 3-dimensional tensors. They are designed in such a format because they process the temporal dimension alongside the spatial dimensions, taking advantage of the similarities between the neighboring video frames as well as neighboring pixels in the individual frames. There are several popular methods for speeding up the execution of CNNs, such as network pruning, quantization, knowledge distillation, etc. [8]. In this paper, we focus on network pruning for 3D-CNNs, which in itself is a non-trivial extension to network pruning of 2D-CNNs, since it involves a deep understanding of the temporal dimension introduced in the video format. We present our delicately designed approach for pruning 3D-CNN networks and evaluate it using an independence score on both filter and depth-wise granularity. Overall our contributions are summarized as follows:

  • A pruning procedure that removes the weights with low independence scores, leading to practical acceleration (5 times faster) with minor accuracy degradation.

  • We propose pruning at both filter-, depth-wise granularity and their combination, leading to more fine-tuned control over prior pruning methods.

  • We propose both one-shot and iterative versions of our pruning method that explore the granularity in the time domain.

  • We evaluated our approach in public datasets (UCF-101 [7] and HMDB-51 [9]) using well-known 3D-CNN architectures (C3D [3], R(2+1)D [4], Slow [5], and TSM [6]).

2 Related Work

3D CNN architecture

has been used in the literature primarily for video-processing tasks such as action recognition, semantic segmentation and temporal action localization, etc. For video action recognition, C3D [3], R(2+1)D [4], Slow [5], and TSM [6] are well-known CNN architectures for video action recognition. Hara et al. [10] conducted a study on the various 3D-CNN architectures and the corresponding activity recognition datasets to determine the current progress in the field as well as the sufficiency of the current dataset sizes to facilitate advanced network training. Furthermore, 3D-CNNs have been proposed in contrastive learning [11], for end-to-end learning in self-driving cars [12], and for recording and exploiting long-term temporal context in camera feeds [13]. [14] proposes continual 3D CNNs can reuse preexisting 3D-CNN weights to reduce the per-prediction FLOPs. Overall, 3D-CNNs have been popularly adopted as a way to capture temporal dimension in video formats, and have shown great success in doing so. Our pruning method takes this into account and achieves comparable performance with far fewer neural connections.

Network Pruning

is an active field research dealing with sparsifying and reducing the computational footprint [15, 16] focusing mainly on 2D deep neural networks. Network pruning methods have shown their effectiveness by enabling reduction in neural network complexity by 10-100x with a slight decrease in precision [15]. Essentially, pruning reduces computational complexity and regularizes the underlying network, thereby avoiding overfitting and achieving better generalization performance. In general, there have been two major pruning strategies: i) unstructured pruning and ii) structured pruning. Unstructured pruning [17, 18] usually takes the form of a regularization function (e.g., L1 or L2 norm) that seeks to zero out weights. However, such pruning does not achieve the optimal speed-up gain on off-the-shelf hardware platforms like Graphical Processing Units (GPUs). This is because GPUs are designed to accelerate structured matrix multiplication, which is absent in unstructured sparse networks. On the other hand, structured pruning [19,20,21] serves to prune the network on a structural basis and seeks to eliminate all layers, channels, or filters (essentially bypassing several matrix multiplications). Therefore, structured pruning brings considerable speed-up gains on the GPUs. Our approach is a structured pruning where we prune both filter and depthwise granularity using an independence score, enabling significant speedup.

Structured Pruning

for image classification is an active research area. For example, Wen et al. [22] use convolutional filters to determine and prune redundant channels. Liu et al. [23] view the scaling factor as a criterion for each channel and impose sparsity regularization on the factors. Molchanov et al. [24] use the Taylor expansion to estimate the importance of a filter to the final output and remove filters with small scores. DMCP [25], MetaPruning [26] use complicated rules to generate the sparsity distribution at the channel level. To reduce the disturbance of irrelevant factors, Tang et al. [27] provide a scientific control scale setting, which helps to choose the characteristics that should be removed. Lin et al. [28] use the rank of each filter to determine the importance of the corresponding filter. [29] incorporates sensitivity feedback during the training procedure. [30] first learns a target sub-network during the model training process and then uses this sub-network to guide the learning of model weights through partial regularization. [31] introduces a unified and principled framework based on information bottleneck theory.

Other works [32,33,34,35,36] also explore similar strategies for pruning 3D CNNs; however, the majority of these existing works are aimed at a single level of granularity (e.g., filter or channel level). In contrast, our approach is applicable to both filter-wise and depth-wise granularity. In addition, the approach also investigates the one-shot and iterative versions of our pruning method, leading to versatility and a substantial reduction in network size.

3 Methodology

We start by describing the preliminaries of 3D-CNNs and the filter pruning problem in Sec. 3.1. Then, we discuss the motivation of our approach in Sec. 3.2 and propose our pruning metric in Sec. 3.3. Moreover, different granularity schemes for pruning are presented to enhance the performance in Sec. 3.4. Lastly, combined with a gradual pruning strategy, we detail our overall algorithm in Sec. 3.5.

3.1 Preliminaries

For a 3D CNN model with L layers, the l-th convolutional layer \(\varvec{\mathcal {W}}^l = \{\varvec{\mathcal {F}}^l_1,\varvec{\mathcal {F}}^l_2, \) \(\cdots ,\varvec{\mathcal {F}}^l_{N}\} \in \mathbb {R}^{N\times C\times K_h \times K_w \times K_d}\) contains N filters \(\varvec{\mathcal {F}}^l_{i}\in \mathbb {R}^{C \times K_h \times K_w \times K_d}\), where N, C, \(K_h\), \(K_w\), \(K_d\) denote the number of filters, the number of input channels, filter-height, filter-width, and filter-depth, respectively. In general, the filter pruning problem can be formulated as,

$$\begin{aligned} \begin{aligned} \min _{\{\varvec{\mathcal {W}^l}\}_{l=1}^L}~~\mathcal {L}(\varvec{\mathcal {Y}},f(\varvec{\mathcal {X}},\varvec{\mathcal {W}}^l)), \text {s.t.}~~~&\Vert \varvec{\mathcal {W}}^l \Vert _0\le \tau ^l, \end{aligned} \end{aligned}$$
(1)

where \(\mathcal {L}(\cdot ,\cdot )\) denotes the loss function, \(\varvec{\mathcal {Y}}\) is the ground-truth label, \(\varvec{\mathcal {X}}\) is the input data, and \(f(\cdot , \cdot )\) is the output function of 3D CNN model with layers \(\{\varvec{\mathcal {W}}^l\}_{l=1}^L\). In addition, \( \Vert \cdot \Vert _0\) represents the \(\ell _0\)-norm that measures the number of non-zero filters in the l-th layer, and \(\tau ^l\) is the maximum number of filters to be maintained in the l-th layer.

Fig. 2
Fig. 2
Full size image

Procedure of vectorizing a 3D convolutional layer associated with two granularity schemes. The red line denotes the filter-wise scheme. The blue line denotes the depth-wise scheme.

3.2 Motivation

The key to solving a structured pruning problem is to determine the importance level of each filter. Under well-studied pruning routines for 2D CNN models, many previous works propose various methods to determine this importance factor [22, 23, 37]. For example, He et al. [37] consider the geometric similarity among filters and prune the most similar ones (based on the intuition that similar filters encode similar information). However, the newly introduced temporal dimension in 3D-CNNs causes that directly adapting pruning approaches of 2D-CNN models to the 3D scenario to fail to achieve good performance, due to 2D pruning incurring coarse granularity on the channel level without considering the depth level. To truly provide a suitable solution for video recognition, it is necessary to investigate the differences between the 2D and 3D modalities.

An important difference between video-based recognition and image-related classification and object detection is that there is substantial redundancy in the adjacent frames of video. For example, as shown in Fig. 1, two contiguous frames \(\varvec{\mathcal {X}_i}\), \(\varvec{\mathcal {X}_{i+1}} \in \mathbb {R}^{H \times W \times K} \), where HWK denote the height, width, and number of channels of the frame, respectively, are very similar. We can observe that most pairs of frames satisfy \(\Vert \varvec{\mathcal {X}_i} - \varvec{\mathcal {X}_{i+1}}\Vert \le \epsilon \), where \(\epsilon \) is a similarity constant. Observing such negligible changes of adjacent frames, Sevilla et al. [38] utilize optical flow to compress the input frames. However, deploying similarity into filter pruning along with depth dimension is not an optimal approach since, even though the adjacent video frames are quite similar, it may not hold for the weights after training.

On the other hand, inspired by [39], we hypothesize that, under ideal circumstances, for a 3D network with maximum representative capacity, the adjacent weights, at the spatial or temporal level, tend to be independent of each other, i.e., each weight plays its own independent role rather than being easily represented by other weights. The corollary of this statement is that a network with all the adjacent weights independent of each other is the one that has the maximum representative capacity for the minimum number of non-zero weights. If that is the case, we can exploit this by creating a network (through pruning) whose maximum representative capacity approaches the capacity required to represent the dataset it is trained on, i.e., using the minimum number of independent weights required to capture the distribution of the training dataset. This constitutes the spirit of our approach: we use the independence score to represent the importance of each filter or depth to create a pruned version of the network. The intuition here being that when one filter of a layer is heavily linearly dependent on other filters, it implicitly means that its contained information has been largely encoded in other filters. As a result, even if this filter is removed, the represented information and knowledge can still be roughly re-created by other filters following the fine-tuning stage. In other words, filters with low independence scores are replaceable.

3.3 Proposed Pruning Metric

To remove the most linearly dependent filters, inspired by [39], we leverage the Independence Score (IS)-based method to prune the 3D CNNs. The entire set of filters from one layer is a 3-D tensor, and we propose to extract the linear dependence information of each filter using the principles of linear algebra. To be specific, given a set of filters from the l-th layer \(\varvec{\mathcal {W}}^l\), we first matricize \(\varvec{\mathcal {W}}^l\) to \(\varvec{W}^l = [{\varvec{w}_1^l}, {\varvec{w}_2^l}, \cdots , {\varvec{w}_{N^l}^l}] \in \mathbb {R}^{{N}\times {CK_hK_wK_d}}\), where a row vector \(\varvec{w}_i^l \in \mathbb {R}^{CK_hK_wK_d}\) is vectorized \(\varvec{W}^l_i\). In such a scenario, the linear independence of each vectorized filter \(\varvec{w}_i^l\), as a row of the matricized entire set of feature maps \(\varvec{W}^l\), can be measured via the existing matrix analysis tool. Because the rank reflects the maximum level of linearly independent rows/columns in the matrix, the most straightforward solution is to use the rank to determine the independence of \(\varvec{w}\). For example, we can remove one row from the matrix, calculate the rank change, and then determine the influence and relevance of the deleted row. The lower the rank change, the lower the independence (and importance) of the removed row.

For the i-th layer with weights \(\varvec{\mathcal {W}}^l = \{ \varvec{W}^l_1,\varvec{W}^l_2,\cdots ,\varvec{W}^l_{N} \} \in \mathbb {R}^{{N} \times C \times {K_h} \times {K_w} \times {K_d}}\), the IS of one filter \(\varvec{W}_i^l \!\in \! \mathbb {R}^{C \!\times \! {K_h} \!\times \! {K_w} \!\times \! {K_d}}\) in the i-th channel is defined and calculated as:

$$\begin{aligned} \begin{aligned} IS(\varvec{W}^l_i) \triangleq \text {Rank} ( \varvec{W}^l ) - \text {Rank} (\varvec{M}_i^l \odot \varvec{W}^l ), \end{aligned} \end{aligned}$$
(2)

where \(\varvec{W}^l \in \mathbb {R}^{{N}\times {CK_hK_wK_d}}\) is the matricized \(\varvec{\mathcal {W}}^l\), \(\odot \) is the Hadamard product, and \(\varvec{M}_i^l \in \mathbb {R}^{{N}\times {CK_hK_wK_d}}\) is the row mask matrix whose i-th row entries are zeros and other entries are ones. However, rank cannot distinguish the different contributions for independence from different filters, as it only calculates overall independence and can only coarsely determine the independence contribution (1 or 0). To better differentiate the scores of filters, we relax the rank and reformulate the definition of IS with nuclear norm instead of rank shown as Fig. 3:

$$\begin{aligned} \begin{aligned} IS(\varvec{W}^l_i) \triangleq \Vert \varvec{W}^l \Vert _* - \Vert \varvec{M}_i^l \odot \varvec{W}^l \Vert _*, \end{aligned} \end{aligned}$$
(3)

where \(\Vert \cdot \Vert _*\) denotes the nuclear norm. This affords us a more fine-grained way to represent the contribution of independence from different filters and enables us to rank them according to their relative independence.

Fig. 3
Fig. 3
Full size image

Procedure of calculating IS from a 3D convolutional layer.

3.4 Granularity Schemes

Apart from the general framework described above to calculate IS, we further consider different granularity schemes for the pruning criterion. To be specific, IS can be calculated either in a filter-wise mode (Filter-wise Independence Score (FIS) that generates N Independence Scores) or in a depth-wise mode (Depth-wise Independence Score (DIS) that generates \(K_d\) Independence Scores), coming from two different perspectives for modeling the information load. FIS is intuitive in the sense that it maps out the importance of filters and mitigates the dependence among them. On the other hand, DIS fully leverages the similarity of the adjacent frames and removes the redundancy among depth-wise weights, implicitly deleting the redundancy of models resulting from the similar adjacent frames.

3.4.1 Filter-wise Independence Score

Equation 3 describes the general framework for calculating IS shown in Figs. 2 and 3. FIS can be calculated in a filter-wise way as:

$$\begin{aligned} \begin{aligned} FIS(\varvec{W}^l) \triangleq \Vert \varvec{W_f}^l \Vert _* - \Vert \varvec{M}_f^l \odot \varvec{W_f}^l \Vert _*, \end{aligned} \end{aligned}$$
(4)

where \(\varvec{W}_f^l \in \mathbb {R}^{{N}\times {CK_hK_wK_d}}\) is the matricized \(\varvec{\mathcal {W}}^l\), \(\odot \) is the Hadamard product, and \(\varvec{M}_f^l \in \mathbb {R}^{{N}\times {CK_hK_wK_d}}\) is the row mask matrix whose i-th row entries are zeros and other entries are ones.

Fig. 4
Fig. 4
Full size image

Iterative pruning scheme: x-axis denotes the number of epochs and y-axis denotes the pruning level. It is noted that the pruning level does not equal the target pruning ratio.

3.4.2 Depth-wise Independence Score

Based on Eq. 3 and as shown in Fig. 2, DIS is formulated as:

$$\begin{aligned} \begin{aligned} DIS(\varvec{W}^l) \triangleq \Vert \varvec{W_d}^l \Vert _* - \Vert \varvec{M}_{d}^l \odot \varvec{W_d}^l \Vert _*, \end{aligned} \end{aligned}$$
(5)

where \(\varvec{W}_d^l \in \mathbb {R}^{{K_d}\times {NCK_hK_w}}\) is the matricized \(\varvec{\mathcal {W}}^l\), \(\odot \) is the Hadamard product, and \(\varvec{M}_d^l \in \mathbb {R}^{{K_d}\times {NCK_hK_w}}\) is the row mask matrix whose i-th row entries are zeros and other entries are ones.

3.4.3 Overall Independence Score

By combining Eqs. 4 and 5 and as shown in Fig. 2, the overall Filter/Depth-wise Independence Score (FDIS) can be formulated as:

$$\begin{aligned} \begin{aligned} FDIS(\varvec{W}^l) \triangleq \lambda *DIS(\varvec{W}^l) + (1 - \lambda ) *FIS(\varvec{W}^l), \end{aligned} \end{aligned}$$
(6)

where \(\lambda \in [0,1]\) is a control parameter to balance the influence of FIS and DIS in the final score.

3.5 Iterative Pruning Scheme

After defining the pruning metric and the granularity schemes, we now focus on the manner in which IS is calculated. Our proposed method, as described above, calculates the FDIS to identify the importance of the filter by one-shot calculation. Therefore, a natural extension to this one-shot calculation is to further adjust the importance ranking via additional learning. Inspired by Zhu et al. [40], we adopt a greedy algorithm to achieve iterative pruning with a fixed percentile once a time.

To be specific, we calculate the FDIS every t epochs from the pre-trained model in a total of T epochs. After each calculation of FDIS, we prune the filters according to a fixed percentage of the target pruning ratio shown in Eq. 7,

$$\begin{aligned} \begin{aligned} \tau ^t = \kappa ^t \times \tau , \end{aligned} \end{aligned}$$
(7)

where \(\tau ^t\) is a target pruning ratio at the t-th epoch. \(\kappa ^t\) is the remaining level in Fig. 4 at the t-th epoch, and \(\tau \) is the target pruning ratio.

4 Performance Evaluation

In this section, we detail the evaluation of our approach and compare it with the baseline pruning approaches. In Sec. 4.1, we detail the baseline models and datasets that we use for evaluation, and also our experimental configuration. Then we go on to discuss the performance of our pruning method on various neural networks, on UCF-101 and HMDB-51 datasets in Sec. 4.2 and 4.3 respectively.

4.1 Experimental Setup

Baselines Models and Datasets

We evaluate the performance of our algorithms on several representative convolutional neural network architectures models: C3D [3], R(2+1)D [3], Slow [5], and TSM [6]; and on popular video action recognition datasets: UCF-101 [7], HMDB-51 [9] to report the speedup and accuracy of our proposed method. We demonstrate the efficacy of our pruning approach by compressing convolutional layers with Filter-wise Independence Score with one-shot pruning, Filter/Depth-wise Independence Score with one-shot pruning, and Filter/Depth-wise Independence Score with iterative pruning. All the neural networks in question are pre-trained on the Kinetics dataset [43] and fine-tuned on the UCF-101 [7] and the HMDB-51 [9] datasets. After fine-tuning, the pruned networks are evaluated through top-1 accuracy in the video recognition datasets.

Table 1 Results for C3D and R(2+1)D model on UCF-101 dataset.

UCF-101

The UCF-101 [7] dataset includes 13320 videos from 101 action categories. This dataset provides the largest diversity in terms of actions along with considerable variations in camera motion, object appearance and poses, object scales, viewpoints, illumination conditions, etc.

HMDB-51

The HMDB-51 [9] dataset contains 6849 clips divided into 51 action categories, and each category contains at least 101 video clips. Action categories can be grouped into four types: general facial actions, general body movements, body movements with object interaction, and body movements for human interaction.

Experimental Configuration

We conduct all our experiments on Nvidia A6000 GPUs and 2080 Ti GPUs with the PyTorch [44] framework. After filter pruning, we then perform fine-tuning on the pruned models with Stochastic Gradient Descent (SGD) as the optimizer. To be specific, the batch size and clip length is fixed to the default training setting. The initial learning rate is set to \(10^{-3}\) to train the pre-trained model, and is then reduced to \(2\times 10^{-4}\) in the pruning and fine-tuning stage. The learning rate is fixed in the pruning process and adjusted in the fine-tuning phase with a scheduler following the cosine function. For different types of granularity schemes and pruning algorithms, the total number of epochs is fixed to 300 epochs. The penalty factor for the combination of the filter independence score and the depth independence score, FIS and DIS, is set to 0.5.

Table 2 Results for C3D model on HMDB-51 dataset.

4.2 UCF-101 Dataset

In this section, we discuss in detail, the evaluation results of our pruning method on C3D and R(2+1)D neural networks using UCF-101 dataset. These results are detailed in Table 1.

4.2.1 C3D Model

We evaluated the precision of the top-1 pruning C3D model on the UCF-101 dataset, using various granularity schemes, with a pruning sparsity of up to 54.2%. Our baseline model is trained to achieve an accuracy of 80.9%, and we analyze the performance of the model using different pruning methods. Our results reveal that FDIS yields superior accuracy performance over the L1 filter-wise pruning model, achieving an accuracy of 79.6% as compared to 78.3%, and it simultaneously reducies FLOPs by approximately 54.2%. Furthermore, compared to the filterwise pruning model with the same level of complexity reduction, our FIS (Filter Independence Score)-based method with one-shot pruning achieves better accuracy performance, with an accuracy of 78.9% compared to 78.3%. Furthermore, we observe that FDIS with a one-shot pruning scheme leads to further improvement, with an accuracy of 79.3% under the same 54. 2% reduction in FLOPs. We also find that FDIS with an iterative pruning scheme is particularly suited to prune 3D neural networks, which can achieve 79.6\(\%\) accuracy.

4.2.2 R(2+1)D Model

For pruning R(2+1)D model on UCF-101 dataset, we evaluate the top-1 accuracy under 48.4% pruning sparsity with various granularity schemes. The baseline model is trained to achieve an accuracy of 92.3%. Using FDIS leads to an improvement in accuracy over the L1 filter-wise pruning model (89.7% vs. 87.8%) with about 48.4% reduction in FLOPs. Furthermore, compared to the filter-wise pruning model with the same complexity reduction, our FIS-based method with one-shot pruning achieves better accuracy performance (88. 7% vs. 87. 8%) in less than 48.4 % FLOP reduction. Using FDIS with a one-shot pruning scheme brings more improvement (89.3% vs. 87.8%) under the same 48.4% reduction of FLOPs.

Table 3 Results for Slow and TSM model on HMDB-51 dataset. It is worth noting that in this study, we solely use the one-shot pruning strategy to evaluate the impact of utilizing the proposed IS criterion solely. As most of the layers have dimension as one along the time dimension, we only implement Filter-wise IS when pruning the models.
Table 4 Results for speedup of Slow and TSM models on HMDB-51 dataset under different sparsity levels for Nvidia RTX 2080Ti and Nvidia AGX Xavier GPUs.

4.3 HMDB-51 Dataset

In this section, we discuss in detail the evaluation results of our pruning method in C3D, Slow, and TSM neural networks using the HMDB-51 data set. These results are detailed in Tables 2  & 3.

4.3.1 C3D Model

In this study, we conduct evaluation of the top-1 accuracy of the C3D model on the HMDB-51 dataset, utilizing various granularity schemes, with a pruning sparsity of up to 54.2%. Our baseline model is trained to achieve an accuracy of 51.5%, and we analyze the performance of the model using different pruning methods. Our findings demonstrate that the FDIS-based pruning method outperforms the L1 filterwise pruning model, achieving an accuracy of 50.6% compared to 48.3%; while simultaneously reducing FLOPs by approximately 54.2%. Furthermore, when compared to the filter-wise pruning model with the same level of complexity reduction, our Filter Independence Score (FIS) method with one-shot pruning achieves better accuracy performance, with an accuracy of 49.7% compared to 48.3%. Furthermore, we observe that using FDIS with a one-shot pruning scheme leads to further improvement, with an accuracy of 50.1% under the same 54. 2% reduction in FLOPs. Moreover, we find that FDIS-based pruning with an iterative pruning scheme is especially well-suited for pruning 3D neural networks.

4.3.2 Slow Model

It should be noted that in the Slow model, most of the layers are singular along the T dimension. As a result, we refrain from implementing Depth-wise Independence Score when pruning the Slow model. For the Slow model on HMDB-51 dataset, we evaluate the top-1 and Top-5 accuracy under different pruning sparsity ratios. We train our baseline model to achieve 46.5% accuracy. And it is seen that our method leads to an improvement in top-5 accuracy over the L1 filter-wise pruning model (75.2% vs.74.6%) with about 20% Sparsity. Furthermore, compared to the L1-based pruning model with the same complexity reduction, our method achieves a better top-1 accuracy performance (45.8% vs. 45.2%) with 50 % sparsity and 0. 3% improvement (43. 2% vs 42. 9%) with the same 80% sparsity.

4.3.3 TSM Model

Similarly as in the Slow model, most of the layers in the TSM model are singular along the T dimension. As a result, we refrained from implementing Depth-wise Independence Score when pruning the TSM model. For the TSM model on the HMDB-51 dataset, we evaluated the top-1 and Top-5 accuracy under different pruning sparsity ratios. More specifically, we train our baseline model to achieve 73.4% top-1 accuracy and 93.0% top-5 accuracy. Our method leads to an improvement in top-5 accuracy over the L1 filter-wise pruning model (91.0% vs. 88.6%) with about 40% sparsity. Furthermore, compared to the L1-based pruning model with the same complexity reduction, our method achieves a better top-1 accuracy performance (54. 1% vs 53. 8%) with 50 % sparsity.

4.4 Practical Runtime

In addition to evaluating accuracy, we also measure the speedup brought about by our method on desktop and embedded GPUs (Nvidia RTX 2080Ti and Nvidia AGX Xavier). The speedups are shown in Table 4. Here inference time per video using the Slow model is 520ms and using TSM model is 15ms on RTX 2080Ti. We observe a run-time of more than 189 ms, which is 2.75 times faster. Even more substantial speedups can be observed on the smaller embedded AGX Xavier GPU, demonstrating that our approach provides a feasible solution for performing video recognition on small resource-constrained devices.

5 Conclusion

In this paper, we propose an approach to pruning 3D CNN models inspired by the similarity of adjacent frames. Our method leverages independent information to evaluate the importance of the weights of 3D networks. In addition, by exploring the different granularity schemes and pruning techniques, we develop FDIS-based pruning for 3D convolutional neural networks. Extensive experimental results on the UCF-101 and HMDB-51 datasets show that our proposed approach brings a significant practical speedup with good accuracy performance.