Abstract
Network pruning provides a promising approach for deploying costly Deep Neural Network (DNN) models on resource-constrained devices. However, most existing pruning works focus on compressing traditional convolutional DNN models for image-related classification and object-detection tasks in 2D scenarios. Due to the complicated structure of 3D CNN models, pruning such models has not been well studied. In this paper, we analyze the different properties between 2D and 3D tasks, then propose a Filter/Depth-wise Independence Score (FDIS) to evaluate the importance of each filter. In addition, we adopt several granularity schemes to improve the performance of the proposed method. To achieve fine-grained pruning, we prune the networks gradually using an iterative pruning procedure. In addition, we experimentally show that weights with low independence scores contain less important information, enabling the removal of filters without serious accuracy degradation. Our proposed FDIS-based approach maintains high accuracy with certain FLOP reduction and practical acceleration.
Similar content being viewed by others
1 Introduction
Video Action Recognition (VAR) is an important computer vision task required in the real world, ranging from deployment in security cameras to healthcare monitoring [1]. To date, there has been great interest in the design of efficient VAR algorithms. Motivated by their state-of-the-art performance in image classification, object detection, image segmentation, etc. [2], several previous works have proposed to use Convolutional Neural Networks (CNNs) for VAR [3, 4]. Unlike using CNN for image domain tasks, CNN-powered VAR needs to properly exploit the extratemporal dimension, since there are very small changes between adjacent frames of a video (see Fig. 1). To that end, researchers have accommodated this temporal dimension, leading to several state-of-the-art networks such as C3D [3], R(2+1)D [4], Slow [5], TSM [6], etc. However, despite achieving state-of-the-art performance, the deployment of these networks on hardware is a far cry, as these proposed networks are resource-hungry, take weeks to train, and barely process the streams in a real-time way on smaller, resource-constrained hardware (such as Nvidia Jetson TX2 and AGX Xavier GPUs). Hence, the current achievements of these networks remain beyond the scope of real-life deployment. Consequently, there is an urgent need to work towards making such networks useful and amenable to resource-constrained devices, and, in turn, enabling small devices such as autonomous drones to navigate and perform crucial tasks using local resources.
Four sets of 6 adjacent frames of four videos taken from UCF-101 dataset [7]. From the top: fencing, archery, skateboarding, and frisbee catch.
The CNNs that are usually used in video-based action recognition are called 3D-CNNs, as they contain kernels that are essentially 3-dimensional tensors. They are designed in such a format because they process the temporal dimension alongside the spatial dimensions, taking advantage of the similarities between the neighboring video frames as well as neighboring pixels in the individual frames. There are several popular methods for speeding up the execution of CNNs, such as network pruning, quantization, knowledge distillation, etc. [8]. In this paper, we focus on network pruning for 3D-CNNs, which in itself is a non-trivial extension to network pruning of 2D-CNNs, since it involves a deep understanding of the temporal dimension introduced in the video format. We present our delicately designed approach for pruning 3D-CNN networks and evaluate it using an independence score on both filter and depth-wise granularity. Overall our contributions are summarized as follows:
-
A pruning procedure that removes the weights with low independence scores, leading to practical acceleration (5 times faster) with minor accuracy degradation.
-
We propose pruning at both filter-, depth-wise granularity and their combination, leading to more fine-tuned control over prior pruning methods.
-
We propose both one-shot and iterative versions of our pruning method that explore the granularity in the time domain.
-
We evaluated our approach in public datasets (UCF-101 [7] and HMDB-51 [9]) using well-known 3D-CNN architectures (C3D [3], R(2+1)D [4], Slow [5], and TSM [6]).
2 Related Work
3D CNN architecture
has been used in the literature primarily for video-processing tasks such as action recognition, semantic segmentation and temporal action localization, etc. For video action recognition, C3D [3], R(2+1)D [4], Slow [5], and TSM [6] are well-known CNN architectures for video action recognition. Hara et al. [10] conducted a study on the various 3D-CNN architectures and the corresponding activity recognition datasets to determine the current progress in the field as well as the sufficiency of the current dataset sizes to facilitate advanced network training. Furthermore, 3D-CNNs have been proposed in contrastive learning [11], for end-to-end learning in self-driving cars [12], and for recording and exploiting long-term temporal context in camera feeds [13]. [14] proposes continual 3D CNNs can reuse preexisting 3D-CNN weights to reduce the per-prediction FLOPs. Overall, 3D-CNNs have been popularly adopted as a way to capture temporal dimension in video formats, and have shown great success in doing so. Our pruning method takes this into account and achieves comparable performance with far fewer neural connections.
Network Pruning
is an active field research dealing with sparsifying and reducing the computational footprint [15, 16] focusing mainly on 2D deep neural networks. Network pruning methods have shown their effectiveness by enabling reduction in neural network complexity by 10-100x with a slight decrease in precision [15]. Essentially, pruning reduces computational complexity and regularizes the underlying network, thereby avoiding overfitting and achieving better generalization performance. In general, there have been two major pruning strategies: i) unstructured pruning and ii) structured pruning. Unstructured pruning [17, 18] usually takes the form of a regularization function (e.g., L1 or L2 norm) that seeks to zero out weights. However, such pruning does not achieve the optimal speed-up gain on off-the-shelf hardware platforms like Graphical Processing Units (GPUs). This is because GPUs are designed to accelerate structured matrix multiplication, which is absent in unstructured sparse networks. On the other hand, structured pruning [19,20,21] serves to prune the network on a structural basis and seeks to eliminate all layers, channels, or filters (essentially bypassing several matrix multiplications). Therefore, structured pruning brings considerable speed-up gains on the GPUs. Our approach is a structured pruning where we prune both filter and depthwise granularity using an independence score, enabling significant speedup.
Structured Pruning
for image classification is an active research area. For example, Wen et al. [22] use convolutional filters to determine and prune redundant channels. Liu et al. [23] view the scaling factor as a criterion for each channel and impose sparsity regularization on the factors. Molchanov et al. [24] use the Taylor expansion to estimate the importance of a filter to the final output and remove filters with small scores. DMCP [25], MetaPruning [26] use complicated rules to generate the sparsity distribution at the channel level. To reduce the disturbance of irrelevant factors, Tang et al. [27] provide a scientific control scale setting, which helps to choose the characteristics that should be removed. Lin et al. [28] use the rank of each filter to determine the importance of the corresponding filter. [29] incorporates sensitivity feedback during the training procedure. [30] first learns a target sub-network during the model training process and then uses this sub-network to guide the learning of model weights through partial regularization. [31] introduces a unified and principled framework based on information bottleneck theory.
Other works [32,33,34,35,36] also explore similar strategies for pruning 3D CNNs; however, the majority of these existing works are aimed at a single level of granularity (e.g., filter or channel level). In contrast, our approach is applicable to both filter-wise and depth-wise granularity. In addition, the approach also investigates the one-shot and iterative versions of our pruning method, leading to versatility and a substantial reduction in network size.
3 Methodology
We start by describing the preliminaries of 3D-CNNs and the filter pruning problem in Sec. 3.1. Then, we discuss the motivation of our approach in Sec. 3.2 and propose our pruning metric in Sec. 3.3. Moreover, different granularity schemes for pruning are presented to enhance the performance in Sec. 3.4. Lastly, combined with a gradual pruning strategy, we detail our overall algorithm in Sec. 3.5.
3.1 Preliminaries
For a 3D CNN model with L layers, the l-th convolutional layer \(\varvec{\mathcal {W}}^l = \{\varvec{\mathcal {F}}^l_1,\varvec{\mathcal {F}}^l_2, \) \(\cdots ,\varvec{\mathcal {F}}^l_{N}\} \in \mathbb {R}^{N\times C\times K_h \times K_w \times K_d}\) contains N filters \(\varvec{\mathcal {F}}^l_{i}\in \mathbb {R}^{C \times K_h \times K_w \times K_d}\), where N, C, \(K_h\), \(K_w\), \(K_d\) denote the number of filters, the number of input channels, filter-height, filter-width, and filter-depth, respectively. In general, the filter pruning problem can be formulated as,
where \(\mathcal {L}(\cdot ,\cdot )\) denotes the loss function, \(\varvec{\mathcal {Y}}\) is the ground-truth label, \(\varvec{\mathcal {X}}\) is the input data, and \(f(\cdot , \cdot )\) is the output function of 3D CNN model with layers \(\{\varvec{\mathcal {W}}^l\}_{l=1}^L\). In addition, \( \Vert \cdot \Vert _0\) represents the \(\ell _0\)-norm that measures the number of non-zero filters in the l-th layer, and \(\tau ^l\) is the maximum number of filters to be maintained in the l-th layer.
Procedure of vectorizing a 3D convolutional layer associated with two granularity schemes. The red line denotes the filter-wise scheme. The blue line denotes the depth-wise scheme.
3.2 Motivation
The key to solving a structured pruning problem is to determine the importance level of each filter. Under well-studied pruning routines for 2D CNN models, many previous works propose various methods to determine this importance factor [22, 23, 37]. For example, He et al. [37] consider the geometric similarity among filters and prune the most similar ones (based on the intuition that similar filters encode similar information). However, the newly introduced temporal dimension in 3D-CNNs causes that directly adapting pruning approaches of 2D-CNN models to the 3D scenario to fail to achieve good performance, due to 2D pruning incurring coarse granularity on the channel level without considering the depth level. To truly provide a suitable solution for video recognition, it is necessary to investigate the differences between the 2D and 3D modalities.
An important difference between video-based recognition and image-related classification and object detection is that there is substantial redundancy in the adjacent frames of video. For example, as shown in Fig. 1, two contiguous frames \(\varvec{\mathcal {X}_i}\), \(\varvec{\mathcal {X}_{i+1}} \in \mathbb {R}^{H \times W \times K} \), where H, W, K denote the height, width, and number of channels of the frame, respectively, are very similar. We can observe that most pairs of frames satisfy \(\Vert \varvec{\mathcal {X}_i} - \varvec{\mathcal {X}_{i+1}}\Vert \le \epsilon \), where \(\epsilon \) is a similarity constant. Observing such negligible changes of adjacent frames, Sevilla et al. [38] utilize optical flow to compress the input frames. However, deploying similarity into filter pruning along with depth dimension is not an optimal approach since, even though the adjacent video frames are quite similar, it may not hold for the weights after training.
On the other hand, inspired by [39], we hypothesize that, under ideal circumstances, for a 3D network with maximum representative capacity, the adjacent weights, at the spatial or temporal level, tend to be independent of each other, i.e., each weight plays its own independent role rather than being easily represented by other weights. The corollary of this statement is that a network with all the adjacent weights independent of each other is the one that has the maximum representative capacity for the minimum number of non-zero weights. If that is the case, we can exploit this by creating a network (through pruning) whose maximum representative capacity approaches the capacity required to represent the dataset it is trained on, i.e., using the minimum number of independent weights required to capture the distribution of the training dataset. This constitutes the spirit of our approach: we use the independence score to represent the importance of each filter or depth to create a pruned version of the network. The intuition here being that when one filter of a layer is heavily linearly dependent on other filters, it implicitly means that its contained information has been largely encoded in other filters. As a result, even if this filter is removed, the represented information and knowledge can still be roughly re-created by other filters following the fine-tuning stage. In other words, filters with low independence scores are replaceable.
3.3 Proposed Pruning Metric
To remove the most linearly dependent filters, inspired by [39], we leverage the Independence Score (IS)-based method to prune the 3D CNNs. The entire set of filters from one layer is a 3-D tensor, and we propose to extract the linear dependence information of each filter using the principles of linear algebra. To be specific, given a set of filters from the l-th layer \(\varvec{\mathcal {W}}^l\), we first matricize \(\varvec{\mathcal {W}}^l\) to \(\varvec{W}^l = [{\varvec{w}_1^l}, {\varvec{w}_2^l}, \cdots , {\varvec{w}_{N^l}^l}] \in \mathbb {R}^{{N}\times {CK_hK_wK_d}}\), where a row vector \(\varvec{w}_i^l \in \mathbb {R}^{CK_hK_wK_d}\) is vectorized \(\varvec{W}^l_i\). In such a scenario, the linear independence of each vectorized filter \(\varvec{w}_i^l\), as a row of the matricized entire set of feature maps \(\varvec{W}^l\), can be measured via the existing matrix analysis tool. Because the rank reflects the maximum level of linearly independent rows/columns in the matrix, the most straightforward solution is to use the rank to determine the independence of \(\varvec{w}\). For example, we can remove one row from the matrix, calculate the rank change, and then determine the influence and relevance of the deleted row. The lower the rank change, the lower the independence (and importance) of the removed row.
For the i-th layer with weights \(\varvec{\mathcal {W}}^l = \{ \varvec{W}^l_1,\varvec{W}^l_2,\cdots ,\varvec{W}^l_{N} \} \in \mathbb {R}^{{N} \times C \times {K_h} \times {K_w} \times {K_d}}\), the IS of one filter \(\varvec{W}_i^l \!\in \! \mathbb {R}^{C \!\times \! {K_h} \!\times \! {K_w} \!\times \! {K_d}}\) in the i-th channel is defined and calculated as:
where \(\varvec{W}^l \in \mathbb {R}^{{N}\times {CK_hK_wK_d}}\) is the matricized \(\varvec{\mathcal {W}}^l\), \(\odot \) is the Hadamard product, and \(\varvec{M}_i^l \in \mathbb {R}^{{N}\times {CK_hK_wK_d}}\) is the row mask matrix whose i-th row entries are zeros and other entries are ones. However, rank cannot distinguish the different contributions for independence from different filters, as it only calculates overall independence and can only coarsely determine the independence contribution (1 or 0). To better differentiate the scores of filters, we relax the rank and reformulate the definition of IS with nuclear norm instead of rank shown as Fig. 3:
where \(\Vert \cdot \Vert _*\) denotes the nuclear norm. This affords us a more fine-grained way to represent the contribution of independence from different filters and enables us to rank them according to their relative independence.
Procedure of calculating IS from a 3D convolutional layer.
3.4 Granularity Schemes
Apart from the general framework described above to calculate IS, we further consider different granularity schemes for the pruning criterion. To be specific, IS can be calculated either in a filter-wise mode (Filter-wise Independence Score (FIS) that generates N Independence Scores) or in a depth-wise mode (Depth-wise Independence Score (DIS) that generates \(K_d\) Independence Scores), coming from two different perspectives for modeling the information load. FIS is intuitive in the sense that it maps out the importance of filters and mitigates the dependence among them. On the other hand, DIS fully leverages the similarity of the adjacent frames and removes the redundancy among depth-wise weights, implicitly deleting the redundancy of models resulting from the similar adjacent frames.
3.4.1 Filter-wise Independence Score
Equation 3 describes the general framework for calculating IS shown in Figs. 2 and 3. FIS can be calculated in a filter-wise way as:
where \(\varvec{W}_f^l \in \mathbb {R}^{{N}\times {CK_hK_wK_d}}\) is the matricized \(\varvec{\mathcal {W}}^l\), \(\odot \) is the Hadamard product, and \(\varvec{M}_f^l \in \mathbb {R}^{{N}\times {CK_hK_wK_d}}\) is the row mask matrix whose i-th row entries are zeros and other entries are ones.
Iterative pruning scheme: x-axis denotes the number of epochs and y-axis denotes the pruning level. It is noted that the pruning level does not equal the target pruning ratio.
3.4.2 Depth-wise Independence Score
Based on Eq. 3 and as shown in Fig. 2, DIS is formulated as:
where \(\varvec{W}_d^l \in \mathbb {R}^{{K_d}\times {NCK_hK_w}}\) is the matricized \(\varvec{\mathcal {W}}^l\), \(\odot \) is the Hadamard product, and \(\varvec{M}_d^l \in \mathbb {R}^{{K_d}\times {NCK_hK_w}}\) is the row mask matrix whose i-th row entries are zeros and other entries are ones.
3.4.3 Overall Independence Score
By combining Eqs. 4 and 5 and as shown in Fig. 2, the overall Filter/Depth-wise Independence Score (FDIS) can be formulated as:
where \(\lambda \in [0,1]\) is a control parameter to balance the influence of FIS and DIS in the final score.
3.5 Iterative Pruning Scheme
After defining the pruning metric and the granularity schemes, we now focus on the manner in which IS is calculated. Our proposed method, as described above, calculates the FDIS to identify the importance of the filter by one-shot calculation. Therefore, a natural extension to this one-shot calculation is to further adjust the importance ranking via additional learning. Inspired by Zhu et al. [40], we adopt a greedy algorithm to achieve iterative pruning with a fixed percentile once a time.
To be specific, we calculate the FDIS every t epochs from the pre-trained model in a total of T epochs. After each calculation of FDIS, we prune the filters according to a fixed percentage of the target pruning ratio shown in Eq. 7,
where \(\tau ^t\) is a target pruning ratio at the t-th epoch. \(\kappa ^t\) is the remaining level in Fig. 4 at the t-th epoch, and \(\tau \) is the target pruning ratio.
4 Performance Evaluation
In this section, we detail the evaluation of our approach and compare it with the baseline pruning approaches. In Sec. 4.1, we detail the baseline models and datasets that we use for evaluation, and also our experimental configuration. Then we go on to discuss the performance of our pruning method on various neural networks, on UCF-101 and HMDB-51 datasets in Sec. 4.2 and 4.3 respectively.
4.1 Experimental Setup
Baselines Models and Datasets
We evaluate the performance of our algorithms on several representative convolutional neural network architectures models: C3D [3], R(2+1)D [3], Slow [5], and TSM [6]; and on popular video action recognition datasets: UCF-101 [7], HMDB-51 [9] to report the speedup and accuracy of our proposed method. We demonstrate the efficacy of our pruning approach by compressing convolutional layers with Filter-wise Independence Score with one-shot pruning, Filter/Depth-wise Independence Score with one-shot pruning, and Filter/Depth-wise Independence Score with iterative pruning. All the neural networks in question are pre-trained on the Kinetics dataset [43] and fine-tuned on the UCF-101 [7] and the HMDB-51 [9] datasets. After fine-tuning, the pruned networks are evaluated through top-1 accuracy in the video recognition datasets.
UCF-101
The UCF-101 [7] dataset includes 13320 videos from 101 action categories. This dataset provides the largest diversity in terms of actions along with considerable variations in camera motion, object appearance and poses, object scales, viewpoints, illumination conditions, etc.
HMDB-51
The HMDB-51 [9] dataset contains 6849 clips divided into 51 action categories, and each category contains at least 101 video clips. Action categories can be grouped into four types: general facial actions, general body movements, body movements with object interaction, and body movements for human interaction.
Experimental Configuration
We conduct all our experiments on Nvidia A6000 GPUs and 2080 Ti GPUs with the PyTorch [44] framework. After filter pruning, we then perform fine-tuning on the pruned models with Stochastic Gradient Descent (SGD) as the optimizer. To be specific, the batch size and clip length is fixed to the default training setting. The initial learning rate is set to \(10^{-3}\) to train the pre-trained model, and is then reduced to \(2\times 10^{-4}\) in the pruning and fine-tuning stage. The learning rate is fixed in the pruning process and adjusted in the fine-tuning phase with a scheduler following the cosine function. For different types of granularity schemes and pruning algorithms, the total number of epochs is fixed to 300 epochs. The penalty factor for the combination of the filter independence score and the depth independence score, FIS and DIS, is set to 0.5.
4.2 UCF-101 Dataset
In this section, we discuss in detail, the evaluation results of our pruning method on C3D and R(2+1)D neural networks using UCF-101 dataset. These results are detailed in Table 1.
4.2.1 C3D Model
We evaluated the precision of the top-1 pruning C3D model on the UCF-101 dataset, using various granularity schemes, with a pruning sparsity of up to 54.2%. Our baseline model is trained to achieve an accuracy of 80.9%, and we analyze the performance of the model using different pruning methods. Our results reveal that FDIS yields superior accuracy performance over the L1 filter-wise pruning model, achieving an accuracy of 79.6% as compared to 78.3%, and it simultaneously reducies FLOPs by approximately 54.2%. Furthermore, compared to the filterwise pruning model with the same level of complexity reduction, our FIS (Filter Independence Score)-based method with one-shot pruning achieves better accuracy performance, with an accuracy of 78.9% compared to 78.3%. Furthermore, we observe that FDIS with a one-shot pruning scheme leads to further improvement, with an accuracy of 79.3% under the same 54. 2% reduction in FLOPs. We also find that FDIS with an iterative pruning scheme is particularly suited to prune 3D neural networks, which can achieve 79.6\(\%\) accuracy.
4.2.2 R(2+1)D Model
For pruning R(2+1)D model on UCF-101 dataset, we evaluate the top-1 accuracy under 48.4% pruning sparsity with various granularity schemes. The baseline model is trained to achieve an accuracy of 92.3%. Using FDIS leads to an improvement in accuracy over the L1 filter-wise pruning model (89.7% vs. 87.8%) with about 48.4% reduction in FLOPs. Furthermore, compared to the filter-wise pruning model with the same complexity reduction, our FIS-based method with one-shot pruning achieves better accuracy performance (88. 7% vs. 87. 8%) in less than 48.4 % FLOP reduction. Using FDIS with a one-shot pruning scheme brings more improvement (89.3% vs. 87.8%) under the same 48.4% reduction of FLOPs.
4.3 HMDB-51 Dataset
In this section, we discuss in detail the evaluation results of our pruning method in C3D, Slow, and TSM neural networks using the HMDB-51 data set. These results are detailed in Tables 2 & 3.
4.3.1 C3D Model
In this study, we conduct evaluation of the top-1 accuracy of the C3D model on the HMDB-51 dataset, utilizing various granularity schemes, with a pruning sparsity of up to 54.2%. Our baseline model is trained to achieve an accuracy of 51.5%, and we analyze the performance of the model using different pruning methods. Our findings demonstrate that the FDIS-based pruning method outperforms the L1 filterwise pruning model, achieving an accuracy of 50.6% compared to 48.3%; while simultaneously reducing FLOPs by approximately 54.2%. Furthermore, when compared to the filter-wise pruning model with the same level of complexity reduction, our Filter Independence Score (FIS) method with one-shot pruning achieves better accuracy performance, with an accuracy of 49.7% compared to 48.3%. Furthermore, we observe that using FDIS with a one-shot pruning scheme leads to further improvement, with an accuracy of 50.1% under the same 54. 2% reduction in FLOPs. Moreover, we find that FDIS-based pruning with an iterative pruning scheme is especially well-suited for pruning 3D neural networks.
4.3.2 Slow Model
It should be noted that in the Slow model, most of the layers are singular along the T dimension. As a result, we refrain from implementing Depth-wise Independence Score when pruning the Slow model. For the Slow model on HMDB-51 dataset, we evaluate the top-1 and Top-5 accuracy under different pruning sparsity ratios. We train our baseline model to achieve 46.5% accuracy. And it is seen that our method leads to an improvement in top-5 accuracy over the L1 filter-wise pruning model (75.2% vs.74.6%) with about 20% Sparsity. Furthermore, compared to the L1-based pruning model with the same complexity reduction, our method achieves a better top-1 accuracy performance (45.8% vs. 45.2%) with 50 % sparsity and 0. 3% improvement (43. 2% vs 42. 9%) with the same 80% sparsity.
4.3.3 TSM Model
Similarly as in the Slow model, most of the layers in the TSM model are singular along the T dimension. As a result, we refrained from implementing Depth-wise Independence Score when pruning the TSM model. For the TSM model on the HMDB-51 dataset, we evaluated the top-1 and Top-5 accuracy under different pruning sparsity ratios. More specifically, we train our baseline model to achieve 73.4% top-1 accuracy and 93.0% top-5 accuracy. Our method leads to an improvement in top-5 accuracy over the L1 filter-wise pruning model (91.0% vs. 88.6%) with about 40% sparsity. Furthermore, compared to the L1-based pruning model with the same complexity reduction, our method achieves a better top-1 accuracy performance (54. 1% vs 53. 8%) with 50 % sparsity.
4.4 Practical Runtime
In addition to evaluating accuracy, we also measure the speedup brought about by our method on desktop and embedded GPUs (Nvidia RTX 2080Ti and Nvidia AGX Xavier). The speedups are shown in Table 4. Here inference time per video using the Slow model is 520ms and using TSM model is 15ms on RTX 2080Ti. We observe a run-time of more than 189 ms, which is 2.75 times faster. Even more substantial speedups can be observed on the smaller embedded AGX Xavier GPU, demonstrating that our approach provides a feasible solution for performing video recognition on small resource-constrained devices.
5 Conclusion
In this paper, we propose an approach to pruning 3D CNN models inspired by the similarity of adjacent frames. Our method leverages independent information to evaluate the importance of the weights of 3D networks. In addition, by exploring the different granularity schemes and pruning techniques, we develop FDIS-based pruning for 3D convolutional neural networks. Extensive experimental results on the UCF-101 and HMDB-51 datasets show that our proposed approach brings a significant practical speedup with good accuracy performance.
Data Availability
The datasets generated during and/or analysed during the current study are available in the UCF-101 [7] repository https://www.crcv.ucf.edu/data/UCF101.php. Example from: https://arxiv.org/pdf/1212.0402.pdf.
The datasets generated and/or analyzed during the current study are available in HMDB-51 [9] repository https://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/ Example from: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6126543.
References
Ranasinghe, S., Al Machot, F., & Mayr, H. C. (2016). A review on applications of activity recognition systems with regard to performance and evaluation. International Journal of Distributed Sensor Networks, 12(8), 1550147716665520.
Khan, A., Sohail, A., Zahoora, U., & Qureshi, A. S. (2020). A survey of the recent architectures of deep convolutional neural networks. Artificial Intelligence Review, 53(8), 5455–5516.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). “Learning spatiotemporal features with 3D convolutional networks.” In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497.
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., & Paluri, M. (2018). “A closer look at spatiotemporal convolutions for action recognition.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6450–6459, 2018.
Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). “Slowfast networks for video recognition.” In Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211.
Lin, J., Gan, C., & Han, S. (2019). Tsm: Temporal shift module for efficient video understanding.
Soomro, K., Zamir, A.R., Shah, M. (2012). “UCF101: A dataset of 101 human actions classes from videos in the wild.” arXiv:1212.0402
Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2017). “A survey of model compression and acceleration for deep neural networks.” arXiv:1710.09282
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). “HMDB: A large video database for human motion recognition.” In International Conference on Computer Vision, pp. 2556–2563, IEEE.
Hara, K., Kataoka, H., & Satoh, Y. (2018). “Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and Imagenet?.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6546–6555.
Qian, R., Meng, T., Gong, B., Yang, M.-H., Wang, H., Belongie, S., & Cui, Y. (2021). “Spatiotemporal contrastive video representation learning.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6964–6974.
Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L.D., Monfort, M., Muller, U., Zhang, J., et al. (2016). “End to end learning for self-driving cars.” arXiv:1604.07316
Beery, S., Wu, G., Rathod, V., Votel, R., & Huang, J. (2020). “Context r-cnn: Long term temporal context for per-camera object detection.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13075–13085.
Hedegaard, L., & Iosifidis, A. (2022). “Continual 3D convolutional neural networks for real-time processing of videos.” In European Conference on Computer Vision, pp. 369–385, Springer.
Hoefler, T., Alistarh, D., Ben-Nun, T., Dryden, N., & Peste, A. (2021). Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research, 22(241), 1–124.
Véstias, M. (2021). Efficient design of pruned convolutional neural networks on fpga. Journal of Signal Processing Systems, 93(5), 531–544.
Han, S., Mao, H., & Dally, W.J. (2015). “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding.” arXiv preprint arXiv:1510.00149
LeCun, Y., Denker, J., & Solla, S. (1989). “Optimal brain damage.” Advances in neural information processing systems, vol. 2.
Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H.P. (2017). “Pruning filters for efficient convnets.” In International Conference on Learning Representations.
Anwar, S., Hwang, K., & Sung, W. (2017). Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3), 1–18.
Sui, Y., Yin, M., Gong, Y., & Yuan, B. (2024). “Co-exploring structured sparsification and low-rank tensor decomposition for compact dnns.” IEEE Transactions on Neural Networks and Learning Systems.
Wen, W., Wu, C., Wang, Y., Chen, Y., & Li, H. (2016). Learning structured sparsity in deep neural networks. Advances in Neural Information Processing Systems, 29, 2074–2082.
Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., & Zhang, C. (2017). “Learning efficient convolutional networks through network slimming.” In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744.
Molchanov, P., Mallya, A., Tyree, S., Frosio, I., & Kautz, J. (2019). “Importance estimation for neural network pruning.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11264–11272.
Guo, S., Wang, Y., Li, Q., & Yan, J. (2020). “Dmcp: Differentiable markov channel pruning for neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1539–1547.
Liu, Z., Mu, H., Zhang, X., Guo, Z., Yang, X., Cheng, K.T., & Sun, J. (2019). “Metapruning: Meta learning for automatic neural network channel pruning.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3296–3305.
Tang, Y., Wang, Y., Xu, Y., Tao, D., Xu, C., Xu, C., & Xu, C. (2020). Scop: Scientific control for reliable neural network pruning. Advances in Neural Information Processing Systems, 33, 10936–10947.
Lin, M., R., Wang, Y., Zhang, Y., Zhang, B., Tian, Y., & Shao, L. (2020). “Hrank: Filter pruning using high-rank feature map.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1529–1538.
Zhang, Y., & Freris, N.M. (2023). “Adaptive filter pruning via sensitivity feedback.” IEEE Transactions on Neural Networks and Learning Systems.
Gao, S., Zhang, Z., Zhang, Y., Huang, F., & Huang, H. (2023). “Structural alignment for network pruning through partial regularization.” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 17402–17412.
Guo, S., Zhang, L., Zheng, X., Wang, Y., Li, Y., Chao, F., Wu, C., Zhang, S., & Ji, R. (2023). “Automatic network pruning via hilbert-schmidt independence criterion lasso under information bottleneck principle.” In Proceedings of the IEEE/CVF international conference on computer vision, pp. 17458–17469.
Yao, H., Zhang, W., Malhan, R., Gryak, J., & Najarian, K. (2018). “Filter-pruned 3D convolutional neural network for drowsiness detection.” In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 1258–1262.
Wang, Z., Hong, W., Tan, Y.-P., & Yuan, J. (2020). Pruning 3D filters for accelerating 3D convnets. IEEE Transactions on Multimedia, 22(8), 2126–2137.
Zhang, Y., Wang, H., Luo, Y., Yu, L., Hu, H., Shan, H., & Quek, T.Q. (2019). “Three-dimensional convolutional neural network pruning with regularization-based method.” In 2019 IEEE International Conference on Image Processing (ICIP), pp. 4270–4274, IEEE.
Xu, Z., Ajanthan, T., Vineet, V., & Hartley, R. (2020). “Ranp: Resource aware neuron pruning at initialization for 3D CNNs.” In 2020 International Conference on 3D Vision (3DV), pp. 1–10, IEEE.
Xiang, X., Wang, Z., Lao, S., & Zhang, B. (2020). Pruning multi-view stereo net for efficient 3D reconstruction. ISPRS Journal of Photogrammetry and Remote Sensing, 168, 17–27.
He, Y., Liu, P., Wang, Z., Hu, Z., & Yang, Y. (2019). “Filter pruning via geometric median for deep convolutional neural networks acceleration.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349.
Sevilla-Lara, L., Liao, Y., Güney, F., Jampani, V., Geiger, A., & Black, M.J. (2018). “On the integration of optical flow and action recognition.” In German Conference on Pattern Recognition, pp. 281–297, Springer.
Sui, Y., Yin, M., Xie, Y., Phan, H., Zonouz, S., & Yuan, B. (2021). “Chip: Channel independence-based pruning for compact neural networks.” In Advances in Neural Information Processing Systems.
Zhu, M., & Gupta, S. (2017). “To prune, or not to prune: exploring the efficacy of pruning for model compression.” arXiv:1710.01878.
Zhang, Y., Wang, H., Luo, Y., Yu, L., Hu, H., Shan, H., & Quek, T.Q. (2019). “Three-dimensional convolutional neural network pruning with regularization-based method.” In 2019 IEEE International Conference on Image Processing (ICIP), pp. 4270–4274, IEEE.
Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2017). “Pruning convolutional neural networks for resource efficient inference.” In International Conference on Learning Representations.
Carreira, J., & Zisserman, A. (2017). “Quo vadis, action recognition? a new model and the kinetics dataset.” In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). “Pytorch: An imperative style, high-performance deep learning library.” In Advances in Neural Information Processing Systems 32 (H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, eds.), pp. 8024–8035, Curran Associates, Inc.
Acknowledgements
This work was supported by the National Science Foundation (NSF) CCF-1937403.
Funding
NSF CCF-1937403
Author information
Authors and Affiliations
Contributions
Yang proposed the idea, conducted the experiments, and wrote the paper. Khizar conducted the experiment and wrote the paper. Dario and Bo discussed ideas and revised the paper.
Corresponding authors
Ethics declarations
Conflict of interest
Editorial Board Member: Bo Yuan.
Financial interests
The authors declare that they have no financial interests.
Non-financial interests
None.
Ethics approval
N/A.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sui, Y., Anjum, K., Pompili, D. et al. Pruning 3D Convolutional Neural Networks via Channel Independence. J Sign Process Syst 97, 247–256 (2025). https://doi.org/10.1007/s11265-025-01950-1
Received:
Revised:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11265-025-01950-1




