Abstract
Fine-grained visual classification (FGVC) aims to classify sub-categories, such as different kinds of birds, varying brands of cars, etc. Learning feature representations from discriminative parts of an object has always played an essential role in this task. Recently, applying the attention mechanism to extract discriminative parts has become a trend. However, using the classical attention mechanism brings two main limitations in FGVC: First, they always focus on informative channels in feature maps but ignore those with poor information, which also contain fine-grained knowledge that is helpful for classification. Second, they largely stare at the most salient parts of objects but ignore the insignificant but discriminative parts. To address these limitations, we propose channel re-attention and spatial multi-region attention for fine-grained visual classification (CRA-SMRA), which incorporate two lightweight modules that can be easily inserted into existing convolutional neural networks (CNN): On the one hand, we provide a channel re-attention module (CRAM), which can select the importance of the channels of the feature map of the current stage, obtaining more discriminative features and enabling the network to mine useful fine-grained knowledge in information-poor channels. On the other hand, a spatial multi-region attention module (SMRAM) is proposed to calculate the spatial matching degree of feature maps in different stages, obtaining multi-stage feature maps that focus on different discriminative parts. Our method does not require bounding boxes/part annotations and can be trained in an end-to-end way. Extensive experimental results on several fine-grained benchmark datasets demonstrate that our approach achieves state-of-the-art performance.
Similar content being viewed by others
1 Introduction
Visual classification tasks are usually divided into coarse-grained and fine-grained visual classification. Coarse-grained visual classification (CGVC) is to distinguish between different base classes, such as the classification of dogs and cats. Unlike CGVC, fine-grained visual classification determines subclasses under the base class, e.g., different kinds of birds [1] or dogs [2], different brands of aircraft [3] or cars [4]. There is a remarkable similarity between different categories, as shown in the first column in Fig. 1, we accidentally group different species of birds into one category in terms of pose and appearance. And due to uncertain factors such as occlusion and background interference, there are great differences in the same category, as shown in the first row of Fig. 1, it is not easy to classify them into the same category. Therefore, FGVC has always been considered a challenging task due to the small variance between classes and the large variance within classes. Early works relied mainly on manual annotations (bounding boxes/part annotations) to capture subtle differences between categories. However, additional annotation consumes human resources and requires expert knowledge in related fields, which makes these methods less practical.
Recent years have seen an increasing focus on weakly supervised FGVC that only needs image-level labels. These methods can be roughly divided into two categories: 1) Part-specific methods [5, 6], which mostly use an attention mechanism or target detection to locate specific salient parts in objects, obtaining the corresponding discriminative features. 2) The methods based on higher-order feature coding [7, 8] is used to obtain more robust fine-grained deep features. But the above methods have their limitations: Part-specific models mainly focus on the most salient parts of the object and therefore ignore insignificant but discriminative parts, making the features not sufficiently discriminating. Methods based on higher-order feature coding require a lot of computational resources and are not adequately interpretable when the channel dimension of the feature map is high.
To address the above limitations, we propose the CRA-SMRA model. First, we use a multi-stage CNN as the backbone to obtain a more comprehensive feature representation. The multi-stage features contain low-level information (color, edge connection points, etc.) and high-level semantic information. Low-level information remains unchanged when the pose of the object or background changes, reducing intra-class variance. Next, different channels in the feature map focus on various visual information and contain different amounts of visual information. Therefore, we use CRAM to select the importance of the channel of the feature map of the current stage, then assign the corresponding weight according to the importance of the channel, the obtained feature map is used as the output of the current stage, which we call the feature map of the channel enhanced. After that, the channel with higher weight is suppressed to obtain the feature map of the channel suppressed, and it feeds into subsequent stages, forcing the network to learn potentially fine-grained knowledge in channels with lower weights. And to keep the balance between low-level and high-level information, we unify the channel dimensions of the features of the multi-stage output, significantly reducing computing resources while ensuring higher dimensional channels and increasing interpretability. Although the discriminative multi-stage features are obtained through CRAM, the parts they concern in the spatial dimension still have great similarities because they both follow the most discriminating parts, leading to the redundancy of features. Final, we propose an SMRAM to enable the features of the multi-stage to focus on different discriminative parts of objects. With our model, not the limitation of attention applied to weakly supervised FGVC is solved, but more discriminative features in both spatial and channel dimensions are obtained.
Finally, we will optimize both CRAM and SMRAM in the training phase. In the testing phase, the SMRAM will be removed, as shown in Fig. 2. Our method merely requires image-level labels. In addition, the two proposed modules are lightweight and can update parameters adaptively, which requires no an artificial selection of the specific channels or the discriminative parts spatially.
Our contributions can be summarized as follows:
-
We propose a channel re-attention module (CRAM), which can force the network to mine potential fine-grained knowledge in the information-poor channel of the feature map, thereby extracting a more comprehensive feature representation.
-
We propose a spatial multi-region attention module (SMRAM), which allow the network to localize multiple distinct discriminative parts of an object, making the extracted features more discriminative.
-
We conduct extensive experiments on four fine-grained benchmark datasets, and experimental results show that CRA-SMRA achieves state-of-the-art results.
2 Related Work
In this section, we briefly introduce fine-grained feature learning, attention mechanism, and calculation of the matching degree of feature map in FGVC.
2.1 Fine-Grained Feature Learning
To the best of our knowledge, existing methods for fine-grained feature learning can be divided into the methods using visual information only and the methods adding extra information. The former relies entirely on visual information to solve classification problems, while the latter attempts to add additional information (such as network data, multimodal data, etc.) to build joint representations.
The methods using visual information only: Fine-grained classification methods using visual information only can be roughly divided into two categories: The methods based on localization-classification sub-network and the methods based on high-order feature encoding. The first mentioned of two is to detect the discriminative parts in the object and build the corresponding local feature representation. Early work [9, 10] employed part annotations as intense supervision to make the network pay attention to subtle differences between categories, but part annotation is expensive. Therefore, most of the current mainstream methods still use weak supervision. RA-CNN [11] recursively learned discriminative region attention and region-based feature representation at multiple scales in a mutually reinforcing manner. MA-CNN [5] adopted multi-attention CNN learning to locate various local regions and then extract corresponding features. A self-supervised approach was used in NTS-Net [12] to effectively find the discriminative parts in the image and obtain related features for subsequent classification. In ELoPE [6], a pre-trained localization module was used to detect the biased regions of objects. Triplet loss and scale-separated NMS were used in CCFR [13] to capture discriminative local regions and obtain joint feature representations from these regions and the entire image. In [14], a weakly supervised discriminatory localization approach (WSDL) was proposed to achieve fast and accurate localization of discriminative parts in objects, greatly improving classification speed and accuracy. The methods based on higher-order feature encoding was to perform higher-order integration of the features generated by CNN, obtaining more discriminative features. The usual way is bilinear CNN [7], which calculated the outer product of different spatial positions of objects through bilinear convergence, then calculated the average intersection of other spatial parts to obtain bilinear features. The bilinear model can provide a more substantial feature representation. Yet, calculating the outer product significantly increases the dimension of features and consumes more computing resources. A symmetric network model was used in [8], which improved the accuracy while reducing the parameter dimension of the bilinear pooling model. The bilinear attention pooling (BAP) operation was proposed in WS-DAN [15], which extracted more exemplary higher-order encoded features and dramatically reduced the computational effort. A new tensor representation, sequence compatibility kernel (SCK) and dynamics compatibility kernel (DCK), was proposed in [16] to capture high-order relationships between features in fine-grained video sequences compactly. In the article [17] published by TPAMI, a new PN layer pooling feature map was proposed to study the power normalization (PN) set in deep learning, the effectiveness of which is verified in fine-grained image classification, scene recognition, and other tasks.
Structure diagram of our proposed CRA-SMRA model. stage1-3, stage4, and stage5 are different stages of the backbone. Through CRAM, \({\mathcal {X}}_e^l\) and \({\mathcal {X}}_s^l\) are obtained, \({\mathcal {X}}_e^l\) is used as the output of the current stage, and \({\mathcal {X}}_s^l\) is input to the subsequent stage to force the network to mine potential fine-grained knowledge. SMRAM is used to make multiple stages of \({\mathcal {X}}_e^l\) spatially focus on different discriminative parts of the object. The blue dotted line in the figure indicates that it is only used in the training phase
The methods adding extra information: For a more efficient classification of fine-grained images, it is helpful to utilize additional information. Zhang et al. [18] proposed a web-supervised network with softly update-drop training, which combined empirical data and achieved better results while effectively reducing the harmful effects of noise in network images. The progressive mask attention (PMA) was introduced in [19] to achieve effective classification using visual and language bimodal data. KERL [20] combined rich additional information with deep neural network architecture, and used a gated graph neural network to transfer node information, generating knowledge representations. An attention fusion network (AFN) and food-ingredient joint learning module were proposed in [21] to build joint feature representations for fine-grained foods and ingredients. In [22], predictive action concepts and auxiliary descriptors (such as object descriptors) were learned by inputting image frames, which allowed the network to establish a self-supervised concept. The IDT-based BoW/FV representations proposed by [23] can be easily integrated into the I3D model, significantly reducing the inference time of the model while improving the accuracy of the model.
2.2 Attention Mechanism
The research on the attention mechanism in deep learning is inspired by exploring the visual attention mechanism in the human brain. Applying the attention mechanism can enhance useful information and suppress useless information. The spatial attention mechanism STN was mentioned in [24], which can locate objects and learn corresponding deformations, reducing the difficulty of model learning. An effective channel attention learning mechanism was made by SENet [25] to improve the attention to channel features. CBAM was proposed in [26], which combined channel attention and spatial attention to enhance the extracted features’ discrimination further. A self-attention mechanism was applied in [27] to improve the accuracy of video classification by establishing non-local relations in the spatiotemporal dimension of the video. A cascaded attention (Cas-Attention) was come up with AKEN [28] to highlight the discriminative parts of objects, obtaining more discriminative features. In [29], the spatial selective sampling module was used to enlarge the local area of the object, and the spatial attention was embedded in the multi-path convolution to obtain the size of different kernels adapted to the region of interest and the background, therefore, more discriminative fine-grained features were extracted. Wang and Koniusz [30] proposed novel Multi-order Multi-mode Transformer (3Mformer) with two modules (Multi-order Pooling (MP) and Temporal block Pooling (TP)) to establish the dependency between not linked body joints, and achieved state-of-the-art results. A new multi granularity partial sampling attention (MPSA) network for fine-grained visual classification has been proposed in [31], which extracts local information at different scales and enhances high-level feature representation through discriminative local features at different granularities, achieving good results. Xu et al. [32] proposed a novel internal ensemble learning transformer (IELT) that votes on the labels of the discriminative regions based on the attention map and spatial relationship as cross-layer features, solving the problem of inconsistent performance of multi-head self-attention (MHSA). A progressive multi task anti noise learning (PMAL) framework and a progressive multi task extraction (PMD) framework were proposed in [33] to address the intra class variation problem caused by image noise in FGVR by treating image denoising as an additional task in image recognition, and gradually force the model to learn noise invariance, achieving high recognition accuracy. Liu et al. [34] proposed Fine-Grained Semantic Category Reasoning (FineR), which internally leverages the world knowledge of a Large Language Model (LLM) as a proxy to reason about fine-grained category names, showing promise in new domains where working in the wild and collecting expert annotations is difficult.
2.3 Calculation of Matching Degree of Feature Map
The calculation of the matching degree of feature map is used to judge the similarity between the multiple feature maps. MA-CNN [5] obtained the matching degree between the channels of the feature map by calculating the peak response of each channel in the feature map, and adopted channel grouping to summarize the adjacent channels in the peak response region. The CA module was proposed in PCA-Net [35] to perform a bilinear operation on the two feature maps, then obtained a bilinear matrix, which largely calculated the similarity between the feature maps. However, when the channel dimension of the feature map is high, the bilinear operation will consume a lot of computing resources. A target-oriented matching mechanism was raised in TOAN [36] to calculate the similarity in the feature space dimension, reducing the intra-class variance and achieving better classification results in fine-grained few-shot classification (FGFS). Wang and Koniusz [37] proposed advanced variant of Dynamic Time Warping to factor out misalignment between query and support sequences of 3D body joints, while achieving the best alignment in the temporal and simulated camera viewpoint spaces. The dynamic time warping (DTW) proposed in [38] matched sequence pairs performs well in applications such as predicting time series evolution and clustering time series. Mechanically, the method is similar to that of [39], both SMRAM and MC-Loss in [39] use diversity loss to train and optimize model parameters, and asks for interactive channels of the feature map coming from different stages of CNN. And the feature map of each stage have enough information to achieve ideal results. Functionally, the purpose of the SMRAM module is to make multi-stage feature maps focus on different spatial discriminative areas, while the goal of MC-Loss is to make feature channels both discriminative and diverse. Structurally, SMRAM is an attention module for fine-grained classification, which is plug-and-play and consists of average pooling, batch nomalization, softmax function, CCMP, and diversity loss. As a loss function, MC-Loss optimizes the performance of feature channels by affecting network training. It consists of a discriminative component function and a diversity component function. The discriminative component forces all feature channels belonging to the same class to be discriminative through diversity loss, while the diversity component imposes constraints on the channel.
3 Methodology
Notation: Denote the basic convolutional network \(M(\cdot )\) has L stages. Possible operations in each stage include the convolutional layer \(W(*)\), the batch norm layer \(BN(*)\), the ReLU activation layer \(ReLU(*)\), and the softmax function \(softmax(*)\), the feature map output by each stage is \({\mathcal {X}}^l\in R^{C_l\times W_l\times H_l}\), l=1,2,...,L, here \(C_l\), \(W_l\), and \(H_l\) are the number of channels, width, and length of the feature map, respectively. \({\mathcal {X}}_e^l\) represents the feature map of the channel enhanced of the \(l^{th}\) stage obtained by CRAM. \(AvgPool(*)\) is the average pooling operation used to average the feature values on the feature map. \(MaxPool(*)\) is the maximum pooling operation used to find the maximum eigenvalue on the feature map. \(FC_1(*)\) and \(FC_2(*)\) are two convolutional layers used for dimensionality reduction and dimensionality enhancement of channel features, respectively. For simplicity and fairness, \({\mathcal {X}}_e^l, {\mathcal {X}}_s^l = CRAM({\mathcal {X}}^l)\), \({\mathcal {X}}_s^l\) and \({\mathcal {X}}^l\) have the same dimension \(C_l \times W_l \times H_l\), the dimension of \({\mathcal {X}}_e^l\) becomes \(C_1 \times W_l \times H_l\) by CRAM, this is for the uniformity of the channel dimensions of \({\mathcal {X}}_e^l\) in different stages. We obtain \(L_{div}\) through SMRAM, which is only used during training.
Extracting multi-stage features is now a common and robust method in many visual tasks, such as [40], this model extracts multi-stage features, utilizes FPN module, and proposes a high-temperature refinement module to learn appropriate feature scales. The background suppression module uses classification confidence to divide the feature map into foreground and background, and suppresses feature values in low confidence areas. Unlike them, we start from the channel and spatial dimensions of the feature map, and proposes CRAM and SMRAM, each extracting more advantageous features for the task, but both use a common method to extract multi-stage features in CNN, achieving better classification results.
In FGVC, extracting latent fine-grained knowledge on the channels of feature maps and localizing multiple distinct discriminative parts in space are important for classification. Therefore, we propose two lightweight modules: 1) Channel re-attention module (CRAM), which can obtain the feature map of the channel enhanced \({\mathcal {X}}_e^l\) and the feature map of the channel suppressed \({\mathcal {X}}_s^l\). The former is the output of the current stage, and the latter is sent to the subsequent stage to explore the potential fine-grained knowledge. After that, we will unify the channel dimensions of the multi-stage \({\mathcal {X}}_e^l\) as the final output. 2) Spatial multi-region attention module (SMRAM), through this module, the \({\mathcal {X}}_e^l\) of each stage can focus on different parts of the object in the spatial dimension.
Compared with CBAM [4], functionally, CRAM only enhances and suppresses the channels of the feature map in the channel dimension to highlight important channels and suppress unimportant channels, forcing the network to mine potential fine-grained features, CBAM through channels and spaces the combination of attention helps the network learn to focus on those salient features in the image in both spatial and channel dimensions, while ignoring those parts that are not salient but helpful for fine-grained image classification. Structurally, CRAM uses average pooling and maximum pooling to integrate spatial information, and will simultaneously output feature maps with significant channel features and feature maps with channel suppression, helping to discover those that are not significant but helpful for classification fine-grained channel features. CBAM contains two sequential sub-modules: channel attention module and apatial attention module. Each module has only one feature map that outputs salient features. CRAM provides a flexible way to process feature maps at different stages by enhancing and suppressing the two output feature maps, which may help the network better capture fine-grained features.
The structure diagram of CRAM, through this module, we can get the feature map of the channel enhanced \({\mathcal {X}}_e^l\) and the feature map of the channel suppressed \({\mathcal {X}}_s^l\), where \(Conv\_layer\) is used to adjust the number of channels of the \({\mathcal {X}}_e^l\), and the \(F(*)\) function is shown in Eqn. (5)
3.1 Channel Re-attention Module
Suppose the backbone has L stages, and the feature map obtained at the current stage of the backbone is \({\mathcal {X}}^l\in R^{C_l\times W_l\times H_l}\), here \(C_l\), \(W_l\), and \(H_l\) are the number of channels, width, and length of the feature map, respectively. First, we use average pooling and max pooling operations to integrate the spatial information of the feature maps, thereby obtaining two feature maps \({\mathcal {X}}^{c}_{avg}\in R^{C_l\times 1\times 1}\) and \({\mathcal {X}}^{c}_{max}\in R^{C_l\times 1\times 1}\). Beisdes, we add the two feature maps and pass the sigmoid function, getting the mask matrix \({\textbf {I}}\in R^{C_l\times 1\times 1}\) representing the importance of each channel in the feature map
Where AP, MP represent average pooling and maximum pooling respectively, the following formula can represent MLP
where the activation function \(ReLU(*)\) is used to eliminate negative activations, \(FC_1(*)\) and \(FC_2(*)\) are used to maximize the retention of fine-grained knowledge and simplify the calculation. The sigmoid function is used to calculate the importance of the channel. Next, we assign corresponding weights to the channel positions of \({\mathcal {X}}^l\) to obtain the feature map of the channel enhanced \({\mathcal {X}}_e^l\) of the current stage
where \(\otimes \) represents an element-wise multiply operation. We unify the channel dimensions of \({\mathcal {X}}_e^l\) through the \(Conv\_layer\) as the output of the current stage. Then, normalize \({\textbf {I}}\) using the softmax function to get \({\textbf {I}}^*\)
by normalizing the \({\textbf {I}}^*\), we can get the mask matrix \({\textbf {S}}\) of the channel suppressed, \(F(*)\) in Fig. 3 represents this part of the operation
where \({\textbf {I}}^*_{max}\) is the maximum value in \({\textbf {I}}^*\), \(\omega \) and \(\delta \) are hyperparameters, representing the degree to which the corresponding channel is suppressed and the degree to which the channel needs to be suppressed, respectively. Final, we use \({\textbf {S}}\) to obtain \({\mathcal {X}}_s^l\)
In general, the function of this module can be represented by \({\mathcal {X}}_e^l, {\mathcal {X}}_s^l = CRAM({\mathcal {X}}^l)\), we input the feature map of the current stage into the CRAM, acquiring the feature map of the channel enhanced \({\mathcal {X}}_e^l\), and the feature map of the channel suppressed \({\mathcal {X}}_s^l\), \({\mathcal {X}}_e^l\) is the output of the current stage, and \({\mathcal {X}}_s^l\) is fed to the next stage to force the network to mine potential fine-grained features. The CRAM is shown in Fig. 3.
3.2 Spatial Multi-region Attention Module
The fine-grained features include spatial and channel information. We obtain the multi-stage \({\mathcal {X}}_e^l\) of the backbone through the CRAM, and the next step is to find the discriminative parts in space. Still, only focusing on the most salient parts of objects is not good for fine-grained classification. The key of FGVC lies in those non-salient but discriminative parts, so SMRAM is proposed to make multi-stage \({\mathcal {X}}_e^l\) pay close attention to different discriminative parts. The SMRAM is shown in Fig. 2.
Assume that the feature maps obtained by the CRAM in the last three stages of the backbone are \({\mathcal {X}}_e^{L-2}\in R^{C_a\times W_{L-2}\times H_{L-2}}\), \({\mathcal {X}}_e^{L-1}\in R^{C_a\times W_{L-1}\times H_{L-1}}\) and \({\mathcal {X}}_e^L\in R^{C_a\times W_L\times H_L}\) respectively, \(C_a\) represents the number of channels after unity. To reduce the amount of computation and ensure that SMRAM can be updated adaptively, we keep the spatial scale of the multi-stage feature maps to the minimum scale, consistenting with the feature maps of the last stage. That is the function of the \(Conv\_block\) in the SMRAM.
To explore the local regions of interest in the feature maps of each stage, first, we use the 1\(\times \)1 convolution \(\phi (*)\) to normalize the number of channels of the feature maps at each stage, obtaining \({\mathcal {X}}_e^l\in R^{1\times W_l\times H_l}\),l =L-2,L-1,L. After that, the softmax function is used in the spatial dimension, and its value represents the degree of attention in space. Finally, the feature maps of each stage are spliced in the channel dimension to gain \({\mathcal {X}}_{concat}\in R^{3\times W_L\times H_L}\), and each channel in \({\mathcal {X}}_{concat}\) represents each stage
where \({\mathcal {X}}_{ej}^l\) represents the \(j^{th}\) pixel in the space of the \(l^{th}\) feature map. We input \({\mathcal {X}}_{concat}\) into cross-channel max pooling (CCMP) [41], which calculates the maximum response of each spatial pixel in the channel dimension. With CCMP, we get a one-dimensional vector \({\mathcal {X}}^*_{concat}\) of length \(W_L\times H_L\). CCMP tends to respond to the peaks in the channel dimension for each pixel in \({\mathcal {X}}_{concat}\). We operate \(h(*)\) of summing and then averaging the elements in \({\mathcal {X}}^*_{concat}\), which is used to calculate the similarity between the feature maps of each stage. The larger the value of \(h(*)\) is, the more significant the difference between the local regions concerned by the feature maps at different stages is. Here \(h(*)\) is defined as
where c is the length of \({\mathcal {X}}^*_{concat}\) and \(\varepsilon \) is the number of channels of \({\mathcal {X}}^*_{concat}\), which value is 3 in our work. Through the SMRAM, we will get a value of \(L_{div}\) that represents the similarity of the multi-stage \({\mathcal {X}}_e^l\). Our work will continuously increase the value of \(h(*)\) during training to make the feature maps of different stages focus on various local regions, that is to reduce the value of \(L_{div}\) loss
The algorithm flow of the SMRAM is shown in Algorithm 1.
3.3 Overall Network Design
We apply our proposed module in mainstream CNN. Taking Resnet50 as an example, it is mainly composed of five stages (L=5), after each stage, the spatial size of the feature map is halved, and the number of channels is doubled. We insert CRAM after the third, fourth, and fifth stages, then take the feature map of the channel enhanced \({\mathcal {X}}_e^l\) obtained by CRAM as output. Then, in the training stage, \({\mathcal {X}}_s^l\) of each stage is delivered to SMRAM, making the feature maps of the three stages focus on different discriminative parts of the object, respectively. Finally, we obtain a more refined feature representation in channel and space.
In the training phase, we will calculate classification loss of each stage, the overall classification loss, and \(L_{div}\) loss
where y is the ground-truth label of the input image, represented by a one-hot vector. i mean the output of the \(i^{th}\) stage of the backbone, and the softmax function is used to calculate the predicted label value of the network to get the final optimization goal
where N=3 represents the output of the last three stages of the backbone, and \(\alpha ,\beta ,\gamma \) are hyperparameters, expressing the proportion of each loss.
In the testing phase, we take the average of the predicted label values of the three stages and the overall feature as the final prediction result.
4 Experiments
4.1 Dataset and Implementation Details
We use CUB\(\_\)200\(\_\)2011 [1], FGVC-Aircraft [3], Stanford Cars [4], and Stanford Dogs [2], four fine-grained benchmark datasets for experiments. All four benchmark datasets provide image-level labels and other annotation information, but our model only uses image-level labels. The specific settings of the dataset are shown in Table 1.
The study use Resnet50, Resnet101 [42], and Densnet161 [43] as backbones, which are all pretrained on the ImageNet [44], and the input image is resized to 550 \(\times \) 550 and cropped to 448 \(\times \) 448 through the center. In the training phase, we perform data augmentation by random horizontal flipping, but in the testing phase, we only use the center crop to 448 \(\times \) 448, the values of hyperparameters \(\alpha ,\beta ,\gamma ,\omega ,\delta \) are set as 2, 1, 1, 0.5, 0.9, the learning rate of our backbone and SMRAM is set to 0.0002, and the learning rate of CRAM is 0.006. We use cosine decay to adjust the learning rate. Stochastic Gradient Descent is used as the optimizer with the momentum of 0.9 and the weight decay of 0.00001. The batch size is set to 20, and a total of 200 epochs are trained. All experiments are performed entirely on a single Tesla V100 GPU, and the Pytorch toolbox is used as the main implementation substrate.
4.2 Compared with State-of-the-Art Methods
We compare the Top-1 accuracy of our method and current state-of-the-art methods on four datasets, and the results are listed in Table 2. Where \(1-stage\) refers to only taking the original image as input, the others refer to finding discriminative parts of objects using raw images as input and using them as input for subsequent stages.
-
On the CUB\(\_\)200\(\_\)2011: CUB\(\_\)200\(\_\)2011 is the most challenging benchmark dataset in FGVC. The method with our proposed module, the performance gain on different backbones are much bigger than that in Table 2, proving effectiveness of the proposed methods. When Resnet50 is the backbone, our method increased by over 9.1\(\%\) and 7.8\(\%\) compared to DeepLAC and Part-RCNN with extra annotation information, respectively. Our model is 4.1\(\%\) higher than RA-CNN with VGG[58]. In contrast to Multi-stage methods, RA-CNN, NTS, PCA-Net, MGE-CNN, S3N, and FDL, we outperforms by 4.1\(\%\), 1.9\(\%\), 1.1\(\%\), 0.9\(\%\), 0.9\(\%\) and 0.8\(\%\), respectively. 1.6\(\%\) more than API-Net, which uses image pairs as input. Our approach dramatically improves accuracy compared with MAMC, CIN, AENet, and ACNet, all of which are one-stage methods. Table 2 also illustrates the necessity of mining the knowledge of feature map channels and spatial dimensions that are helpful for fine-grained classification.
-
On the FGVC-Aircraft: Our method achieves competitive results on this dataset. We go beyond RA-CNN with VGG. Compared with NTS, CIN, API-Net, ACNet, PCA-Net, and S3N, using Resnet50 as their backbone, our method exceed by 1.8\(\%\), 0.6\(\%\), 0.2\(\%\), 0.8\(\%\), 0.8\(\%\) and 0.4\(\%\) respectively. The performance of our model is slightly lower than AENet and FDL. However, AENet takes three images as input, and FDL is a two-stage method, which consumes more memory in the data processing.
-
On the Stanford Cars: Our method gains the best results equipped with Resnet101 and Densenet161 respectively. When Resnet50 is used as the backbone, our method exceeds others in Table 2 except for API-Net. Although API-Net is 0.1\(\%\) higher than ours, it uses image pairs as input, which dramatically increases the consumption of computing resources.
-
On the Stanford Dogs: Most of the previous methods have not been tested on this dataset because of the computational complexity. Our method also obtains competitive results on this dataset. We achieves the best results when Resnet101 and Densenet161 are in use. When Resnet50 is treat as our backbone, the performance of our method is much bigger than other ways except for API-Net. Although API-Net is 0.1\(\%\) higher than ours, it uses image pairs as input, increasing computing resource consumption dramatically.
Our method performs well on all four datasets due to its simplicity and effectiveness. When Resnet50 is considered as the backbone, API-Net and AENet achieve the best results on Stanford Cars and Stanford Dogs, respectively, but they perform poorly on CUB\(\_\)200\(\_\)2011. Although FDL outperforms us on FGVC-Aircraft, we outperform it on the rest of the datasets. And our method achieves the best results on all four datasets when Resnet101 and Densenet161 are used.
4.3 Ablation Studies
To verify the effectiveness of our proposed modules, we conduct ablation experiments on four fine-grained benchmark datasets with Resnet50. First, we explore the role of each module by sequentially adding our modules to the backbone, as shown in Table 3. Next, we investigated two crucial parameters: the channel inhibition degree \(\delta \) and the spatial multi-region division degree \(\gamma \), The results are shown in Table 4 and Table 5, respectively.
The effect of CRAM: To obtain more discriminative fine-grained features in the object channel dimension, we insert CRAMs into the third, fourth, and fifth stages of Resnet50. After adding this module, our accuracy on the four datasets increase by over 1.3\(\%\), 1.4\(\%\), 0.9\(\%\), and 1.4\(\%\), respectively, proving the effectiveness of CRAM. The results are shown in the third row in Table 3, at the same time, we added the replacement of CRAM with CBAM on the basic network to compare the effectiveness of our module. The results are shown in the second row of the table..
The effect of SMRAM: To get several different discriminative partial fine-grained features in the space dimension, we added SMRAM, and our final accuracy on the four datasets was improved by 0.4\(\%\), 0.4\(\%\), 0.1\(\%\) and 0.2\(\%\), respectively, which directly proves the effectiveness of SMRAM. The results are shown in the fourth row in Table 3.
The setting of the parameter \(\varvec{\delta }\): The parameter \(\delta \) represents the degree to which channels need to be suppressed, that is, to explore which channels of the feature map contain knowledge that is beneficial for fine-grained classification. By changing the value of \(\delta \), we can find that when \(\delta \)=0.9, the model can learn a feature representation that is more helpful for classification. At the same time, it also shows that the information-poor channels in the feature map contain much favorable fine-grained knowledge. The details are shown in Table 4.
The setting of the parameter \(\varvec{\gamma }\): The parameter \(\gamma \) represents the degree of spatial multi-region division, which means finding multiple discriminative parts in space. The core of fine-grained classification is to explore multiple inconspicuous but discriminative parts of the object, \(\gamma \) can provide information about the distribution of useful discriminative features in fine-grained images. By changing the value of \(\gamma \), we can see that when \(\gamma \) = 1, the model can find various discriminative parts of objects that are helpful for classification, as shown in Table 5.
Visualization of the activation maps produced by the third, fourth, and fifth stages of the model while using a bird as the input image, the hyperparameter \(\delta \) representing the degree of the channel that needs to be suppressed, the underline in the figure means the optimal value of \(\delta \)
4.4 Visualization
We use grad-cam [59] to visualize activation maps generated by models using only backbone, pulsing CRAM, and pulsing CRAM and SMGAM on four fine-grained benchmark datasets. Resnet50 serves as our backbone throughout the experiment. The first row is the original image sampled from the four datasets. Under each original image, the activation maps from the first to the third columns come from the third, fourth, and fifth stages of muti-stage CNN. In particular, the activation map is obtained by averaging the activation values of the channel dimension of the feature map. As shown in Fig. 4, we can find that when there is the backbone only, the activation map shows that the features are not satisfactory enough and just concern the most salient parts of the object. After adding CRAM, the extracted features become more comprehensive. When coupled with SMRAM, the feature maps produced by each stage of the model become comprehensively rich and spatially focused on multiple distinct discriminative parts. And in Fig. 5, we explore the performance of the model by controlling the value of the hyperparameter \(\delta \). The best \(\delta \) that can make the extracted features are neither weak nor redundant. In Fig. 6, we make the multi-stage feature map pay close attention to the optimal multiple discriminative parts by controlling the hyperparameter \(\gamma \) value. We can see that the performance of our model is best when \(\delta =0.9\) and \(\gamma =1\). The entire visualization experiment, directly and indirectly, demonstrates the ability of CRAM to extract latent features in the channel and the power of SMRAM to capture multiple distinct discriminative parts in the space.
5 Conclusion
This work proposes channel re-attention and spatial multi-region attention for fine-grained visual classification. In particular, we raise two lightweight modules: One is the channel re-attention module, which can help the network mine the neglected fine-grained knowledge from the information-poor channels of the feature map, obtaining more discriminative feature representations. The other is the spatial multi-region attention module, which can calculate the matching degree of the feature map of the channel enhanced generated at different stages. Through training, the feature map of the channel enhanced at each stage can focus on other discriminative parts of the object. The cooperation of the two proposed modules help the network extract finer and more discriminative feature representations from both channel and spatial dimensions. Our method achieves state-of-the-art performance on several fine-grained benchmark datasets. In the future, we plan to apply the proposed CRAM and SMRAM to the other backbones, such as Transformer. In combination with Transformer, we can extract feature maps from multiple encoder layers of the Transformer model, which can be regarded as feature maps of different stages. Then, CRAM multiplies the channel importance mask matrix with the original feature map element by element to enhance the features of important channels. SMRAM is used to calculate the similarity between feature maps of different stages to adjust the attention distribution, so that the model can focus on different local regions. Furthermore, the Transformer model can achieve seamless fusion of local and global features by integrating CRAM and SMRAM modules, and they can be added as independent modules to the Transformer model, or adjusted and optimized as needed to achieving better performance.
References
Welinder P, Branson S, Mita T, Wah C, Schroff F, Belongie S, Perona P (2010) Caltech-ucsd birds 200
Khosla A, Jayadevaprakash N, Yao B, Li F-F (2011) Novel dataset for fine-grained image categorization: Stanford dogs. In: Proc. CVPR workshop on fine-grained visual categorization (FGVC), vol 2. Citeseer
Krause J, Stark M, Deng J, Fei-Fei L (2013) 3d object representations for fine-grained categorization. In: Proceedings of the IEEE international conference on computer vision workshops, pp 554–561
Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp 3–19
Zheng H, Fu J, Mei T, Luo J (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE international conference on computer vision, pp 5209–5217
Hanselmann H, Ney H (2020) Elope: Fine-grained visual classification with efficient localization, pooling and embedding. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1247–1256
Lin T-Y, RoyChowdhury A, Maji S (2015) Bilinear cnn models for fine-grained visual recognition. In: Proceedings of the IEEE international conference on computer vision, pp 1449–1457
Kong S, Fowlkes C (2017) Low-rank bilinear pooling for fine-grained classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 365–374
Lin D, Shen X, Lu C, Jia J (2015) Deep lac: Deep localization, alignment and classification for fine-grained recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1666–1674
Zhang N, Donahue J, Girshick R, Darrell T (2014) Part-based r-cnns for fine-grained category detection. In: European conference on computer vision, pp 834–849. Springer
Fu J, Zheng H, Mei T (2017) Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4438–4446
Yang Z, Luo T, Wang D, Hu Z, Gao J, Wang L (2018) Learning to navigate for fine-grained classification. In: Proceedings of the European conference on computer vision (ECCV), pp 420–435
Yang S, Liu S, Yang C, Wang C (2021) Re-rank coarse classification with local region enhanced features for fine-grained image recognition. arXiv preprint arXiv:2102.09875
He X, Peng Y, Zhao J (2017) Fast fine-grained image classification via weakly supervised discriminative localization. IEEE Trans Circuits Syst Video Technol 1–1
Hu T, Qi H, Huang Q, Lu Y (2019) See better before looking closer: Weakly supervised data augmentation network for fine-grained visual classification. arXiv preprint arXiv:1901.09891
Koniusz P, Wang L, Cherian A (2021) Tensor representations for action recognition. IEEE Trans Pattern Anal Mach Intell 44(2):648–665
Koniusz P, Zhang H (2021) Power normalizations in fine-grained image, few-shot image and graph classification. IEEE Trans Pattern Anal Mach Intell 44(2):591–609
Zhang C, Yao Y, Liu H, Xie G-S, Shu X, Zhou T, Zhang Z, Shen F, Tang Z (2020) Web-supervised network with softly update-drop training for fine-grained visual classification. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 12781–12788
Song K, Wei X-S, Shu X, Song R-J, Lu J (2020) Bi-modal progressive mask attention for fine-grained recognition. IEEE Trans Image Process 29:7006–7018
Chen T, Lin L, Chen R, Wu Y, Luo X (2018) Knowledge-embedded representation learning for fine-grained image recognition. arXiv preprint arXiv:1807.00505
Liu C, Liang Y, Xue Y, Qian X, Fu J (2020) Food and ingredient joint learning for fine-grained recognition. IEEE Trans Circuits Syst Video Technol 99:1–1
Wang L, Koniusz P (2021) Self-supervising action recognition by statistical moment and subspace descriptors. In: Proceedings of the 29th ACM international conference on multimedia, pp 4324–4333
Wang L, Koniusz P, Huynh DQ (2019) Hallucinating idt descriptors and i3d optical flow features for action recognition with cnns. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8698–8708
Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu K (2015) Spatial transformer networks. In: MIT Press
Jie H, Li S, Gang S, Albanie S (2017) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 99
Woo S, Park J, Lee JY, Kweon IS (2018) Cbam: convolutional block attention module. Springer, Cham
Wang X, Girshick R, Gupta A, He K (2017) Non-local neural networks
Attentional kernel encoding networks for fine-grained visual categorization (2021)
Ding Y, Han Z, Zhou Y, Zhu Y, Jiao J (2021) Dynamic perception framework for fine-grained recognition. IEEE Trans Circuits Syst Video Technol PP(99):1–1
Wang L, Koniusz P (2023) 3mformer: Multi-order multi-mode transformer for skeletal action recognition. arXiv preprint arXiv:2303.14474
Wang J, Xu Q, Jiang B, Luo B, Tang J (2024) Multi-granularity part sampling attention for fine-grained visual classification. IEEE Trans Image Process
Xu Q, Wang J, Jiang B, Luo B (2023) Fine-grained visual classification via internal ensemble learning transformer. IEEE Trans Multimed 25:9015–9028
Liu D (2024) Progressive multi-task anti-noise learning and distilling frameworks for fine-grained vehicle recognition. arXiv preprint arXiv:2401.14336
Liu M, Roy S, Li W, Zhong Z, Sebe N, Ricci E (2024) Democratizing fine-grained visual recognition with large language models. arXiv preprint arXiv:2401.13837
Zhang T, Chang D, Ma Z, Guo J (2021) Progressive co-attention network for fine-grained visual classification. In: 2021 international conference on visual communications and image processing (VCIP), pp 1–5. IEEE
Huang H, Zhang J, Yu L, Zhang J, Wu Q, Xu C (2021) Toan: target-oriented alignment network for fine-grained image categorization with few labeled samples. IEEE Trans Circuits Syst Video Technol
Wang L, Koniusz P (2022) Temporal-viewpoint transportation plan for skeletal few-shot action recognition. In: Proceedings of the Asian conference on computer vision, pp 4176–4193
Wang L, Koniusz P (2022) Uncertainty-dtw for time series and sequences. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXI, pp 176–195. Springer
Chang D, Ding Y, Xie J, Bhunia AK, Li X, Ma Z, Wu M, Guo J, Song Y-Z (2020) The devil is in the channels: mutual-channel loss for fine-grained image classification. IEEE Trans Image Process 29:4683–4695
Chou P-Y, Kao Y-Y, Lin C-H (2023) Fine-grained visual classification with high-temperature refinement and background suppression. arxiv preprint arxiv:2303.06442
Goodfellow I, Warde-Farley D, Mirza M, Courville A, Bengio Y (2013) Maxout networks. In: International conference on machine learning, pp 1319–1327. PMLR
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255. IEEE
Fu J, Zheng H, Mei T (2017) Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4438–4446
Sun M, Yuan Y, Zhou F, Ding E (2018) Multi-attention multi-class constraint for fine-grained image recognition. In: Proceedings of the European conference on computer vision (ECCV), pp 805–821
Gao Y, Han X, Wang X, Huang W, Scott M (2020) Channel interaction networks for fine-grained image categorization. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 10818–10825
Hu Y, Liu X, Zhang B, Han J, Cao X (2021) Alignment enhancement network for fine-grained<? brk?> visual categorization. ACM Trans Multimed Comput Commun Appl (TOMM) 17(1s):1–20
Zhuang P, Wang Y, Qiao Y (2020) Learning attentive pairwise interaction for fine-grained classification. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 13130–13137
Chen Y, Bai Y, Zhang W, Mei T (2019) Destruction and construction learning for fine-grained image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5157–5166
Ji R, Wen L, Zhang L, Du D, Wu Y, Zhao C, Liu X, Huang F (2020) Attention convolutional binary neural tree for fine-grained visual categorization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10468–10477
Liang Y, Zhu L, Wang X, Yang Y (2022) Penalizing the hard example but not too much: a strong baseline for fine-grained visual classification. IEEE Trans Neural Netw Learn Syst
Zhang L, Huang S, Liu W, Tao D (2019) Learning a mixture of granularity-specific experts for fine-grained categorization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8331–8340
Ding Y, Zhou Y, Zhu Y, Ye Q, Jiao J (2019) Selective sparse sampling for fine-grained image recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6599–6608
Liu C, Xie H, Zha Z-J, Ma L, Yu L, Zhang Y (2020) Filtration and distillation: enhancing region attention for fine-grained visual categorization. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 11555–11562
Liang Y, Zhu L, Wang X, Yang Y (2022) A simple episodic linear probe improves visual recognition in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9559–9569
Dubey A, Gupta O, Guo P, Raskar R, Farrell R, Naik N (2018) Pairwise confusion for fine-grained visual classification. In: Proceedings of the European conference on computer vision (ECCV), pp 70–86
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626
Acknowledgements
This work was supported by the “Haiyou Plan" Industry Leadership Talent Project in Jinan (Jinan Municipal Government, 2024)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wang, X., Sun, Y., Liu, X. et al. Rethinking Attention Mechanism: Channel Re-attention and Spatial Multi-region Attention for Fine-grained Visual Classification. Neural Process Lett 57, 43 (2025). https://doi.org/10.1007/s11063-025-11757-7
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s11063-025-11757-7









