Abstract
In previous research on medical image classification, CNNs cannot capture long-range dependencies with local receptive fields and result in poor classification performance. Transformer-based models are limited by the quadratic computational complexity of the self-attention mechanism, especially when processing high-resolution medical images. It is difficult to deploy them in limited computational settings without sacrificing performance. Mamba-based models have attracted a lot of interests in computer vision due to their linear computation complexity. Despite their low FLOPs, Mamba-based models with less parameters perform sub-optimally in image classification tasks. To overcome the limitations of Mamba-based models, we propose InceptionMamba, a model that combines lightweight design with high accuracy for medical image classification tasks. Inspired by the impressive performance of the Inception architecture at a relatively low computational cost, we introduce Inception modules to the Mamba-based model. Meanwhile, a channel attention mechanism is employed to improve performance. Additionally, we conduct an in-depth analysis of the modeling capabilities of State Space Model (SSM) from the perspective of frequency response, revealing that it is better suited for medical images dominated by low-frequency components rather than natural images dominated by high-frequency information. InceptionMamba demonstrates competitive performance on medical image classification tasks, surpassing most state-of-the-art methods. The source code is publicly available at https://github.com/pepper1329/InceptionMamba.
Similar content being viewed by others
1 Introduction
Medical image classification is a core task in medical image analysis and plays a vital role in computer-aided diagnosis (CAD) systems [41, 52]. With the proliferation of medical imaging technologies, these images are extensively used in disease screening, diagnosis, and treatment monitoring [28, 31, 38, 72, 82]. However, their high dimensionality, complex background, noise, and class imbalance make manual interpretation inefficient and subjective [68].
Deep learning has shown great promise in automating medical image classification by learning discriminative features from large datasets [9, 14, 68]. These methods not only reduce clinicians’ workload but also enable early diagnosis and support intelligent healthcare applications [5, 40, 43, 47, 62, 67, 80]. Among various approaches, CNNs and Vision Transformers (ViTs) dominate medical visual representation learning [1, 13, 40, 53, 59]. Yet, challenges persist due to high inter-class similarity and low intra-class variance in medical images [55]. CNNs are efficient at capturing local patterns but limited in modeling global context due to fixed receptive fields [40, 59]. ViTs, though effective at learning long-range dependencies, suffer from high computational cost and weakened local feature learning [13, 63]. Hybrid CNN-ViT architectures have been proposed, but balancing performance and complexity remains difficult [7, 18, 65].
Recently, Structured State Space Models (SSMs) have emerged as a promising alternative, offering efficient and scalable sequence modeling by combining RNN-like recurrence with convolution-level parallelism [20, 21]. In particular, Mamba enhances long-range modeling through time-varying parameters and hardware-friendly operations, outperforming attention-based models in NLP and genomics with lower complexity [17]. Motivated by this, we propose leveraging Mamba in medical image classification to reduce computational cost while preserving or improving accuracy [17, 19].
Recent work on MambaOut [77] shows that SSM-based models underperform on general image classification due to limited global feature access in short-sequence tasks. However, given medical images’ structural and textural alignment with SSMs’ low- and mid-frequency modeling capabilities, Mamba is inherently better suited for medical image classification [6, 76]. To leverage Mamba’s strength in long-range dependency modeling and enhance local multi-scale feature extraction, we integrate the Inception module and channel attention. These components, widely validated in visual tasks [29, 32], help capture rich textures and adaptively emphasize critical features to improve classification performance.
Motivated by this insight, we introduce InceptionMamba, a lightweight SSM-based medical image classification framework combining the Inception module for local multi-scale feature extraction and Mamba for modeling long-range dependencies [32]. Integrated with a channel attention mechanism to enhance key feature representation, InceptionMamba significantly reduces parameters and FLOPs while maintaining accuracy, making it suitable for resource-limited clinical applications.
The main contributions of this paper can be summarized as follows:
-
Proposing the InceptionMamba model that combines the Inception architecture with Mamba, realizing a lightweight medical image classification approach.
-
Revealing the modeling advantages of Mamba in the medical imaging domain from the frequency response characteristics of SSMs.
-
Extensive experiments on 14 public datasets demonstrate the strong competitiveness of InceptionMamba in medical image classification.
The paper is structured as follows: Chapter 2 reviews recent progress in applying CNNs, Transformers, and State Space Models (SSMs) to medical image classification. Chapter 3 introduces the proposed InceptionMamba model and its core methods. Chapter 4 describes the experimental setup and results. Chapter 5 analyzes the frequency-domain behavior of SSMs. Chapter 6 concludes the paper and outlines future work.
2 Related Work
Convolutional Neural Networks. CNNs have been widely applied in medical image classification due to their strong capability in spatial feature extraction [40, 58]. Early architectures such as AlexNet laid the foundation, while subsequent models like VGG [14] and ResNet [23] introduced deeper layers and residual connections to enhance feature representation [16]. Transfer learning techniques further improved performance across various tasks, including skin lesion detection and tumor classification, thereby increasing diagnostic accuracy [14, 23, 83].
Vision Transformers. Inspired by the success of Transformers in natural language processing, Vision Transformers (ViTs) treat images as sequences of patches and leverage self-attention mechanisms to model long-range dependencies [8, 13, 35]. This enables ViTs to often outperform CNNs in medical imaging tasks by more effectively capturing global contextual information [2, 33].
Visual State Space Models. State Space Models (SSMs) model sequences using linear recurrent structures, offering a more efficient alternative to CNNs and ViTs. The S4 model [21] addressed early training instability, while Mamba [19] introduced time-varying parameters and hardware-efficient designs, achieving state-of-the-art results in vision tasks. SSMs have demonstrated strong accuracy and efficiency in medical image classification [42, 44, 56, 81, 85].
However, MambaOut revealed limitations of SSMs in image classification, prompting an analysis of their frequency response. Our experimental results show that SSMs model low-frequency signals well but perform poorly on high-frequency components. Medical images are dominated by low-frequency content, whereas natural images contain more high-frequency features. This explains the advantage of SSMs in medical imaging and their limitations in natural image domains (detailed in Chapter 6).
3 Method
3.1 InceptionMamba architecture
The architecture of InceptionMamba (Figure 1) starts with a Patch Embedding layer that splits the input image \(x \in \mathbb {R}^{H \times W \times 3}\) into non-overlapping \(4\times 4\) patches, linearly projected to a default dimension \(C=96\), yielding \(x' \in \mathbb {R}^{\frac{H}{4} \times \frac{W}{4} \times C}\). The network comprises four stages of InceptionMamba Blocks, maintaining spatial dimensions while extracting features. Each stage ends with a Patch Merging module that halves spatial resolution and doubles channels. The number of blocks per stage is [2, 2, 4, 2], with channel dimensions [C, 2C, 4C, 8C]. A final classifier with adaptive pooling and a fully connected layer produces the prediction.
3.2 2D-Selective-Scan for Vision Data (SS2D)
2D-selective-scan (SS2D), originating from VMamba, is one of the core components of InceptionMamba [44]. SS2D consists of three parts: scan unfolding, the S6 module, and scan folding. The scan unfolding operation expands the input image into sequences along four different directions: top-left to bottom-right, bottom-right to top-left, top-right to bottom-left, and bottom-left to top-right. This ensures that information from multiple directions is comprehensively captured, allowing the model to extract diverse features while maintaining linear computational complexity. Next, the sequences obtained from the four directional scans are summed and merged to restore the output to the same spatial dimensions as the input. The S6 module, derived from Mamba and built upon the S4 model, introduces a selection mechanism that dynamically adjusts the parameters of the State Space Model (SSM) based on the input. This allows the model to distinguish and retain relevant information while filtering out irrelevant signals. The pseudocode for the S6 module is provided in Algorithm 1.
3.3 InceptionMamba block
To enhance multi-scale feature representation capability and address the limitation of traditional convolutional structures constrained by fixed kernel sizes, this work adopts the Inception module as a core component [32]. The module is strategically designed to capture semantic information across different scales simultaneously. This is crucial as complex visual tasks, particularly in medical imaging, require processing both fine-grained details and large contextual patterns. By integrating parallel \(1 \times 1\) and \(3 \times 3\) convolutions, two consecutive \(3 \times 3\) convolutions, and pooling operations, the Inception module achieves effective multi-scale fusion while maintaining computational efficiency.
Specifically, the Inception module comprises four parallel branches, each contributing uniquely to the feature map:
-
\(1 \times 1\) Convolution: Primarily for feature mapping and dimensionality reduction. This acts as a bottleneck layer to significantly reduce the computational burden and number of parameters for subsequent larger convolutions.
-
\(1 \times 1\) followed by a \(3 \times 3\) Convolution: Extracts mid-scale features.
-
\(1 \times 1\) followed by two consecutive \(3 \times 3\) Convolutions: Captures information with a larger receptive field. This sequence efficiently approximates a \(5 \times 5\) receptive field but uses fewer parameters and introduces added nonlinearity between the two \(3 \times 3\) layers, thereby enhancing the model’s overall expressiveness and discriminative power.
-
\(3 \times 3\) Max Pooling followed by a \(1 \times 1\) Convolution: The pooling provides features with translational invariance, while the final \(1 \times 1\) convolution aggregates contextual information.
The concatenated outputs of these parallel branches yield a rich, multi-scale feature representation.
In convolutional neural networks—particularly after passing through multi-scale modules such as Inception—different channels often capture distinct types of information. Some channels focus on fine edges, others represent textures, some encode global structural patterns, and a few may even contain redundant or noisy responses. Treating all channels as equally informative forces the network to process useful and irrelevant information simultaneously, which ultimately degrades the quality of the learned representation.
The theoretical motivation for introducing channel attention is to provide a feature selection mechanism that adapts to the input content. Instead of relying solely on fixed convolutional responses, channel attention evaluates the global behavior of each channel across the entire image and determines whether it should be emphasized. Channels that contain essential semantic or discriminative cues are enhanced, while unimportant or redundant channels are suppressed. In essence, this mechanism adaptively recalibrates the feature space, allowing the network to focus on the most meaningful components.
A further theoretical rationale lies in the fact that feature channels are not independent. Different visual cues tend to co-occur; for instance, textures often emerge alongside edges, and local details usually depend on global shapes for correct interpretation. Certain channels may even conflict with one another. Without modeling these inter-channel relationships, the network may struggle to integrate multi-dimensional information effectively. Channel attention addresses this issue by learning cross-channel dependencies directly from data, enabling the model to identify which channels should cooperate and which should be down-weighted. This leads to feature representations that are more coherent and better aligned with the underlying visual semantics.
Integrating channel attention after multi-scale fusion is particularly appropriate because multi-scale architectures naturally produce heterogeneous channels, yet not all scales contribute equally to every image or task. Channel attention provides a principled way to distinguish, filter, and highlight the most informative scale-specific features.
Finally, channel attention achieves these benefits with minimal computational overhead while substantially improving the discriminability of the feature representation, making it an efficient and theoretically well-founded enhancement module.
The SSM branch begins with Layer Normalization and splits into two sub-paths. The main path performs channel projection using a Linear layer, applies \(\text {DWConv}3 \times 3\) for local feature enhancement, uses SiLU activation, enters the SS2D module for long-range dependency modeling, and finishes with a final LayerNorm. The gating path applies a Linear layer and SiLU to produce a gating vector, which modulates the main path output via element-wise multiplication for precise feature selection. A final Linear layer projects the fused features.
The InceptionMamba Block (Figure 1) is a dual-branch module that splits input channels, processes them via the Inception and SSM branches to capture multi-scale local and global features respectively, and then merges outputs through channel concatenation and channel shuffle to enhance semantic representation.
We formalized the modeling process of InceptionMamba block for the feature maps.Given a module input \(x \in \mathbb {R}^{H \times W \times C}\) and a module output \(y \in \mathbb {R}^{H \times W \times C}\), we used f to represent the channel-split, and then there is
Next, the \(f^{-1}\) and g are used to represent channel-concatenation and channel-shuffle respectively. To match the convolution operation, we utilized a permute operation to rearrange the original feature map. Based on the above, the modeling process of the Inception branch can be defined as follows:
Furthermore, the modeling process of the Channel Attention branch can be defined as follows:
Meanwhile, the modeling process of SSM-Branch can be defined as follows:
In summary, the output of InceptionMamba block be formulated as follows:
4 Experiments
4.1 Datasets
We adopted 14 publicly available medical image datasets to comprehensively evaluate the effectiveness and potential of InceptionMamba in medical image classification tasks.
PAD-UFES-20 [51].PAD-UFES-20 contains 2,298 samples from six skin lesion types—BCC, SCC (including Bowen’s disease), ACK, SEK, MEL, and NEV—collected via various smartphones.
Fetal-Planes-DB [4].This maternal-fetal ultrasound dataset, collected from two hospitals with diverse operators and devices, contains expert-labeled images in six classes: four fetal planes (Abdomen, Brain, Femur, Thorax), maternal cervix, and a general category. Fetal brain images are further divided into three subplanes for fine-grained classification.
CPN X-ray [36, 57]. The public CPN X-ray dataset includes 5,228 chest images labeled as COVID-19, Normal, or Pneumonia, supporting deep learning-based disease classification.
Kvasir [54]. The Kvasir dataset, annotated by expert endoscopists, includes hundreds of GI tract images per class, covering anatomical landmarks, pathological findings, and endoscopic procedures such as lesion removal.
MedMNIST [73, 75]. MedMNIST is a large-scale MNIST-like biomedical image collection with 12 standardized 2D and 6 3D datasets, supporting lightweight classification across diverse tasks and scales. This work uses ten 2D datasets, including PathMNIST, DermaMNIST, OCTMNIST, PneumoniaMNIST, RetinaMNIST, BreastMNIST, BloodMNIST, and OrganMNIST variants.
For non-MedMNIST datasets, we preserved original class distributions and adopted MedMamba’s sample splits for fair comparison [81]. For MedMNIST, official data splits were used without changes.
4.2 Experimental setup
All experiments were conducted on a single NVIDIA GeForce RTX 4090 GPU. For non-MedMNIST datasets, we used the AdamW optimizer with an initial learning rate of 1e-4, weight decay of 1e-4, batch size of 32, and trained for 150 epochs. For MedMNIST datasets, we followed MedMNISTv2 settings and trained InceptionMamba for 100 epochs using AdamW with an initial learning rate of 1e-3, decayed by 0.1 at the 50th and 75th epochs. A batch size of 64 was used. To ensure objective evaluation on raw data, we did not apply any pretrained models or data augmentation.
4.3 Analysiss
4.3.1 Performance Analysis
Table 1 presents a unified comparison of InceptionMamba with a variety of reference models on PAD-UFES-20, Kvasir, and Fetal-Planes-DB datasets. On PAD-UFES-20, InceptionMamba achieves competitive results with the lowest FLOPs (0.8G) and smallest parameter count (7.0M), obtaining 63.3% OA and 0.840 AUC. Compared to MedMamba-T, it improves OA and AUC by 4.5% and 0.032, while reducing FLOPs and parameters by 60% and 51.7%, respectively. Although its OA is 0.2% lower than that of Nest-tiny, InceptionMamba achieves the highest AUC among all methods.On the Kvasir dataset, InceptionMamba again outperforms all baseline models with 83.8% OA and 0.986 AUC, while maintaining the lowest computational cost. It shows a 4.5% OA and 0.01 AUC improvement over MedMamba-S, alongside a 77.1% reduction in FLOPs and 69.3% fewer parameters.For Fetal-Planes-DB, InceptionMamba achieves the best results overall, with 94.6% OA and 0.995 AUC, surpassing MedMamba-B by 0.2% in OA and 0.002 in AUC, while using 89.2% fewer FLOPs and 85.1% fewer parameters. These results consistently demonstrate that InceptionMamba achieves state-of-the-art performance with significantly improved efficiency across diverse datasets.
Table 2 compares AUC and OA of InceptionMamba with other models on MedMNIST subsets. MedMamba-X applies data augmentation. InceptionMamba outperforms ResNet18, ResNet50, and Mamba variants across all datasets. Specifically, it improves OA on PneumoniaMNIST by 0.4% over MedVit-S, and on DermaMNIST, it surpasses MedMamba-T by 2.8% in OA and 4% in AUC, demonstrating superiority in dermoscopic image classification. InceptionMamba achieves the best results on DermaMNIST, PneumoniaMNIST, and BloodMNIST, balancing high accuracy with model efficiency.
In summary, InceptionMamba achieves strong performance on most medical image classification tasks in MedMNIST while remaining lightweight.
4.3.2 Ablation Studies
We performed ablation studies to evaluate the impact of key architectural components. As shown in Table 3, using only the SSM module leads to the lowest performance across all datasets. The SSM module serves as the baseline configuration.
Introducing the Inception convolution and fusing it with the SSM branch significantly boosts performance while substantially reducing model complexity. For instance, on the Kvasir dataset, adding the Inception convolution increases OA by \(3.9\%\), rising from \(78.1\%\) to \(82.0\%\). Similarly, on the PneumoniaMNIST and dermamnist datasets, this step resulted in an OA improvement of \(2.6\%\) and \(3.6\%\), respectively. Notably, this improvement is achieved by dramatically reducing FLOPs and parameters by \(1.2\text {G}\) and \(9.7\text {M}\) respectively, dropping the complexity from \(2.0\text {G}/16.2\text {M}\) to \(0.8\text {G}/6.5\text {M}\).
Furthermore, incorporating the Channel Attention (CA) module enhances feature selection and improves model performance without notable computational overhead. The parameters only increase by about \(0.5\text {M}\) with the addition of the CA module. On Kvasir, the CA module further boosts OA to \(83.8\%\); on dermamnist, OA reaches \(80.7\%\).
These consistent results demonstrate that our approach successfully achieves effective capture of multi-scale features through the efficient fusion of Inception and SSM, coupled with precise feature filtering via Channel Attention. This method improves model performance while reducing complexity and parameter size, achieving an efficient balance.
To further validate Mamba’s effectiveness in medical image classification, we conducted ablations on the SSM branch. Specifically, we tested removing SS2D and replacing it with a standard SSM. As shown in Table 4, excluding SS2D yields the lowest AUC and OA. Introducing SSM improves both metrics, and incorporating SS2D, as in VMamba, leads to further significant gains. Thus, the Mamba model proves effective for medical image classification.
4.3.3 Visual Analysis
To enhance the interpretability of the InceptionMamba model, we applied the Grad-CAM method to visualize the internal decision-making mechanism of InceptionMamba. The Grad-CAM results are displayed as rainbow-colored maps, where red indicates highly relevant regions, yellow indicates moderately relevant regions, and blue indicates low relevance. In Figure 2, we present the heatmap results generated by InceptionMamba on the PAD-UFES-20 dataset. It is clearly observed that, in most cases, the model accurately focuses on the lesion areas in the images, with lesion regions predominantly shown in red.
We further used t-SNE to visualize features from InceptionMamba and MedMamba (Figure 3). The results show that InceptionMamba’s features have stronger representativeness and discriminability: samples of the same class form clearer clusters, while different classes are more separated. This demonstrates InceptionMamba’s superior ability in capturing inter-class differences and modeling medical images effectively.
Discussion Previous work MambaOut [77] demonstrated limited effectiveness of Mamba on image classification, while our results reveal significant advantages in medical image classification. This discrepancy arises from intrinsic differences between natural and medical images.
Natural images contain rich textures, complex edges, and diverse local details that correspond to high-frequency components in the frequency domain, making them visually detailed. Medical images (e.g., MRI, CT, ultrasound) emphasize overall morphology with smoother content, soft boundaries, and dominant low-frequency components, giving a sense of smoothness.
Theoretically, Mamba acts as a low-pass filter, preserving low-frequency information effectively. To quantify this, we propose the “Frequency Centroid” metric, a weighted average frequency based on energy distribution. Our analysis confirms that medical images concentrate energy in low frequencies, whereas natural images emphasize high frequencies, explaining Mamba’s superior performance in medical image classification.
4.4 The frequency-domain response characteristics of state space models (SSMs)
SSM-based models such as S4 and Mamba, grounded in classical continuous systems, excel in long sequence modeling [19, 21]. They map a one-dimensional input \(x(t) \in \mathbb {R}\) to an output \(y(t) \in \mathbb {R}\) via a hidden state \(h(t) \in \mathbb {R}^N\), governed by a linear ODE [19, 42, 44].
Here, \(A \in \mathbb {R}^{N \times N}\) is the state matrix, while \(B, C \in \mathbb {R}^{N \times 1}\) are projection parameters.
To analyze the frequency domain response characteristics of the state space models, we consider a one-dimensional scalar simplified model:
Where \(\lambda \in \mathbb {R}\) represents the state decay coefficient, which determines the degree to which the state retains past information.
Performing response analysis of the system to a unit impulse input yields the impulse response as:
Where \(1_{t \ge 0}\) is the unit step function, indicating that the system response takes effect from the input moment. This response describes the system’s memory and decay characteristics to an instantaneous impulse signal.
Performing the Fourier transform on the impulse response yields the system’s response magnitude:
The function monotonically decreases with frequency \(\omega \), indicating strong suppression of high-frequency components and sensitivity to low frequencies, thus exhibiting typical low-pass filter behavior. This suggests that recursive mechanisms in state space models like Mamba inherently filter out high-frequency signals, making them well-suited for modeling low-frequency dominant data. Yu et al. (2024) also validated Mamba’s superior accuracy in low-frequency prediction by decomposing inputs into frequency bands.
In image modeling tasks, this characteristic has significant implications. High-frequency visual information, such as edges, textures, and fine details, are important components for natural image recognition. Due to their frequency response characteristics, state space models have relatively weaker capability in modeling abrupt regions and local high-frequency details in images. In contrast, convolutional neural networks (CNNs) and Transformers are better at encoding such high-frequency information. Medical images such as MRI, CT, and ultrasound are often dominated by low-frequency features like organ boundaries and structural contours. Therefore, we propose that state space models possess a natural structural adaptation advantage in medical image modeling tasks.
4.5 Frequency-Domain Comparison of Natural and Medical Images
We introduce the Frequency Centroid as a core metric to quantitatively analyze image frequency distributions. Based on the 2D Fourier transform, it captures the energy distribution in the frequency domain and reflects the ratio of high- and low-frequency components, enabling frequency comparison across datasets.
Images are converted from spatial to frequency domain via 2D discrete Fourier transform (2D-DFT), producing a complex spectrum F(u, v) representing cosine and sine components at various scales and orientations. The squared magnitude yields the power spectral density (PSD) at each frequency.
PSD quantifies image energy at specific frequencies, intuitively reflecting frequency energy distribution. Low frequencies correspond to large, smooth areas; high frequencies represent details, textures, edges, or noise. High PSD in high-frequency regions indicates rich details, while concentration in low frequencies implies smoother, simpler structures.
To further quantify frequency structure, we define the Frequency Centroid:
where f(u, v) denotes the Euclidean distance between the frequency point (u, v) and the center of the frequency spectrum, representing the magnitude of that frequency component; P(u, v) is the corresponding power spectral density at that point. This metric essentially represents the weighted expectation of frequency magnitudes, with weights given by the energy of each frequency component. Therefore, it can be regarded as the "centroid" or "center of mass" of the image’s energy distribution in the frequency domain.
Images rich in high-frequency components yield higher \(f_{avg}\) values, while those dominated by low-frequency components have lower \(f_{avg}\). Therefore, \(f_{avg}\) reliably reflects image frequency distribution characteristics.
All images were resized to 224\(\times \)224 pixels and normalized. The \(f_{avg}\) was computed per image and averaged at the dataset level. A histogram was created to visualize frequency-domain differences between medical and natural images. Using the Frequency Centroid, we compared frequency features of representative medical image subsets from MedMNIST (PathMNIST, ChestMNIST, OCTMNIST, OrganAMNIST, BreastMNIST) with the natural image dataset ImageNet.
Figure 4 presents the results showing that Frequency Centroid values for natural image datasets are significantly higher than those for all medical image subsets. The mean \(f_{avg}\) of ImageNet notably exceeds that of medical subsets, reflecting richer textures and details in natural images and their higher frequency energy distribution. This aligns with the complex, diverse structures in natural scenes.
In contrast, medical image subsets exhibit lower Frequency Centroid values, indicating spectral energy concentrated in low-frequency regions. This reflects the smooth, simple tissue structures typical of medical images, supporting state space models’ natural fit for low-frequency components and explaining Mamba’s superior performance in medical imaging tasks.
Notably, the BreastMNIST dataset exhibits a Frequency Centroid significantly higher than other medical subsets. Correspondingly, Table 2 shows that Mamba underperforms compared to CNN- and Transformer-based models on BreastMNIST. In contrast, OCTMNIST also has a relatively high Frequency Centroid but still demonstrates strong performance with Mamba (Table 2). We attribute this to the inherent sequentiality and structural continuity of OCT images in the spatial domain. As illustrated in Figure 5, OCT images are generated by lateral scanning of eye tissue, which effectively “unfolds” the 3D structure into a continuous tomographic image, line by line. This orderly spatial unfolding creates a sequence-like structure that aligns well with Mamba’s state space scanning approach, enabling it to more effectively capture long-range dependencies across image regions and thereby significantly outperform CNN and Transformer models on this dataset.
In summary, based on the frequency-domain characteristics of state space models and the Frequency Centroid metric, we analyzed frequency distribution differences between medical and natural images. The results show medical images strongly emphasize low frequencies, which aligns with state space models’ sensitivity to low-frequency features. This consistency explains the superior performance of the Mamba model in medical image classification.
5 Conclusion
In this paper, we propose a medical image classification model, InceptionMamba, based on structured state space model (SSM), fully leveraging the advantages of SSM. We extract global features and multi-scale local features in parallel using SSM and Inception, respectively, while further optimizing the semantic representation of features through a channel attention mechanism. InceptionMamba achieves a good balance between efficient modeling and computational resource consumption. We conducted extensive validation on 14 public datasets, and experimental results demonstrate that the model exhibits highly competitive performance on medical image classification tasks.
Data Availability
No datasets were generated or analysed during the current study.
References
Anthimopoulos M, Christodoulidis S, Ebner L et al (2016) Lung pattern classification for interstitial lung diseases using a deep convolutional neural network. IEEE Trans Med Imaging 35(5):1207–1216
Ashish V (2017) Attention is all you need. Advances in neural information processing systems 30:I
Bisong E (2019) Google automl: cloud vision. In: Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners. Springer, p 581–598
Burgos-Artizzu XP, Coronado-Gutiérrez D, Valenzuela-Alcaraz B et al (2020) Evaluation of deep convolutional neural networks for automatic classification of common maternal fetal ultrasound planes. Sci Rep 10(1):10200
Campanella G, Hanna MG, Geneslaw L et al (2019) Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nat Med 25(8):1301–1309
Cenedese M, Axås J, Bäuerlein B et al (2022) Data-driven modeling and prediction of non-linearizable dynamics via spectral submanifolds. Nat Commun 13(1):872
Chen J, Lu Y, Yu Q, et al (2021) Transunet: Transformers make strong encoders for medical image segmentation. arXiv:2102.04306
Chen J, Wu P, Zhang X et al (2024) Add-vit: Cnn-transformer hybrid architecture for small data paradigm processing. Neural Process Lett 56(3):198
Chen X, Wang X, Zhang K et al (2022) Recent advances and clinical applications of deep learning in medical image analysis. Med Image Anal 79:102444
Chu X, Tian Z, Wang Y et al (2021) Twins: Revisiting the design of spatial attention in vision transformers. Adv Neural Inf Process Syst 34:9355–9366
Ding M, Xiao B, Codella N, et al (2022) Davit: Dual attention vision transformers. In: European conference on computer vision, Springer, pp 74–92
Ding X, Zhang X, Ma N, et al (2021) Repvgg: Making vgg-style convnets great again. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13733–13742
Dosovitskiy A, Beyer L, Kolesnikov A, et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929
Esteva A, Kuprel B, Novoa RA et al (2017) Dermatologist-level classification of skin cancer with deep neural networks. Nature 542(7639):115–118
Feurer M, Klein A, Eggensperger K, et al (2015) Efficient and robust automated machine learning. Advances in neural information processing systems 28
Fırat H, Asker ME, Bayındır Mİ et al (2023) Hybrid 3d/2d complete inception module and convolutional neural network for hyperspectral remote sensing image classification. Neural Process Lett 55(2):1087–1130
Fu DY, Dao T, Saab KK, et al (2022) Hungry hungry hippos: Towards language modeling with state space models. arXiv:2212.14052
Gao Y, Zhou M, Metaxas DN (2021) Utnet: a hybrid transformer architecture for medical image segmentation. In: International conference on medical image computing and computer-assisted intervention, Springer, pp 61–71
Gu A, Dao T (2023) Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752
Gu A, Dao T, Ermon S et al (2020) Hippo: Recurrent memory with optimal polynomial projections. Adv Neural Inf Process Syst 33:1474–1487
Gu A, Goel K, Ré C (2021) Efficiently modeling long sequences with structured state spaces. arXiv:2111.00396
Gu J, Wang Z, Kuen J et al (2018) Recent advances in convolutional neural networks. Pattern Recogn 77:354–377
Gulshan V, Peng L, Coram M et al (2016) Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama 316(22):2402–2410
Han K, Xiao A, Wu E et al (2021) Transformer in transformer. Adv Neural Inf Process Syst 34:15908–15919
Han K, Wang Y, Chen H et al (2022) A survey on vision transformer. IEEE Trans Pattern Anal Mach Intell 45(1):87–110
Hatamizadeh A, Yin H, Heinrich G, et al (2023) Global context vision transformers. In: International Conference on Machine Learning, PMLR, pp 12633–12646
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hoang DT, Shulman ED, Turakulov R et al (2024) Prediction of dna methylation-based tumor types from histopathology in central nervous system tumors with deep learning. Nat Med 30(7):1952–1961
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Huang G, Liu Z, Van Der Maaten L, et al (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4700–4708
Huang SC, Pareek A, Jensen M et al (2023) Self-supervised learning for medical image classification: a systematic review and implementation guidelines. NPJ Digital Medicine 6(1):74
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pmlr, pp 448–456
Jiang J, Xu H, Xu X et al (2023) Transformer-based fused attention combined with cnns for image classification. Neural Process Lett 55(9):11905–11919
Jin H, Song Q, Hu X (2019) Auto-keras: An efficient neural architecture search system. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp 1946–1956
Khan S, Naseer M, Hayat M et al (2022) Transformers in vision: A survey. ACM computing surveys (CSUR) 54(10s):1–41
Kumar S, Shastri S, Mahajan S et al (2022) Litecovidnet: A lightweight deep neural network model for detection of covid-19 using x-ray images. Int J Imaging Syst Technol 32(5):1464–1480
Li Y, Wu CY, Fan H, et al (2022) Mvitv2: Improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4804–4814
Li Z, Jiang J, Chen K et al (2021) Preventing corneal blindness caused by keratitis using artificial intelligence. Nat Commun 12(1):3738
Li Z, Liu F, Yang W et al (2021) A survey of convolutional neural networks: analysis, applications, and prospects. IEEE transactions on neural networks and learning systems 33(12):6999–7019
Litjens G, Kooi T, Bejnordi BE et al (2017) A survey on deep learning in medical image analysis. Med Image Anal 42:60–88
Liu L, Sun H, Li F (2023) A lie group kernel learning method for medical image classification. Pattern Recogn 142:109735
Liu X, Zhang C, Zhang L (2024a) Vision mamba: A comprehensive survey and taxonomy. arXiv:2405.04404
Liu Y, Chen PHC, Krause J et al (2019) How to read articles that use machine learning: users’ guides to the medical literature. JAMA 322(18):1806–1816
Liu Y, Tian Y, Zhao Y et al (2024) Vmamba: Visual state space model. Adv Neural Inf Process Syst 37:103031–103063
Liu Z, Lin Y, Cao Y, et al (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Liu Z, Mao H, Wu CY, et al (2022) A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11976–11986
Lundervold AS, Lundervold A (2019) An overview of deep learning in medical imaging focusing on mri. Zeitschrift fuer medizinische Physik 29(2):102–127
Maaz M, Shaker A, Cholakkal H, et al (2022) Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications. In: European conference on computer vision, Springer, pp 3–20
Manzari ON, Ahmadabadi H, Kashiani H et al (2023) Medvit: a robust vision transformer for generalized medical image classification. Comput Biol Med 157:106791
Mehta S, Rastegari M (2022) Separable self-attention for mobile vision transformers. arXiv:2206.02680
Pacheco AG, Lima GR, Salomao AS et al (2020) Pad-ufes-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones. Data Brief 32:106221
Park CW, Seo SW, Kang N, et al (2020) Artificial intelligence in health care: current applications and issues. Journal of Korean medical science 35(42)
Parvaiz A, Khalid MA, Zafar R et al (2023) Vision transformers in medical computer vision–a contemplative retrospection. Eng Appl Artif Intell 122:106126
Pogorelov K, Randel KR, Griwodz C, et al (2017) Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection. In: Proceedings of the 8th ACM on Multimedia Systems Conference, pp 164–169
Raghu M, Zhang C, Kleinberg J, et al (2019) Transfusion: Understanding transfer learning for medical imaging. Advances in neural information processing systems 32
Ruan J, Li J, Xiang S (2024) Vm-unet: Vision mamba unet for medical image segmentation. arXiv:2402.02491
Shastri S, Kansal I, Kumar S et al (2022) Cheximagenet: a novel architecture for accurate classification of covid-19 with chest x-ray digital images using deep convolutional neural networks. Heal Technol 12(1):193–204
Shen D, Wu G, Suk HI (2017) Deep learning in medical image analysis. Annu Rev Biomed Eng 19(1):221–248
Shin HC, Roth HR, Gao M et al (2016) Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE Trans Med Imaging 35(5):1285–1298
Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning, PMLR, pp 6105–6114
Tan M, Le Q (2021) Efficientnetv2: Smaller models and faster training. In: International conference on machine learning, PMLR, pp 10096–10106
Topol EJ (2019) High-performance medicine: the convergence of human and artificial intelligence. Nat Med 25(1):44–56
Touvron H, Cord M, Douze M, et al (2021a) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning, PMLR, pp 10347–10357
Touvron H, Cord M, Sablayrolles A, et al (2021b) Going deeper with image transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 32–42
Valanarasu JMJ, Patel VM (2022) Unext: Mlp-based rapid medical image segmentation network. In: International conference on medical image computing and computer-assisted intervention, Springer, pp 23–33
Vasu PKA, Gabriel J, Zhu J, et al (2023) Mobileone: An improved one millisecond mobile backbone. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7907–7917
Wang B, Jin S, Yan Q et al (2020) Ai-assisted ct imaging analysis for covid-19 screening: Building and deploying a medical ai system. Appl Soft Comput 98:106897
Wang W, Liang D, Chen Q, et al (2019) Medical image classification using deep learning. In: Deep learning in healthcare: paradigms and applications. Springer, p 33–51
Wang W, Xie E, Li X et al (2022) Pvt v2: Improved baselines with pyramid vision transformer. Computational visual media 8(3):415–424
Wu X, Feng Y, Xu H et al (2023) Ctranscnn: Combining transformer and cnn in multilabel medical image classification. Knowl-Based Syst 281:111030
Xu W, Xu Y, Chang T, et al (2021) Co-scale conv-attentional image transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9981–9990
Yadav SS, Jadhav SM (2019) Deep convolutional neural network based medical image classification for disease diagnosis. Journal of Big data 6(1):1–18
Yang J, Shi R, Ni B (2021) Medmnist classification decathlon: A lightweight automl benchmark for medical image analysis. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), IEEE, pp 191–195
Yang J, Li C, Dai X et al (2022) Focal modulation networks. Adv Neural Inf Process Syst 35:4203–4217
Yang J, Shi R, Wei D et al (2023) Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data 10(1):41
Yu A, Lyu D, Lim SH, et al (2024) Tuning frequency bias of state space models. arXiv:2410.02035
Yu W, Wang X (2025) Mambaout: Do we really need mamba for vision? In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp 4484–4496
Yu W, Luo M, Zhou P, et al (2022) Metaformer is actually what you need for vision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10819–10829
Yu W, Si C, Zhou P et al (2023) Metaformer baselines for vision. IEEE Trans Pattern Anal Mach Intell 46(2):896–912
Yu-Xing T, You-Bao T, Yifan P, et al (2020) Automated abnormality classification of chest radiographs using deep convolutional neural networks. NPJ Digital Medicine 3(1)
Yue Y, Li Z (2024) Medmamba: Vision mamba for medical image classification. arXiv:2403.03849
Zhang X, Zhao Z, Wang R et al (2024) A multicenter proof-of-concept study on deep learning-based intraoperative discrimination of primary central nervous system lymphoma. Nat Commun 15(1):3768
Zhang Y, Li X, Chen W et al (2024) Image classification based on low-level feature enhancement and attention mechanism. Neural Process Lett 56(4):217
Zhang Z, Zhang H, Zhao L, et al (2022) Nested hierarchical transformer: Towards accurate, data-efficient and interpretable visual understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 3417–3425
Zhao S, Wu X, Tian K, et al (2025) A transformer-based hierarchical hybrid encoder network for semantic segmentation: S. zhao et al. Neural Processing Letters 57(4):66
Funding
The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
B.Q.H. conceived the study, designed the methodology, performed the experiments, and wrote the manuscript. Y.L. provided guidance on the research and revised the manuscript. B.T. assisted in conducting part of the experiments and data summarization. G.F. supervised the project and provided guidance on the experimental design and manuscript revision.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Huang, B., Liu, Y., Tang, B. et al. InceptionMamba: A Lightweight and Effective Model for Medical Image Classification Revealing Mamba’s Low-Frequency Bias. Neural Process Lett 58, 15 (2026). https://doi.org/10.1007/s11063-025-11823-0
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1007/s11063-025-11823-0






