1 Introduction

Medical image classification is a core task in medical image analysis and plays a vital role in computer-aided diagnosis (CAD) systems [41, 52]. With the proliferation of medical imaging technologies, these images are extensively used in disease screening, diagnosis, and treatment monitoring [28, 31, 38, 72, 82]. However, their high dimensionality, complex background, noise, and class imbalance make manual interpretation inefficient and subjective [68].

Deep learning has shown great promise in automating medical image classification by learning discriminative features from large datasets [9, 14, 68]. These methods not only reduce clinicians’ workload but also enable early diagnosis and support intelligent healthcare applications [5, 40, 43, 47, 62, 67, 80]. Among various approaches, CNNs and Vision Transformers (ViTs) dominate medical visual representation learning [1, 13, 40, 53, 59]. Yet, challenges persist due to high inter-class similarity and low intra-class variance in medical images [55]. CNNs are efficient at capturing local patterns but limited in modeling global context due to fixed receptive fields [40, 59]. ViTs, though effective at learning long-range dependencies, suffer from high computational cost and weakened local feature learning [13, 63]. Hybrid CNN-ViT architectures have been proposed, but balancing performance and complexity remains difficult [7, 18, 65].

Recently, Structured State Space Models (SSMs) have emerged as a promising alternative, offering efficient and scalable sequence modeling by combining RNN-like recurrence with convolution-level parallelism [20, 21]. In particular, Mamba enhances long-range modeling through time-varying parameters and hardware-friendly operations, outperforming attention-based models in NLP and genomics with lower complexity [17]. Motivated by this, we propose leveraging Mamba in medical image classification to reduce computational cost while preserving or improving accuracy [17, 19].

Recent work on MambaOut [77] shows that SSM-based models underperform on general image classification due to limited global feature access in short-sequence tasks. However, given medical images’ structural and textural alignment with SSMs’ low- and mid-frequency modeling capabilities, Mamba is inherently better suited for medical image classification [6, 76]. To leverage Mamba’s strength in long-range dependency modeling and enhance local multi-scale feature extraction, we integrate the Inception module and channel attention. These components, widely validated in visual tasks [29, 32], help capture rich textures and adaptively emphasize critical features to improve classification performance.

Motivated by this insight, we introduce InceptionMamba, a lightweight SSM-based medical image classification framework combining the Inception module for local multi-scale feature extraction and Mamba for modeling long-range dependencies [32]. Integrated with a channel attention mechanism to enhance key feature representation, InceptionMamba significantly reduces parameters and FLOPs while maintaining accuracy, making it suitable for resource-limited clinical applications.

The main contributions of this paper can be summarized as follows:

  • Proposing the InceptionMamba model that combines the Inception architecture with Mamba, realizing a lightweight medical image classification approach.

  • Revealing the modeling advantages of Mamba in the medical imaging domain from the frequency response characteristics of SSMs.

  • Extensive experiments on 14 public datasets demonstrate the strong competitiveness of InceptionMamba in medical image classification.

The paper is structured as follows: Chapter 2 reviews recent progress in applying CNNs, Transformers, and State Space Models (SSMs) to medical image classification. Chapter 3 introduces the proposed InceptionMamba model and its core methods. Chapter 4 describes the experimental setup and results. Chapter 5 analyzes the frequency-domain behavior of SSMs. Chapter 6 concludes the paper and outlines future work.

2 Related Work

Convolutional Neural Networks. CNNs have been widely applied in medical image classification due to their strong capability in spatial feature extraction [40, 58]. Early architectures such as AlexNet laid the foundation, while subsequent models like VGG [14] and ResNet [23] introduced deeper layers and residual connections to enhance feature representation [16]. Transfer learning techniques further improved performance across various tasks, including skin lesion detection and tumor classification, thereby increasing diagnostic accuracy [14, 23, 83].

Vision Transformers. Inspired by the success of Transformers in natural language processing, Vision Transformers (ViTs) treat images as sequences of patches and leverage self-attention mechanisms to model long-range dependencies [8, 13, 35]. This enables ViTs to often outperform CNNs in medical imaging tasks by more effectively capturing global contextual information [2, 33].

Visual State Space Models. State Space Models (SSMs) model sequences using linear recurrent structures, offering a more efficient alternative to CNNs and ViTs. The S4 model [21] addressed early training instability, while Mamba [19] introduced time-varying parameters and hardware-efficient designs, achieving state-of-the-art results in vision tasks. SSMs have demonstrated strong accuracy and efficiency in medical image classification [42, 44, 56, 81, 85].

However, MambaOut revealed limitations of SSMs in image classification, prompting an analysis of their frequency response. Our experimental results show that SSMs model low-frequency signals well but perform poorly on high-frequency components. Medical images are dominated by low-frequency content, whereas natural images contain more high-frequency features. This explains the advantage of SSMs in medical imaging and their limitations in natural image domains (detailed in Chapter 6).

3 Method

3.1 InceptionMamba architecture

The architecture of InceptionMamba (Figure 1) starts with a Patch Embedding layer that splits the input image \(x \in \mathbb {R}^{H \times W \times 3}\) into non-overlapping \(4\times 4\) patches, linearly projected to a default dimension \(C=96\), yielding \(x' \in \mathbb {R}^{\frac{H}{4} \times \frac{W}{4} \times C}\). The network comprises four stages of InceptionMamba Blocks, maintaining spatial dimensions while extracting features. Each stage ends with a Patch Merging module that halves spatial resolution and doubles channels. The number of blocks per stage is [2, 2, 4, 2], with channel dimensions [C, 2C, 4C, 8C]. A final classifier with adaptive pooling and a fully connected layer produces the prediction.

Fig. 1
figure 1

The overall architecture of the InceptionMamba model. The InceptionMamba Block is the core module of InceptionMamba

3.2 2D-Selective-Scan for Vision Data (SS2D)

2D-selective-scan (SS2D), originating from VMamba, is one of the core components of InceptionMamba [44]. SS2D consists of three parts: scan unfolding, the S6 module, and scan folding. The scan unfolding operation expands the input image into sequences along four different directions: top-left to bottom-right, bottom-right to top-left, top-right to bottom-left, and bottom-left to top-right. This ensures that information from multiple directions is comprehensively captured, allowing the model to extract diverse features while maintaining linear computational complexity. Next, the sequences obtained from the four directional scans are summed and merged to restore the output to the same spatial dimensions as the input. The S6 module, derived from Mamba and built upon the S4 model, introduces a selection mechanism that dynamically adjusts the parameters of the State Space Model (SSM) based on the input. This allows the model to distinguish and retain relevant information while filtering out irrelevant signals. The pseudocode for the S6 module is provided in Algorithm 1.

Algorithm 1
figure a

Pseudo-code for S6 block in SS2D [19, 42, 44]

3.3 InceptionMamba block

To enhance multi-scale feature representation capability and address the limitation of traditional convolutional structures constrained by fixed kernel sizes, this work adopts the Inception module as a core component [32]. The module is strategically designed to capture semantic information across different scales simultaneously. This is crucial as complex visual tasks, particularly in medical imaging, require processing both fine-grained details and large contextual patterns. By integrating parallel \(1 \times 1\) and \(3 \times 3\) convolutions, two consecutive \(3 \times 3\) convolutions, and pooling operations, the Inception module achieves effective multi-scale fusion while maintaining computational efficiency.

Specifically, the Inception module comprises four parallel branches, each contributing uniquely to the feature map:

  • \(1 \times 1\) Convolution: Primarily for feature mapping and dimensionality reduction. This acts as a bottleneck layer to significantly reduce the computational burden and number of parameters for subsequent larger convolutions.

  • \(1 \times 1\) followed by a \(3 \times 3\) Convolution: Extracts mid-scale features.

  • \(1 \times 1\) followed by two consecutive \(3 \times 3\) Convolutions: Captures information with a larger receptive field. This sequence efficiently approximates a \(5 \times 5\) receptive field but uses fewer parameters and introduces added nonlinearity between the two \(3 \times 3\) layers, thereby enhancing the model’s overall expressiveness and discriminative power.

  • \(3 \times 3\) Max Pooling followed by a \(1 \times 1\) Convolution: The pooling provides features with translational invariance, while the final \(1 \times 1\) convolution aggregates contextual information.

The concatenated outputs of these parallel branches yield a rich, multi-scale feature representation.

In convolutional neural networks—particularly after passing through multi-scale modules such as Inception—different channels often capture distinct types of information. Some channels focus on fine edges, others represent textures, some encode global structural patterns, and a few may even contain redundant or noisy responses. Treating all channels as equally informative forces the network to process useful and irrelevant information simultaneously, which ultimately degrades the quality of the learned representation.

The theoretical motivation for introducing channel attention is to provide a feature selection mechanism that adapts to the input content. Instead of relying solely on fixed convolutional responses, channel attention evaluates the global behavior of each channel across the entire image and determines whether it should be emphasized. Channels that contain essential semantic or discriminative cues are enhanced, while unimportant or redundant channels are suppressed. In essence, this mechanism adaptively recalibrates the feature space, allowing the network to focus on the most meaningful components.

A further theoretical rationale lies in the fact that feature channels are not independent. Different visual cues tend to co-occur; for instance, textures often emerge alongside edges, and local details usually depend on global shapes for correct interpretation. Certain channels may even conflict with one another. Without modeling these inter-channel relationships, the network may struggle to integrate multi-dimensional information effectively. Channel attention addresses this issue by learning cross-channel dependencies directly from data, enabling the model to identify which channels should cooperate and which should be down-weighted. This leads to feature representations that are more coherent and better aligned with the underlying visual semantics.

Integrating channel attention after multi-scale fusion is particularly appropriate because multi-scale architectures naturally produce heterogeneous channels, yet not all scales contribute equally to every image or task. Channel attention provides a principled way to distinguish, filter, and highlight the most informative scale-specific features.

Finally, channel attention achieves these benefits with minimal computational overhead while substantially improving the discriminability of the feature representation, making it an efficient and theoretically well-founded enhancement module.

The SSM branch begins with Layer Normalization and splits into two sub-paths. The main path performs channel projection using a Linear layer, applies \(\text {DWConv}3 \times 3\) for local feature enhancement, uses SiLU activation, enters the SS2D module for long-range dependency modeling, and finishes with a final LayerNorm. The gating path applies a Linear layer and SiLU to produce a gating vector, which modulates the main path output via element-wise multiplication for precise feature selection. A final Linear layer projects the fused features.

The InceptionMamba Block (Figure 1) is a dual-branch module that splits input channels, processes them via the Inception and SSM branches to capture multi-scale local and global features respectively, and then merges outputs through channel concatenation and channel shuffle to enhance semantic representation.

We formalized the modeling process of InceptionMamba block for the feature maps.Given a module input \(x \in \mathbb {R}^{H \times W \times C}\) and a module output \(y \in \mathbb {R}^{H \times W \times C}\), we used f to represent the channel-split, and then there is

$$\begin{aligned} x \in \mathbb {R}^{H \times W \times C} \quad x_{i=1,2} \in \mathbb {R}^{H \times W \times \frac{C}{2}} \end{aligned}$$

Next, the \(f^{-1}\) and g are used to represent channel-concatenation and channel-shuffle respectively. To match the convolution operation, we utilized a permute operation to rearrange the original feature map. Based on the above, the modeling process of the Inception branch can be defined as follows:

$$\begin{aligned} \overline{x}_1 \in \mathbb {R}^{H \times W \times \frac{C}{2}} \leftarrow \text {permute}(x_1) \end{aligned}$$
$$\begin{aligned} \text {BasicConv}_{z \times z}(\cdot ) = \text {ReLU}(\text {BatchNorm}(\text {Conv}_{z \times z}(\cdot ))) \end{aligned}$$
$$\begin{aligned} x_1' = \text {BasicConv}_{1 \times 1}(\text {AvgPool}_{3 \times 3}(\overline{x_1})) \end{aligned}$$
$$\begin{aligned} x_1'' = \text {BasicConv}_{1 \times 1}(\overline{x_1}) \end{aligned}$$
$$\begin{aligned} x_1''' = \text {BasicConv}_{3 \times 3}(\text {BasicConv}_{3 \times 3}(\text {BasicConv}_{1 \times 1}(\overline{x_1}))) \end{aligned}$$
$$\begin{aligned} x_1'''' = \text {BasicConv}_{3 \times 3}(\text {BasicConv}_{1 \times 1}(\overline{x_1})) \end{aligned}$$
$$\begin{aligned} \widehat{x_1} = \text {Concat}(x_1', x_1'', x_1''', x_1'''') \end{aligned}$$
$$\begin{aligned} \widetilde{x_1} \in \mathbb {R}^{H \times W \times \frac{C}{2}} \leftarrow \text {Permute}(\widehat{x_1}) \end{aligned}$$

Furthermore, the modeling process of the Channel Attention branch can be defined as follows:

$$\begin{aligned} x_2' = \text {Sigmoid}(\text {Conv}_{1 \times 1}(\text {AvgPool}_{3 \times 3}(\widetilde{x_1}))) \end{aligned}$$
$$\begin{aligned} \widetilde{x_2} = x_2' \otimes \widetilde{x_1} \end{aligned}$$

Meanwhile, the modeling process of SSM-Branch can be defined as follows:

$$\begin{aligned} \overline{x}_3&= \text {LayerNorm}_1(x_2) \\ x_3'&= \text {SiLU}(\text {DWConv}(\text {Linear}(\overline{x}_3))) \\ x_3''&= \text {LayerNorm}_2(\text {SS2D}(x_3')) \\ x_3'''&= \text {SiLU}(\text {Linear}(\overline{x}_3)) \\ \widetilde{x}_3&= \text {Linear}(x_3'' \otimes x_3''') \end{aligned}$$

In summary, the output of InceptionMamba block be formulated as follows:

$$\begin{aligned} y = x \oplus g(f^{-1}(\widetilde{x_2}, \widetilde{x_3})) \end{aligned}$$

4 Experiments

4.1 Datasets

We adopted 14 publicly available medical image datasets to comprehensively evaluate the effectiveness and potential of InceptionMamba in medical image classification tasks.

PAD-UFES-20 [51].PAD-UFES-20 contains 2,298 samples from six skin lesion types—BCC, SCC (including Bowen’s disease), ACK, SEK, MEL, and NEV—collected via various smartphones.

Fetal-Planes-DB [4].This maternal-fetal ultrasound dataset, collected from two hospitals with diverse operators and devices, contains expert-labeled images in six classes: four fetal planes (Abdomen, Brain, Femur, Thorax), maternal cervix, and a general category. Fetal brain images are further divided into three subplanes for fine-grained classification.

CPN X-ray [36, 57]. The public CPN X-ray dataset includes 5,228 chest images labeled as COVID-19, Normal, or Pneumonia, supporting deep learning-based disease classification.

Kvasir [54]. The Kvasir dataset, annotated by expert endoscopists, includes hundreds of GI tract images per class, covering anatomical landmarks, pathological findings, and endoscopic procedures such as lesion removal.

MedMNIST [73, 75]. MedMNIST is a large-scale MNIST-like biomedical image collection with 12 standardized 2D and 6 3D datasets, supporting lightweight classification across diverse tasks and scales. This work uses ten 2D datasets, including PathMNIST, DermaMNIST, OCTMNIST, PneumoniaMNIST, RetinaMNIST, BreastMNIST, BloodMNIST, and OrganMNIST variants.

For non-MedMNIST datasets, we preserved original class distributions and adopted MedMamba’s sample splits for fair comparison [81]. For MedMNIST, official data splits were used without changes.

4.2 Experimental setup

All experiments were conducted on a single NVIDIA GeForce RTX 4090 GPU. For non-MedMNIST datasets, we used the AdamW optimizer with an initial learning rate of 1e-4, weight decay of 1e-4, batch size of 32, and trained for 150 epochs. For MedMNIST datasets, we followed MedMNISTv2 settings and trained InceptionMamba for 100 epochs using AdamW with an initial learning rate of 1e-3, decayed by 0.1 at the 50th and 75th epochs. A batch size of 64 was used. To ensure objective evaluation on raw data, we did not apply any pretrained models or data augmentation.

4.3 Analysiss

4.3.1 Performance Analysis

Table 1 presents a unified comparison of InceptionMamba with a variety of reference models on PAD-UFES-20, Kvasir, and Fetal-Planes-DB datasets. On PAD-UFES-20, InceptionMamba achieves competitive results with the lowest FLOPs (0.8G) and smallest parameter count (7.0M), obtaining 63.3% OA and 0.840 AUC. Compared to MedMamba-T, it improves OA and AUC by 4.5% and 0.032, while reducing FLOPs and parameters by 60% and 51.7%, respectively. Although its OA is 0.2% lower than that of Nest-tiny, InceptionMamba achieves the highest AUC among all methods.On the Kvasir dataset, InceptionMamba again outperforms all baseline models with 83.8% OA and 0.986 AUC, while maintaining the lowest computational cost. It shows a 4.5% OA and 0.01 AUC improvement over MedMamba-S, alongside a 77.1% reduction in FLOPs and 69.3% fewer parameters.For Fetal-Planes-DB, InceptionMamba achieves the best results overall, with 94.6% OA and 0.995 AUC, surpassing MedMamba-B by 0.2% in OA and 0.002 in AUC, while using 89.2% fewer FLOPs and 85.1% fewer parameters. These results consistently demonstrate that InceptionMamba achieves state-of-the-art performance with significantly improved efficiency across diverse datasets.

Table 2 compares AUC and OA of InceptionMamba with other models on MedMNIST subsets. MedMamba-X applies data augmentation. InceptionMamba outperforms ResNet18, ResNet50, and Mamba variants across all datasets. Specifically, it improves OA on PneumoniaMNIST by 0.4% over MedVit-S, and on DermaMNIST, it surpasses MedMamba-T by 2.8% in OA and 4% in AUC, demonstrating superiority in dermoscopic image classification. InceptionMamba achieves the best results on DermaMNIST, PneumoniaMNIST, and BloodMNIST, balancing high accuracy with model efficiency.

In summary, InceptionMamba achieves strong performance on most medical image classification tasks in MedMNIST while remaining lightweight.

Table 1 Performance comparison of InceptionMamba and reference models on PAD-UFES-20, Kvasir, and Fetal-Planes-DB datasets
Table 2 Performance comparison of InceptionMamba and reference models on MedMNIST. Bold font indicates the best values. OA represents the model’s average overall accuracy
Table 3 Performance of InceptionMamba under different component configurations

4.3.2 Ablation Studies

We performed ablation studies to evaluate the impact of key architectural components. As shown in Table 3, using only the SSM module leads to the lowest performance across all datasets. The SSM module serves as the baseline configuration.

Introducing the Inception convolution and fusing it with the SSM branch significantly boosts performance while substantially reducing model complexity. For instance, on the Kvasir dataset, adding the Inception convolution increases OA by \(3.9\%\), rising from \(78.1\%\) to \(82.0\%\). Similarly, on the PneumoniaMNIST and dermamnist datasets, this step resulted in an OA improvement of \(2.6\%\) and \(3.6\%\), respectively. Notably, this improvement is achieved by dramatically reducing FLOPs and parameters by \(1.2\text {G}\) and \(9.7\text {M}\) respectively, dropping the complexity from \(2.0\text {G}/16.2\text {M}\) to \(0.8\text {G}/6.5\text {M}\).

Furthermore, incorporating the Channel Attention (CA) module enhances feature selection and improves model performance without notable computational overhead. The parameters only increase by about \(0.5\text {M}\) with the addition of the CA module. On Kvasir, the CA module further boosts OA to \(83.8\%\); on dermamnist, OA reaches \(80.7\%\).

These consistent results demonstrate that our approach successfully achieves effective capture of multi-scale features through the efficient fusion of Inception and SSM, coupled with precise feature filtering via Channel Attention. This method improves model performance while reducing complexity and parameter size, achieving an efficient balance.

To further validate Mamba’s effectiveness in medical image classification, we conducted ablations on the SSM branch. Specifically, we tested removing SS2D and replacing it with a standard SSM. As shown in Table 4, excluding SS2D yields the lowest AUC and OA. Introducing SSM improves both metrics, and incorporating SS2D, as in VMamba, leads to further significant gains. Thus, the Mamba model proves effective for medical image classification.

Table 4 Ablation results comparing different variants of the SSM branch

4.3.3 Visual Analysis

To enhance the interpretability of the InceptionMamba model, we applied the Grad-CAM method to visualize the internal decision-making mechanism of InceptionMamba. The Grad-CAM results are displayed as rainbow-colored maps, where red indicates highly relevant regions, yellow indicates moderately relevant regions, and blue indicates low relevance. In Figure 2, we present the heatmap results generated by InceptionMamba on the PAD-UFES-20 dataset. It is clearly observed that, in most cases, the model accurately focuses on the lesion areas in the images, with lesion regions predominantly shown in red.

We further used t-SNE to visualize features from InceptionMamba and MedMamba (Figure 3). The results show that InceptionMamba’s features have stronger representativeness and discriminability: samples of the same class form clearer clusters, while different classes are more separated. This demonstrates InceptionMamba’s superior ability in capturing inter-class differences and modeling medical images effectively.

Fig. 2
figure 2

Grad-CAM visualizations on PAD-UFES-20

Fig. 3
figure 3

t-SNE plots on Kvasir

Discussion Previous work MambaOut [77] demonstrated limited effectiveness of Mamba on image classification, while our results reveal significant advantages in medical image classification. This discrepancy arises from intrinsic differences between natural and medical images.

Natural images contain rich textures, complex edges, and diverse local details that correspond to high-frequency components in the frequency domain, making them visually detailed. Medical images (e.g., MRI, CT, ultrasound) emphasize overall morphology with smoother content, soft boundaries, and dominant low-frequency components, giving a sense of smoothness.

Theoretically, Mamba acts as a low-pass filter, preserving low-frequency information effectively. To quantify this, we propose the “Frequency Centroid” metric, a weighted average frequency based on energy distribution. Our analysis confirms that medical images concentrate energy in low frequencies, whereas natural images emphasize high frequencies, explaining Mamba’s superior performance in medical image classification.

4.4 The frequency-domain response characteristics of state space models (SSMs)

SSM-based models such as S4 and Mamba, grounded in classical continuous systems, excel in long sequence modeling [19, 21]. They map a one-dimensional input \(x(t) \in \mathbb {R}\) to an output \(y(t) \in \mathbb {R}\) via a hidden state \(h(t) \in \mathbb {R}^N\), governed by a linear ODE [19, 42, 44].

$$\begin{aligned} \begin{aligned} h'(t)&= \textbf{A} h(t) + \textbf{B} x(t) \\ y(t)&= \textbf{C} h(t) \end{aligned} \end{aligned}$$
(1)

Here, \(A \in \mathbb {R}^{N \times N}\) is the state matrix, while \(B, C \in \mathbb {R}^{N \times 1}\) are projection parameters.

To analyze the frequency domain response characteristics of the state space models, we consider a one-dimensional scalar simplified model:

$$\begin{aligned} h'(t) = \lambda h(t) + x(t) \end{aligned}$$
(2)

Where \(\lambda \in \mathbb {R}\) represents the state decay coefficient, which determines the degree to which the state retains past information.

Performing response analysis of the system to a unit impulse input yields the impulse response as:

$$\begin{aligned} h(t) = e^{\lambda t} \cdot \textbf{1}_{t \ge 0} \end{aligned}$$
(3)

Where \(1_{t \ge 0}\) is the unit step function, indicating that the system response takes effect from the input moment. This response describes the system’s memory and decay characteristics to an instantaneous impulse signal.

Performing the Fourier transform on the impulse response yields the system’s response magnitude:

$$\begin{aligned} \left| H(j\omega ) \right| = \frac{1}{\sqrt{\omega ^{2} + \lambda ^{2}}} \end{aligned}$$
(4)

The function monotonically decreases with frequency \(\omega \), indicating strong suppression of high-frequency components and sensitivity to low frequencies, thus exhibiting typical low-pass filter behavior. This suggests that recursive mechanisms in state space models like Mamba inherently filter out high-frequency signals, making them well-suited for modeling low-frequency dominant data. Yu et al. (2024) also validated Mamba’s superior accuracy in low-frequency prediction by decomposing inputs into frequency bands.

In image modeling tasks, this characteristic has significant implications. High-frequency visual information, such as edges, textures, and fine details, are important components for natural image recognition. Due to their frequency response characteristics, state space models have relatively weaker capability in modeling abrupt regions and local high-frequency details in images. In contrast, convolutional neural networks (CNNs) and Transformers are better at encoding such high-frequency information. Medical images such as MRI, CT, and ultrasound are often dominated by low-frequency features like organ boundaries and structural contours. Therefore, we propose that state space models possess a natural structural adaptation advantage in medical image modeling tasks.

4.5 Frequency-Domain Comparison of Natural and Medical Images

We introduce the Frequency Centroid as a core metric to quantitatively analyze image frequency distributions. Based on the 2D Fourier transform, it captures the energy distribution in the frequency domain and reflects the ratio of high- and low-frequency components, enabling frequency comparison across datasets.

Images are converted from spatial to frequency domain via 2D discrete Fourier transform (2D-DFT), producing a complex spectrum F(uv) representing cosine and sine components at various scales and orientations. The squared magnitude yields the power spectral density (PSD) at each frequency.

$$\begin{aligned} P(u,v) = |F(u,v)|^{2} \end{aligned}$$
(5)

PSD quantifies image energy at specific frequencies, intuitively reflecting frequency energy distribution. Low frequencies correspond to large, smooth areas; high frequencies represent details, textures, edges, or noise. High PSD in high-frequency regions indicates rich details, while concentration in low frequencies implies smoother, simpler structures.

Fig. 4
figure 4

Histogram Comparison of Frequency Centroids Across Datasets

To further quantify frequency structure, we define the Frequency Centroid:

$$\begin{aligned} f_{\textrm{avg}} = \frac{\sum _{u,v} f(u,v) \cdot P(u,v)}{\sum _{u,v} P(u,v)} \end{aligned}$$
(6)

where f(uv) denotes the Euclidean distance between the frequency point (uv) and the center of the frequency spectrum, representing the magnitude of that frequency component; P(uv) is the corresponding power spectral density at that point. This metric essentially represents the weighted expectation of frequency magnitudes, with weights given by the energy of each frequency component. Therefore, it can be regarded as the "centroid" or "center of mass" of the image’s energy distribution in the frequency domain.

Images rich in high-frequency components yield higher \(f_{avg}\) values, while those dominated by low-frequency components have lower \(f_{avg}\). Therefore, \(f_{avg}\) reliably reflects image frequency distribution characteristics.

All images were resized to 224\(\times \)224 pixels and normalized. The \(f_{avg}\) was computed per image and averaged at the dataset level. A histogram was created to visualize frequency-domain differences between medical and natural images. Using the Frequency Centroid, we compared frequency features of representative medical image subsets from MedMNIST (PathMNIST, ChestMNIST, OCTMNIST, OrganAMNIST, BreastMNIST) with the natural image dataset ImageNet.

Fig. 5
figure 5

Illustration of the OCTMNIST dataset

Figure 4 presents the results showing that Frequency Centroid values for natural image datasets are significantly higher than those for all medical image subsets. The mean \(f_{avg}\) of ImageNet notably exceeds that of medical subsets, reflecting richer textures and details in natural images and their higher frequency energy distribution. This aligns with the complex, diverse structures in natural scenes.

In contrast, medical image subsets exhibit lower Frequency Centroid values, indicating spectral energy concentrated in low-frequency regions. This reflects the smooth, simple tissue structures typical of medical images, supporting state space models’ natural fit for low-frequency components and explaining Mamba’s superior performance in medical imaging tasks.

Notably, the BreastMNIST dataset exhibits a Frequency Centroid significantly higher than other medical subsets. Correspondingly, Table 2 shows that Mamba underperforms compared to CNN- and Transformer-based models on BreastMNIST. In contrast, OCTMNIST also has a relatively high Frequency Centroid but still demonstrates strong performance with Mamba (Table 2). We attribute this to the inherent sequentiality and structural continuity of OCT images in the spatial domain. As illustrated in Figure 5, OCT images are generated by lateral scanning of eye tissue, which effectively “unfolds” the 3D structure into a continuous tomographic image, line by line. This orderly spatial unfolding creates a sequence-like structure that aligns well with Mamba’s state space scanning approach, enabling it to more effectively capture long-range dependencies across image regions and thereby significantly outperform CNN and Transformer models on this dataset.

In summary, based on the frequency-domain characteristics of state space models and the Frequency Centroid metric, we analyzed frequency distribution differences between medical and natural images. The results show medical images strongly emphasize low frequencies, which aligns with state space models’ sensitivity to low-frequency features. This consistency explains the superior performance of the Mamba model in medical image classification.

5 Conclusion

In this paper, we propose a medical image classification model, InceptionMamba, based on structured state space model (SSM), fully leveraging the advantages of SSM. We extract global features and multi-scale local features in parallel using SSM and Inception, respectively, while further optimizing the semantic representation of features through a channel attention mechanism. InceptionMamba achieves a good balance between efficient modeling and computational resource consumption. We conducted extensive validation on 14 public datasets, and experimental results demonstrate that the model exhibits highly competitive performance on medical image classification tasks.