1 Introduction

Colonoscopy is the gold standard for examining the inner lining of the colon and rectum, offering details of natural color and texture essential for detecting abnormalities such as polyps, inflammation, bleeding, and cancerous lesions. It remains unmatched in visualizing hollow organs like the colon, stomach, and esophagus. However, the procedure poses challenges due to the complexities in controlling the trajectory and viewpoint of the endoscope, including its distance and orientation relative to tissue. Suboptimal viewpoints can hinder lesion detection and diagnosis, while obstructions from anatomical structures or imaging artifacts—such as overexposure, underexposure, and specular highlights—further complicate the examination.

Medical imaging plays a pivotal role in modern healthcare by providing essential visual information for diagnosis and treatment planning. However, illumination artifacts—such as overexposure, underexposure, and specular reflections—can obscure critical details, increasing the risk of misdiagnosis. Traditional enhancement techniques, including histogram equalization, wavelet-based methods, and deep learning approaches, have been developed to address issues like low contrast and poor illumination [6, 13, 22]. While effective, these methods often require expertise in computer science and image processing, limiting their accessibility to clinicians. Moreover, they can be time-consuming and impractical in fast-paced clinical environments where efficiency and accuracy are vital.

To address this, we propose a novel enhancement system that allows clinicians to specify desired image changes via natural language prompts. This simplifies the enhancement process, enabling high-quality imaging without requiring complex software or technical knowledge. To the best of our knowledge, this is the first model specifically tailored for prompt-assisted enhancement in endoscopic imaging, bridging technical innovation and clinical usability.

This work presents the development and implementation of our prompt-assisted enhancement system, emphasizing its capacity to improve image quality and support more accurate diagnoses. Experimental results demonstrate the system’s effectiveness across diverse clinical scenarios, underscoring its value in medical imaging.

The paper is organized as follows: Sect. 2 outlines the motivation, Sect. 3 reviews related work on prompt-assisted image models, Sect. 4 details the dataset, training, and exposure correction, and Sect. 5 reports enhancement results and confusion matrices for classification using the trained BERT model.

2 Motivation

Endoscopic imaging is critical in modern diagnostics, particularly for minimally invasive procedures. Visualizing and navigating within hollow organs like the colon and esophagus is essential for effective diagnosis and treatment planning. A major advancement in this field is 3D reconstruction, which offers detailed maps of internal structures to support more precise interventions. However, applying traditional 3D reconstruction techniques to endoscopy faces challenges due to the unique lighting conditions inside the human body.

In domains like robotics and autonomous driving, methods such as Simultaneous Localization and Mapping (SLAM) [18], Density-Based Spatial Clustering (DBSCAN) [12], optical flow [21], and Structure from Motion (SfM) [20] have been used for 3D surface reconstruction. While effective in controlled settings, these techniques struggle under the unpredictable illumination typical of endoscopic environments.

Many 3D reconstruction algorithms, especially in autonomous systems, rely on the brightness constancy assumption—that pixel intensities remain stable across frames. For example, CNN-SLAM uses convolutional neural networks (CNNs) to predict depth maps based on this principle. However, in endoscopy, lighting conditions vary due to factors like camera angle, tissue curvature, and direct reflections, invalidating this assumption. Luo et al. [17] showed that depth predictions can substitute stereo measurements in systems like Stereo Direct Sparse Odometry (DSO), but these approaches still falter under the extreme photometric variations found in endoscopy.

These challenges are particularly evident in colonoscopy, where endoscope movement, tissue curvature, and lighting variability lead to overexposed, underexposed, or specular artifacts, as seen in Fig. 1. Such inconsistencies undermine traditional methods that assume stable illumination. Additional complications include moisture, surgical tools, and occlusions, which further reduce the reliability of SLAM-based 3D reconstruction.

To address these limitations, image enhancement has become essential for pre-processing. Our approach employs a prompt-assisted enhancement system powered by a BERT-based model, which interprets natural language prompts like ‘Remove the white dots’ for specularity removal or ‘Fix the underexposed areas’ for shadow correction. The model identifies and enhances problematic regions, correcting major illumination artifacts that would otherwise hinder 3D reconstruction. It targets common issues like underexposure, overexposure, and specular highlights, improving image consistency for SLAM and other computer vision methods.

In parallel, detecting unusable frames is equally important. Severe lighting changes, specular reflections, or distortions may render certain frames unfit for processing. Our system integrates a corrupted frame detection algorithm, such as the one by Axel Vega et al. [7], to flag or discard such frames based on a defined corruption threshold. This ensures that only high-quality data is used, maintaining pipeline integrity—especially critical in clinical scenarios where image quality varies.

Variable lighting and photometric inconsistencies in endoscopy hinder traditional 3D reconstruction methods, which often fail under clinical conditions despite success in controlled environments. Our prompt-assisted enhancement system improves input quality, enabling more accurate and robust reconstructions. Combined with corrupted frame detection, it ensures only reliable frames are processed. This approach addresses multiple image artifacts while maintaining real-time applicability in endoscopic procedures.

3 State-of-the-Art

Prompt-assisted image enhancement is a relatively recent development in computer vision, drawing inspiration from advances in natural language processing (NLP). Foundational work by Li and Liang [15] and Lester et al. [14] demonstrated how prompts can guide model behavior. In image enhancement, typical methods employ text embeddings or text-to-image generative models. For instance, Kawar et al. [11] introduced the Imagic model, using diffusion for image generation, while Brooks et al. [4] developed InstructPix2Pix, which combines GPT-3 with Stable Diffusion to perform image transformations based on natural language instructions.

Other notable approaches improve neural architectures for specialized tasks, particularly in medical imaging. Fischer, Alexander, and Yang [8] proposed a parameter-efficient segmentation method where continuous prompt tokens, prepended to the input and optimized via gradient descent, guide a frozen backbone to perform specific tasks without modifying the core model—enabling flexibility in adapting to new segmentation problems.

To our knowledge, no current models directly address prompt-based enhancement for endoscopic images. The most relevant work is InstructIR by Conde et al. [5], which serves as a primary reference for this study. InstructIR uses a GPT-4 language model to interpret prompts and perform image enhancement across seven degradation types: denoising, deblurring, dehazing, deraining, super-resolution, low-light enhancement, and general image improvement, using a dataset of 10,000 unique prompts.

The model employs NAFNet as the image backbone, featuring a 4-level encoder-decoder design. The encoder includes block configurations (2, 2, 4, 8) across levels 1 to 4, while the decoder uses (2, 2, 2, 2). Four middle blocks are inserted between encoder and decoder to enhance feature extraction. Skip connections use addition instead of concatenation. Task routing is managed by the Instruction Condition Block (ICB) embedded in both encoder and decoder components.

Fig. 1.
Four side-by-side endoscopic images of a red and pink tissue. The first image is labeled "Original," showing the tissue with natural lighting. The second, labeled "Underexposed," appears darker. The third, "Brightness Correction," shows enhanced lighting. The fourth, "Specularity Removal," reduces glare on the tissue surface.

InstructIR results for underexposure, overexposure, and specular reflection correction.

InstructIR shows limitations in medical imaging. As seen in Fig. 1, it struggles with specularity removal and causes data loss in bright regions despite brightness correction. To address this, we implemented Endo-LMSPEC [9] and Endo-STTN [6]. The lack of a prompt dataset for exposure and specularity hindered GPT-4 fine-tuning, so we used a BERT model, better suited for these tasks.

Fig. 2.
A flow chart illustrating a medical diagnostic process. On the left, a doctor performs an endoscopy on a patient. The process begins with "Gastroesophageal endoscopy" leading to a decision point: "H. Pylori detected" or "Not detected." The "Classifier" step follows, leading to two pathways: "Endo-AID" and "ENDO-SIM." Each pathway shows an endoscopic image with annotations: "Gastric cancer segment" and "Non-cancerous segment." The chart visually represents the classification and diagnosis process in gastroenterology.

Pipeline of the proposed prompt-assisted image enhancement model.

Prompt-driven generative models are increasingly relevant in medical imaging. González-González et al. [19] proposed a diffusion-based method for biomedical image translation guided by natural language. Although centered on microscopy, their work highlights the expanding use of language-guided diffusion in clinical imaging and supports developing prompt-based enhancement methods for specific modalities, as done here for colonoscopy.

4 Materials and Methods

The complete model is implemented in Python, integrating pre-trained models (Endo-LMSPEC and Endo-STTN) dynamically based on user prompts. As seen in Fig. 2, the system uses BERT to process prompts that guide the enhancement, whereas Endo-LMSPEC and Endo-STTN operate on image data only, and the model has been designed for real-time use in endoscopic procedures. The workflow is as follows:

  1. 1.

    Upload and Preprocessing: User uploads images; the system clears previous files, resizes images, and generates specular masks.

  2. 2.

    Prompt Interpretation: User provides a prompt, and a BERT model classifies it.

  3. 3.

    Enhancement Selection: Based on the prompt, the system selects enhancement methods (specularity removal, exposure correction, or both).

  4. 4.

    Enhancement Application: The selected models are applied to the images.

  5. 5.

    Comparison and Validation: Original and enhanced images are displayed for user validation.

4.1 Exposure Correction

Endo-LMSPEC corrects exposure artifacts using a modified LMSPEC model for given different specularities as input, such as underexposed, overexposed and both (see Fig. 3). It integrates Laplacian pyramid decomposition with U-Net sub-networks and GANs for multi-scale illumination correction. The general flow can be seen in in Fig. 4, and the key steps include:

  1. 1.

    Patch Extraction: Images are divided into patches based on intensity and gradient thresholds:

    $$ P_i = \{ I(x, y) \, | \, \text {threshold}(I, \nabla I) \}. $$
  2. 2.

    Multi-Scale Decomposition: Patches are decomposed using a Laplacian pyramid:

    $$ L(P_i) = \{ P_i^1, P_i^2, \dots , P_i^L \}. $$
  3. 3.

    U-Net Sub-Networks: Four U-Net-like sub-networks process pyramid levels:

    $$ O_i^l = \text {U-Net}(P_i^l). $$
  4. 4.

    Discriminator (GAN): A discriminator classifies patches as real or fake:

    $$ D(O_i) = \{ 0, 1 \}. $$

Loss Functions. The total loss is a weighted combination of:

$$\begin{aligned} \mathcal {L}_{\text {total}} = \alpha \mathcal {L}_{\text {pyr}} + \beta \mathcal {L}_{\text {rec}} + \gamma \mathcal {L}_{\text {SSIM}} + \delta \mathcal {L}_{\text {adv}}, \end{aligned}$$
(1)

where:

$$\begin{aligned} \mathcal {L}_{\text {pyr}} = \sum _{l=1}^{L} \Vert L(O_i^l) - G(P_{\text {GT}}^l) \Vert _2^2, \end{aligned}$$
(2)
$$\begin{aligned} \mathcal {L}_{\text {rec}} = \Vert O_i - P_{\text {GT}} \Vert _1, \end{aligned}$$
(3)
$$\begin{aligned} \mathcal {L}_{\text {SSIM}} = 1 - \text {SSIM}(O_i, P_{\text {GT}}), \end{aligned}$$
(4)
$$\begin{aligned} \mathcal {L}_{\text {adv}} = - \log D(O_i). \end{aligned}$$
(5)

In the above equations, \(O_i\) denotes the output generated by the model for the i-th sample. \(P_{\text {GT}}\) represents the corresponding ground truth target (e.g., the reference image, frame, or patch of the dataset), and \(P_{\text {GT}}^l\) is its down-sampled version at the l-th level of the image pyramid. The SSIM term, \(\textrm{SSIM}(O_i, P_{\text {GT}})\), measures the structural similarity between the generated output and the ground truth.

Fig. 3.
Flow chart illustrating a process for image analysis in endoscopy. The input, labeled "Endo4IE," is divided into training splits: underexposed, overexposed, and both. Patch extraction follows, separating corrupted and normal frames. Pre-processing is indicated with a gear icon. Frequency decomposition is shown with two pyramids: a Laplacian pyramid for corrupted patches and a Gaussian pyramid for normal patches, with dimensions 128x128 and 256x256 noted.

Overall input, preprocessing, and input for Endo-LMSPEC.

Fig. 4.
Flow chart illustrating a two-phase training process. On the left, a Laplacian Pyramid with corrupted patches is shown, labeled with levels l_1 to l_4 . Arrows indicate the flow to two adaptation phases. The first phase involves a series of operations leading to "First weights." The second phase, labeled "Transfer Learning," leads to a "Final model." An enhanced frame image is shown at the end. The dimensions 128x128 and 256x256 are noted between the pyramid and adaptation phases.

Two-phase training process.

4.2 Specularity Removal

To adapt STTN for specularity removal in endoscopy, we modified Zeng et al.’s method [23]. As shown in Fig. 5, following Daher et al. [6], the system segments specularities, relocates them to create pseudo ground truth, and trains STTN using Hyper-Kvasir videos with embedding, matching, and attending stages. A temporal GAN, initialized with random mask training, generates a discriminator loss to enhance the frames.

  1. 1.

    Pseudo Ground Truth Generation: Specularity masks are generated using the Dichromatic Reflection Model (DRM) and translated to create pseudo ground truth.

  2. 2.

    Training: The model is trained using a temporal GAN with loss functions:

    $$\begin{aligned} L = \lambda _{\text {hole}} \cdot L_{\text {hole}} + \lambda _{\text {valid}} \cdot L_{\text {valid}} + \lambda _{\text {adv}} \cdot L_{\text {adv}}, \end{aligned}$$
    (6)

    where:

    $$\begin{aligned} L_{\text {hole}} = \frac{\Vert M^T \odot (Y^T - \hat{Y}^T) \Vert _1}{\Vert M^T \Vert _1}, \end{aligned}$$
    (7)
    $$\begin{aligned} L_{\text {valid}} = \frac{\Vert (1 - M^T) \odot (Y^T - \hat{Y}^T) \Vert _1}{\Vert (1 - M^T) \Vert _1}, \end{aligned}$$
    (8)
    $$\begin{aligned} L_{\text {adv}} = -\mathbb {E}_{z \sim P_{Y_1}(z)}[D(z)]. \end{aligned}$$
    (9)
Fig. 5.
Diagram illustrating a flow chart for a neural network model. The process begins with "Hyper Kvasir" and "Pseudo Masks" inputs, which are processed through an "Encoder." The encoded data is divided into three components labeled Q, K, and V, each passing through a 1x1 convolution layer. These components are combined in a "Head" section, producing outputs denoted as alpha_{i,j} and o_i . The outputs are further processed through a 3x3 convolution layer and a "Decoder," resulting in the final "Inpainted" output. The diagram is layered, indicating multiple iterations or layers of processing.

General pipeline of the methodology of Endo-STTN. [6]

4.3 BERT-Based Sequence Classification

To automatically categorize user instructions into corresponding enhancement operations, we fine-tuned a BERT-based sequence classification model. The task was framed as a six-class classification problem, where each prompt is mapped to one of the defined enhancement actions: specularity removal, underexposure correction, overexposure correction, and their combinations.

The model is initialized with pre-trained weights from the bert-base-uncased checkpoint and fine-tuned end-to-end using a linear classification head appended to the [CLS] token representation. The final layer outputs a six-dimensional logit vector, corresponding to the probability distribution over the enhancement classes.

The model is optimized using the categorical cross-entropy loss function:

$$ L_{\text {CE}} = - \sum _{i=1}^{N} y_i \log (\hat{y}_i), $$

where \( y_i \) is the ground truth one-hot encoded label, and \( \hat{y}_i \) is the softmax probability assigned to class \( i \). This loss is appropriate for multi-class classification tasks where each input belongs to exactly one category.

Training was conducted using the Adam optimizer with a learning rate of \(1 \times 10^{-5}\), batch size of 32, and for a total of 15 epochs. Early stopping was applied based on validation F1-score to avoid overfitting. The dataset was split into 70% training, 15% validation, and 15% test partitions. Input prompts were truncated or padded to a maximum sequence length of 64 tokens.

Performance was evaluated using precision, recall, and F1-score for each class, as well as overall accuracy. The model achieved a macro-averaged F1-score of 0.89, confirming its ability to generalize across different phrasings of user intent. Detailed results per class are shown in Table 3.

4.4 Used Datasets

We use three datasets to train our system: (1) Endo4IE [10] for the image enhancement block, (2) a Structure from Motion (SfM)-based dataset for depth prediction using real colonoscopic videos, and (3) a custom prompt dataset for training the BERT model to classify and interpret enhancement prompts. Each dataset is essential to developing and evaluating its corresponding model component.

1) Endo4IE. To train the Endo-LMSPEC model, we used the Endo4IE dataset by Garcia-Vega et al. [10], created from frames extracted from EAD [2], EDD [1], and HyperKvasir [3]. Using CycleGAN [16], synthetic overexposed and underexposed versions of original endoscopic images were generated. The dataset includes: (1) 2,216 unmodified ground truth frames, (2) 1,231 overexposed frames, and (3) 985 underexposed frames. It was split into 70% training, 27% validation, and 3% test sets.

2) Prompt-Based Dataset for BERT Training. To train our BERT model for prompt classification, we curated a custom dataset of 1,150 prompts, each assigned to one of six key image enhancement classes. As no existing dataset addressed prompt-based enhancement, we ensured a balanced distribution across all classes (see Table 1). The most frequent classes—’specularity removal with underexposure correction’ and ’specularity removal with overexposure correction’—contain 227 and 207 prompts, respectively, allowing the model to distinguish between single and combined enhancement types. In Table 2 we can observe examples of the Prompt that is being told to our model, and the target enhancement that is being made by it.

Table 1. Distribution of enhancement classes in the prompt dataset.
Table 2. Example prompts and their corresponding enhancement classes in the dataset.

Each prompt was carefully mapped to its corresponding enhancement class, allowing BERT to predict the correct technique from textual input. The dataset was divided into training and test sets, ensuring balanced representation across all six classes.

5 Results and Discussion

5.1 BERT Classification Results

Table 3 summarizes BERT’s performance on various image enhancement tasks prompted via natural language. The model achieved an overall accuracy of 0.88, demonstrating strong classification capabilities, though performance varied by class.

On the test set, the model performed best on specularity removal (F1-score: 0.93, precision: 0.94, recall: 0.91) and overexposure correction (F1-score: 0.92, precision: 0.90, recall: 0.95). Underexposure was more challenging (F1-score: 0.84), with high precision (0.95) but lower recall (0.75), indicating missed cases.

In combined classes, ‘Specularity + Underexposure‘ achieved an F1-score of 0.85 (precision: 0.76, recall: 0.96), while ‘Specularity + Overexposure‘ scored 0.83 (precision: 0.94, recall: 0.74), suggesting class overlap affected recall. Additional training data may improve these cases.

The best results were for ‘Overexposure + Underexposure‘, with an F1-score of 0.96, perfect recall (1.00), and precision of 0.92, likely due to the clear visual characteristics of this combination.

Table 3. BERT classification results per enhancement class. Overall accuracy: 0.88. Macro average—Precision: 0.90, Recall: 0.89, F1: 0.89. Weighted average—Precision: 0.89, Recall: 0.88, F1: 0.88.
Table 4. Quantitative results for enhancement methods across exposure, specularities, and both. Lower MSE and higher SSIM are better. Endo-LMSPEC achieved the lowest MSE for exposure (266.16 vs. 900.43 for LMSPEC and 850.65 for STTN), and the combined model performed best for both artifacts (422.26). SSIM scores also favored our models, indicating better image quality.

The model’s overall weighted precision, recall, and F1-score are 0.89, 0.88, and 0.88, respectively, indicating consistent performance across classes. Similar macro average values confirm that the model handles each class reliably, without major imbalances.

Enhancement Results

We evaluated the performance of the proposed enhancement models using three standard image quality metrics: Mean Squared Error (MSE), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index Measure (SSIM). These metrics were computed over the test split of the dataset, and results are reported separately for three types of photometric corruption: exposure errors, specularities, and frames affected by both simultaneously.

Table 4 presents the MSE and SSIM scores. Lower MSE values indicate smaller pixel-level differences from the ground truth, while higher SSIM scores reflect improved perceptual quality and structural preservation. Among all methods, Endo-LMSPEC achieved the lowest MSE for exposure correction (266.16) and the highest SSIM (0.811), significantly outperforming classical LMSPEC (MSE: 900.43, SSIM: 0.700). For frames corrupted by specularities, the best performance was achieved by the combined model (Endo-LMSPEC + Endo-STTN), which obtained the highest SSIM (0.834) and the lowest MSE (401.57). A similar trend is observed in the mixed artifact scenario, where the combined model again yielded the best SSIM (0.817) and lowest error.

Table 5. PSNR results for enhancement methods across exposure, specularities, and both artifact types. Higher is better.
Fig. 6.
A grid of endoscopic images showing different processing techniques. The columns are labeled: "Regular frame," "Corrupted frame," "Endo-LMSPEC," "Endo-STTN," and "Endo-LMSPEC + Endo-STTN." The rows are labeled: "Overexposure + Specularities," "Specularities," and "Underexposure." Each image displays variations in lighting and clarity, demonstrating the effects of different processing methods on endoscopic visuals.

Qualitative comparison of enhancement methods across various photometric artifacts. Each row represents a different type of degradation: overexposure with specularities (top), isolated specularities (middle), and underexposure (bottom). From left to right: reference (regular) frame, corrupted input, Endo-LMSPEC output, Endo-STTN output, and the combined Endo-LMSPEC + Endo-STTN output. The combined method shows improved robustness in restoring contrast, reducing artifacts, and preserving anatomical detail across all scenarios.

Table 5 summarizes the PSNR results, where higher values correspond to better reconstruction quality. Endo-LMSPEC performed best on pure exposure correction (23.88 dB), whereas Endo-STTN was particularly effective at handling isolated specularities (22.64 dB). Notably, the combined model achieved the best overall PSNR for frames exhibiting both types of artifacts (23.60 dB), confirming its capacity to generalize across complex photometric conditions.

These quantitative findings are supported by the qualitative comparisons shown in Fig. 6. The top row depicts frames with overexposure and specular reflections; the middle row shows isolated specularities; and the bottom row displays underexposed scenes. In each case, the combined model demonstrates improved contrast restoration, reduced artifact visibility, and better anatomical continuity compared to individual methods. Notably, it mitigates over-smoothing and retains realistic color tones across all scenarios, illustrating the complementary strengths of Endo-LMSPEC and Endo-STTN when integrated.

Together, these results confirm that combining structural and temporal enhancement strategies yields a more robust and generalizable solution for photometric artifact correction in endoscopic imaging.

6 Conclusion and Future Work

This study highlights the effectiveness of customized image pre-processing to improve camera trajectory reconstruction in 3D colonoscopy using RNN-SLAM. By applying localized corrections to under and overexposed regions, rather than global gamma adjustments, the Endo-LMSPEC model significantly enhances trajectory estimation and reconstruction quality. It proves more effective in mitigating illumination artifacts and specular reflections common in colonoscopic imaging.

A key contribution is the introduction of a prompt-driven, spatially-aware enhancement framework. Leveraging a BERT-based model, clinicians can guide enhancement through natural language prompts. Unlike systems such as InstructIR, which apply global changes, our method enables fine-grained, context-aware corrections aligned with clinical needs. This human-in-the-loop approach supports real-time, intuitive interaction without manual annotations, advancing AI integration in clinical workflows.

Future work will focus on incorporating outlier rejection into 3D point cloud generation to improve mesh fidelity, and evaluating robustness under motion blur and camera-induced distortions. We also plan to expand comparisons with emerging prompt-based models—such as MedSegDiff, Med-PaLM, and InstructIR—to further contextualize our approach within current advancements in medical image enhancement.