SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation
Abstract
Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR
1 Introduction
Vision Transformers (ViTs) [vit] have become the backbone of modern computer vision, powering foundation models like CLIP [CLIP], DINO [dino, jose2025dinov2, dinov3], and SigLIP [siglip, SigLIP2] through contrastive and self-supervised learning. ViTs process images as sequences of patches using self-attention [vaswani2017attention], but due to its quadratic complexity, are typically pre-trained at a single, low resolution. While this suffices for image-level, it limits performance on dense prediction tasks like segmentation, which require fine-grained, per-pixel understanding.
This limitation is particularly severe in training-free open-vocabulary segmentation (OVS), where models must segment categories specified by text without access to pixel-level supervision. OVS approaches leverage ViTs as image feature extractors within Vision-Language models (VLMs), like CLIP [CLIP], for their strong pre-training, but struggle with high-resolution test inputs due to discrepancies with the training resolution. Two common strategies attempt to address this issue: (i) interpolating positional encodings and (ii) sliding-window inference.
The first approach supports single-pass processing and is efficient, but yields poor accuracy. In contrast, we show that the second approach, when using small strides and highly overlapping windows, significantly improves performance. Sliding-window inference enables a patch to appear in multiple contexts and enhances prediction accuracy through aggregation in overlapping regions. Interestingly, strides that are not divisible by the patch size lead to even better results, as sub-patch areas are exposed to diverse contexts.
However, sliding-window inference with small strides increases computational cost, making it impractical for real-world use (see Fig. 1). Despite growing architectural advances [Pix2Struct, NaVit, SigLIP2, dinov3] designed to handle variable resolutions, they are either not widely adopted or, as indicated in our experiments, have unexplored room for improvement on standard OVS benchmarks.
To address these challenges, we propose SPAR: a method for enabling efficient resolution-agnostic feature extraction in ViTs via teacher-student distillation. SPAR transfers the spatial reasoning of a frozen VLM teacher that operates in a finely-strided sliding-window manner, to a fast, single-pass student. The student learns to align its image features with the teacher’s using a regression loss, without requiring architectural changes or additional supervision. As shown in Fig. 1, SPAR surpasses the teacher’s performance while maintaining single-pass efficiency.
SPAR uses the unchanged ViT architecture of the VLM and is compatible with a wide range of ViT backbones. During training, the student is exposed to diverse resolutions and aspect ratios, promoting robust generalization. We find that unfreezing only a small subset of layers suffices for strong resolution tolerance. To reduce compute, teacher features are precomputed and reused across training iterations. Our OVS experiments with vanilla zero-shot predictors show that SPAR is effective with ViTs of OpenCLIP [openclip], SigLIP2 [SigLIP2], but also DINOv3.txt [dinov3], which already includes components making it resolution resilient.
In summary, we introduce SPAR, a framework for producing resolution-agnostic ViTs that deliver strong accuracy in a single forward pass. SPAR is applied to the vision encoder of common VLMs and requires no architectural modifications or additional labels. Rather than introducing new components, SPAR focuses on a principled training strategy that ingrains resolution flexibility without sacrificing efficiency or feature-space alignment. The resulting model excels in training-free open-vocabulary segmentation across resolutions, matching or surpassing finely-strided sliding-window teachers while being up to faster.
2 Related Work
Open-vocabulary semantic segmentation (OVSS). Unlike standard semantic segmentation [zheng2021setr, ZegFormer], open-vocabulary semantic segmentation (OVSS) allows specifying and recognizing arbitrary categories using text. Progress on this task has closely followed advances in Vision–Language Models (VLMs). CLIP [CLIP] introduces contrastive pretraining on web-scale image–text pairs, demonstrating remarkable zero-shot classification capability. Subsequent models, such as ALIGN [ALIGN] and the SigLIP family [siglip, SigLIP2], further scale datasets and refine training objectives – the latter transitioning from contrastive formulations to more efficient, sigmoid-based matching losses – thereby improving both semantic alignment and localization. LSeg [LSeg] is the first to adapt CLIP for OVSS by aligning per-pixel visual features with frozen text embeddings. MaskCLIP [MaskCLIP] exposes spatial cues within CLIP’s vision tower by modifying its final attention layer to recover fine-grained localization.
Current methods are broadly categorized as training-based [LSeg, OpenSeg, ZegFormer, catseg, jose2025dinov2] and training-free [MaskCLIP, SCLIP, ProxyCLIP, CorrCLIP]. Training-based approaches include: (i) methods that correlate text embeddings with regions produced by decoupled mask proposal generators [OpenSeg, ZegFormer, ZSEG, OVSeg, ODISE]; (ii) models that couple CLIP features with proposal generation [FCCLIP]; and (iii) variants that directly adapt CLIP for dense prediction [FCCLIP, MaskCLIP, ClearCLIP]. Training-free methods, on the other hand, aim to preserve the pre-trained weights of CLIP without any fine-tuning. Some rectify localization weaknesses within CLIP itself by refining attention calculation and token intermixing [MaskCLIP, SCLIP, ClearCLIP, ITACLIP], others incorporate spatial priors from vision foundation models [ProxyCLIP, Trident, CorrCLIP], or aid localization with generative priors [OVDiff, DiffSegmenter, CLIPer]. Another line of work performs propagation of features or predictions to improve their spatial consistency [dinoiser, stojnic2025lposs]. Across all these works, Vision Transformers (ViTs) remain the dominant backbone, and their ingrained resolution sensitivity persists as a shared limitation; one that our method addresses.
Resolution-agnostic transformers. While Vision Transformers (ViTs) are powerful, fixed-resolution pretraining limits their robustness to varying input sizes. The naive remedy of interpolating positional encodings to match new image resolutions causes notable performance degradation [SwinTransformer, DeIT, CPE, ResFormer]. Pix2Struct [Pix2Struct] is among the first to address this issue by fixing token sequence length and resizing images with aspect ratios intact, breaking from traditional sliding-window inference [conv_window_slide, SwinTransformer]. Building on this idea, several works introduce multi-resolution training to aid robustness. ResFormer [ResFormer] trains on square crops of varying sizes and expands positional encodings during attention with neighborhood-focused components. FlexiViT [FlexiViT] removes fixed patch sizes, randomly sampling them to expose the model to diverse sequence lengths. NaViT [NaVit] factorizes positional encodings along spatial axes, enabling processing of arbitrary-resolution images with native aspect ratios. SigLIP2’s NaFlex [SigLIP2] unifies these ideas, combining flexible patching with variable sequence lengths for improved resolution generalization. However, NaFlex remains optimized for image–text alignment at the image- and not patch-level. This limits its use for dense prediction tasks like OVSS, as demonstrated in Sec. 4. Overall, these methods enhance robustness through diverse pretraining resolutions rather than post hoc adaptation.
Recent works continue this trend. ViTAR [ViTAR] expands training resolutions via fuzzy positional encoding, using lossy compression to reduce attention costs, while UniViTAR [UniViTAR] merges image and video modeling with aspect ratio preservation, introducing non-standard ViT changes. OryxMLLM [OryxMLLM] trains on native resolutions using dynamic feature compression. In contrast, our approach adapts ViTs to varying resolutions post hoc, without requiring additional labeled data or architectural modifications.
3 Method
3.1 Task Definition
Open Vocabulary Segmentation (OVS) aims to assign a semantic class to every pixel in an image . The set of classes , containing class names, is arbitrary and can be specified at inference time.
3.2 Preliminary
Vision Transformers (ViTs) process images by dividing them into square patches, projecting each patch into a vector representation, and adding positional encodings based on patch location. These vectors are concatenated into a sequence and passed through transformer blocks. A special CLS token is typically included, but we omit it here.
Foundational ViTs are traditionally pre-trained on square images of fixed image resolution and consist of an encoder that produces a feature map per input image, where is the number of patches, is the feature dimension, and are the spatial dimensions after reshaping into a tensor of size . Due to patching, , where is the patch size. Positional encodings are learned for that resolution only, i.e. on a grid of possible positions.
Vision-Language Models (VLMs) combine a vision encoder with a text encoder to map images and text into a shared feature space. Let denote the concatenated text features for all classes in .
OVS is performed by computing dot-product similarities between normalized image features and text features. The 2D map of class similarities for all classes given by
where denotes normalization applied along the feature dimension, is reshaped and upsampled to the resolution of using bilinear interpolation. Upsampling is necessary as the feature map has low resolution due to the patchification of input during inference, which limits segmentation quality, especially for small objects. Operating at higher resolutions, along with arbitrary aspect ratios, is critical for accurate segmentation. Therefore, for image of arbitrary size, we require an encoder capable of , with , that produces feature maps of spatial resolution proportional to the input size, i.e. , .
3.3 Pre-trained Baselines
Single-pass inference at arbitrary resolutions is feasible with ViT-based VLMs since self-attention can handle sequences of varying length. The required adjustment is interpolating positional encodings to match the new resolution. However, this approach suffers from performance degradation due to the training-inference resolution mismatch, and due to interpolation, which disrupts the absolute understanding of positional information learned during training.
Sliding-window is a common strategy for processing an arbitrary-resolution image . The image is divided into overlapping, or at least adjacent, windows of size , each processed independently. We denote the image windows by , where . The stride denotes how much we stride between neighboring windows horizontally and/or vertically, controlling their overlap.
Feature maps are computed as , and class similarities as . The final prediction is obtained by stitching:
| (1) |
Stitching refers to merging the individual window predictions into a single coherent output map, typically by averaging overlapping regions and aligning them to their original spatial positions.
This method preserves the native resolution of the vision encoder, avoiding single-pass issues, but incurs higher computational cost proportional to the total number of windows . Even with batch processing, sliding-window inference is slower than single-pass for small strides. Nevertheless, small strides yield better performance by increasing window overlap, allowing each pixel to be seen in multiple contexts. This redundancy improves robustness through prediction averaging, akin to test-time augmentation.
3.4 Single-Pass Any-Resolution (SPAR)
Performance vs. inference time trade-off. Observing the trade-off between speed and accuracy in single-pass versus sliding-window inference (see Fig. 1), we propose a feature distillation approach that combines their advantages. A student model is trained to mimic the feature embeddings of a sliding-window teacher while maintaining the efficiency of single-pass inference. Both models have the same architecture, allowing the student to be initialized with the same weights and keep the same feature space. A method overview is presented in Figure 2.
Sliding-window teacher. Assume is a training image of arbitrary resolution and aspect ratio. Instead of stitching predictions, we stitch features over windows of size producing the teacher feature map by
where stitching refers to merging feature maps from windows into a single coherent representation aligned to the feature map layout that the single-pass model would produce, by averaging overlapping regions.
Distillation loss. The student model processes the entire image in a single pass to produce . The distillation loss minimizes the mean squared error between teacher and student features:
Training. The teacher remains frozen during training. We empirically observe that training only the last blocks of the student is a good choice for the standard OVS settings, but training all parameters excels at very large resolution inference and other tasks. Optimization is performed over a dataset of images with diverse resolutions and aspect ratios. No annotations are required since the loss operates solely on feature maps; a generic image dataset suffices.
Training runs for multiple epochs, and teacher feature maps are precomputed and stored to save time. To enable easier feature-level stitching, all training images are resampled bilinearly to have dimensions divisible by the patch size , and window size is also divisible by . If stride is divisible by , stitching occurs at the encoder’s native feature resolution due to full patch alignment. However, experiments show that using a stride not divisible by improves performance by exposing pixels to more diverse contexts. In such cases, if is divisible by , stitching is performed by upsampling all feature maps by a factor of before merging, while then downsampling by a factor of after merging to restore the original feature resolution.
4 Experiments
4.1 Implementation Details
| Models | Inference | Upsampler | Voc21 | Voc20 | CS | ADE | C60 | C59 | |||
| \rowcolorgray!10 SigLIP2 [SigLIP2] – ViT-B-16 | |||||||||||
| NaFlex [SigLIP2] | single-pass | bilinear | 35.8 | 66.6 | 22.8 | 16.2 | 23.7 | 25.2 | 31.7 | ||
| Pre-trained | sliding-window | bilinear | 45.0 | 77.5 | 38.4 | 21.8 | 30.7 | 34.0 | 41.2 |
|
|
| Pre-trained | single-pass | bilinear | 36.1 | 71.3 | 23.5 | 16.8 | 24.5 | 26.1 | 33.1 |
|
|
| SPAR | single-pass | bilinear | 47.3 | 81.5 | 38.4 | 23.4 | 33.8 | 37.2 | 43.6 | +10.5 | +2.4 |
| LPOSS [stojnic2025lposs]+Pre-tr. | single-pass | bilinear | 46.1 | 89.6 | 34.5 | 19.9 | 30.9 | 35.2 | 42.7 |
|
|
| LPOSS [stojnic2025lposs]+SPAR | single-pass | bilinear | 51.2 | 89.7 | 39.2 | 25.8 | 34.5 | 39.8 | 46.7 | +4.0 | |
| Pre-trained | single-pass | AnyUp [AnyUp] | 42.4 | 82.3 | 33.2 | 23.1 | 30.1 | 34.5 | 40.9 |
|
|
| SPAR | single-pass | AnyUp [AnyUp] | 51.0 | 86.2 | 38.6 | 26.1 | 37.1 | 41.5 | 46.8 | +5.9 | |
| \rowcolorgray!10 OpenCLIP [openclip] – ViT-B-16 | |||||||||||
| Pre-trained | sliding-window | bilinear | 48.5 | 55.1 | 30.8 | 15.4 | 25.2 | 26.6 | 33.6 |
|
|
| Pre-trained | single-pass | bilinear | 43.1 | 52.3 | 17.7 | 11.5 | 20.3 | 21.4 | 27.7 |
|
|
| SPAR | single-pass | bilinear | 48.5 | 57.6 | 25.7 | 16.2 | 28.2 | 30.3 | 34.4 | +6.7 | +0.8 |
| Pre-trained | single-pass | AnyUp [AnyUp] | 48.5 | 61.0 | 20.9 | 14.8 | 25.3 | 26.9 | 32.9 |
|
|
| SPAR | single-pass | AnyUp [AnyUp] | 50.3 | 59.2 | 26.1 | 17.0 | 29.2 | 32.0 | 35.6 | +2.7 | |
| \rowcolorgray!10 DINOv3.txt [dinov3] – ViT-L-16 | |||||||||||
| Pre-trained | sliding-window | bilinear | 42.6 | 90.8 | 39.7 | 24.9 | 31.8 | 34.4 | 44.0 |
|
|
| Pre-trained | single-pass | bilinear | 46.0 | 89.5 | 35.9 | 24.4 | 32.1 | 34.8 | 43.8 |
|
|
| SPAR | single-pass | bilinear | 43.1 | 91.3 | 40.1 | 25.4 | 31.6 | 35.0 | 44.4 | +0.6 | +0.4 |
| Pre-trained | single-pass | AnyUp [AnyUp] | 46.2 | 89.9 | 32.4 | 24.9 | 32.4 | 35.1 | 43.5 |
|
|
| SPAR | single-pass | AnyUp [AnyUp] | 42.8 | 91.5 | 36.1 | 25.6 | 31.7 | 35.0 | 43.8 | +0.3 | |
Models. We employ 3 pre-trained VLMs. Most experiments use SigLIP2 [SigLIP2] with the ViT-B-16 vision encoder, pre-trained on images. SigLIP2 employs attention pooling over patch tokens with a learned query to obtain the final image-level representation. We skip the pooling step and directly project the value representation of patch tokens through the output linear layer of the attention-pooling and then the rest of the network. In this way, the patch representation we obtain is compatible with that of the text encoder. For the CLIP [CLIP] experiments, we use OpenCLIP [openclip] with ViT-B-16 and native image size . We follow MaskCLIP [MaskCLIP] and set the last encoder’s attention matrix to identity. Lastly, we use DINOv.3txt [dinov3] with ViT-L-16 and native image size . Inference is performed in accordance with the original instructions for semantic segmentation [dinov3]. As the image-level representation is a concatenation of a CLS token and the average of patch representations, we keep only the second half of the textual feature dimensions for later similarity comparison.
Methods. For SPAR, we evaluate single-pass inference, and for pre-trained models, we use two variants. The single-pass baseline, which has exactly the same runtime complexity as ours, and the sliding-window baseline for different values of stride . In particular, for this corresponds to the settings of the teacher model during distillation. Note that, when evaluating sliding-window processing, stitching happens for class similarities at the pixel-level, while during distillation we stitch together upsampled vision features and resample back down to patch-level as discussed in 3. Additionally, the case for SigLIP2, corresponds to the commonly used setting in OVS literature [stojnic2025lposs, dinoiser, SCLIP] of taking half the window size due to its low runtime cost. The window size is set to the native image size of each model, i.e. for SigLIP2, for MaskCLIP and for DINOv3.txt. We further show complementarity by combining SPAR with learnable upsampling by AnyUp [AnyUp], and label propagation performed by LPOSS [stojnic2025lposs]. The former is a feature-agnostic model that operates independently per-feature dimension. Even though it is not its original use, we apply it to upsample class similarities and not the features themselves in all benchmarks, which has reduced complexity, i.e. . LPOSS is a training‑free OVSS approach that refines initial patch‑based predictions from a vision‑language model by propagating VLM-labels (CLIP) across patches in accordance with spatial and DINOv2-semantic similarity. For LPOSS, we replace both models with SPAR-SigLIP2 to highlight its improved spatial coherence. We additionally report NaFlex performance: SigLIP2’s native variant to handle images of any resolution. Unless otherwise stated, bilinear interpolation is used to upsample predictions.
Datasets and evaluation. We evaluate on six standard open-vocabulary semantic segmentation benchmarks derived from four datasets: Pascal VOC [voc], Pascal Context [context], ADE20K [ade20k] and Cityscapes [cityscapes]. VOC21 and Context60 variants include an additional background class over VOC20 and Context59. To allow evaluation for models that have large native resolution, e.g. SigLIP2, we bilinearly resize images of all datasets to have their shorter side equal to 672, with the exception of Cityscapes, which is used at its original resolution. We use mean Intersection over Union (mIoU) as the metric of choice for semantic segmentation. By we denote the average mIoU over all six datasets. We train on a subset of 25k images from SA-1B [sam], an 11-million-image dataset for class-agnostic semantic segmentation. Please note that we do not use any ground-truth annotations from SA-1B.
Training of SPAR. For training, we chose the teacher with sliding windows and stride for its good balance of performance and inference time (see Fig. 1). We apply image augmentations once per image, as the stitched feature map from the teacher network is computed once and stored. An image is augmented with random resizing of the shorter side to in the default case, and in the experiments with larger resolution. We apply axis-independent random cropping, with side lengths from 512 to the maximum possible, and horizontal flipping, each with a 0.5 probability.
We train for 10 epochs with a constant learning rate with the AdamW [adamw] optimizer and weight decay while tuning only the ViT’s last two blocks, unless otherwise mentioned. The default choice for the training set consists of 25k images from SA-1B [sam]. More training details in the supplement.
4.2 Main Results
SPAR improvements and compatibility. We report the quantitative results across three different backbones in Sec. 4.1. Compared to pre-trained single-pass inference, SPAR yields major gains of +10.5 and +6.7 mIoU for SigLIP2 and OpenCLIP, respectively. Moreover, SPAR consistently surpasses the teacher across all models in average performance as well as individually on most datasets.
For DINOv3.txt, SPAR produces only slight gains over the teacher. DINOv3.txt already performs well in single-pass inference, leaving limited room for distillation from sliding-windows. This is likely due to the model’s RoPE encodings [RoPE] and an additional high-res fine-tuning stage. Nevertheless, SPAR on Cityscapes, which has larger resolution test images, achieves a substantial gain from 35.9 to 40.1 mIoU, effectively recovering performance on images that deviate most from the training resolution.
Moreover, we see that the SigLIP2 NaFlex variant does not perform competitively compared to either SPAR or the sliding-window baseline, demonstrating its limitations for dense prediction tasks despite its additional training for resolution robustness. AnyUp [AnyUp], combined with the single-pass baseline, overall does not manage to exceed SPAR. However, in tandem, it proves compatible and boosts the performance by an appreciable margin, notably by 3.2 on average for SPAR-SigLIP2. Interestingly, AnyUp is not effective when combined with DINOv3.txt. Finally, the combination of SPAR and LPOSS [stojnic2025lposs] further boosts performance (another +3.1 mIoU), demonstrating how SPAR-trained models are complementary to other OVS methods.
Performance on large resolution inference. Pre-trained single-pass inference and NaFlex do not benefit from larger resolutions, which is not the case for pre-trained sliding-window inference. The default variant of SPAR benefits from increasing effective resolution up to k2, and surpasses the sliding-window approach when but not the more costly . Nevertheless, we observe that training the full network improves SPAR’s ability to perform large resolution inference, which is further enhanced by increasing the maximum shorter side length in training from 2048 to 2560. These two choices help with large, but not with small, resolution inference. Overall, SPAR achieves the best performance across all inference-time resolutions.
Due to the observation in Sec. 4.1 that with SPAR, DINOv3.txt benefits at larger resolution, we replicate the same experiment for DINOv3.txt in Fig. 4. We observe that SPAR consistently improves DINOv3.txt performance across all resolutions, despite the design and training choices of DINOv3.txt that already promote resolution robustness.
Performance vs. inference time. In Fig. 1, we analyze the trade-off between inference time and performance of SPAR SigLIP2 and the pre-trained models for different variants. SPAR achieves higher results than the best sliding-window variant, while being faster than the teacher, and having a point mIoU performance gain over the pre-trained single-pass model, while keeping the same inference time. For the sliding-window variants, we observe smaller stride values leading to much better performance than the setting of , i.e. half the window size which is commonly adopted in the OVS literature due to its low runtime cost. Furthermore, without compromising inference time, we see that a stride value not divisible by patch size gives a good performance boost, justifying its use for the SPAR teacher.
We further analyze the performance and inference time trade-off for increasing image resolution in Sec. 4.1, with images scaled to the same area as denoted on the axis. Compared to single-pass inference, the sliding-window approach with incurs a substantial runtime cost, being two orders of magnitude slower. All SPAR variants are noticeably cheaper, providing a great performance-cost trade-off.
Supplementary Material
6 Training Details
We conduct all training on two NVIDIA RTX A6000 GPUs using precomputed image features generated by the corresponding teacher model in sliding-window mode with a stride of 24. For feature generation, we use the smallest upsampling factor such that the upsampled feature maps of cropped windows will have their corresponding image sub-patches fully overlapping. Stitching is then done by simply aligning and averaging feature maps. For , we choose which is the smallest value of for which is divisible by . Both the up- and downsampling is bilinear. To accommodate variable input sizes despite using precomputed features, we train with a batch size of 1. Larger batches would require generating teacher features in grouped batches so that all samples share the same sequence length. This would either reduce the effective diversity of the training set, as image tuples would always co-occur in the same context, or introduce additional complexity to gradient accumulation, along with masking in attention to prevent cross-sample interaction. Attention does not usually support unpadded sequences of different lengths in a batch, which would necessitate flattening together all sequences into a single long one. Training is performed in mixed float16 precision.
By default, for the initial rescaling in image augmentation, we use MMCV’s RandomResize with scale=(2048,1024), ratio_range=(0.5,1) and keep_ratio=True. For the extended image range experiments, denoted by , we adjust scale=(2560,2560) and ratio_range=(0.2,1), while still keeping aspect ratio intact with keep_ratio=True. The dataset class names and their feature extraction are as in [SCLIP] and its official Github implementation. The costs for SPAR SigLIP2 – ViT-B-16 time-wise include 9 hours for feature extraction on a single A6000 GPU, and 1.5 hours of training on 2 A6000. Feature storage takes about 170GB.
For experiments involving SA-1B [sam], we use the first 25k images from archive files sa_000000.tar, sa_000001.tar, and sa_000002.tar, as named in the master file for downloading SA-1B. For the alternate seed used in LABEL:tab:siglip2_training_set_ablation, we instead sample images sequentially from the randomly chosen sa_000165.tar, sa_000205.tar, and sa_000569.tar.
7 Performance Across Seeds
We report repeated experimental trials to assess the repeatability of SPAR. For each setting, we compute the mean and standard deviation across three independent runs. The aggregated results for individual tuning configurations experiments are presented in Secs. 7 and 7, and mirror LABEL:tab:config_ablations_stride_24_teacher and LABEL:tab:siglip2_training_set_ablation in the main paper. By we denote the average mIoU over six datasets: Voc21, Voc20, Cityscapes, ADE20K, Context60, and Context59. and indicate the mean and standard deviation of over three independent runs.
| Training Configuration | ||
|---|---|---|
| \rowcolorgray!10 SPAR model | ||
| All params | 42.5 | 42.3 0.27 |
| Last block | 43.3 | 43.2 0.14 |
| Last 2 blocks (default) | 43.6 | 43.6 0.04 |
| Last 3 blocks | 42.9 | 43.1 0.15 |
| Patch projection | 38.0 | 38.1 0.11 |
| Positional encoding | 39.6 | 39.6 0.03 |
| Last 2 blocks - MLP | 42.7 | 42.8 0.07 |
| Last 2 blocks - QKV | 41.7 | 41.7 0.04 |
| Training Set | ||
|---|---|---|
| \rowcolorgray!10 SPAR model | ||
| ADE20k+CS+VOC | 43.4 | 43.6 0.27 |
| ADE20k | 43.0 | 43.1 0.12 |
| SA-1B 1.25k (5%) | 41.1 | 41.0 0.09 |
| SA-1B 2.5k (10%) | 42.1 | 41.9 0.31 |
| SA-1B 12.5k (50%) | 43.2 | 43.0 0.15 |
| SA-1B 25k (100%) | 43.6 | 43.6 0.04 |
| SA-1B 25k (diff. subset) | 43.6 | 43.3 0.30 |
| SA-1B 50k (200%) | 43.6 | 43.5 0.10 |
8 Measuring Inference Time
In this section, we provide details on how we measure inference time for the experiments reported in Fig. 1. To ensure a fair comparison, we only accumulate the time required for the forward passes needed to process each image. In single-pass mode, this corresponds to timing the underlying Vision Transformer (ViT) for a single image. For sliding-window inference, we sum the time required to process each sub-batch of window crops. The sub-batch size is set to 60 and is kept constant across experiments; if an image’s total number of windows is smaller, they are processed in a single batch. All experiments are conducted on a single NVIDIA RTX A6000 GPU, and inference time is measured by accumulating the differences between start and end timestamps using the native time package. Before timing each forward pass, we perform 10 warm-up passes with the data. Measurement is done on the 11th pass.
9 SPAR with LPOSS
For experiments combining SPAR-SigLIP2 with LPOSS [stojnic2025lposs], we tune the hyperparameter, which controls the Laplacian computation during label propagation. For single-pass inference, we set to better align with the label distribution produced by SPAR, while for pretrained single-pass and sliding-window inference we retain the default , as we found these settings to perform best for each approach.
10 Vision-only dense prediction tasks details
We utilize the code of [benchmarking-benchmark] with only minor adaptations to enable training and evaluation in a single pass using native image resolutions. We disable scale jittering and cropping augmentations in the training source code: scale jittering would distort images in a way that does not reflect reality, and cropping to a fixed size is inconsistent with our goal of training a transformer capable of processing images at their native aspect ratio. Horizontal flipping remains enabled during training. In Sec. 10, we additionally report linear probing results when using the code’s default augmentations and resizing of images: for VOC21 and ADE20K, and for Cityscapes. Training the last two blocks and all parameters becomes more similar for VOC21 and ADE20K, while the gap on Cityscapes closes. SPAR still provides a noticeable performance benefit.
| SigLIP2 – ViT-B-16 | VOC21 | CS | ADE |
|---|---|---|---|
| \rowcolorgray!10 Linear Probe - native resolution | |||
| Pre-trained single-pass | 67.1 | 54.1 | 36.0 |
| SPAR (Last 2 blocks) | 70.2 | 56.2 | 38.1 |
| SPAR (All) | 68.9 | 66.7 | 36.5 |
| \rowcolorgray!10 Linear Probe - repository resolution | |||
| Pre-trained single-pass | 71.2 | 57.0 | 37.7 |
| SPAR (Last 2 blocks) | 74.9 | 60.9 | 40.0 |
| SPAR (All) | 75.0 | 67.1 | 39.1 |
We use the official implementation of Hummingbird [hummingbird] for KNN segmentation, leaving images at their native resolution and aspect ratio. Due to A6000 memory limitations, we utilize only 50% of VOC21 and 30% of ADE20K training images to construct the index used to classify patches from the evaluation images. Cityscapes uses the full training set, while the other two datasets use the largest subset that avoids out-of-memory errors. We report the mean over three trials, as the training images are randomly sampled, and observe minimal fluctuations: the standard deviation never exceeds 0.4 mIoU points.
The panoptic segmentation experiments report ADE20K results for images resized to , while Cityscapes is kept at its native resolution, following standard practice. We additionally evaluated other resolutions and observed the same trend: training all parameters yields the highest-quality representations and emerges as a promising approach for future research in resolution-agnostic panoptic segmentation.
| GT Panoptic | ADE | CS | |||
|---|---|---|---|---|---|
| (native) | |||||
| Pre-trained single pass | 33.9 | 34.3 | 33.6 | 31.0 | 31.5 |
| SPAR Last 2 blocks | 34.6 | 35.2 | 34.0 | 30.7 | 28.6 |
| SPAR All parameters | 37.9 | 39.7 | 41.1 | 39.0 | 52.4 |
11 SPAR and other distillation schemes
In Tab. 9, we explore additional distillation targets. We isolate the effect of multi-resolution training by distilling from a single-pass teacher, for which we precompute the image features and maintain the same setup as described in Sec. 4. During training, the student sees the image additionally bilinearly interpolated by a random factor and attempts to align its features, which are bilinearly up- or downsampled to match the teacher’s as needed. As observed, this yields only a +2.9 mIoU average improvement, highlighting the importance of the teacher’s multi-context supervision. We also explore using a teacher with a finer stride of , which underperforms by 3 mIoU, emphasizing the benefit of observing pixels in the context of different patches.
To quantify potential benefits that might be missed by not aligning class similarities, we experiment with using the class lists of ADE20K and Cityscapes, respectively. This approach overfits to the domain of the class set used (e.g., using Cityscapes classes yields high performance on Cityscapes but not on other datasets), reaffirming that SPAR does not require any knowledge of the target domain to be effective and in fact benefits from the classless approach.
| SigLIP2 – ViT-B-16 | Voc21 | Voc20 | CS | ADE | C60 | C59 | |
|---|---|---|---|---|---|---|---|
| Pre-trained single-pass | 36.1 | 71.3 | 23.5 | 16.8 | 24.5 | 26.1 | 33.1 |
| Multi-resolution distill. | 38.0 | 76.8 | 26.3 | 19.0 | 26.8 | 29.0 | 36.0 |
| SPAR w/ Teach. | 43.8 | 77.6 | 36.2 | 21.3 | 31.0 | 34.0 | 40.6 |
| SPAR w/ Teach. | 47.3 | 81.5 | 38.4 | 23.4 | 33.8 | 37.2 | 43.6 |
| Class sim. distill. ADE | 48.0 | 75.4 | 37.1 | 23.0 | 32.8 | 36.4 | 42.1 |
| Class sim. distill. CS | 44.0 | 64.8 | 38.3 | 19.8 | 26.0 | 31.8 | 37.4 |
12 Qualitative Analysis
We provide additional semantic segmentation results in Fig. 6 and PCA visualizations in Fig. 7 for ADE20K [ade20k] and Cityscapes [cityscapes]. Utilized backbones, MaskCLIP [MaskCLIP] with OpenCLIP [openclip] and SigLIP2 [SigLIP2], are indicated per-row in the figures. The segmentation maps demonstrate how SPAR denoises teacher predictions while preserving semantic alignment. SPAR-OpenCLIP improves delineation of bathroom elements and buildings (1st and 3rd columns in Fig. 6) while suppressing noisy classes on the roads (4th and 5th columns). SPAR-SigLIP2 behaves similarly, yet slightly more robustly. The PCA visualizations show further how SPAR improves inter-object separability and smooths intra-object consistency without losing finer details. The former is visible in the PCA from both backbones, e.g., in the clearer separation of people from the wall or bedroom elements (1st and 2nd columns in Fig. 7), while improved intra-object consistency is most apparent in the interior of the camper (3rd column).