SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation

Naomi Kombol¹ Ivan Martinović¹ Siniša Šegvić¹ Giorgos Tolias²
¹Faculty of Electrical Engineering and Computing ²VRG, Faculty of Electrical Engineering University of Zagreb Czech Technical University in Prague

Abstract

Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR

1 Introduction

Figure 1: Performance vs. inference time trade-off. Comparison between the pre-trained SigLIP2 – ViT-B-16 with single-pass or sliding-window (stride value reported in text labels) inference, and our single-pass SPAR-distilled model. The teacher has sliding windows of size

512\times 512

with stride 24. We report average performance across six datasets along with average inference time for an 1024

\times

2048 image. *ND: stride not divisible by patch size.

Vision Transformers (ViTs) [vit] have become the backbone of modern computer vision, powering foundation models like CLIP [CLIP], DINO [dino, jose2025dinov2, dinov3], and SigLIP [siglip, SigLIP2] through contrastive and self-supervised learning. ViTs process images as sequences of patches using self-attention [vaswani2017attention], but due to its quadratic complexity, are typically pre-trained at a single, low resolution. While this suffices for image-level, it limits performance on dense prediction tasks like segmentation, which require fine-grained, per-pixel understanding.

This limitation is particularly severe in training-free open-vocabulary segmentation (OVS), where models must segment categories specified by text without access to pixel-level supervision. OVS approaches leverage ViTs as image feature extractors within Vision-Language models (VLMs), like CLIP [CLIP], for their strong pre-training, but struggle with high-resolution test inputs due to discrepancies with the training resolution. Two common strategies attempt to address this issue: (i) interpolating positional encodings and (ii) sliding-window inference.

The first approach supports single-pass processing and is efficient, but yields poor accuracy. In contrast, we show that the second approach, when using small strides and highly overlapping windows, significantly improves performance. Sliding-window inference enables a patch to appear in multiple contexts and enhances prediction accuracy through aggregation in overlapping regions. Interestingly, strides that are not divisible by the patch size lead to even better results, as sub-patch areas are exposed to diverse contexts.

However, sliding-window inference with small strides increases computational cost, making it impractical for real-world use (see Fig. 1). Despite growing architectural advances [Pix2Struct, NaVit, SigLIP2, dinov3] designed to handle variable resolutions, they are either not widely adopted or, as indicated in our experiments, have unexplored room for improvement on standard OVS benchmarks.

To address these challenges, we propose SPAR: a method for enabling efficient resolution-agnostic feature extraction in ViTs via teacher-student distillation. SPAR transfers the spatial reasoning of a frozen VLM teacher that operates in a finely-strided sliding-window manner, to a fast, single-pass student. The student learns to align its image features with the teacher’s using a regression loss, without requiring architectural changes or additional supervision. As shown in Fig. 1, SPAR surpasses the teacher’s performance while maintaining single-pass efficiency.

SPAR uses the unchanged ViT architecture of the VLM and is compatible with a wide range of ViT backbones. During training, the student is exposed to diverse resolutions and aspect ratios, promoting robust generalization. We find that unfreezing only a small subset of layers suffices for strong resolution tolerance. To reduce compute, teacher features are precomputed and reused across training iterations. Our OVS experiments with vanilla zero-shot predictors show that SPAR is effective with ViTs of OpenCLIP [openclip], SigLIP2 [SigLIP2], but also DINOv3.txt [dinov3], which already includes components making it resolution resilient.

In summary, we introduce SPAR, a framework for producing resolution-agnostic ViTs that deliver strong accuracy in a single forward pass. SPAR is applied to the vision encoder of common VLMs and requires no architectural modifications or additional labels. Rather than introducing new components, SPAR focuses on a principled training strategy that ingrains resolution flexibility without sacrificing efficiency or feature-space alignment. The resulting model excels in training-free open-vocabulary segmentation across resolutions, matching or surpassing finely-strided sliding-window teachers while being up to $52\times$ faster.

2 Related Work

Open-vocabulary semantic segmentation (OVSS). Unlike standard semantic segmentation [zheng2021setr, ZegFormer], open-vocabulary semantic segmentation (OVSS) allows specifying and recognizing arbitrary categories using text. Progress on this task has closely followed advances in Vision–Language Models (VLMs). CLIP [CLIP] introduces contrastive pretraining on web-scale image–text pairs, demonstrating remarkable zero-shot classification capability. Subsequent models, such as ALIGN [ALIGN] and the SigLIP family [siglip, SigLIP2], further scale datasets and refine training objectives – the latter transitioning from contrastive formulations to more efficient, sigmoid-based matching losses – thereby improving both semantic alignment and localization. LSeg [LSeg] is the first to adapt CLIP for OVSS by aligning per-pixel visual features with frozen text embeddings. MaskCLIP [MaskCLIP] exposes spatial cues within CLIP’s vision tower by modifying its final attention layer to recover fine-grained localization.

Current methods are broadly categorized as training-based [LSeg, OpenSeg, ZegFormer, catseg, jose2025dinov2] and training-free [MaskCLIP, SCLIP, ProxyCLIP, CorrCLIP]. Training-based approaches include: (i) methods that correlate text embeddings with regions produced by decoupled mask proposal generators [OpenSeg, ZegFormer, ZSEG, OVSeg, ODISE]; (ii) models that couple CLIP features with proposal generation [FCCLIP]; and (iii) variants that directly adapt CLIP for dense prediction [FCCLIP, MaskCLIP, ClearCLIP]. Training-free methods, on the other hand, aim to preserve the pre-trained weights of CLIP without any fine-tuning. Some rectify localization weaknesses within CLIP itself by refining attention calculation and token intermixing [MaskCLIP, SCLIP, ClearCLIP, ITACLIP], others incorporate spatial priors from vision foundation models [ProxyCLIP, Trident, CorrCLIP], or aid localization with generative priors [OVDiff, DiffSegmenter, CLIPer]. Another line of work performs propagation of features or predictions to improve their spatial consistency [dinoiser, stojnic2025lposs]. Across all these works, Vision Transformers (ViTs) remain the dominant backbone, and their ingrained resolution sensitivity persists as a shared limitation; one that our method addresses.

Resolution-agnostic transformers. While Vision Transformers (ViTs) are powerful, fixed-resolution pretraining limits their robustness to varying input sizes. The naive remedy of interpolating positional encodings to match new image resolutions causes notable performance degradation [SwinTransformer, DeIT, CPE, ResFormer]. Pix2Struct [Pix2Struct] is among the first to address this issue by fixing token sequence length and resizing images with aspect ratios intact, breaking from traditional sliding-window inference [conv_window_slide, SwinTransformer]. Building on this idea, several works introduce multi-resolution training to aid robustness. ResFormer [ResFormer] trains on square crops of varying sizes and expands positional encodings during attention with neighborhood-focused components. FlexiViT [FlexiViT] removes fixed patch sizes, randomly sampling them to expose the model to diverse sequence lengths. NaViT [NaVit] factorizes positional encodings along spatial axes, enabling processing of arbitrary-resolution images with native aspect ratios. SigLIP2’s NaFlex [SigLIP2] unifies these ideas, combining flexible patching with variable sequence lengths for improved resolution generalization. However, NaFlex remains optimized for image–text alignment at the image- and not patch-level. This limits its use for dense prediction tasks like OVSS, as demonstrated in Sec. 4. Overall, these methods enhance robustness through diverse pretraining resolutions rather than post hoc adaptation.

Recent works continue this trend. ViTAR [ViTAR] expands training resolutions via fuzzy positional encoding, using lossy compression to reduce attention costs, while UniViTAR [UniViTAR] merges image and video modeling with aspect ratio preservation, introducing non-standard ViT changes. OryxMLLM [OryxMLLM] trains on native resolutions using dynamic feature compression. In contrast, our approach adapts ViTs to varying resolutions post hoc, without requiring additional labeled data or architectural modifications.

3 Method

Refer to caption — Figure 2: Overview of SPAR. During training, the teacher branch uses a frozen foundational vision encoder to generate feature maps via a sliding-window process followed by stitching. Stitching refers to merging the feature maps of overlapping windows into a unified representation aligned with the original image layout. The student branch, initialized from the same pre-trained weights, trains to match the teacher’s output using efficient single-pass inference. At inference time, the student model enables fast and accurate segmentation at diverse resolutions and aspect ratios using a single forward pass.

3.1 Task Definition

Open Vocabulary Segmentation (OVS) aims to assign a semantic class to every pixel in an image $X\in\mathbb{R}^{3\times H\times W}$ . The set of classes $\mathcal{C}$ , containing $C$ class names, is arbitrary and can be specified at inference time.

3.2 Preliminary

Vision Transformers (ViTs) process images by dividing them into square patches, projecting each patch into a vector representation, and adding positional encodings based on patch location. These vectors are concatenated into a sequence and passed through transformer blocks. A special CLS token is typically included, but we omit it here.

Foundational ViTs are traditionally pre-trained on square images of fixed $K\times K$ image resolution and consist of an encoder $f:\mathbb{R}^{3\times K\times K}\rightarrow\mathbb{R}^{d\times n}$ that produces a feature map $V=f(X)\in\mathbb{R}^{d\times n}$ per input image, where $n=k\cdot k$ is the number of patches, $d$ is the feature dimension, and $(k,k)$ are the spatial dimensions after reshaping into a tensor of size $d\times k\times k$ . Due to patching, $k=K/P$ , where $P$ is the patch size. Positional encodings are learned for that resolution only, i.e. on a $k\times k$ grid of possible positions.

Vision-Language Models (VLMs) combine a vision encoder with a text encoder to map images and text into a shared feature space. Let $F\in\mathbb{R}^{d\times C}$ denote the concatenated text features for all classes in $\mathcal{C}$ .

OVS is performed by computing dot-product similarities between normalized image features and text features. The 2D map of class similarities for all classes given by

Y(X)=\mathrm{norm}(V)^{\top}\mathrm{norm}(F)\in[0,1]^{C\times n},

where $\mathrm{norm}(\cdot)$ denotes $\ell_{2}$ normalization applied along the feature dimension, is reshaped and upsampled to the resolution of $X$ using bilinear interpolation. Upsampling is necessary as the feature map $V$ has low resolution due to the patchification of input during inference, which limits segmentation quality, especially for small objects. Operating at higher resolutions, along with arbitrary aspect ratios, is critical for accurate segmentation. Therefore, for image $X\in\mathbb{R}^{3\times H\times W}$ of arbitrary size, we require an encoder capable of $f:\mathbb{R}^{3\times H\times W}\rightarrow\mathbb{R}^{d\times n}$ , with $n=h\cdot w$ , that produces feature maps of spatial resolution proportional to the input size, i.e. $h=H/P$ , $w=W/P$ .

3.3 Pre-trained Baselines

Single-pass inference at arbitrary resolutions is feasible with ViT-based VLMs since self-attention can handle sequences of varying length. The required adjustment is interpolating positional encodings to match the new resolution. However, this approach suffers from performance degradation due to the training-inference resolution mismatch, and due to interpolation, which disrupts the absolute understanding of positional information learned during training.

Sliding-window is a common strategy for processing an arbitrary-resolution image $X\in\mathbb{R}^{3\times H\times W}$ . The image is divided into $m$ overlapping, or at least adjacent, windows of size $K\times K$ , each processed independently. We denote the image windows by $\{X_{w_{i}}\}_{i=1}^{m}$ , where $X_{w_{i}}\in\mathbb{R}^{3\times K\times K}$ . The stride $s$ denotes how much we stride between neighboring windows horizontally and/or vertically, controlling their overlap.

Feature maps are computed as $V_{w_{i}}=f(X_{w_{i}})$ , and class similarities as $Y(X_{w_{i}})=V_{w_{i}}^{\top}F$ . The final prediction is obtained by stitching:

Y_{\text{stitch}}(X)=\text{stitch}(\{Y(X_{w_{i}})\}_{i=1}^{m})\in[0,1]^{C\times h\times w}.

(1)

Stitching refers to merging the individual window predictions into a single coherent output map, typically by averaging overlapping regions and aligning them to their original spatial positions.

This method preserves the native resolution of the vision encoder, avoiding single-pass issues, but incurs higher computational cost proportional to the total number of windows $m$ . Even with batch processing, sliding-window inference is slower than single-pass for small strides. Nevertheless, small strides yield better performance by increasing window overlap, allowing each pixel to be seen in multiple contexts. This redundancy improves robustness through prediction averaging, akin to test-time augmentation.

3.4 Single-Pass Any-Resolution (SPAR)

Performance vs. inference time trade-off. Observing the trade-off between speed and accuracy in single-pass versus sliding-window inference (see Fig. 1), we propose a feature distillation approach that combines their advantages. A student model is trained to mimic the feature embeddings of a sliding-window teacher while maintaining the efficiency of single-pass inference. Both models have the same architecture, allowing the student to be initialized with the same weights and keep the same feature space. A method overview is presented in Figure 2.

Sliding-window teacher. Assume $X\in\mathbb{R}^{3\times H\times W}$ is a training image of arbitrary resolution and aspect ratio. Instead of stitching predictions, we stitch features over windows of size $K\times K$ producing the teacher feature map by

V_{\text{teacher}}(X)=\text{stitch}(\{f(X_{w_{i}})\}_{i=1}^{m}),

where stitching refers to merging feature maps $f(X_{w_{i}})$ from $m$ windows into a single coherent representation aligned to the feature map layout that the single-pass model would produce, by averaging overlapping regions.

Distillation loss. The student model $g:\mathbb{R}^{3\times H\times W}\rightarrow\mathbb{R}^{d\times n}$ processes the entire image in a single pass to produce $V_{\text{student}}(X)=g(X)$ . The distillation loss minimizes the mean squared error between teacher and student features:

\mathcal{L}_{\text{distill}}=\|V_{\text{teacher}}(X)-V_{\text{student}}(X)\|_{2}^{2}.

Training. The teacher remains frozen during training. We empirically observe that training only the last blocks of the student is a good choice for the standard OVS settings, but training all parameters excels at very large resolution inference and other tasks. Optimization is performed over a dataset of images with diverse resolutions and aspect ratios. No annotations are required since the loss operates solely on feature maps; a generic image dataset suffices.

Training runs for multiple epochs, and teacher feature maps are precomputed and stored to save time. To enable easier feature-level stitching, all training images are resampled bilinearly to have dimensions divisible by the patch size $P$ , and window size $K$ is also divisible by $P$ . If stride $s$ is divisible by $P$ , stitching occurs at the encoder’s native feature resolution due to full patch alignment. However, experiments show that using a stride not divisible by $P$ improves performance by exposing pixels to more diverse contexts. In such cases, if $s$ is divisible by $P/r$ , stitching is performed by upsampling all feature maps by a factor of $r$ before merging, while then downsampling by a factor of $r$ after merging to restore the original feature resolution.

Models	Inference	Upsampler	Voc21	Voc20	CS	ADE	C60	C59	$\text{Mean}_{6}$
\rowcolorgray!10 SigLIP2_[SigLIP2] – ViT-B-16
NaFlex_[SigLIP2]	single-pass	bilinear	35.8	66.6	22.8	16.2	23.7	25.2	31.7
Pre-trained	sliding-window	bilinear	45.0	77.5	38.4	21.8	30.7	34.0	41.2		$\Rsh$
Pre-trained	single-pass	bilinear	36.1	71.3	23.5	16.8	24.5	26.1	33.1	$\Rsh$
SPAR	single-pass	bilinear	47.3	81.5	38.4	23.4	33.8	37.2	43.6	+10.5	+2.4
LPOSS_{[stojnic2025lposs]}+Pre-tr.	single-pass	bilinear	46.1	89.6	34.5	19.9	30.9	35.2	42.7	$\Rsh$
LPOSS_{[stojnic2025lposs]}+SPAR	single-pass	bilinear	51.2	89.7	39.2	25.8	34.5	39.8	46.7	+4.0
Pre-trained	single-pass	AnyUp_[AnyUp]	42.4	82.3	33.2	23.1	30.1	34.5	40.9	$\Rsh$
SPAR	single-pass	AnyUp_[AnyUp]	51.0	86.2	38.6	26.1	37.1	41.5	46.8	+5.9
\rowcolorgray!10 OpenCLIP_[openclip] – ViT-B-16
Pre-trained	sliding-window	bilinear	48.5	55.1	30.8	15.4	25.2	26.6	33.6		$\Rsh$
Pre-trained	single-pass	bilinear	43.1	52.3	17.7	11.5	20.3	21.4	27.7	$\Rsh$
SPAR	single-pass	bilinear	48.5	57.6	25.7	16.2	28.2	30.3	34.4	+6.7	+0.8
Pre-trained	single-pass	AnyUp_[AnyUp]	48.5	61.0	20.9	14.8	25.3	26.9	32.9	$\Rsh$
SPAR	single-pass	AnyUp_[AnyUp]	50.3	59.2	26.1	17.0	29.2	32.0	35.6	+2.7
\rowcolorgray!10 DINOv3.txt_[dinov3] – ViT-L-16
Pre-trained	sliding-window	bilinear	42.6	90.8	39.7	24.9	31.8	34.4	44.0		$\Rsh$
Pre-trained	single-pass	bilinear	46.0	89.5	35.9	24.4	32.1	34.8	43.8	$\Rsh$
SPAR	single-pass	bilinear	43.1	91.3	40.1	25.4	31.6	35.0	44.4	+0.6	+0.4
Pre-trained	single-pass	AnyUp_[AnyUp]	46.2	89.9	32.4	24.9	32.4	35.1	43.5	$\Rsh$
SPAR	single-pass	AnyUp_[AnyUp]	42.8	91.5	36.1	25.6	31.7	35.0	43.8	+0.3

Training Configuration	$\mathrm{Reported\>Mean}_{6}$	$\overline{\mathrm{Mean}_{6}}$ $\pm$ $\sigma_{\mathrm{Mean}_{6}}$
\rowcolorgray!10 SPAR model
All params	42.5	42.3 $\pm$ 0.27
Last block	43.3	43.2 $\pm$ 0.14
Last 2 blocks (default)	43.6	43.6 $\pm$ 0.04
Last 3 blocks	42.9	43.1 $\pm$ 0.15
Patch projection	38.0	38.1 $\pm$ 0.11
Positional encoding	39.6	39.6 $\pm$ 0.03
Last 2 blocks - MLP	42.7	42.8 $\pm$ 0.07
Last 2 blocks - QKV	41.7	41.7 $\pm$ 0.04

Training Set	$\mathrm{Reported\>Mean}_{6}$	$\overline{\mathrm{Mean}_{6}}$ $\pm$ $\sigma_{\mathrm{Mean}_{6}}$
\rowcolorgray!10 SPAR model
ADE20k+CS+VOC	43.4	43.6 $\pm$ 0.27
ADE20k	43.0	43.1 $\pm$ 0.12
SA-1B 1.25k (5%)	41.1	41.0 $\pm$ 0.09
SA-1B 2.5k (10%)	42.1	41.9 $\pm$ 0.31
SA-1B 12.5k (50%)	43.2	43.0 $\pm$ 0.15
SA-1B 25k (100%)	43.6	43.6 $\pm$ 0.04
SA-1B 25k (diff. subset)	43.6	43.3 $\pm$ 0.30
SA-1B 50k (200%)	43.6	43.5 $\pm$ 0.10

\rowcolorgray!10 Linear Probe - native resolution
SigLIP2 – ViT-B-16	VOC21	CS	ADE
Pre-trained single-pass	67.1	54.1	36.0
SPAR (Last 2 blocks)	70.2	56.2	38.1
SPAR (All)	68.9	66.7	36.5
\rowcolorgray!10 Linear Probe - repository resolution
Pre-trained single-pass	71.2	57.0	37.7
SPAR (Last 2 blocks)	74.9	60.9	40.0
SPAR (All)	75.0	67.1	39.1

GT Panoptic	ADE				CS
	$512^{2}$	$640^{2}$	$800^{2}$	$1024^{2}$	$1024\times 2048$ (native)
Pre-trained single pass	33.9	34.3	33.6	31.0	31.5
SPAR Last 2 blocks	34.6	35.2	34.0	30.7	28.6
SPAR All parameters	37.9	39.7	41.1	39.0	52.4

SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation

Abstract

1 Introduction

2 Related Work

3 Method

3.1 Task Definition

3.2 Preliminary

3.3 Pre-trained Baselines

3.4 Single-Pass Any-Resolution (SPAR)

4 Experiments

4.1 Implementation Details

4.2 Main Results

6 Training Details

7 Performance Across Seeds

8 Measuring Inference Time

9 SPAR with LPOSS

10 Vision-only dense prediction tasks details

11 SPAR and other distillation schemes

12 Qualitative Analysis

Image
Ground-truth
OpenCLIP Single-pass
OpenCLIP Sliding-window
SPAR OpenCLIP
SigLIP2 Single-pass
SigLIP2 Sliding-window
SPAR SigLIP2