EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors

Luca Bartolomei^1,2,3 Fabio Tosi² Matteo Poggi^1,2
Stefano Mattoccia^1,2 Guillermo Gallego³
¹Advanced Research Center on Electronic System (ARCES) ³ TU Berlin, Robotics Institute Germany, ²Department of Computer Science and Engineering (DISI) Einstein Center Digital Future, University of Bologna, Italy SCIoI Excellence Cluster, Germany
Project page: https://bartn8.github.io/eventhub

Abstract

We propose EventHub, a novel framework for training deep-event stereo networks without ground truth annotations from costly active sensors, relying instead on standard color images. From these images, we derive either proxy annotations and proxy events through state-of-the-art novel view synthesis techniques, or simply proxy annotations when images are already paired with event data. Using the training set generated by our data factory, we repurpose state-of-the-art stereo models from RGB literature to process event data, obtaining new event stereo models with unprecedented generalization capabilities. Experiments on widely used event stereo datasets support the effectiveness of EventHub and show how the same data distillation mechanism can improve the accuracy of RGB stereo foundation models in challenging conditions such as nighttime scenes.

	EventHub Train Data	\begin{overpic}[width=82.38885pt]{imgs/teaser/nsd_200_z.corrected.jpg} \put(25.0,68.0){\hbox{\pagecolor{white}\color[rgb]{0,0,0} \small{NeRFSt~\cite[cite]{[\@@bibref{Number}{tosi2023nerf}{}{}]}}}} \end{overpic}	\begin{overpic}[width=82.38885pt]{imgs/teaser/nsd_198_v.corrected.jpg} \put(25.0,68.0){\hbox{\pagecolor{white}\color[rgb]{0,0,0} \small{NeRFSt~\cite[cite]{[\@@bibref{Number}{tosi2023nerf}{}{}]}}}} \end{overpic}	\begin{overpic}[width=82.38885pt]{imgs/teaser/scannet_c07c707449.corrected.jpg} \put(20.0,68.0){\hbox{\pagecolor{white}\color[rgb]{0,0,0} \small{ScanNet++~\cite[cite]{[\@@bibref{Number}{yeshwanth2023scannet}{}{}]}}}} \end{overpic}	\begin{overpic}[width=82.38885pt]{imgs/teaser/dsec_zurich_city_06_a_2.corrected.jpg} \put(27.0,68.0){\hbox{\pagecolor{white}\color[rgb]{0,0,0} \small{DSEC~\cite[cite]{[\@@bibref{Number}{gehrig2021dsec}{}{}]}}}} \end{overpic}
Test on DSEC [18]		\begin{overpic}[width=82.38885pt]{imgs/teaser/predictions/dsec_ematch.jpg} { \put(25.0,68.0){\hbox{\pagecolor{white}\color[rgb]{0,0,0} \small{EMatch~\cite[cite]{[\@@bibref{Number}{zhang2025ematch}{}{}]}}}} \put(0.0,2.0){{\color[rgb]{1,1,1} \footnotesize{MAE 0.95px}}}} \put(85.0,0.0){\Huge{\color[rgb]{0,0,0} \char 51}} \put(85.5,0.25){\huge{\color[rgb]{0,1,0} \char 51}} \end{overpic}	\begin{overpic}[width=82.38885pt]{imgs/teaser/predictions/dsec_foundation.jpg} { \put(5.0,68.75){\hbox{\pagecolor{white}\color[rgb]{0,0,0} \scriptsize{E-FoundationStereo (Ours)}}} \put(0.0,2.0){{\color[rgb]{1,1,1} \footnotesize{MAE 0.89px}}}} \put(85.0,0.0){\Huge{\color[rgb]{0,0,0} \char 51}} \put(85.5,0.25){\huge{\color[rgb]{0,1,0} \char 51}} \end{overpic}	\begin{overpic}[width=82.38885pt]{imgs/teaser/predictions/m3ed_ematch_indoor_crop.jpg} { \put(25.0,68.0){\hbox{\pagecolor{white}\color[rgb]{0,0,0} \small{EMatch~\cite[cite]{[\@@bibref{Number}{zhang2025ematch}{}{}]}}}} \put(0.0,2.0){{\color[rgb]{1,1,1} \footnotesize{MAE 4.05px}}}} \put(85.0,0.0){\Huge{\color[rgb]{0,0,0} \char 55}} \put(86.25,1.5){\huge{\color[rgb]{1,0,0} \char 55}} \end{overpic}	\begin{overpic}[width=82.38885pt]{imgs/teaser/predictions/m3ed_foundation_indoor_crop.jpg} { \put(5.0,68.75){\hbox{\pagecolor{white}\color[rgb]{0,0,0} \scriptsize{E-FoundationStereo (Ours)}}} \put(0.0,2.0){{\color[rgb]{1,1,1} \footnotesize{MAE 2.53px}}}} \put(85.0,0.0){\Huge{\color[rgb]{0,0,0} \char 51}} \put(85.5,0.25){\huge{\color[rgb]{0,1,0} \char 51}} \end{overpic}
						Test on M3ED [7]

Figure 1: EventHub: LiDAR-free proxy data for robust event stereo. Our factory generates training data from multiple sources [63, 75, 18] (top), allowing our E-FoundationStereo to match EMatch [82] in-domain [18] and outperform it in generalization [7] (bottom).

1 Introduction

Now nearing its fiftieth anniversary [62], stereo matching has undergone rapid evolution over the past decade thanks to deep learning [62], thus enabling high-accuracy and high-resolution depth maps that are crucial for applications such as autonomous driving, 3D scene reconstruction, augmented reality, and robotic navigation. Recent deep-based stereo models achieved remarkable performance [69, 6], also in a zero-shot manner, thanks to a large quantity of labeled data – i.e., millions of labeled synthetic and real images. The acquisition of those images required years of incredible efforts from the community: starting from sophisticated active setups [55] to achieve high-accuracy real datasets, to the usage of large computing resources to render large-scale photo-realistic synthetic datasets [69].

Recently, the introduction of the first commercial event cameras [39] led to the creation of a novel branch of stereo literature that aims to estimate depth from a pair of synchronized event cameras [20]. These sensors capture asynchronous per-pixel brightness changes occurring in the scene, so-called “events” [15, 53]. An event is characterized by a pixel coordinate (the location where the change occurred), a timestamp (when it occurred), and a polarity ( $\pm 1$ , indicating whether brightness increased or decreased). The asynchronous working principle enables these sensors to capture information at microsecond resolution, allowing them to surpass traditional frame-based cameras in challenging scenarios, such as fast motion (resulting in no motion blur) and high dynamic (resulting in no over/under-exposure). As a drawback, adapting the large body of image-based computer vision algorithms to event cameras is not trivial due to the very sparse nature of events [15].

Despite the growing interest in event-based stereo matching, the availability of labeled datasets remains very limited compared to the traditional frame-based domain [20]. Capturing dense and accurate ground truth for asynchronous event streams is significantly more challenging due to the still-emergent event community and the substantial deviation from traditional frame-based cameras.

In this paper, we aim to introduce a novel framework for training deep-based event stereo networks effortlessly and without any ground-truth. By leveraging state-of-the-art novel view synthesis solutions [59], we can generate event stereo training data from image sequences collected with a single color camera, alongside with proxy depth labels. In alternative, when paired RGB stereo images and event stereo data are available, we distill the knowledge of stereo foundation models processing to annotate the latter. This approach drastically reduces the need for complex data acquisition setups and large-scale manual labeling efforts, democratizing access to high-quality training data for event-based stereo (Fig. 1). With our data, we then repurpose stereo foundation models to obtain a new generation of state-of-the-art, event stereo models with unparalleled generalization capabilities, which can be used in turn to further improve the original color models in challenging scenarios.

We summarize our main contributions as follows:

•

We propose EventHub, the first framework combining neural rendering data generation and cross-modal distillation from RGB stereo foundation models to train event stereo networks without active sensor supervision.
•

We demonstrate superior out-of-domain generalization compared to LiDAR-supervised models, reducing error by up to 50% on M3ED and MVSEC datasets.
•

We establish bi-directional knowledge transfer between RGB and event modalities, enabling event models to improve the performance of RGB stereo foundation models in challenging nighttime conditions.

\begin{overpic}[abs,unit=1mm,scale={.25}]{imgs/qualitative_dataset_errors_no_text_small.pdf} \par\par \par\par\par\put(7.0,62.0){\hbox{\pagecolor{white}\color[rgb]{1,.5,0}{Low LiDAR Density (A)}}} \put(55.0,62.0){\hbox{\pagecolor{white}\color[rgb]{1,.5,0}{Low Accumulation Density (B)}}} \put(113.0,62.0){\hbox{\pagecolor{white}\color[rgb]{1,0,0}{Accumulation Errors (C)}}} \put(167.0,62.0){\hbox{\pagecolor{white}\color[rgb]{1,0,1}{Reprojection Errors (D)}}} \put(215.0,62.0){\hbox{\pagecolor{white}\color[rgb]{0,1,1}{Non-Lambertian Surfaces (E)}}} \par\put(3.0,1.0){\hbox{\pagecolor{white}{DSEC Raw Scan (7x7 dilation)}}} \put(63.0,1.0){\hbox{\pagecolor{white}{DSEC Ground-Truth}}} \put(114.0,1.0){\hbox{\pagecolor{white}{MVSEC Ground-Truth}}} \put(168.0,1.0){\hbox{\pagecolor{white}{M3ED Ground-Truth}}} \put(221.0,1.0){\hbox{\pagecolor{white}{DSEC Ground-Truth}}} \end{overpic}

Figure 2: Limitations of LiDAR-supervised real-world datasets. Despite their popularity [18, 7, 84], LiDAR annotations remain sparse (A), poorly capture dynamic scenes (B–C), are prone to reprojection errors (D), and struggle on transparent or reflective surfaces (E).

2 Related Work

Frame-based Stereo. Stereo depth estimation has transitioned from traditional hand-crafted approaches [56] to data-driven solutions [62, 51, 33]. Early learning-based methods [76, 44] focused on individual matching components, while later works [45, 30, 8, 72] introduced fully trainable pipelines combining feature extraction, cost aggregation, and disparity prediction, by leveraging the abundance of synthetic data [45, 65] for training. Building on optical flow principles, recurrent architectures [42, 70, 66] performed iterative refinement over correlation volumes. Transformer models [22, 38, 71] further enhanced matching through global attention mechanisms. Generalizing across different environments remains challenging. Solutions include learning domain-agnostic features [78, 80], incorporating geometric constraints [2, 61], self-supervised learning with photometric consistency [21, 67, 52], distillation from traditional methods [60, 3, 10] and radiance field supervision [63, 41]. Recently, foundation models trained on massive diverse datasets have demonstrated unprecedented zero-shot capabilities [69, 6, 29, 11], establishing a new state of the art. However, the scarcity of annotated stereo data in challenging conditions (e.g., night) limits their performance in such scenarios, leaving room for improvement.

Event-based Stereo. While monocular event-based depth estimation [23, 34, 4, 86] has been explored, we focus on stereo configurations exploiting binocular geometry [20]. Early stereo methods [58, 32] relied on temporal coincidence matching via frame accumulation or event-driven search, later enhanced with epipolar geometry and temporal-luminance constraints [54, 28]. Neuromorphic implementations [14, 50] deployed cooperative networks on specialized spiking hardware, while deep learning approaches [64, 1, 81] introduced learnable representations and spatio-temporal encoders. Recent works incorporate temporal context [12, 49, 19, 24], attention mechanisms [9] and unified architectures [82] that handle both stereo and optical flow. Hybrid configurations combine events with frames: binocular setups (2E+2F) [47, 13, 49] fuse modalities through recurrent networks or selection mechanisms, while asymmetric systems (1E+1F) [68, 77, 43] address cross-modal alignment challenges. Despite progress, event stereo development is severely constrained by the limited amount of annotated data [20]. Existing datasets [18, 84, 7] remain orders of magnitude smaller than frame-based ones. They also lack diversity, which constrains the ability of models to generalize beyond their training domains. This motivates us to seek alternative training strategies.

Neural Rendering for Training Data Generation. Neural radiance fields [46] enable photorealistic novel view synthesis (NVS) from sparse images, facilitating synthetic training data generation. NeRF-supervised frameworks [63] rendered stereo pairs from monocular sequences, using synthesized images and depth as proxy supervision for stereo networks. Similar approaches emerged for optical flow [41], with confidence-based filtering to remove unreliable proxy labels. Beyond depth estimation, neural rendering has been exploited for object detection [16], learning dense descriptors [74], semantic labeling [83], 6D pose estimation [35] and automated annotation in driving scenes [27]. For event cameras, video-to-event simulators [17] recycle existing datasets into synthetic streams, while specialized engines [36] generate event data with optical flow labels. Concurrent work GS2E [37] generates multi-view event data from sparse RGB images via 3D Gaussian Splatting [31], and it is mildly tested for NVS and image deblurring. However, no prior work has explored neural rendering for generating event stereo training data. Leveraging efficient radiance field rendering [48, 31, 59, 40], we propose the first framework to synthesize stereo event streams from monocular RGB data, enabling large-scale training without active sensors, such as LiDAR.

\begin{overpic}[abs,unit=1mm,scale={.25}]{imgs/architecture.corrected.pdf} \par \par\put(3.0,2.0){\scriptsize(i)} \put(95.5,38.0){\scriptsize(ii)} \put(95.5,15.0){\scriptsize(iii)} \par\put(2.5,37.5){\tiny Multi-View Images (\ref{sub:imagecapture})} \put(3.5,17.0){\tiny COLMAP (\ref{sub:imagecapture})} \put(24.5,18.0){\tiny Regularized Dense 3D Optimization (\ref{sub:svrastertraining})} \par\put(27.0,1.5){\tiny Virtual Trajectory Construction (\ref{sub:virtualtrajectory})} \put(32.0,14.0){\tiny Trj. Local} \put(38.0,11.0){\tiny$\Gamma_{x}$} \put(34.5,4.5){\tiny$\Gamma_{y}$} \put(31.25,7.25){\tiny$\Gamma_{z}$} \put(41.0,14.0){\tiny Trj. Global} \put(44.5,8.5){\tiny$\Omega$} \par\par\put(64.5,5.5){\tiny Motion-Adaptive Stereo Rendering} \put(74.0,3.5){\tiny(\ref{sub:trinocularrendering})} \put(66.5,39.0){\tiny Rendered} \put(65.0,37.25){\tiny Stereo Events} \put(80.0,39.0){\tiny Rendered} \put(77.0,37.25){\tiny Depth \& Confidence} \put(70.0,21.0){\tiny Rendered RGB Triplet} \par\put(97.0,24.5){\tiny Stereo RGB} \put(112.0,32.25){\tiny\hbox{\pagecolor{White}{Teacher-SFM}}} \put(115.0,26.25){\tiny(\ref{sec:method:distillation})} \put(111.0,24.5){\tiny Proxy Estimation} \put(126.0,24.5){\tiny Misaligned Proxy} \put(137.0,34.125){\tiny Reprojection} \put(146.0,24.5){\tiny Aligned Proxy} \par\put(96.0,1.5){\tiny Stereo Events} \put(115.0,9.4){\tiny\hbox{\pagecolor{White}{(\ref{sec:method:adapting_vfm})}}} \put(112.0,7.8){\tiny\hbox{\pagecolor{White}{Adapted-SFM}}} \put(113.0,1.5){\tiny Training} \put(127.5,1.5){\tiny Disparity Map} \put(137.0,11.0){\tiny Supervision} \put(147.0,1.5){\tiny EventHub} \end{overpic}

Figure 3: Framework Overview: We obtain training data through two complementary approaches: (i) Event Data Factory: SVRaster [59] generates synthetic event stereo pairs and depth labels from sparse RGB images via virtual camera trajectories (left); (ii) Stereo Cross-Modal Distillation: existing RGB stereo models produce proxy depth labels for real event data in calibrated RGB-Event stereo setups (top right). (iii) Both data sources are combined in EventHub to train/adapt event stereo networks (bottom right).

3 Method

Sourcing accurate depth labels is costly and time-consuming, as it typically requires the use of active sensors such as LiDARs which, despite their high accuracy, provide very sparse data (Fig. 2 (A)). This limitation is partially mitigated by temporal accumulation; however, this strategy is ineffective in dealing with moving objects –Fig. 2 (B) or yields noise due to the motion of dynamic entities– Fig. 2 (C). Finally, imprecise calibration or non-Lambertian surfaces also harm the quality of the annotations –Fig. 2 (D,E).

Aiming to remove the dependency on noisy labeled data captured with costly LiDAR-based setups, we turn to a much cheaper data modality, simpler to obtain and already available in abundance: color images. Through the lens of RGB cameras, we can exploit state-of-the-art depth estimation techniques to annotate data with proxy labels, having accuracy not far from the one of LiDAR sensors. Most available color images, however, are collected by a single RGB camera, usually navigating through the scene, unpaired with any event camera counterpart. In this setting, besides proxy labels, we also need to generate proxy events. Conversely, when color images are paired with event data collected within the same environment, we can exploit camera calibration and multi-view geometry to annotate the real events, without the need to generate proxy events.

The overview of our framework is shown in Fig. 3. We develop techniques to extract proxy labels in the two above-mentioned settings: (i) one based on NVS frameworks,in which a modified SVRaster [59] is used to generate both proxy events and proxy labels from RGB sequences (Sec. 3.1.1), and (ii) one leveraging robust RGB stereo matching in dual RGB–Event stereo setups, where the RGB pair offers the proxy-supervision through state-of-the-art models, such as FoundationStereo [69] (Sec. 3.1.2). Moreover, to exploit the knowledge already available in the color image domain, we take a step further by exploring how to adapt pre-trained, robust RGB-based stereo matching networks [6, 69] to the event domain, thereby minimizing the need for labeled event data (Sec. 3.2).

3.1 EventHub: Data Generation

3.1.1 Synthetic Generation via Novel View Synthesis

Given sparse RGB images of a static scene, NVS frameworks [46, 31] reconstruct high-fidelity digital representations that can be rendered from arbitrary viewpoints. While NeRF-based data factories for RGB stereo exist [63], an equivalent pipeline for the event-based stereo domain remains unexplored: NVS frameworks typically output static frames rather than events, requiring fast rendering to match the event camera’s temporal resolution, plus additional components to handle frames-to-events generation and motion trajectories. Therefore, we propose a novel pipeline for event stereo data generation by leveraging SVRaster [59] as the NVS foundation framework. We now describe the proposed pipeline step by step.

Image Capture and Camera Calibration. After collecting $N$ multi-view RGB images $\hat{\mathbf{I}}_{i}$ of a static scene, we follow [63] and deploy COLMAP [57] to recover intrinsics $\hat{\mathbf{K}}\in\mathbb{R}^{3\times 3}$ and $N$ camera poses $[\hat{\mathbf{R}}_{i}|\hat{\mathbf{t}}_{i}]=\hat{\mathbf{T}}_{i}\in\mathbb{SE}(3)$ .

Regularized Dense 3D Optimization. Next, for each captured scene, we fed $\hat{\mathbf{I}}_{i}$ , $\hat{\mathbf{K}}$ , and $\hat{\mathbf{T}}_{i}$ to SVRaster’s training pipeline, obtaining a radiance representation of the scene. We follow [59] and use both MSE and SSIM to optimize the rendered image. The rendered color $\mathbf{I}$ and corresponding depth $\mathbf{Z}$ along each camera ray are defined as:

\mathbf{I}=\sum_{i=1}^{N}T_{i}\alpha_{i}\mathbf{c}_{i},\;\mathbf{Z}=\sum_{i=1}^{N}T_{i}\alpha_{i}z_{i},\;T_{i}=\prod_{j=1}^{i-1}(1-\alpha_{j}),

(1)

where $\alpha_{i}\in[0,1]$ , $T_{i}\in[0,1]$ , $\mathbf{c}_{i}\in[0,1]^{3}$ , and $z_{i}>0$ are the opacity, the transmittance [46], the color, and the depth of the $i$ -th voxel, respectively.

To further improve depth quality, we applied several regularizers during training: among these, (i) $\mathcal{L}_{N-\text{mean}}$ and $\mathcal{L}_{N-\text{med}}$ enforce self-consistency between rendered depth and normals, respectively aggregated using mean and median [25]; (ii) $\mathcal{L}_{\text{DAv2}}$ enforces the rendered depth to be consistent with monocular predictions from DepthAnythingV2 [73]. We studied additional regularizers $\mathcal{L}_{\text{asc}}$ , $\mathcal{L}_{\text{sparse}}$ , and $\mathcal{L}_{\text{mast3r}}$ , with further details in the supplementary material. Each regularizer’s contribution is weighted inside a regularization loss $\mathcal{L}_{\text{reg}}\doteq\lambda_{{N-\text{mean}}}\mathcal{L}_{{N-\text{mean}}}+\lambda_{N-\text{med}}\mathcal{L}_{N-\text{med}}+\lambda_{\text{asc}}\mathcal{L}_{\text{asc}}+\lambda_{\text{sparse}}\mathcal{L}_{\text{sparse}}+\lambda_{\text{DAv2}}\mathcal{L}_{\text{DAv2}}+\lambda_{\text{mast3r}}\mathcal{L}_{\text{mast3r}}$ , yielding the total loss:

\mathcal{L}\doteq\mathcal{L}_{\text{MSE}}+\lambda_{\text{SSIM}}\mathcal{L}_{\text{SSIM}}+\mathcal{L}_{\text{reg}}.

(2)

Refer to caption — Figure 4: Qualitative examples of events and proxy annotations by EventHub. From top to bottom, examples obtained from NeRF-Stereo [63], ScanNet++ [75] through novel view synthesis, and from DSEC [18] through cross-modal distillation.

Virtual Trajectory Construction. Given that the captured scene is static and an event camera triggers events only when the logarithmic intensity changes exceed a threshold, we emulate such variations by simulating virtual camera egomotion. We design two types of virtual trajectories: a local trajectory $\Gamma(\tau)$ and a global trajectory $\Omega(\tau)$ . Both $\Gamma(\tau)$ and $\Omega(\tau)$ are continuous functions mapping a virtual time instant $\tau\in[0,1]$ into a virtual pose $\mathbf{T}_{\tau}\in\mathbb{SE}(3)$ . Given an initial camera pose $\hat{\mathbf{T}}_{i}$ (previously estimated using COLMAP), $\Gamma(\tau)=[\hat{\mathbf{R}}_{i}|\hat{\mathbf{t}}_{i}+\tau\mathbf{r}]$ applies a $\tau$ -weighted translation $\mathbf{r}$ along an arbitrary axis, e.g., $\mathbf{r}=(0\ 1\ 0)^{\top}$ . This simple setup is well-suited for object-centric captured scenes [63], where the quality of novel views tends to degrade as the rendering pose moves farther from those observed during training.

Conversely, the global trajectory $\Omega(\tau)$ is obtained by performing a least-squares fit of three cubic splines to a subset (typically one-half or one-third) of the estimated camera poses, producing smooth and continuous motion. Although a single cubic spline suffices to model the translation component $\mathbf{t}_{\tau}$ , two additional splines followed by a re-orthogonalization ensure that $\mathbf{R}_{\tau}\in\mathbb{SO}(3)$ . This configuration enables the synthesis of complex trajectories involving large rotations. However, to maintain meaningful viewpoints (i.e., camera orientations directed toward observed scene regions), it is generally preferable to employ this type of trajectory for indoor recordings [75].

Motion-Adaptive Stereo Rendering. After defining a virtual stereo baseline $b$ and recovering the focal length $f$ from the intrinsic matrix $\mathbf{K}$ of the virtual camera, we render the disparity map $\mathbf{D}=(b\cdot f)/{\mathbf{Z}}$ used for the supervision of stereo networks. Although depth regularization improves stability, the rendered depth maps still exhibit noise. To mitigate this, [63] proposed to extract trinocular images $\mathbf{I}_{LL}$ , $\mathbf{I}_{L}$ , $\mathbf{I}_{R}$ (where $\mathbf{I}_{LL}$ and $\mathbf{I}_{R}$ are rendered after applying respectively stereo translations $(b\ 0\ 0)^{\top}$ and $(-b\ 0\ 0)^{\top}$ to $\mathbf{t}_{\tau}$ ), balancing $\mathbf{D}$ with a trinocular photometric loss using Ambient Occlusion $\mathbf{C}_{\text{AO}}$ as the weighting term (more details in the supplementary material). To improve confidence estimation, we design a voxel-based confidence measure:

\textstyle\mathbf{C}_{\text{Vsize}}=\text{norm}\left(\sum_{i=1}^{N}T_{i}s_{i}\right)\odot\text{norm}\left(\sum_{i=1}^{N}T_{i}\alpha_{i}\right),

(3)

where $s_{i}$ denotes the size of the $i$ -th voxel along the camera ray, $\odot$ represent the Hadamard product, and $\text{norm}(\cdot)$ normalizes the input to the range $[0,1]$ .

Given that SVRaster [59] does not natively support event generation, we leverage ESIM [17] to simulate stereo events from rendered frames: given two consecutive virtual frames $\mathbf{I}_{L}(\tau)$ and $\mathbf{I}_{L}(\tau+\Delta\tau)$ (or $\mathbf{I}_{R}(\tau)$ and $\mathbf{I}_{R}(\tau+\Delta\tau)$ if $(-b\ 0\ 0)^{\top}$ is applied), ESIM simulates the event stream along the virtual motion, assuming $\Delta\tau$ being enough small. Since both $\Gamma(\tau)$ and $\Omega(\tau)$ are continuous functions, we can render frames with arbitrary $\Delta\tau$ , avoiding frame interpolation [17]; however, choosing $\Delta\tau$ is not trivial: too large values introduce simulation artifacts, while too small values leads to redundant computation. Starting from a conservative value – e.g., $\Delta\tau=\frac{1}{32}$ – we dynamically adapt this value using pixel motion: given the depth map $\mathbf{Z}$ and the camera poses $\mathbf{T}_{\tau}$ and $\mathbf{T}_{\tau+\Delta\tau}$ , we compute the optical flow by projecting 3D points reconstructed from $\mathbf{Z}$ into the next view using the known relative motion $\mathbf{T}_{\tau\rightarrow\tau+\Delta\tau}$ and intrinsics $\mathbf{K}$ . The flow field $\mathbf{F}$ is then obtained as the displacement between corresponding pixel projections across the two frames. To ensure bounded pixel motion and prevent event artifacts, we set the number of intermediate renderings between $\tau$ and $\tau+\Delta\tau$ to $2^{n}$ with $n=\max(\lceil\log_{2}(|\mathbf{F}|_{\max})\rceil,0)$ .

Dataset	MIX 1	MIX 2	MIX 3	MIX 4
NeRF-Stereo [63]	✓	✓		✓
ScanNet++ [75]		✓		✓
DSEC [18]			✓	✓

Table 1: Combinations of datasets used by EventHub. Proxy labels are applied to annotate each dataset.

	Training Method	SE-CFF [49]				EMatch [82]				E-FoundationStereo				E-StereoAnywhere				Avg Rank
	Training Method	1PE	2PE	3PE	MAE	1PE	2PE	3PE	MAE	1PE	2PE	3PE	MAE	1PE	2PE	3PE	MAE	Avg Rank
(A)	Photometric [63]	88.54	71.73	55.35	7.94	92.31	69.17	44.90	3.37	93.85	73.12	49.82	3.65	92.55	72.25	51.65	4.11	6.94
(B)	EV-SceneFlow [17, 45]	66.30	50.18	41.47	3.50	71.64	53.67	42.02	3.56	61.80	48.04	41.68	3.10	64.97	49.54	41.74	3.21	6.06
(C)	MIX 1	45.61	25.15	16.54	1.87	47.17	23.23	13.58	1.70	38.70	15.63	8.83	1.39	35.86	15.13	9.03	1.36	5.00
	MIX 2	38.52	18.17	11.02	1.56	41.06	17.81	9.97	1.45	27.20	9.72	5.53	1.04	31.49	12.45	7.32	1.23	4.00
	MIX 3	\cellcolorsecondcolor24.73	\cellcolorsecondcolor8.58	\cellcolorsecondcolor5.08	\cellcolorsecondcolor1.01	31.30	11.14	5.90	1.15	20.99	6.82	4.10	0.89	24.35	8.22	4.76	0.99	2.75
	MIX 4	27.31	9.69	5.66	1.07	\cellcolorsecondcolor26.13	\cellcolorsecondcolor8.55	\cellcolorsecondcolor4.71	\cellcolorsecondcolor0.99	\cellcolorsecondcolor20.42	\cellcolorsecondcolor6.53	\cellcolorsecondcolor3.91	\cellcolorsecondcolor0.87	\cellcolorsecondcolor23.90	\cellcolorsecondcolor7.97	\cellcolorsecondcolor4.62	\cellcolorsecondcolor0.96	\cellcolorsecondcolor2.25
(D)	LiDAR (GT)	\cellcolorfirstcolor13.82	\cellcolorfirstcolor4.05	\cellcolorfirstcolor2.37	\cellcolorfirstcolor0.66	\cellcolorfirstcolor24.11	\cellcolorfirstcolor7.80	\cellcolorfirstcolor3.99	\cellcolorfirstcolor0.89	\cellcolorfirstcolor12.53	\cellcolorfirstcolor3.48	\cellcolorfirstcolor1.98	\cellcolorfirstcolor0.60	\cellcolorfirstcolor14.66	\cellcolorfirstcolor4.32	\cellcolorfirstcolor2.51	\cellcolorfirstcolor0.69	\cellcolorfirstcolor1.00

Table 2: In-domain experimental results – DSEC [18] dataset. We train four event stereo models according to different training protocols, not exploiting LiDAR annotations (A,B,C), compared against in-domain training on DSEC with LiDAR labels (D).

3.1.2 Stereo Cross-Modal Distillation

In the second setting, we assume the availability of data collected in the same environment by two calibrated sensors: RGB and event cameras. In this case, we can obtain proxy labels from color images and transfer them to the events domain by exploiting multi-view geometry, if needed.

More specifically, as we focus on the event stereo matching task, we assume the availability of paired color-event stereo pairs $(\mathbf{I}_{L},\mathbf{I}_{R})-(\mathbf{E}_{L},\mathbf{E}_{R})$ , often true for the most popular event stereo datasets available in the literature [84, 18, 7]. Accordingly, we can use an off-the-shelf, state-of-the-art Stereo Foundation Model (SFM) [6, 69] $\Phi_{c}$ to predict proxy labels by processing a color stereo pair $(\mathbf{I}_{L},\mathbf{I}_{R})$ . Then, by knowing the relative pose $\mathbf{T}_{c\rightarrow e}$ between the left color camera and the left event one, we can transfer the labels and annotate the event data.

Specifically, the disparity map predicted by $\Phi_{c}$ is converted into depth $\mathbf{Z}_{c}$ by knowing baseline $b_{c}$ and focal length $f_{c}$ of the color stereo camera

\mathbf{Z}_{c}=(b_{c}\cdot f_{c})/\mathbf{D}_{c},\quad\quad\text{with}\quad\mathbf{D}_{c}=\Phi_{c}(\mathbf{I}_{L},\mathbf{I}_{R}).

(4)

Then, being $\mathbf{u}_{c}$ a pixel in homogeneous coordinates on the color camera frame, we back-project it into a 3D point $\mathbf{p}_{c}$ according to depth $\mathbf{Z}_{c}(\mathbf{u}_{c})$ and intrinsics $\mathbf{K}_{c}$ . $\mathbf{p}_{c}$ is then expressed in the event camera reference system by applying the transformation $\mathbf{T}_{c\rightarrow e}$ between both, obtaining

\mathbf{p}_{e}={\mathbf{T}}_{c\rightarrow e}\mathbf{p}_{c},\quad\quad\text{with}\quad\mathbf{p}_{c}=\mathbf{Z}_{c}(\mathbf{u}_{c})\mathbf{K}_{c}^{-1}\mathbf{u}_{c}.

(5)

Finally, we project the $z$ coordinate of $\mathbf{p}_{e}$ into the event camera frame according to intrinsics $\mathbf{K}_{e}$ . Doing this for any pixel in $\mathbf{I}_{L}$ yields a depth map $\mathbf{Z}_{e}$ aligned with $\mathbf{E}_{L}$ . Finally, we obtain the disparity map $\mathbf{D}_{e}$ through triangulation, thus obtaining proxy labels $\mathbf{D}_{e}={(b_{e}\cdot f_{e})}/{\mathbf{Z}_{e}}$ for the event stereo pair $(\mathbf{E}_{L},\mathbf{E}_{R})$ . In this way, we can distill the knowledge of state-of-the-art stereo models [6, 69] and reuse it in the events domain to pursue the same advances achieved in RGB stereo. This procedure is not needed if a sensor such as the DAVIS camera is available, providing pixel aligned grayscale images and event streams [84]: in such a case, the initial disparity map $\mathbf{D}_{c}$ already coincides with $\mathbf{D}_{e}$ .

3.2 Repurposing RGB Stereo into Event Stereo

Besides exploiting RGB stereo models to distill proxy annotations for the event domain, we further benefit from the vast priors they learned from the abundant RGB stereo data available by repurposing the models themselves into event stereo models. In other words, we design and train an event stereo model $\Phi_{e}$ having the same architecture as an RGB stereo one, starting from pre-trained weights $\Phi_{c}$ (i.e., those used to distill proxy labels).

To this aim, keeping the number of input channels unchanged across RGB and event stereo would avoid any modification to the original deep neural network model: purposely, we encode event streams into stacked tensors according to the 3-channel Tencode representation [26] that are passed as inputs to $\Phi_{e}$ , by sampling events backward in time based on a fixed number of events:

(x,y,t,p)\!\rightarrow\!\mathbf{S}(x,y)=\begin{cases}(1,\,\tfrac{t_{\max}-t}{\Delta t},\,0),&p>0\\ (0,\,\tfrac{t_{\max}-t}{\Delta t},\,1),&p\leq 0,\end{cases}

(6)

with $t_{\max}$ being the timestamp of the latest event occurred in the timelapse $\Delta t$ during which events are stacked.

4 Experiments

4.1 Implementation and Experimental Settings

EventHub Settings. We collect proxy data from multiple sources [63, 75, 18]. For datasets without ready-to-use events [63, 75], we employ our NVS pipeline to synthesize both proxy events and depth, while for [18] we apply our distillation pipeline to estimate proxy depth only. Each NVS scene is optimized independently, setting $\lambda_{\text{SSIM}}=0.02$ , $\lambda_{{N-\text{mean}}}=\lambda_{N-\text{med}}=0.0005$ , $\lambda_{\text{DAv2}}=0.01$ and disabling all other regularizers. We use three local trajectories $\Gamma_{x},\Gamma_{y},\Gamma_{z}$ (one per axis) for [63], and a global trajectory $\Omega$ for [75] with additional processing described in the supplementary material.

For NeRF-Stereo [63], we render the 270 scenes three times (one for each baseline $b\in\{0.1,0.3,0.5\}$ ) at $640\times 480$ px resolution, setting $\Delta\tau=0.03$ . For our 403 scenes selection of ScanNet++ [75], we render both $640\times 480$ px and $1280\times 720$ px resolutions, each with three baselines $b\in\{0.05,0.08,0.1\}$ , setting $\Delta\tau=0.015$ . To filter noisy labels, we adopt curation pipeline [69], training SE-CFF [49] paired with Tencode [26] on all the NVS data, discarding samples with excessive pixel errors, yielding $\sim 70$ k curated pairs. For DSEC [18] proxy labeling, we retain the train split of [5] (excluding night sequences), generating proxy labels via FoundationStereo [69] ViT-L, clipping depth between $[0.5,100]$ m before reprojection, obtaining a total of $\sim 30$ k samples. Figure 4 shows some annotated examples generated from the three datasets by EventHub, while Table 1 reports different mixtures of training data derived from them and used in our experiments.

Stereo Models and Training Settings. We evaluate two event-based stereo networks [49, 82], using Tencode [26] and VoxelGrid [85] event representations, respectively, and two RGB-based models [69, 6] adapted through our repurposing strategy and dubbed E-FoundationStereo and E-StereoAnywhere, respectively. Event networks are trained from scratch with a learning rate (lr) of $5\cdot 10^{-4}$ , while RGB models are fine-tuned from the authors’ ViT-S checkpoints using $\text{lr}=5\cdot 10^{-5}$ and freezing the DAv2-S prior only. Training is performed in PyTorch with the AdamW optimizer, OneCycle learning rate scheduler, and data augmentations including random cropping at $576\times 448$ px. All models are trained for 10 epochs on a single A100 GPU with batch size 2. On NVS data [63, 75] we use the NeRF-supervised loss [63], while on distilled data [18] and non-EventHub data we use the original loss of each model. These settings are used for all experiments.

\begin{overpic}[width=89.43048pt]{imgs/ablation_qualitative/all/foundationstereo/zurich_city_06_a/00005_es_left.jpg} \put(30.0,66.5){\hbox{\pagecolor{white}\small{Events}}} \end{overpic}	\begin{overpic}[width=89.43048pt]{imgs/ablation_qualitative/photometric/foundationstereo/zurich_city_06_a/00005_norm.jpg} \put(20.0,66.5){\hbox{\pagecolor{white}\small{Photometric}}} \end{overpic}	\begin{overpic}[width=89.43048pt]{imgs/ablation_qualitative/ev_sceneflow/foundationstereo/zurich_city_06_a/00005_norm.jpg} \put(15.0,66.5){\hbox{\pagecolor{white}\small{EV-SceneFlow}}} \end{overpic}
\begin{overpic}[width=89.43048pt]{imgs/ablation_qualitative/proxy/foundationstereo/zurich_city_06_a/00005_norm.jpg} \put(17.0,66.0){\hbox{\pagecolor{white}\small{MIX~3 (ours)}}} \end{overpic}	\begin{overpic}[width=89.43048pt]{imgs/ablation_qualitative/all/foundationstereo/zurich_city_06_a/00005_norm.jpg} \put(17.0,66.0){\hbox{\pagecolor{white}\small{MIX~4 (ours)}}} \end{overpic}	\begin{overpic}[width=89.43048pt]{imgs/ablation_qualitative/supervised/foundationstereo/zurich_city_06_a/00005_norm.jpg} \put(20.0,66.0){\hbox{\pagecolor{white}\small{LiDAR (GT)}}} \end{overpic}

Figure 5: Qualitative results on DSEC dataset [18]. Predictions by E-FoundationStereo trained according to different protocols.

Model

Training Method

M3ED (Day)

M3ED (Night)

M3ED (Indoor)

Avg Rank

1PE

2PE

3PE

MAE

1PE

2PE

3PE

MAE

1PE

2PE

3PE

MAE

SE-CFF [49]

MIX 3

\cellcolorsecondcolor46.84

\cellcolorsecondcolor25.31

\cellcolorsecondcolor16.99

\cellcolorsecondcolor2.81

\cellcolorsecondcolor58.32

\cellcolorsecondcolor36.07

24.45

3.50

51.03

31.43

23.84

4.52

2.50

MIX 4

\cellcolorfirstcolor35.65

\cellcolorfirstcolor15.18

\cellcolorfirstcolor9.37

\cellcolorfirstcolor1.22

\cellcolorfirstcolor51.57

\cellcolorfirstcolor26.84

\cellcolorfirstcolor15.33

\cellcolorfirstcolor1.70

\cellcolorsecondcolor48.56

\cellcolorfirstcolor27.36

\cellcolorfirstcolor19.33

\cellcolorfirstcolor2.95

\cellcolorfirstcolor1.08

LiDAR (GT)

58.82

41.41

32.93

3.05

58.94

36.78

\cellcolorsecondcolor23.55

\cellcolorsecondcolor2.20

\cellcolorfirstcolor45.33

\cellcolorsecondcolor28.39

\cellcolorsecondcolor21.59

\cellcolorsecondcolor4.48

\cellcolorsecondcolor2.42

EMatch [82]

MIX 3

86.18

77.61

72.13

40.36

90.02

84.94

82.26

45.41

76.99

65.23

58.98

15.80

3.00

MIX 4

\cellcolorfirstcolor43.99

\cellcolorfirstcolor20.87

\cellcolorfirstcolor12.93

\cellcolorfirstcolor2.23

\cellcolorfirstcolor63.69

\cellcolorfirstcolor38.74

\cellcolorfirstcolor26.38

\cellcolorfirstcolor5.03

\cellcolorfirstcolor58.42

\cellcolorfirstcolor32.16

\cellcolorfirstcolor21.48

\cellcolorfirstcolor3.10

\cellcolorfirstcolor1.00

LiDAR (GT)

\cellcolorsecondcolor83.16

\cellcolorsecondcolor71.65

\cellcolorsecondcolor62.81

\cellcolorsecondcolor12.22

\cellcolorsecondcolor83.36

\cellcolorsecondcolor73.03

\cellcolorsecondcolor66.06

\cellcolorsecondcolor18.63

\cellcolorsecondcolor64.34

\cellcolorsecondcolor46.85

\cellcolorsecondcolor38.80

\cellcolorsecondcolor7.95

\cellcolorsecondcolor2.00

E-FoundationStereo

MIX 3

\cellcolorsecondcolor33.44

\cellcolorsecondcolor19.20

\cellcolorsecondcolor12.37

\cellcolorsecondcolor1.49

\cellcolorsecondcolor49.26

\cellcolorsecondcolor26.09

\cellcolorsecondcolor14.94

\cellcolorsecondcolor1.84

\cellcolorsecondcolor40.27

22.08

\cellcolorsecondcolor15.73

\cellcolorfirstcolor2.37

\cellcolorsecondcolor2.00

MIX 4

\cellcolorfirstcolor26.38

\cellcolorfirstcolor11.57

\cellcolorfirstcolor6.96

\cellcolorfirstcolor0.98

\cellcolorfirstcolor46.90

\cellcolorfirstcolor23.09

\cellcolorfirstcolor12.96

\cellcolorfirstcolor1.54

40.74

\cellcolorfirstcolor21.83

\cellcolorfirstcolor15.61

\cellcolorsecondcolor2.45

\cellcolorfirstcolor1.25

LiDAR (GT)

54.80

39.48

31.43

2.89

55.77

34.93

22.60

1.99

\cellcolorfirstcolor38.93

\cellcolorsecondcolor22.03

15.96

2.87

2.75

E-StereoAnywhere

MIX 3

\cellcolorsecondcolor47.53

\cellcolorsecondcolor27.85

\cellcolorsecondcolor18.83

4.48

\cellcolorsecondcolor59.21

\cellcolorsecondcolor34.29

\cellcolorsecondcolor21.21

3.64

46.02

26.71

19.55

\cellcolorsecondcolor3.19

\cellcolorsecondcolor2.42

MIX 4

\cellcolorfirstcolor34.99

\cellcolorfirstcolor13.01

\cellcolorfirstcolor7.88

\cellcolorfirstcolor1.12

\cellcolorfirstcolor58.62

\cellcolorfirstcolor27.85

\cellcolorfirstcolor14.42

\cellcolorfirstcolor1.71

\cellcolorfirstcolor42.74

\cellcolorfirstcolor23.79

\cellcolorfirstcolor16.84

\cellcolorfirstcolor2.58

\cellcolorfirstcolor1.00

LiDAR (GT)

63.70

43.33

33.90

\cellcolorsecondcolor3.26

63.23

41.22

26.72

\cellcolorsecondcolor2.78

\cellcolorsecondcolor44.27

\cellcolorsecondcolor25.81

\cellcolorsecondcolor18.80

3.72

2.58

Table 3: Out-of-domain experimental results – M3ED [7] dataset. We compare the generalization capability of the four event stereo models trained with MIX 3 and MIX 4 against their counterparts trained using DSEC LiDAR labels.

Model

Training Method

MVSEC (Day)

MVSEC (Night)

MVSEC (Indoor)

Avg Rank

1PE

2PE

3PE

MAE

1PE

2PE

3PE

MAE

1PE

2PE

3PE

MAE

SE-CFF [49]

MIX 3

\cellcolorsecondcolor77.13

\cellcolorsecondcolor53.87

\cellcolorsecondcolor31.12

\cellcolorsecondcolor3.21

\cellcolorsecondcolor78.19

\cellcolorsecondcolor54.75

\cellcolorsecondcolor31.24

\cellcolorsecondcolor3.64

\cellcolorsecondcolor42.14

\cellcolorsecondcolor24.53

18.04

3.30

\cellcolorsecondcolor2.17

MIX 4

\cellcolorfirstcolor31.99

\cellcolorfirstcolor12.00

\cellcolorfirstcolor6.88

\cellcolorfirstcolor1.11

\cellcolorfirstcolor40.54

\cellcolorfirstcolor19.00

\cellcolorfirstcolor10.13

\cellcolorfirstcolor1.45

\cellcolorfirstcolor29.66

\cellcolorfirstcolor12.34

\cellcolorfirstcolor6.89

\cellcolorfirstcolor1.39

\cellcolorfirstcolor1.00

LiDAR (GT)

97.82

94.85

90.47

6.12

96.89

92.95

87.14

6.07

46.67

26.37

\cellcolorsecondcolor15.43

\cellcolorsecondcolor1.78

2.83

EMatch [82]

MIX 3

\cellcolorsecondcolor93.80

\cellcolorsecondcolor80.23

\cellcolorsecondcolor59.05

\cellcolorsecondcolor6.00

\cellcolorsecondcolor91.74

\cellcolorsecondcolor75.37

\cellcolorsecondcolor49.51

\cellcolorsecondcolor4.67

\cellcolorsecondcolor58.44

\cellcolorsecondcolor35.05

\cellcolorsecondcolor24.24

3.02

\cellcolorsecondcolor2.08

MIX 4

\cellcolorfirstcolor56.29

\cellcolorfirstcolor21.61

\cellcolorfirstcolor6.67

\cellcolorfirstcolor1.39

\cellcolorfirstcolor68.51

\cellcolorfirstcolor40.48

\cellcolorfirstcolor14.92

\cellcolorfirstcolor1.81

\cellcolorfirstcolor46.03

\cellcolorfirstcolor21.40

\cellcolorfirstcolor12.36

\cellcolorfirstcolor1.93

\cellcolorfirstcolor1.00

LiDAR (GT)

99.47

98.36

95.83

6.70

98.56

96.20

92.40

6.21

66.21

43.21

28.13

\cellcolorsecondcolor2.60

2.92

E-FoundationStereo

MIX 3

\cellcolorsecondcolor81.78

\cellcolorsecondcolor54.75

\cellcolorsecondcolor28.95

\cellcolorsecondcolor2.75

\cellcolorsecondcolor81.89

\cellcolorsecondcolor58.54

\cellcolorsecondcolor36.27

\cellcolorsecondcolor2.73

\cellcolorsecondcolor34.45

\cellcolorsecondcolor18.51

\cellcolorsecondcolor12.48

1.62

\cellcolorsecondcolor2.08

MIX 4

\cellcolorfirstcolor45.94

\cellcolorfirstcolor20.92

\cellcolorfirstcolor9.45

\cellcolorfirstcolor1.33

\cellcolorfirstcolor58.15

\cellcolorfirstcolor38.14

\cellcolorfirstcolor18.02

\cellcolorfirstcolor1.75

\cellcolorfirstcolor24.55

\cellcolorfirstcolor9.11

\cellcolorfirstcolor5.29

\cellcolorfirstcolor1.07

\cellcolorfirstcolor1.00

LiDAR (GT)

97.91

94.65

89.45

6.04

97.32

94.15

89.74

6.26

40.19

21.27

12.62

\cellcolorsecondcolor1.61

2.92

E-StereoAnywhere

MIX 3

\cellcolorsecondcolor77.22

\cellcolorsecondcolor60.04

\cellcolorsecondcolor42.41

\cellcolorsecondcolor4.37

\cellcolorsecondcolor75.97

\cellcolorsecondcolor57.26

\cellcolorsecondcolor35.88

\cellcolorsecondcolor3.47

\cellcolorsecondcolor40.68

\cellcolorsecondcolor24.74

\cellcolorsecondcolor19.64

4.18

\cellcolorsecondcolor2.08

MIX 4

\cellcolorfirstcolor68.18

\cellcolorfirstcolor44.60

\cellcolorfirstcolor20.92

\cellcolorfirstcolor1.96

\cellcolorfirstcolor72.21

\cellcolorfirstcolor47.93

\cellcolorfirstcolor20.39

\cellcolorfirstcolor1.96

\cellcolorfirstcolor21.27

\cellcolorfirstcolor7.84

\cellcolorfirstcolor4.48

\cellcolorfirstcolor0.94

\cellcolorfirstcolor1.00

LiDAR (GT)

98.39

95.68

90.43

7.85

97.59

94.77

89.27

8.12

49.80

31.71

21.39

\cellcolorsecondcolor2.79

2.92

Table 4: Out-of-domain experimental results – MVSEC [84] dataset. We compare the generalization capability of the four event stereo models trained with MIX 3 and MIX 4 against their counterparts trained using DSEC LiDAR labels.

4.2 Evaluation Datasets & Protocol

Datasets. We run our evaluation on three main datasets: DSEC [18], M3ED [7] and MVSEC [84]. DSEC features $640\times 480$ px event stereo pairs captured by Prophesee Gen3.1 sensors, with ground truth depth annotations obtained from a 32-line LiDAR whose scans are accumulated and post-processed. We use the validation split proposed in [5] to evaluate in-domain performance. M3ED and MVSEC are instead used to evaluate the generalization performance of models trained under different paradigms (no data from these datasets is used for training). M3ED contains $1280\times 720$ px event stereo pairs captured by Prophesee IMX636 sensors and annotated by a 64-line LiDAR. MVSEC provides $346\times 260$ px event stereo pairs captured by DAVIS346B sensors, annotated with a 16-line LiDAR accumulated via LOAM [79].

Evaluation Metrics. We evaluate the networks using two main disparity metrics: the Mean-Absolute-Error (MAE) in pixels, and the percentage of pixels having an absolute disparity error larger than a specific threshold, set to 1, 2, and 3 pixels (namely, 1PE, 2PE, and 3PE).

We highlight the best and second-best scores.

4.3 In-Domain Evaluation

We first assess how the different mixtures of data generated by EventHub impact the accuracy of trained models, as well as comparing our training strategies with existing LiDAR-free alternatives as well as with LiDAR supervision. Table 2 collects the outcome of this evaluation, carried out by training four event stereo models according to four main strategies. The first two rows report results obtained by training the models: (A) using photometric loss between DSEC’s RGB stereo images projected into event frame, or (B) using a synthetic event dataset derived from SceneFlow [45] via [17], which provides perfect ground truth disparities, yet proxy event data. Both approaches are scarcely effective, with MAE never dropping below 3 pixels for any model.

Then, we report the results achieved by training on the data and annotation produced by EventHub (C), involving different mixtures of data. Notably, MIX 1 already yields largely lower error values, benefiting from the stronger supervision of the proxy labels rendered by SVRaster. Adding data from ScanNet (MIX 2) yields moderate improvements. Training on MIX 3 further boosts performance across all stereo models, which is not surprising since it involves training data from the same domain used in the evaluation. Nonetheless, combining all data sources (MIX 4) produces the best overall performance for EMatch [82], E-FoundationStereo and E-StereoAwywhere. The bottom row reports the accuracy obtained by running in-domain supervised training using LiDAR ground truth (D), which unsurprisingly yields the lowest errors, yet proving how models trained with MIX 4 get very close to this upper bound despite not using any ground truth annotation from LiDAR.

Despite the thoroughness of Tab. 2, the shortcomings of LiDAR data for both training and evaluation are fully not apparent. Figure 5 unveils how the sparse nature of LiDAR annotations hampers the network’s ability to produce dense and accurate disparity maps. In contrast, training with MIX 3 already avoids most of the artifacts introduced by LiDAR supervision, even though this is not reflected in the error metrics, which are based on LiDAR data.

Model	Training Method	DSEC (Night)
Model	Training Method	1PE	2PE	3PE	MAE
FoundationStereo (ViT-S)	Author’s Weights [69]	68.40	47.69	35.80	3.89
FoundationStereo (ViT-S)	MIX 3 (Night)	\cellcolorfirstcolor24.75	\cellcolorfirstcolor8.09	\cellcolorfirstcolor4.25	\cellcolorfirstcolor1.01
FoundationStereo (ViT-L)	Author’s Weights [69]	30.06	14.87	10.88	1.87
FoundationStereo (ViT-L)	MIX 3 (Night)	\cellcolorfirstcolor25.33	\cellcolorfirstcolor8.48	\cellcolorfirstcolor4.56	\cellcolorfirstcolor1.06
StereoAnywhere (ViT-S)	Author’s Weights [6]	33.01	14.24	8.93	1.61
StereoAnywhere (ViT-S)	MIX 3 (Night)	\cellcolorfirstcolor30.61	\cellcolorfirstcolor10.81	\cellcolorfirstcolor5.72	\cellcolorfirstcolor1.22
StereoAnywhere (ViT-L)	Author’s Weights [6]	\cellcolorfirstcolor31.34	13.20	8.27	1.52
StereoAnywhere (ViT-L)	MIX 3 (Night)	32.42	\cellcolorfirstcolor11.45	\cellcolorfirstcolor5.93	\cellcolorfirstcolor1.23

Table 5: Experimental results on DSEC night images [18] – RGB stereo models. Fine-tuning SFMs on proxy labels derived from our event models yields improvements on nighttime images.

4.4 Out-of-Domain Evaluation

We now extend our evaluation beyond the single domain represented by DSEC, moving to M3ED and MVSEC. These cover both indoor and outdoor scenarios and are collected by sensors with very different properties, thus representing a significant domain shift with respect to DSEC.

Table 3 presents the results achieved by the four models on M3ED, each supervised with MIX 3, MIX 4 and LiDAR training strategies, i.e., the same models trained on DSEC and transferred to M3ED without any additional fine-tuning. Since MIX 1 and MIX 2 perform worse than MIX 3 and MIX 4, they are omitted from here onward. We report results averaged over three main subdomains: Day, Night and Indoor scenes. We observe that models trained with MIX 4 largely outperform their counterparts trained with LiDAR annotations in terms of generalization. This confirms the sub-optimality of supervision provided by sparse and noisy LiDAR measurements, which is often surpassed even by MIX 3 alone (which just replaces LiDAR annotations with proxy labels, without additional training data).

Table 4 reports the outcome of the evaluation on MVSEC, using the same four models trained with MIX 3, MIX 4, or LiDAR labels, averaging results over the same three subdomains as before. Once again, MIX 4 emerges as the absolute winner in terms of generalization, while MIX 3 consistently achieves the second-best results, except in a few cases. Finally, Fig. 6 shows examples from M3ED and MVSEC datasets, highlighting that any model produces much sharper and more accurate disparity maps when trained with MIX 4 than with LiDAR labels.

4.5 Closing the Loop: Improving SFMs at Night

Finally, we investigate whether the event-based stereo models trained with proxy labels can, in turn, serve as sources of new proxy labels for annotating color images in scenarios where conventional SFMs struggle, such as nighttime conditions. To this end, we generate proxy labels from stereo pairs $(\mathbf{E}_{L},\mathbf{E}_{R})$ and transfer them to $(\mathbf{I}_{L},\mathbf{I}_{R})$ , reversing the procedure described in Sec. 3.1.2.

Table 5 presents the results of this experiment, showing that both FoundationStereo and StereoAnywhere perform poorly on nighttime images. After fine-tuning them for 10 epochs on the proxy labels predicted by their E-FoundationStereo and E-StereoAnywhere couterparts, their accuracy improves substantially, effectively closing the loop across modalities.

5 Conclusion

We presented EventHub, a paradigm for supervising deep event-stereo networks that does not rely on expensive, yet noisy, annotations from LiDAR sensors. EventHub leverages novel view synthesis and knowledge distillation to obtain proxy labels (and proxy events, when needed) directly from conventional color image collections. Models trained with EventHub achieve superior generalization, outperforming those trained with LiDAR labels in cross-domain scenarios. These models can be used to close the loop across modalities, yielding proxy labels to improve RGB stereo models in scenarios where they struggle.

\thetitle

Supplementary Material

This document reports additional material related to the CVPR paper “EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors”.

•

First, we present an extended description of our Novel View Synthesis (NVS) pipeline in Section 3, including details about the depth regularizers used to improve depth estimation (Sec. 6.1), and the novel voxel-based confidence $\mathbf{C}_{\text{Vsize}}$ (Sec. 6.2).
•

Next, we include additional implementation details, in particular, regarding the global trajectory $\Omega(\tau)$ (Section 8.1), the datasets splits (Sec. 8.2), and the stereo model losses (Sec. 8.3).
•

Finally, we present extensive qualitative results regarding both generated data from [63, 75, 18] using our EventHub pipeline, and disparity estimation from our trained event stereo networks, using the three evaluation datasets [84, 18, 7].

6 Method Overview: Additional Details

In this section, we include an extended description of our EventHub pipeline.

6.1 Depth Regularizers

To improve the quality of our NVS generation pipeline, we rely on a subset of the following regularization strategies:

•

$\mathcal{L}_{N-\text{mean}}$ and $\mathcal{L}_{N-\text{med}}$ : both losses encourage agreement between depth and normal renderings, obtained through mean and median aggregation, respectively [25];
•

$\mathcal{L}_{\text{DAv2}}$ promotes consistency between the rendered depth and the monocular predictions from DepthAnythingV2 [73];
•

$\mathcal{L}_{\text{asc}}$ encourages density to increase monotonically along the ray direction;
•

$\mathcal{L}_{\text{sparse}}$ fosters depth regularization using COLMAP [57] sparse 3D points;
•

$\mathcal{L}_{\text{MASt3R}}$ guides the depth regularization following MASt3R predictions.

Ablation study and metrics for NVS. To assess the contribution of each depth regularizer, we conducted an ablation experiment on ScanNet++ [75], which provides ground-truth depth. In particular, we selected a small dataset split and evaluated the contribution of each regularizer using two metrics for NVS image quality (i.e., PSNR and SSIM) and two metrics for depth evaluation (i.e., MAE and $\delta\leq\rho$ ):

•

Peak Signal-to-Noise Ratio (PSNR). For color images $\mathbf{I}$ and ground-truth $\mathbf{I}^{*}$ , PSNR is defined based on the mean squared error (MSE):

$\text{MSE}=\frac{1}{N}\sum_{i}(\mathbf{I}_{i}-\mathbf{I}_{i}^{*})^{2},\qquad\text{PSNR}=-10\log_{10}(\text{MSE}),$ (7)

where $N$ is the number of pixels, and $\mathbf{I}_{i}$ and $\mathbf{I}_{i}^{*}$ are the RGB values of the $i$ -th pixel in the rendered and ground-truth images, respectively. Higher PSNR indicates better agreement with the ground truth.

•

Structural Similarity Index (SSIM). SSIM measures similarity between predicted and ground-truth color images $\mathbf{I}$ and $\mathbf{I}^{*}$ by comparing local windows $\mathbf{X}\in\mathcal{N}_{\mathbf{I}}$ and $\mathbf{Y}\in\mathcal{N}_{\mathbf{I^{*}}}$ :

\text{SSIM}(\mathbf{I},\mathbf{I}^{*})=\frac{1}{M}\sum_{\mathbf{X}\in\mathcal{N}_{\mathbf{I}}\ \mathbf{Y}\in\mathcal{N}_{\mathbf{I^{*}}}}\frac{(2\mu_{\mathbf{X}}\mu_{\mathbf{Y}}+C_{1})(2\sigma_{\mathbf{X}\mathbf{Y}}+C_{2})}{(\mu_{\mathbf{X}}^{2}+\mu_{\mathbf{Y}}^{2}+C_{1})(\sigma_{\mathbf{X}}^{2}+\sigma_{\mathbf{Y}}^{2}+C_{2})},

(8)

where $M$ is the number of windows, $C_{1}$ and $C_{2}$ are constants, $\mu_{\mathbf{X}},\mu_{\mathbf{Y}}$ are local means, $\sigma_{\mathbf{X}}^{2},\sigma_{\mathbf{Y}}^{2}$ the local variances, and $\sigma_{\mathbf{X}\mathbf{Y}}$ the local covariance. Higher values indicate better structural similarity.

•

Mean Absolute Error (MAE). It measures the average magnitude of errors:

$\text{MAE}=\frac{1}{N}\sum_{i}|\mathbf{Z}_{i}-\mathbf{Z}_{i}^{*}|,$ (9)

where $N$ is the number of pixels, $\mathbf{Z}_{i}$ and $\mathbf{Z}_{i}^{*}$ are the predicted and ground-truth depths of the $i$ -th pixel, respectively.

•

Threshold Accuracy. It reports the percentage of predicted depths within a threshold $\rho$ (in our ablation experiment $\rho=1.25$ ) indicating the proportion of accurate predictions:

\text{Accuracy}=\frac{1}{N}\sum_{i}\chi\left(\max\Big(\frac{\mathbf{Z}_{i}}{\mathbf{Z}_{i}^{*}},\frac{\mathbf{Z}_{i}^{*}}{\mathbf{Z}_{i}}\Big)\leq\rho\right)=\frac{1}{N}\sum_{i}\chi\left(\delta\leq\rho\right)

(10)

where $N$ is the number of pixels, $\mathbf{Z}_{i}$ and $\mathbf{Z}_{i}^{*}$ are the predicted and ground-truth depths of the $i$ -th pixel, respectively, and $\chi(\cdot)$ is the indicator function.

Row	$\lambda_{{N-\text{mean}}}$	$\lambda_{N-\text{med}}$	$\lambda_{\text{asc}}$	$\lambda_{\text{sparse}}$	$\lambda_{\text{DAv2}}$	$\lambda_{\text{MASt3R}}$	PSNR	SSIM( $\times 100$ )	MAE (cm)	$\delta\leq 1.25$ (%)
1	-	-	-	-	-	-	\cellcolorfirstcolor33.85	\cellcolorfirstcolor87.36	8.81	93.51
2	0.001	0.001	-	-	-	-	33.25	86.77	9.03	92.43
3	0.001	0.001	0.01	-	-	-	33.25	86.77	9.03	92.47
4	0.001	0.001	0.01	0.01	-	-	33.25	86.77	8.99	92.52
5	0.001	0.001	0.01	0.01	0.01	-	33.19	86.73	6.61	96.23
6	0.001	0.001	0.01	0.01	0.01	0.01	26.86	79.64	38.71	67.31
7	0.001	0.001	0.01	-	0.01	-	33.20	86.73	6.58	96.25
8	0.001	0.001	-	-	0.01	-	33.19	86.73	\cellcolorsecondcolor6.57	\cellcolorsecondcolor96.29
9	-	-	-	-	0.01	-	\cellcolorsecondcolor33.74	\cellcolorsecondcolor87.24	7.15	95.89
10	0.0005	0.0005	-	-	0.01	-	33.37	86.91	\cellcolorfirstcolor6.38	\cellcolorfirstcolor96.44

Table 6: Depth regularization ablation. PSNR values are given in decibels. SSIM values are multiplied by a 100 factor. MAE values are reported in centimeters.

Model	PSNR $\uparrow$	MAE (cm) $\downarrow$	$\delta\leq 1.25$ (%) $\uparrow$	Setup Time (min/scene) $\downarrow$	FPS $\uparrow$
Depth Anything v3 [40]	19.19	41.94	53.92	$\sim$ 1	165
Instant-NGP [48]	29.21	24.68	82.88	$\sim$ 8	5
3DGS [31]	32.51	22.36	76.23	$\sim$ 20	165
SVRaster [59]	33.37	6.38	96.44	$\sim$ 20	143

Table 7: Comparison between different NVS engines. SVRaster achieves the best trade-off between rendering quality, setup time and rendering speed.

Ablation Analysis. Table 6 reports the results of our study on depth-guided regularization terms. The left columns indicate the weights $\lambda$ set for each regularizer, starting with the default values from [59]. Without any regularization (first row), SVRaster achieves solid PSNR and SSIM but exhibits a relatively large depth error (MAE = 8.81 cm). Introducing the first four regularizers – i.e., $\mathcal{L}_{\text{N-mean}}$ and $\mathcal{L}_{\text{N-med}}$ (row 2), $\mathcal{L}_{\text{asc}}$ (row 3), and $\mathcal{L}_{\text{sparse}}$ (row 4) – yields no meaningful improvements, aside from a marginal SSIM gain. In contrast, incorporating the monocular prior from DepthAnythingV2 [73] (row 5) produces a substantial reduction in depth error (25% decrease in MAE) while preserving nearly unchanged image quality. Adding $\mathcal{L}_{\text{MASt3R}}$ on top of all other regularizers (row 6), however, severely degrades performance. Given the strong influence of $\mathcal{L}_{\text{DAv2}}$ , we perform additional ablations where the remaining regularizers are removed one at a time (rows 7, 8, and 9). This analysis shows minor contribution from $\mathcal{L}_{\text{asc}}$ and $\mathcal{L}_{\text{sparse}}$ , but disabling $\lambda_{{N-\text{mean}}}$ and $\lambda_{N-\text{med}}$ leads to worse results than those of row 5. Therefore, we reintroduce these two terms with halved weights (row 10), which yields the best overall depth performance. We adopt this last configuration as the final set of depth-regularization weights for our NVS pipeline.

Impact of the NVS engine. To support our choice of using SVRaster to render both proxy labels and event streams, we report a comparison with other state-of-the-art novel view synthesis approaches in Table 7. In addition to rendering quality, we also consider the setup time necessary to process each single scene before starting the rendering process, as well as the speed at which data is generated. Notably, Depth Anything v3 has the lowest setup time, as it directly predicts a 3DGS field in a feed-forward fashion rather than a per-scene optimization process. However, this speed is traded for a much lower rendering quality. Instant-NGP still requires a low setup time, yet features a very low rendering speed and sub-optimal rendering quality. Finally, although requiring the highest setup time, 3DGS and SVRaster yields the highest rendering quality: among the two, SVRaster shines thanks to the careful use of depth regularization.

6.2 Novel Voxel-based Confidence

Despite the added depth regularization, the resulting depth maps may still contain noticeable noise. To address this issue, [63] introduced a trinocular photometric loss:

\mathcal{L}_{\text{NS}}=\lambda_{\text{disp}}\cdot\eta(\mathbf{C}_{\text{AO}};\mu_{\text{AO}})\cdot\mathcal{L}_{\text{disp}}+\mathbf{M}_{\text{auto}}\cdot\lambda_{\text{3p}}\cdot(1-\eta(\mathbf{C}_{\text{AO}};\mu_{\text{AO}}))\cdot\mathcal{L}_{\text{3p}},

(11)

where $\mathcal{L}_{\text{disp}}$ is the disparity supervision loss with respect to the estimated disparity $\mathbf{D}_{e}$ (further details in Sec. 8.3), $\lambda_{\text{disp}}=1.0$ and $\lambda_{\text{3p}}=0.1$ are the loss weights set to the default values in [63], $\eta(\mathbf{C}_{\text{AO}};\mu_{\text{AO}})$ is the truncation function that truncates confidence $\mathbf{C}_{\text{AO}}$ using the threshold $\mu_{\text{AO}}=0.5$ :

\eta(\mathbf{C};\mu)=\begin{cases}0&\text{if}\ \mathbf{C}\leq\mu\\ \mathbf{C}&\text{otherwise}\end{cases},\qquad\mathbf{C}_{\text{AO}}=\text{norm}\left(\sum_{i=1}^{N}T_{i}\alpha_{i}^{2}\right),\qquad\text{norm}(\mathbf{X})=\frac{\mathbf{X}-\min(\mathbf{X})}{\max(\mathbf{X})-\min(\mathbf{X})},

(12)

and given the three rendered images $\mathbf{I}_{LL}$ , $\mathbf{I}_{L}$ , $\mathbf{I}_{R}$ – where $\mathbf{I}_{LL}$ and $\mathbf{I}_{R}$ are rendered after applying respective stereo translations $(b\ 0\ 0)^{\top}$ and $(-b\ 0\ 0)^{\top}$ to the translation component $\mathbf{t}_{\tau}$ of the virtual trajectory $\Gamma(\tau)$ or $\Omega(\tau)$ – we can define the trinocular photometric loss $\mathcal{L}_{\text{3p}}$ as follow:

\mathcal{L}_{\text{3p}}(\mathbf{I}_{LL},\mathbf{I}_{L},\mathbf{I}_{R})=\min\Bigl(\mathcal{L}_{\text{2p}}\bigl(\mathbf{I}_{L},\mathcal{W}(\mathbf{I}_{LL},\mathbf{D}_{e})\bigr),\mathcal{L}_{\text{2p}}\bigl(\mathbf{I}_{L},\mathcal{W}(\mathbf{I}_{R},-\mathbf{D}_{e})\bigr)\Bigr),

(13)

where $\mathcal{L}_{\text{2p}}$ is the standard photometric loss, $\mathcal{W}(\cdot,\cdot)$ is the backward warping function using the estimated disparity $\mathbf{D}_{e}$ from the event stereo model, and $\mathbf{M}_{\text{auto}}$ is the automasking term that removes untextured regions. The standard photometric loss $\mathcal{L}_{\text{2p}}$ and the automasking term $\mathbf{M}_{\text{auto}}$ are defined, respectively, as follow:

\mathcal{L}_{\text{2p}}(\textbf{I},\textbf{I}^{\mathcal{W}})=\beta\,\frac{1-\text{SSIM}(\textbf{I},\textbf{I}^{\mathcal{W}})}{2}+(1-\beta)\left|\textbf{I}-\textbf{I}^{\mathcal{W}}\right|,

(14)

\mathbf{M}_{\text{auto}}=\chi\left(\min\mathcal{L}_{\text{3p}}\left(\mathcal{W}(\mathbf{I}_{LL},\mathbf{D}_{e}),\mathbf{I}_{L},\mathcal{W}(\mathbf{I}_{R},-\mathbf{D}_{e})\right)<\min\mathcal{L}_{\text{3p}}\left(\mathbf{I}_{LL},\mathbf{I}_{L},\mathbf{I}_{R}\right)\right).

(15)

Confidence	Threshold	MAE (cm)	$\delta\leq 1.25$ (%)	Density (%)
-	-	6.38	96.44	100.00
$\mathbf{C}_{\text{AO}}$	0.35	\cellcolorsecondcolor6.26	\cellcolorfirstcolor96.60	\cellcolorsecondcolor95.45
$\mathbf{C}_{\text{Vsize}}$	0.75	\cellcolorfirstcolor6.23	\cellcolorsecondcolor96.57	\cellcolorfirstcolor97.42
$\mathbf{C}_{\text{AO}}$	0.40	\cellcolorsecondcolor6.22	\cellcolorfirstcolor96.66	\cellcolorsecondcolor92.04
$\mathbf{C}_{\text{Vsize}}$	0.80	\cellcolorfirstcolor6.15	\cellcolorsecondcolor96.61	\cellcolorfirstcolor95.56
$\mathbf{C}_{\text{AO}}$	0.45	\cellcolorsecondcolor6.20	\cellcolorfirstcolor96.72	\cellcolorsecondcolor87.38
$\mathbf{C}_{\text{Vsize}}$	0.85	\cellcolorfirstcolor6.03	\cellcolorsecondcolor96.69	\cellcolorfirstcolor91.63

Table 8: Confidence threshold study. Comparison between ambient occlusion confidence

\mathbf{C}_{\text{AO}}

[63] and our voxel-based confidence

\mathbf{C}_{\text{Vsize}}

on ScanNet++ [75]. MAE values are reported in centimeters.

We studied a replacement for $\mathbf{C}_{\text{AO}}$ that exploits the properties peculiar to the underlying NVS engine [59], and introduced $\mathbf{C}_{\text{Vsize}}$ using the voxel size as a confidence measure. Indeed, voxel sizes are defined during scene optimization and encouraged to be smaller for voxels seen from multiple viewpoints (i.e., those points in the scene that are more constrained by multi-view geometry). With reference to Equation 3, we include additional details:

\mathbf{C}_{\text{Vsize}}=\text{norm}\left(\sum_{i=1}^{N}T_{i}s_{i}\right)\odot\text{norm}\left(\sum_{i=1}^{N}T_{i}\alpha_{i}\right)=\mathbf{C}^{\prime}_{\text{Vsize}}\odot\mathbf{C}_{\text{hole}},

(16)

where $\mathbf{C}^{\prime}_{\text{Vsize}}$ returns high confidence to pixels whose rays intersect small voxels, and $\mathbf{C}_{\text{hole}}$ is the hole confidence that gives low confidence to pixels whose rays intersect empty space. We conducted an ablation experiment to compare the performance of our novel voxel-based confidence $\mathbf{C}_{\text{Vsize}}$ against the ambient occlusion confidence $\mathbf{C}_{\text{AO}}$ from [63]. We evaluated both approaches using different truncation thresholds on a small ScanNet++ [75] subset (i.e., 07f5b601ee, 08bbbdcc3d, 0c5385e84b, 210f741378, 25aa952aa3, 39f36da05b, 56a0ec536c, 5a269ba6fe, a1d9da703c, bc2fce1d81, be0ed6b33c, daffc70503, dc263dfbf0, ef18cf0708, fb564c935d), reporting depth estimation results in Table 8. Notably, our $\mathbf{C}_{\text{Vsize}}$ consistently achieves lower MAE while maintaining a higher density if compared to $\mathbf{C}_{\text{AO}}$ . We selected $\mu_{\text{Vsize}}=0.75$ as the final truncation threshold for our voxel-based confidence.

7 Additional Experiments

We now report further, focused experiments.

Further in-domain comparisons. Table 9 reports some additional experiments on DSEC, aimed at assessing the impact of rendering quality on the accuracy of the trained stereo models. We conduct this further evaluation over two axis: on top, we compare the results achieved by replacing SVRaster as the rendering engine of our pipeline with the feed-forward model Depth Anything v3 [40]. Despite the much faster data generation process enabled by this latter, we can observe a significant drop in the accuracy of the trained models; at the bottom, we extend the amount of synthetic data used to generate proxy events with E2VID [17], specifically by including TartanAir together with Sceneflow. Despite the improvement enabled by the larger amount of initial data, we can still notice a consistent gap between models trained on this kind of data with respect to ours. Importantly, we emphasize that event data generated from synthetic RGB datasets are not direct competitors to our EventHub framework; rather, the two sources could be combined to enhance performance further.

Efficiency Analysis. In Table 10, we report the complexity of each of the stereo backbones involved in our experiments, detailing the number of parameters, FLOPs, the runtime and the peak memory usage. SE-CFF stands as the least complex architectures, although achieving the worse results in our evaluation. On the contrary, E-StereoAnywhere and E-FoundationStereo stand as the most computationally intense architectures.

Convergence Analysis. By fixing the amount of epochs across the different dataset to 10, as described in the main paper, we obtain different amounts of total training steps, possibly biasing the evaluation of the trained models. However, as shown in Figure 7, we can appreciate how the models converge pretty soon to stable results, with marginal or no improvements being achieved by extending the training for more iterations, as occurs when using larger data splits such as MIX2 and MIX4.

Training Method	SE-CFF				E-FoundationStereo
Training Method	1PE $\downarrow$	2PE $\downarrow$	3PE $\downarrow$	MAE $\downarrow$	1PE $\downarrow$	2PE $\downarrow$	3PE $\downarrow$	MAE $\downarrow$
MIX 3 (SVRaster [59])	24.73	8.58	5.08	1.01	20.99	6.82	4.10	0.89
MIX 3 (Depth Anything v3 [40])	74.35	47.82	30.63	3.18	71.37	41.01	23.99	2.54
EV-SceneFlow [17, 45]	66.30	50.18	41.47	3.50	61.80	48.04	41.68	3.10
EV-(SceneFlow+TartanAir) [17, 45, 65]	57.78	33.01	19.75	2.17	41.86	23.13	16.58	1.76

Table 9: Further in-domain experimental results – DSEC dataset [18]. On top: comparison between SVRaster and Depth Anything v3 generated data. At the bottom: results by extending the synthetic data used to generate proxy events.

Model	Parameters (M)	FLOPs (G)	Runtime (ms)	Peak Memory (MB)
SE-CFF [49]	2.97	85.98	46.27	379.13
EMatch [82]	6.71	501.95	115.20	3090.49
E-StereoAnywhere	39.96	1566.58	219.81	1479.82
E-FoundationStereo	60.09	4445.51	280.11	1525.13

Table 10: Hardware analysis on DSEC dataset [18]. Measurements taken on a X GPU.

8 Additional Details Concerning Implementation and Experimental Settings

In this section, we include additional implementation details, in particular an extended overview of our global trajectory $\Omega(\tau)$ , the datasets splits used for both training [63, 75, 18] and evaluation [18, 7, 84], and the adaptation of $\mathcal{L}_{\text{NS}}$ loss for each stereo architecture, i.e., SE-CFF [49], EMatch [82], E-StereoAnywhere [6], and E-FoundationStereo [69].

8.1 Global Trajectory Implementation

For each selected ScanNet++ scene, we gather all COLMAP training poses $[\hat{\mathbf{R}}_{i}|\hat{\mathbf{t}}_{i}]=\hat{\mathbf{T}}_{i}\in\mathbb{SE}(3)$ and project $\hat{\mathbf{t}}_{i}=(\hat{x}_{i},\hat{y}_{i},\hat{z}_{i})^{\top}$ onto a 2D top-view by discarding the last $\hat{z}_{i}$ component. We then compute the corresponding $\alpha$ -shape, yielding an obstacle-avoiding 2D circular path. The resulting 2D curve is lifted back to 3D via a nearest-neighbor search, which we used to optimize the three splines using least squares (implemented using the SciPy package). As detailed in the main paper, one spline provides a continuous representation of the translation component $\mathbf{t}_{\tau}$ , while the splines $\mathbf{r}(\tau)$ and $\mathbf{l}(\tau)$ parametrize the rotation component $\mathbf{R}_{\tau}$ :

\mathbf{R}_{\tau}=\begin{bmatrix}\mathbf{d}(\tau)\times\mathbf{l}(\tau)&\mathbf{d}(\tau)&\mathbf{l}(\tau)\end{bmatrix},\quad\mathbf{d}(\tau)=\mathbf{l}(\tau)\times\mathbf{r}(\tau).

(17)

However, given the ScanNet++ randomness of pose orientation, which causes unnatural camera egomotion, we re-estimate camera rotations $\hat{\mathbf{R}}_{i}$ to align them with the direction of motion. Specifically, we approximate the motion direction $\nabla\mathbf{t}_{\tau}$ using finite differences, and construct the updated orientation:

\mathbf{R}^{\prime}_{\tau}=\begin{bmatrix}\mathbf{r}^{\prime}(\tau)&\mathbf{d}^{\prime}(\tau)&\nabla\mathbf{t}_{\tau}\end{bmatrix},\quad\mathbf{r}^{\prime}(\tau)=\mathbf{g}\times\nabla\mathbf{t}_{\tau},\quad\mathbf{d}^{\prime}(\tau)=\nabla\mathbf{t}_{\tau}\times\mathbf{r}^{\prime}(\tau),

(18)

where $\mathbf{g}=\left(0\ 0\ 1\right)^{\top}$ denotes the ScanNet++ gravity vector. Finally, we clamp the $\hat{z}_{i}$ translation component to its $[45,55]$ -th percentile range to suppress strong vertical oscillations. This procedure yields a “human-like” walking trajectory through the scene, as shown in Figure 8 (right).

8.2 ScanNet++ Scenes Used for NVS

We enrich Section 4.1 with further information regarding the dataset used for event data generation [63, 18, 75]. For event data generation from Novel View Synthesis, we collect 30 samples for each scene, where each sample is composed of the stereo streams $\mathbf{E}_{L}$ and $\mathbf{E}_{R}$ , the intrinsic $\mathbf{K}$ , the baseline $b$ , the RGB triplet $\mathbf{I}_{LL}$ , $\mathbf{I}_{L}$ , and $\mathbf{I}_{R}$ , the depth $\mathbf{Z}$ and the confidence $\mathbf{C}_{\text{Vsize}}$ . The maximum number of events for the event stereo streams $\mathbf{E}_{L}$ and $\mathbf{E}_{R}$ is limited to $650\,000$ and $1\,000\,000$ events, respectively, for the samples at resolutions $640\times 480$ px and $1280\times 720$ px. Furthermore, we randomize the contrast threshold using a uniform distribution $\mathcal{U}(0.15,0.25)$ . We used all 270 scenes from the NeRF Stereo Dataset [63] – i.e., starting from scene $0000$ up to scene $0269$ – while we selected the following 403 scenes from ScanNet++[75]: 00777c41d4, 0271889ec0, 02c2ddee2a, 036bce3393, 0452249a1e, 04d0dc245b, 04df8734b7, 052d72e137, 0658da5bc0, 068ba2946c, 06b5863f73, 06bc6d1b24, 076c822ecc, 079a326597, 07f5b601ee, 08bbbdcc3d, 09a6767fc2, 09bced689e, 0a5c013435, 0c5385e84b, 0c6c7145ba, 0c7962bd64, 0caa1ae59a, 0d8ead0038, 0e100756bf, 0e350246d3, 0e900bcc5c, 0f0191b10b, 0f25f24a4f, 0f3474b837, 10242d1eaf, 10c8ab99f4, 1117299565, 1204e08f17, 124a6e789b, 12c0f7a7da, 13285009a4, 132cb783ed, 13b4efaf62, 15c4aa5bbb, 16c9bd2e1e, 1730c7d709, 1841a0b525, 192ab15daf, 1a130d092a, 1a3100752b, 1a8e0d78c0, 1b9692f0c7, 1bb93d185e, 1c08823a41, 1c4b893630, 1c7a683c92, 1d003b07bd, 1eacc65607, 20871b98f3, 20ff72df6e, 210f741378, 216b9e55e8, 238b940049, 246fe09e98, 2489b7f4fe, 24b248e676, 251443268c, 25aa952aa3, 25bae29ab3, 25bde9e167, 260db9cf5a, 260fa55d50, 2634683a9f, 2748de13fb, 2779f8f9e2, 27dc178a3d, 281bc17764, 2970e95b65, 29c7afafed, 2a1b555966, 2a496183e1, 2b71155e0d, 2f5996ff01, 2f6f83ea1f, 302a7f6b67, 303745abc7, 30f4a2b44d, 320c3af000, 324d07a5b3, 3391ff8a71, 3423e509af, 35050f41c5, 355e5e32db, 364f01bc18, 37562e7f48, 3799bd47b3, 37c9538a2b, 38fcf02d0b, 390eda9157, 39580e2a43, 39e6ee46df, 39f36da05b, 3a3745a437, 3aa115e55e, 3b90310b1c, 3c8d535d49, 3caf4324fd, 3cbb18c391, 3ce6d36ab5, 3d838ee1ab, 3e7e4b07c4, 3e928dc2f6, 3ff873c77e, 413085a827, 41b00feddb, 4318f8bb3c, 4380e4646a, 43cd995c51, 4422722c49, 4423a61d09, 442b144761, 44c85584ae, 4517d988d8, 45d2e33be1, 46001f434d, 4610b2104c, 46638cfd0f, 47b37eb6f9, 4808c4a397, 480ddaadc0, 484ad681df, 48573f4c95, 48701abb21, 4897e95232, 49789448b8, 4aef651da7, 4c141d5b1b, 4c5c60fa76, 4d451d9c36, 4e0b8cbd33, 4ea827f5a1, 504cf57907, 511061232, 51bdbf173f, 523657b4d0, 5334a4164a, 53755e535e, 546292a9db, 54b005d19d, 55b2bf8036, 5654092cc2, 56669a70bc, 56a0ec536c, 58960ff105, 589f5c7c58, 58f6a5c5ec, 59e3f1ea37, 5a269ba6fe, 5a9cdde1ba, 5aeac3800a, 5bc6227191, 5c215ef3b0, 5d152fab1b, 5d902f1593, 5ea3e738c3, 5f0fb991a7, 6126572846, 612f70fe00, 617326da3e, 618310ed87, 6183f0657d, 61adeff7d5, 6248c6742d, 635852d56e, 639f2c4d5a, 6464461276, 64672b5bf5, 652d9cb0d7, 666d04a14a, 66ba53719a, 66c98f4a9b, 67d702f2e8, 696317583f, 69e56cf0f8, 69e5939669, 6ad6cef000, 6b19334aeb, 6b40d1a939, 6bd39ac392, 6da1d5ab04, 6f1848d1e3, 70945f435a, 709ab5bffe, 70f0e494b2, 712b9ae775, 724c40236c, 72f527a47c, 73f9370962, 75d29d69b8, 7739004a45, 77b40ce601, 785e7504b9, 791a5c253d, 7b04052ad0, 7b4a316aea, 7b4cb756d4, 7c0ba828a9, 7c31a42404, 7c31bccde5, 7d8d37ca38, 7e7d2e8640, 7f22d5ef1b, 7f68c514bd, 7f77abce34, 7fb8ff20e9, 8013901416, 80ffca8a48, 81a82c3618, 82f448db76, 82ff39b7ef, 85251de7d1, 85dc2702b7, 867d97cf3d, 871efc90fa, 8737a0d1ad, 88627b561e, 8890d0a267, 88f265fe25, 893fb90e89, 8be0cd3817, 8d0f714398, 8de35c04a3, 8e22c48c20, 8f82c394d6, 8fc40ba77b, 9084d4cd97, 909a9ea5fc, 91fc568d84, 9444b90aaa, 9471b8d485, 94b1acde81, 95748dd597, 95d525fbfd, 97e5512e91, 9816c49e97, 98b4ec142f, 98fe276aa8, 99010a8938, 9b74afd2d2, 9bfbc75700, 9c7b4394af, 9cfea269dd, 9d8fcc4215, 9dc5ad040f, 9ef5fc6271, a08d9a2476, a1d9da703c, a23f391ba9, a30646cae6, a31b2ef388, a492fe77aa, a4d48ea6b3, a4e227f506, a892730b61, a8f7f66985, a9e4791c7e, aa852f7871, aab83fd6f1, ab046f8faf, ab11145646, ab6983ae6c, abf29d2474, ac250f0ead, acd69a1746, ad2d07fd11, adf4ab4a53, aea84db0de, b068706ef0, b08a908f0f, b09431c547, b0b004c40f, b0f057c684, b0fe0c610f, b1d75ecd55, b20a261fdf, b24697b3a1, b2632b738a, b3ac0beef0, b4b39438f0, b5918e4637, b6d73041c8, b97261909e, bac7ee3b1b, bb05a0c48c, bb0ad8a081, bc2fce1d81, bc400d86e1, be05b26a38, be0ed6b33c, be8367fcbe, bf07750a0b, bf50f418ba, bfcfe53c6a, bfd3fd54d2, c026d108e0, c07c707449, c08d1d52b7, c0da8f4a4d, c0f5742640, c29b5e479c, c2d714d386, c31ebd4b22, c40466a844, c465f388d1, c47168fab2, c4aaedcfd1, c4d4cb61f6, c601466b77, c842edbdf5, c856c41c99, c8eeef6427, c8f2218ee2, c9a8357e8f, ca0c580422, cab239278a, cb7785f6ad, cc5ea8026c, ccfd3ed9c7, cd0b6082d2, ce12db9e81, cec8312f4e, d054227009, d1345a65c1, d1f82299d0, d240136ce4, d2f44bf242, d537ef1d41, d551dac194, d61691f945, d6a77f7c22, d6bb698875, d7abfc4b17, d7b871aaa8, d807fb583b, d918af9c5f, d986399f4c, daffc70503, db5293a870, dc263dfbf0, dd685be466, de3c77cecd, de5881aa12, deb1867829, dec0b11090, defd3457db, dfa70fb232, dfac5b38df, e050c15a8d, e0de253456, e1aa584dd5, e2caaaf5b5, e3ad7115db, e3b3b0d0c7, e3c1da58dd, e3e0617f98, e3ecd49e2b, e3ef8b690b, e4007ff6b5, e4e625a3e4, e4fb2a623b, e5a769dbf5, e667e09fe6, e69064f2f3, e7ccd75e5d, e81c8b3eec, e8e81396b6, e8ea9b4da8, e909f8213d, e9e16b6043, eaa6c90310, eaab7bcc15, eab5494dca, eb8ef9b4cc, ec2cb8dae1, ed2216380b, eea4ad9c04, eeeb9836b8, ef18cf0708, ef25276c25, f19ca0a52e, f248c2bcdc, f25f5e6f63, f2e6c43543, f38b0108a1, f3f016ba3f, f576071590, f6659a3107, f6a9b64a0d, f847086d15, f8d5147d1d, f8e13ab4ae, f8eac0ad24, f97de2c3e9, faba6e97d7, faec2f0468, fb152519ad, fb564c935d, fb893ffaf3, fb9b4c2f15, fd361ab85f, fd8560cfd6, fe1733741f, and ff17657f71.

8.3 Custom Stereo Losses

As mentioned in Section 6.2, we adapt $\mathcal{L}_{\text{NS}}$ for each event stereo model – i.e., SE-CFF [49], EMatch [82], E-StereoAnywhere [6], and E-FoundationStereo [69]. In particular, we started from the original loss proposed by the authors of each architecture, obtaining the following losses:

•

We adapt $\mathcal{L}_{\text{NS}}$ for SE-CFF [49] starting from their multi-scale disparity loss:

\mathcal{L}^{\prime}_{\text{NS}}=\sum^{L}_{s}w_{s}\left[\left(\lambda_{\text{disp}}\cdot\eta(\mathbf{C}^{(s)}_{\text{Vsize}};\mu_{\text{Vsize}})\cdot\mathcal{L}^{(s)}_{\text{disp}}+\mathbf{M}^{(s)}_{\text{auto}}\cdot\lambda_{\text{3p}}\cdot(1-\eta(\mathbf{C}^{(s)}_{\text{Vsize}};\mu_{\text{Vsize}}))\cdot\mathcal{L}^{(s)}_{\text{3p}}\right)+\lambda_{\text{smooth}}\cdot\mathcal{L}^{(s)}_{\text{smooth}}\right],

(19)

where $L$ is the number of scales, $w_{s}$ is the weight for the $s$ -th scale, $\mathcal{L}^{(s)}_{\text{disp}}$ is a L1 loss computed at scale $s$ , $\mathcal{L}^{(s)}_{\text{smooth}}$ is a gradient regularization term that ensure smooth disparity estimations, and $\lambda_{\text{smooth}}=0.1$ is the weighting term for $\mathcal{L}^{(s)}_{\text{smooth}}$ .

•

For other stereo networks – i.e., EMatch [82], E-StereoAnywhere [6], and E-FoundationStereo [69] – we adopt a RAFTStereo-like [42] loss with further supervision for the initial disparity estimation:

\mathcal{L}^{\prime\prime}_{\text{NS}}=\left[\sum^{N}_{i}w_{i}\left[\left(\lambda_{\text{disp}}\cdot\eta(\mathbf{C}_{\text{Vsize}};\mu_{\text{Vsize}})\cdot\mathcal{L}^{i}_{\text{disp}}+\mathbf{M}_{\text{auto}}\cdot\lambda_{\text{3p}}\cdot(1-\eta(\mathbf{C}_{\text{Vsize}};\mu_{\text{Vsize}}))\cdot\mathcal{L}^{i}_{\text{3p}}\right)\right]\right]+\mathcal{L}^{0}_{\text{NS}},

(20)

where $N$ is the number of refinement steps, $w_{i}$ is the exponentially increasing weight for the $i$ -th refined disparity, $\mathcal{L}^{i}_{\text{disp}}$ is a L1 loss computed with respect to the $i$ -th refined disparity, and $\mathcal{L}^{0}_{\text{NS}}$ is the NeRF-supervised loss for the initial disparity.

As mentioned in the main paper (Sec. 4.1), the losses $\mathcal{L}^{\prime}_{\text{NS}}$ and $\mathcal{L}^{\prime\prime}_{\text{NS}}$ are used for NVS data only – where $\mathbf{C}_{\text{Vsze}}$ , and the RGB triplet $\mathbf{I}_{LL}$ , $\mathbf{I}_{L}$ , and $\mathbf{I}_{R}$ are available. For the other sources of data – i.e., distilled data from [18], and ground-truth supervised trainings – we maintain only the $\mathcal{L}^{(s)}_{\text{disp}}$ and $\mathcal{L}^{i}_{\text{disp}}$ terms respectively from $\mathcal{L}^{\prime}_{\text{NS}}$ and $\mathcal{L}^{\prime\prime}_{\text{NS}}$ .

9 Additional Qualitative Results

In this section, we collect additional qualitative results, including full samples from the EventHub data (Sec. 9.1), predictions generated by event-based stereo networks (Sec. 9.2), and finally, a qualitative comparison between conventional RGB Stereo Foundation Models like FoundationStereo before and after fine-tuning on EventHub data against challenging night sequences (Sec. 9.3).

9.1 Qualitative Samples from EventHub

We report a few training samples generated with our EventHub pipeline, obtained both by means of cross-modal distillation and by deploying novel view synthesis.

Figure 9 shows three samples from the DSEC datasets, obtained through the former paradigm. From left to right, we display the left image from the color stereo pair, the left event frame, and the proxy disparity map generated by FoundationStereo [69] and projected over the event frame, as described in Sec. 3.1.2. We can notice, in particular, the high level of detail of these predicted labels, crucial for providing the event stereo models with strong guidance.

Figures 10 and 11 collect four examples from scenes available in the NeRFStereo [63] and ScanNet++ [75] datasets, respectively. From left to right, we show rendered RGB and event frames, followed by rendered depth maps, confidence maps based on voxel sizes, and rendered depth maps masked according to confidence thresholding. The latter further highlight the importance of confidence thresholding in removing outliers in the rendered depth maps.

9.2 Predictions from Event Stereo Networks

We report additional qualitative results concerning event stereo models trained with different supervision flavors.

Figures 12, 13 and 14 collect two samples each, respectively, from DSEC [18], M3ED [7] and MVSEC [84] datasets. On any dataset, we can clearly notice how MIX 4 allows for training any of the four models involved in our experiments at their best, with the novel models introduced by repurposing stereo foundation models from the RGB literature [6, 69] – E-StereoAnywhere and E-FoundationStereo – benefiting the most from the superior annotations produced by EventHub.

9.3 Predictions from RGB SFMs at Night

We conclude by showing qualitatively how we can improve the original stereo foundation models – StereoAnywhere and FoundationStereo, from which we derived our new E-StereoAnywhere and E-FoundationStereo frameworks – on challenging conditions where they struggle, by distilling the knowledge of E-StereoAnywhere and E-FoundationStereo themselves.

Figures 15 and 16 collect two nighttime images from DSEC [18] each. From left to right, we show (a) the left color image, then the predictions by FoundationStereo [69] respectively (b) before any further fine-tuning – i.e., using the original weights – and (c) after being fine-tuned on proxy labels distilled by E-FoundationStereo. After fine-tuning, FoundationStereo learns to deal with this challenging domain and is able to better retain fine details in the predicted disparity maps.

Acknowledgment. The authors gratefully acknowledge the EuroHPC Joint Undertaking for awarding this project access to supercomputing resources under Proposal ID EHPC-DEV-2025D05-081.

References

[1] S. H. Ahmed, H. W. Jang, S. N. Uddin, and Y. J. Jung (2021) Deep event stereo leveraged by event-to-image translation. In Proc. AAAI Conference on Artificial Intelligence, Vol. 35, pp. 882–890. Cited by: §2.
[2] F. Aleotti, F. Tosi, P. Z. Ramirez, M. Poggi, S. Salti, S. Mattoccia, and L. Di Stefano (2021) Neural disparity refinement for arbitrary resolution stereo. In Int. Conf. 3D Vision (3DV), pp. 207–217. Cited by: §2.
[3] F. Aleotti, F. Tosi, L. Zhang, M. Poggi, and S. Mattoccia (2020) Reversing the cycle: self-supervised deep stereo through enhanced monocular distillation. In Eur. Conf. Comput. Vis. (ECCV), pp. 614–632. Cited by: §2.
[4] L. Bartolomei, E. Mannocci, F. Tosi, M. Poggi, and S. Mattoccia (2025) Depth AnyEvent: a cross-modal distillation paradigm for event-based monocular depth estimation. In Int. Conf. Comput. Vis. (ICCV), Cited by: §2.
[5] L. Bartolomei, M. Poggi, A. Conti, and S. Mattoccia (2024) Lidar-event stereo fusion with hallucinations. In Eur. Conf. Comput. Vis. (ECCV), pp. 125–145. Cited by: §4.1, §4.2.
[6] L. Bartolomei, F. Tosi, M. Poggi, and S. Mattoccia (2025) Stereo Anywhere: robust zero-shot deep stereo matching even where either stereo or mono fail. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Cited by: §1, §2, §3.1.2, §3.1.2, §3, §4.1, Table 5, Table 5, 2nd item, §8.3, §8, §9.2.
[7] K. Chaney, F. Cladera, Z. Wang, A. Bisulco, M. A. Hsieh, C. Korpela, V. Kumar, C. J. Taylor, and K. Daniilidis (2023) M3ED: multi-robot, multi-sensor, multi-environment event dataset. In IEEE Conf. Comput. Vis. Pattern Recog. Workshops (CVPRW), pp. 4016–4023. External Links: Document Cited by: Figure 1, Figure 1, Figure 2, Figure 2, §2, §3.1.2, Figure 6, Figure 6, Figure 6, §4.2, Table 3, Table 3, 3rd item, §8, Figure 13, Figure 13, §9.2, EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors.
[8] J. Chang and Y. Chen (2018) Pyramid stereo matching network. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 5410–5418. Cited by: §2.
[9] W. Chen, Y. Zhang, X. Sun, and F. Wu (2024) Event-based stereo depth estimation by temporal-spatial context learning. IEEE Signal Process. Lett. 31, pp. 1429–1433. Cited by: §2.
[10] Z. Chen, X. Ye, W. Yang, Z. Xu, X. Tan, Z. Zou, E. Ding, X. Zhang, and L. Huang (2021-10) Revealing the reciprocal relations between self-supervised stereo and monocular depth estimation. In Int. Conf. Comput. Vis. (ICCV), pp. 15529–15538. Cited by: §2.
[11] J. Cheng, L. Liu, G. Xu, X. Wang, Z. Zhang, Y. Deng, J. Zang, Y. Chen, Z. Cai, and X. Yang (2025) MonSter: marry monodepth to stereo unleashes power. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Cited by: §2.
[12] H. Cho, J. Kang, and K. Yoon (2024) Temporal event stereo via joint learning with stereoscopic flow. In Eur. Conf. Comput. Vis. (ECCV), pp. 294–314. Cited by: §2.
[13] H. Cho and K. Yoon (2022) Event-image fusion stereo using cross-modality feature propagation. In AAAI Conf. Artificial Intell., Vol. 36, pp. 454–462. Cited by: §2.
[14] M. Firouzi and J. Conradt (2016) Asynchronous event-based cooperative stereo matching using neuromorphic silicon retinas. Neural Process. Lett. 43 (2), pp. 311–326. Cited by: §2.
[15] G. Gallego, T. Delbruck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. Davison, J. Conradt, K. Daniilidis, and D. Scaramuzza (2022) Event-based vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44 (1), pp. 154–180. External Links: Document Cited by: §1.
[16] Y. Ge, H. Behl, J. Xu, S. Gunasekar, N. Joshi, Y. Song, X. Wang, L. Itti, and V. Vineet (2022) Neural-Sim: learning to generate training data with NeRF. In Eur. Conf. Comput. Vis. (ECCV), pp. 477–493. Cited by: §2.
[17] D. Gehrig, M. Gehrig, J. Hidalgo-Carrió, and D. Scaramuzza (2020) Video to events: recycling video datasets for event cameras. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 3586–3595. Cited by: §2, §3.1.1, Table 2, §4.3, Table 9, Table 9, §7.
[18] M. Gehrig, W. Aarents, D. Gehrig, and D. Scaramuzza (2021) DSEC: a stereo event camera dataset for driving scenarios. IEEE Robot. Autom. Lett. 6 (3), pp. 4947–4954. External Links: Document Cited by: Figure 1, Figure 1, Figure 2, Figure 2, §2, Figure 4, Figure 4, Figure 4, §3.1.2, Table 1, Table 2, Table 2, Figure 5, Figure 5, §4.1, §4.1, §4.1, §4.2, Table 5, Table 5, 3rd item, Table 10, Table 10, Table 9, Table 9, §8.2, §8.3, §8, Figure 12, Figure 12, Figure 9, Figure 9, §9.2, §9.3, EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors.
[19] S. Ghosh and G. Gallego (2022) Multi-event-camera depth estimation and outlier rejection by refocused events fusion. Adv. Intell. Syst. 4 (12), pp. 2200221. External Links: Document Cited by: §2.
[20] S. Ghosh and G. Gallego (2025) Event-based stereo depth estimation: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 47 (10), pp. 9130–9149. External Links: Document Cited by: §1, §1, §2.
[21] C. Godard, O. Mac Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 270–279. Cited by: §2.
[22] W. Guo, Z. Li, Y. Yang, Z. Wang, R. H. Taylor, M. Unberath, A. Yuille, and Y. Li (2022) Context-enhanced stereo transformer. In Eur. Conf. Comput. Vis. (ECCV), pp. 263–279. Cited by: §2.
[23] J. Hidalgo-Carrió, D. Gehrig, and D. Scaramuzza (2020-11) Learning monocular dense depth from events. In Int. Conf. 3D Vision (3DV), pp. 534–542. External Links: Document Cited by: §2.
[24] D. Hitzges, S. Ghosh, and G. Gallego (2025) DERD-Net: learning depth from event-based ray densities. In Adv. Neural Inf. Process. Syst. (NeurIPS), Cited by: §2.
[25] B. Huang, Z. Yu, A. Chen, A. Geiger, and S. Gao (2024) 2D gaussian splatting for geometrically accurate radiance fields. In ACM SIGGRAPH Conf. papers, pp. 1–11. Cited by: §3.1.1, 1st item.
[26] Z. Huang, L. Sun, C. Zhao, S. Li, and S. Su (2023) EventPoint: self-supervised interest point detection and description for event-based camera. In IEEE Winter Conf. Appl. Comput. Vis. (WACV), pp. 5396–5405. Cited by: §3.2, §4.1, §4.1.
[27] Y. Huo, G. Jiang, H. Wei, J. Liu, S. Zhang, H. Liu, X. Huang, M. Lu, J. Peng, D. Li, et al. (2025) EGSRAL: an enhanced 3D Gaussian splatting based renderer with automated labeling for large-scale driving scene. In AAAI Conf. Artificial Intell., Vol. 39, pp. 3860–3867. Cited by: §2.
[28] S. Ieng, J. Carneiro, M. Osswald, and R. Benosman (2018) Neuromorphic event-based generalized time-based stereovision. Front. Neurosci. 12, pp. 442. Cited by: §2.
[29] H. Jiang, Z. Lou, L. Ding, R. Xu, M. Tan, W. Jiang, and R. Huang (2025) DEFOM-stereo: depth foundation model based stereo matching. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Cited by: §2.
[30] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry (2017) End-to-end learning of geometry and context for deep stereo regression. In Int. Conf. Comput. Vis. (ICCV), pp. 66–75. Cited by: §2.
[31] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis (2023) 3D Gaussian splatting for real-time radiance field rendering.. ACM Trans. on Graph. (TOG) 42 (4), pp. 139–1. Cited by: §2, §3.1.1, Table 7.
[32] J. Kogler, M. Humenberger, and C. Sulzbachner (2011) Event-based stereo matching approaches for frameless address event stereo data. In Int. Symp. Visual Computing, pp. 674–685. Cited by: §2.
[33] H. Laga, L. V. Jospin, F. Boussaid, and M. Bennamoun (2020) A survey on deep learning techniques for stereo-based depth estimation. IEEE Trans. Pattern Anal. Mach. Intell. 44 (4), pp. 1738–1764. Cited by: §2.
[34] J. L. Lee and G. H. Lee (2025) Distil-E2D: distilling image-to-depth priors for event-based monocular depth estimation. In Adv. Neural Inf. Process. Syst. (NeurIPS), Cited by: §2.
[35] A. Legrand, R. Detry, and C. De Vleeschouwer (2024) Domain generalization for 6D pose estimation through NeRF-based image synthesis. arXiv preprint arXiv:2407.10762. Cited by: §2.
[36] Y. Li, Z. Huang, S. Chen, X. Shi, H. Li, H. Bao, Z. Cui, and G. Zhang (2023) Blinkflow: a dataset to push the limits of event-based optical flow estimation. In IEEE/RSJ Int. Conf. Intell. Robot. Syst. (IROS), pp. 3881–3888. Cited by: §2.
[37] Y. Li, C. Feng, Z. Tang, K. Deng, W. Yu, Y. Tian, and L. Yuan (2025) GS2E: Gaussian splatting is an effective data generator for event stream generation. In Adv. Neural Inf. Process. Syst. (NeurIPS), Cited by: §2.
[38] Z. Li, X. Liu, N. Drenkow, A. Ding, F. X. Creighton, R. H. Taylor, and M. Unberath (2021-10) Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In Int. Conf. Comput. Vis. (ICCV), pp. 6197–6206. Cited by: §2.
[39] P. Lichtsteiner, C. Posch, and T. Delbruck (2008) A 128 $\times$ 128 120 dB 15 $\mu$ s latency asynchronous temporal contrast vision sensor. IEEE J. Solid-State Circuits 43 (2), pp. 566–576. External Links: Document Cited by: §1.
[40] H. Lin, S. Chen, J. H. Liew, D. Y. Chen, Z. Li, Y. Zhao, S. Peng, H. Guo, X. Zhou, G. Shi, J. Feng, and B. Kang (2026) Depth Anything 3: recovering the visual space from any views. In Int. Conf. Learn. Representations (ICLR), Cited by: §2, Table 7, Table 9, §7.
[41] H. Ling, Y. Sun, Q. Sun, I. Tsang, and Y. Zheng (2024) Self-assessed generation: trustworthy label generation for optical flow and stereo matching in real-world. arXiv preprint arXiv:2410.10453. Cited by: §2, §2.
[42] L. Lipson, Z. Teed, and J. Deng (2021) Raft-stereo: multilevel recurrent field transforms for stereo matching. In Int. Conf. 3D Vision (3DV), pp. 218–227. Cited by: §2, 2nd item.
[43] H. Lou, J. Liang, M. Teng, B. Fan, Y. Xu, and B. Shi (2024) Zero-shot event-intensity asymmetric stereo via visual prompting from image domain. In Adv. Neural Inf. Process. Syst. (NeurIPS), Vol. 37, pp. 13274–13301. Cited by: §2.
[44] W. Luo, A. G. Schwing, and R. Urtasun (2016) Efficient deep learning for stereo matching. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 5695–5703. Cited by: §2.
[45] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox (2016) A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 4040–4048. Cited by: §2, Table 2, §4.3, Table 9, Table 9.
[46] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2021) NeRF: representing scenes as neural radiance fields for view synthesis. Comm. of the ACM 65 (1), pp. 99–106. Cited by: §2, §3.1.1, §3.1.1.
[47] M. Mostafavi, K. Yoon, and J. Choi (2021) Event-intensity stereo: estimating depth by the best of both worlds. In Int. Conf. Comput. Vis. (ICCV), pp. 4258–4267. Cited by: §2.
[48] T. Müller, A. Evans, C. Schied, and A. Keller (2022-07) Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. 41 (4), pp. 102:1–102:15. External Links: Document Cited by: §2, Table 7.
[49] Y. Nam, M. Mostafavi, K. Yoon, and J. Choi (2022) Stereo depth from events cameras: concentrate and focus on the future. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 6114–6123. Cited by: §2, Table 2, Figure 6, §4.1, §4.1, Table 3, Table 4, Table 10, 1st item, §8.3, §8, Figure 12, Figure 13, Figure 14.
[50] M. Osswald, S. Ieng, R. Benosman, and G. Indiveri (2017) A spiking neural network model of 3D perception for event-based neuromorphic stereo vision systems. Scientific reports 7 (1), pp. 40703. Cited by: §2.
[51] M. Poggi, F. Tosi, K. Batsos, P. Mordohai, and S. Mattoccia (2021) On the synergies between machine learning and binocular stereo for depth estimation from images: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 44 (9), pp. 5314–5334. Cited by: §2.
[52] M. Poggi and F. Tosi (2024) Federated online adaptation for deep stereo. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Cited by: §2.
[53] C. Posch, T. Serrano-Gotarredona, B. Linares-Barranco, and T. Delbruck (2014-10) Retinomorphic event-based vision sensors: bioinspired cameras with spiking output. Proc. IEEE 102 (10), pp. 1470–1484. External Links: Document Cited by: §1.
[54] P. Rogister, R. Benosman, S. Ieng, P. Lichtsteiner, and T. Delbruck (2011) Asynchronous event-based binocular stereo matching. IEEE Trans. Neural Netw. Learn. Syst. 23 (2), pp. 347–353. Cited by: §2.
[55] D. Scharstein, H. Hirschmüller, Y. Kitajima, G. Krathwohl, N. Nešić, X. Wang, and P. Westling (2014) High-resolution stereo datasets with subpixel-accurate ground truth. In Pattern Recognition: 36th German Conference, GCPR 2014, Münster, Germany, September 2-5, 2014, Proceedings 36, pp. 31–42. Cited by: §1.
[56] D. Scharstein and R. Szeliski (2002) A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 47, pp. 7–42. Cited by: §2.
[57] J. L. Schonberger and J. Frahm (2016) Structure-from-motion revisited. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 4104–4113. Cited by: §3.1.1, 4th item.
[58] S. Schraml, P. Schön, and N. Milosevic (2007) Smartcam for real-time stereo vision-address-event based embedded system. In Int. Conf. Computer Vision Theory and Applications, Vol. 2, pp. 466–471. Cited by: §2.
[59] C. Sun, J. Choe, C. Loop, W. Ma, and Y. F. Wang (2025) Sparse voxels rasterization: real-time high-fidelity radiance field rendering. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Cited by: §1, Figure 3, Figure 3, §2, §3.1.1, §3.1.1, §3.1.1, §3, §6.1, §6.2, Table 7, Table 9.
[60] A. Tonioni, M. Poggi, S. Mattoccia, and L. Di Stefano (2017) Unsupervised adaptation for deep stereo. In Int. Conf. Comput. Vis. (ICCV), pp. 1605–1613. Cited by: §2.
[61] F. Tosi, F. Aleotti, P. Z. Ramirez, M. Poggi, S. Salti, S. Mattoccia, and L. Di Stefano (2024) Neural disparity refinement. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §2.
[62] F. Tosi, L. Bartolomei, and M. Poggi (2025) A survey on deep stereo matching in the twenties. Int. J. Comput. Vis., pp. 1–32. Cited by: §1, §2.
[63] F. Tosi, A. Tonioni, D. De Gregorio, and M. Poggi (2023) NeRF-supervised deep stereo. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 855–866. Cited by: Figure 1, Figure 1, §2, §2, Figure 4, Figure 4, Figure 4, §3.1.1, §3.1.1, §3.1.1, §3.1.1, Table 1, Table 2, §4.1, §4.1, §4.1, 3rd item, §6.2, §6.2, §6.2, Table 8, Table 8, §8.2, §8, Figure 10, Figure 10, §9.1.
[64] S. Tulyakov, F. Fleuret, M. Kiefel, P. Gehler, and M. Hirsch (2019) Learning an event sequence embedding for dense event-based deep stereo. In Int. Conf. Comput. Vis. (ICCV), pp. 1527–1537. External Links: Document Cited by: §2.
[65] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, and S. Scherer (2020) TartanAir: a dataset to push the limits of visual SLAM. In IEEE/RSJ Int. Conf. Intell. Robot. Syst. (IROS), pp. 4909–4916. External Links: Document Cited by: §2, Table 9.
[66] X. Wang, G. Xu, H. Jia, and X. Yang (2024) Selective-stereo: adaptive frequency information selection for stereo matching. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Cited by: §2.
[67] Y. Wang, P. Wang, Z. Yang, C. Luo, Y. Yang, and W. Xu (2019-06) UnOS: unified unsupervised optical-flow and stereo-depth estimation by watching videos. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Cited by: §2.
[68] Z. Wang, L. Pan, Y. Ng, Z. Zhuang, and R. Mahony (2021) Stereo hybrid event-frame (SHEF) cameras for 3D perception. In IEEE/RSJ Int. Conf. Intell. Robot. Syst. (IROS), pp. 9758–9764. Cited by: §2.
[69] B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birchfield (2025) FoundationStereo: zero-shot stereo matching. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), Cited by: §1, §2, §3.1.2, §3.1.2, §3, §4.1, §4.1, Table 5, Table 5, 2nd item, §8.3, §8, Figure 9, Figure 9, §9.1, §9.2, §9.3.
[70] G. Xu, X. Wang, X. Ding, and X. Yang (2023) Iterative geometry encoding volume for stereo matching. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 21919–21928. Cited by: §2.
[71] H. Xu, J. Zhang, J. Cai, H. Rezatofighi, F. Yu, D. Tao, and A. Geiger (2023) Unifying flow, stereo and depth estimation. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §2.
[72] H. Xu and J. Zhang (2020) AANet: adaptive aggregation network for efficient stereo matching. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 1959–1968. Cited by: §2.
[73] L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao (2024) Depth anything v2. In Adv. Neural Inf. Process. Syst. (NeurIPS), Vol. 37, pp. 21875–21911. Cited by: §3.1.1, 2nd item, §6.1.
[74] L. Yen-Chen, P. Florence, J. T. Barron, T. Lin, A. Rodriguez, and P. Isola (2022) NeRF-supervision: learning dense object descriptors from neural radiance fields. In IEEE Int. Conf. Robot. Autom. (ICRA), pp. 6496–6503. Cited by: §2.
[75] C. Yeshwanth, Y. Liu, M. Nießner, and A. Dai (2023) ScanNet++: a high-fidelity dataset of 3D indoor scenes. In Int. Conf. Comput. Vis. (ICCV), pp. 12–22. Cited by: Figure 1, Figure 1, Figure 4, Figure 4, Figure 4, §3.1.1, Table 1, §4.1, §4.1, §4.1, 3rd item, §6.1, §6.2, Table 8, Table 8, §8.2, §8, Figure 11, Figure 11, §9.1.
[76] J. Zbontar and Y. LeCun (2015) Computing the stereo matching cost with a convolutional neural network. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 1592–1599. Cited by: §2.
[77] D. Zhang, Q. Ding, P. Duan, C. Zhou, and B. Shi (2022) Data association between event streams and intensity frames under diverse baselines. In Eur. Conf. Comput. Vis. (ECCV), pp. 72–90. Cited by: §2.
[78] F. Zhang, X. Qi, R. Yang, V. Prisacariu, B. Wah, and P. Torr (2020) Domain-invariant stereo matching networks. In Eur. Conf. Comput. Vis. (ECCV), Cited by: §2.
[79] J. Zhang, S. Singh, et al. (2014) LOAM: lidar odometry and mapping in real-time.. In Robotics: Science and systems, Vol. 2, pp. 1–9. Cited by: §4.2.
[80] J. Zhang, X. Wang, X. Bai, C. Wang, L. Huang, Y. Chen, L. Gu, J. Zhou, T. Harada, and E. R. Hancock (2022-06) Revisiting domain generalized stereo matching networks from a feature consistency perspective. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 13001–13011. Cited by: §2.
[81] K. Zhang, K. Che, J. Zhang, J. Cheng, Z. Zhang, Q. Guo, and L. Leng (2022) Discrete time convolution for fast event-based stereo. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 8676–8686. Cited by: §2.
[82] P. Zhang, L. Zhu, X. Wang, L. Wang, and H. Huang (2025) EMatch: a unified framework for event-based optical flow and stereo matching. In Int. Conf. Comput. Vis. (ICCV), pp. 5845–5855. Cited by: Figure 1, Figure 1, §2, Table 2, Figure 6, §4.1, §4.3, Table 3, Table 4, Table 10, 2nd item, §8.3, §8, Figure 12, Figure 13, Figure 14.
[83] S. Zhi, T. Laidlow, S. Leutenegger, and A. J. Davison (2021) In-place scene labelling and understanding with implicit scene representation. In Int. Conf. Comput. Vis. (ICCV), pp. 15838–15847. Cited by: §2.
[84] A. Z. Zhu, D. Thakur, T. Özaslan, B. Pfrommer, V. Kumar, and K. Daniilidis (2018) The multivehicle stereo event camera dataset: an event camera dataset for 3D perception. IEEE Robot. Autom. Lett. 3 (3), pp. 2032–2039. Cited by: Figure 2, Figure 2, §2, §3.1.2, §3.1.2, Figure 6, Figure 6, Figure 6, §4.2, Table 4, Table 4, 3rd item, §8, Figure 14, Figure 14, §9.2.
[85] A. Z. Zhu, L. Yuan, K. Chaney, and K. Daniilidis (2019) Unsupervised event-based learning of optical flow, depth, and egomotion. In IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), pp. 989–997. External Links: Document Cited by: §4.1.
[86] J. Zhu, T. Pan, Z. Cao, Y. Liu, J. T. Kwok, and H. Xiong (2025) Depth any event stream: enhancing event-based monocular depth estimation via dense-to-sparse distillation. In Int. Conf. Comput. Vis. (ICCV), pp. 5146–5155. Cited by: §2.

	Events & Ground Truth		SE-CFF [49]	EMatch [82]	E-StereoAnywhere	E-FoundationStereo
M3ED [7]		LiDAR (GT)
M3ED [7]		MIX 4
MVSEC [84]		LiDAR (GT)
MVSEC [84]		MIX 4


2D top-view showing the $\alpha$ -shape (black)	Final 3D global trajectory inside the scene