1]Alaya Studio, Shanda AI Research Tokyo
2]National Taiwan University 3]The University of Tokyo 4]National Yang Ming Chiao Tung University

Generative World Renderer

Zheng-Hui Huang Zhixiang Wang Jiaming Tan Ruihan Yu Yidan Zhang Bo Zheng Yu-Lun Liu Yung-Yu Chuang Kaipeng Zhang [ [ [ [ zhixiang.wang@shanda.com

(April 2, 2026)

Abstract

Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.

\project

https://alaya-studio.github.io/renderer/ \codehttps://github.com/ShandaAI/AlayaRenderer \correspondenceZhixiang Wang <>

Refer to caption — Figure 1: We present a large-scale dataset curated from game engines to support scalable generative world rendering. The dataset provides high-resolution RGB videos with aligned G-buffers, covering continuous and dynamic scenes, long temporal trajectories, and diverse visual conditions.

1 Introduction

Digital world modeling involves two fundamental tasks: forward rendering, which synthesizes photorealistic images from scene attributes (geometry, materials, and lighting) via the rendering equation kajiya1986rendering ; and inverse rendering, which decomposes observed images back into these physical components. Recent advancements in generative models have begun to bridge this gap, treating rendering and its inverse as two sides of the same coin within a unified framework liang2025diffusion ; chen2024unirendere . At the heart of this unification lies the G-buffer—a rich, intermediate representation that provides explicit geometric and material guidance for controllable synthesis while serving as the supervision target for decomposition.

Despite the conceptual elegance of this unification, scaling bidirectional rendering to “in-the-wild” scenarios remains a formidable challenge. The primary bottleneck is data: the scarcity of large-scale, diverse, and temporally continuous video sequences synchronized with high-fidelity ground-truth G-buffers. Existing synthetic datasets often feature limited scene complexity, static camera trajectories, simplified material models, and a lack of adverse weather conditions like fog, rain, or snow. These limitations lead to a persistent domain gap, where models fail to handle the long-tail complexity of real-world videos—such as imperfect "delighting" in cluttered environments, fine-grained vegetation geometry, or temporal flickering under rapid motion. As illustrated in Figure 2, these data-starved models struggle to maintain physical plausibility and temporal coherence. This data bottleneck cap the potential of current models to tackle real-world complexity and serve as generative world renderers.

In this paper, we address this data bottleneck by introducing a large-scale, continuous video dataset curated from two AAA games, specifically designed to advance both video inverse rendering and G-buffer-conditioned forward synthesis. Our dataset comprises over 4M frames at 720p/30fps, featuring five synchronized G-buffer channels (depth, normals, albedo, metallic, and roughness) aligned with high-quality RGB frames. Unlike previous short-clip collections, our data consists of long, uninterrupted sequences across diverse urban and natural environments under varying atmospheric conditions (e.g., sunny, rainy, foggy, sunset). We develop a non-intrusive pipeline that intercepts runtime G-buffers at the rendering API level, bypassing the need for decompilation or asset extraction. We employ a dual-screen stitched capture strategy to record high-resolution buffers with minimal quality loss.

Crucially, this dataset enables a leap in bidirectional capability. For inverse rendering, it provides the dense supervision necessary for robust material decomposition in complex scenes. For forward rendering, it allows generative models to learn a flexible prior that transcends rigid geometry; for instance, our model can leverage G-buffers to synthesize complex volumetric effects (e.g., fog and rain) that are often omitted in simplified physics-based renderers.

To facilitate wider practical applicability, our dataset pairs clean RGB frames with synthesized motion-blur variants, ensuring that models trained on our data remain resilient to common real-world imaging degradations. Furthermore, recognizing the inherent challenges in evaluating real-world performance, we introduce a VLM-based evaluation protocol. This framework systematically assesses semantic correctness, spatial fidelity, and temporal consistency, demonstrating a strong correlation with human preferences in scenarios where traditional per-frame metrics fall short.

Contributions.

Our contributions are three-fold:

•

A Large-Scale Dataset: A continuous, high-fidelity G-buffer and video dataset featuring 4M frames with rich dynamics, diverse weather, and long-term temporal coherence.
•

An Efficient Data Curation Pipeline: A novel capture framework based on graphics API interception and dual-screen stitching that enables scalable acquisition of high-resolution G-buffers.
•

Enhanced Rendering Performance & Evaluation: Evidence that fine-tuning on our data significantly improves state-of-the-art models (e.g., DiffusionRenderer) in both decomposition and controllable editing, supported by a new VLM-based ranking protocol for real-world assessment.

2 Related Work

Inverse Rendering and Forward Rendering Methods.

Forward rendering synthesizes images from scene attributes by solving the rendering equation kajiya1986rendering , typically via Monte Carlo path tracing with microfacet BRDF models cook1982reflectance ; walter2007microfacet as systematized in modern rendering pipelines pharr2016pbr ; akenine2018realtime . On the sampling front, spatiotemporal reservoir resampling (ReSTIR) bitterli2020restir ; lin2022gris has dramatically accelerated real-time path tracing by reusing samples across pixels and frames, enabling interactive rendering of scenes with millions of dynamic lights—including the Cyberpunk 2077 environments from which our dataset is collected. The broader neural rendering paradigm tewari2022advances has progressively augmented or replaced components of the classical pipeline. Early works learn to interpret neural textures via deferred shading nalbach2017deep ; thies2019deferred and extend this to free-viewpoint relighting gao2020deferred and city-scale lighting factorization liu2020factorize ; neural appearance models then compress complex layered SVBRDFs into compact latent textures decoded by small MLPs for real-time BRDF evaluation, importance sampling, and level-of-detail filtering rainer2019neural ; kuznetsov2021neumip ; sztrajman2021neural ; fan2022neural ; zeltner2024neural ; xu2025neuralmaterials ; neural radiance caching accelerates global illumination at path vertices muller2021nrc ; and end-to-end neural renderers such as RenderFormer zeng2025renderformer directly render triangle meshes with full global illumination via a transformer, without per-scene training. Most recently, diffusion models have been repurposed as data-driven generative renderers that directly map G-buffers and lighting descriptions to photorealistic images liang2025diffusion ; chen2024unirendere ; zeng2024rgb , or augment the forward pass with physics-inspired constraints pilight2026 . Unlike classical path tracers, these generative approaches implicitly learn complex light transport—including volumetric scattering, global illumination, and view-dependent effects—from paired supervision, bypassing explicit material models and costly per-sample Monte Carlo integration. This paradigm also extends to controllable relighting and scene editing jin2024neural ; zhang2025scaling ; he2025unirelight ; physicalrelighting2025 , yet its scalability is fundamentally gated by the availability of large-scale, temporally continuous G-buffer–RGB pairs—precisely the gap our dataset addresses.

Inverse rendering decomposes images into geometry, reflectance, and materials for relighting and editing. Early optimization methods land1971lightness ; barron2014shape ; grosse2009ground struggle with real-world complexity. Learning-based approaches improved intrinsic decomposition li2018cgintrinsics ; sengupta2019neural ; li2020inverse ; li2021openrooms ; careaga2024colorful and material estimation li2018materials ; deschaintre2018single ; boss2020two ; lopes2024material using synthetic supervision, while neural fields enabled joint optimization via shape-reflectance factorization zhang2021nerfactor , tensorial representations jin2023tensoir , room-scale decomposition ye2023intrinsicnerf , neural SDFs zhu2023i2 , Gaussian splatting with BRDF decomposition liang2024gs ; gao2024relightable ; saito2024relightable , and physics-based losses wu2025pbr —though all require per-scene optimization. Diffusion models improved generalization through joint intrinsic prediction luo2024intrinsicdiffusion , bidirectional material decomposition zeng2024rgb , probabilistic formulations kocsis2024intrinsic , stochastic inverse rendering enyo2024diffusion , lighting-material disambiguation chen2024intrinsicanything , multi-view intrinsic decomposition li2024idarb , SVBRDF synthesis vecchio2024matfuse ; sartor2023matfusion ; vecchio2024controlmat , single-image PBR extraction lopes2024material ; litman2025materialfusion , text-to-PBR generation kocsis2025intrinsix , and multi-view G-buffer estimation wang2025mage ; he2025neural . On the relighting side, diffusion-based methods enable object-level relighting jin2024neural , illumination editing with consistent light transport zhang2025scaling , indoor scene relighting xing2025luminet , video relighting he2025unirelight , and dynamic human rendering wang2024intrinsicavatar . Recent video diffusion models enable temporally consistent inverse rendering liang2025diffusion and video-level PBR material extraction munkberg2025videomat , but train on short, object-centric clips—our dataset provides the long, continuous sequences with complex dynamics these methods require.

Datasets and Game-Based Collection.

Acquiring real-world G-buffers remains challenging, making synthetic datasets essential. Indoor datasets provide disentangled reflectance roberts2021hypersim , SVBRDF annotations li2021openrooms ; zhu2022learning , laser scans yeshwanth2023scannet++ , scalable mid-level vision data eftekhar2021omnidata , and controllable lighting at scale li2018interiornet . Material resources offer PBR collections vecchio2024matsynth ; ma2023opensvbrdf ; zhou2023photomat , relighting benchmarks ummenhofer2024objects ; ren2022diligent102 , while outdoor datasets cover city-scale scenes li2023matrixcity ; wang2025lightcity , aerial imagery liu2021urbanscene3d , and driving scenarios barua2025gta . Object-centric resources include video frames ling2024dl3dv , 3D objects deitke2023objaverse ; wu2023omniobject3d , outdoor 3DGS data xiong2024gauu , and depth/flow benchmarks wang2020tartanair ; mehl2023spring . Procedurally generated datasets raistrick2024infinigen ; raistrick2024infinigen provide multi-modal ground truth (depth, normals, albedo) without external assets, and configurable simulation platforms ge2024behavior offer adjustable scene parameters, though both lack the visual fidelity and content diversity of artist-crafted game worlds. Long synthetic video datasets with multi-modal annotations zheng2023pointodyssey ; yang2024depth further demonstrate the value of temporal supervision at scale. Games enable photorealistic data extraction via graphics interception richter2016playing ; richter2017playing , DirectX injection krahenbuhl2018free , engine plugins qiu2017unrealcv ; ros2016synthia ; huang2018deepmvs ; dosovitskiy2017carla ; shah2017airsim ; pollok2019unrealgt , and ReShade/OBS pipelines zhou2025omniworld , with domain adaptation addressing sim-to-real gaps tobin2017domain ; tremblay2018training ; hoyer2022daformer ; mikami2021scaling . However, existing datasets remain image-centric or provide sparse channels with short sequences—we extract synchronized multi-channel G-buffers (depth, normals, albedo, metallic, roughness) as continuous long-duration video from AAA games.

Temporal Consistency and Depth Estimation.

Real videos exhibit motion blur from finite exposure; MPI-Sintel’s clean or degraded passes butler2012naturalistic and high-resolution synthetic benchmarks mehl2023spring establish the design philosophy we follow, generating blur via frame interpolation huang2022real ; jiang2018super ; reda2022film . Temporal consistency methods span recurrent networks lai2018learning , deep video priors lei2020blind , video diffusion blattmann2023align ; blattmann2023stable , feature propagation geyer2023tokenflow ; qi2023fatezero , spatial-temporal constraints yang2024fresco ; yang2023rerender , token merging li2024vidtome , temporal transformers yan2023temporally , content deformation fields ouyang2024codef , flow-based methods liang2024flowvid , streaming video translation liang2024looking , and correspondence-guided diffusion chu2024medm . Depth estimation has advanced with foundation models yang2024depth , diffusion priors ke2024repurposing ; bochkovskii2024depth ; he2024lotus , joint depth-normal prediction hu2024metric3d ; fu2024geowizard , surface normal estimation bae2024rethinking ; ye2024stablenormal , optical flow wang2024sea ; dong2024memflow , and temporally consistent video depth chen2025video ; hu2025depthcrafter ; shao2025learning ; wang2024nvds . Recent work further extends video diffusion priors to temporally consistent normal estimation bin2025normalcrafter . Our long sequences with ground-truth geometry support training and validating such temporally-aware methods.

Evaluation Protocols.

Standard metrics (PSNR, LPIPS zhang2018unreasonable ) miss cross-buffer consistency, while perceptual metrics fu2023dreamsim and video metrics (FVD unterthiner2019fvd ) show quality-consistency trade-offs ge2024content ; huang2024vbench . For real videos without ground truth, VLMs enable semantic evaluation via quality assessment wang2023exploring ; wu2023q ; wu2023qalign , faithfulness VQA hu2023tifa , compositional benchmarks huang2023t2i , 3D evaluation wu2024gpt , preference learning xu2023imagereward , and multimodal depiction deqa_score ; depictqa_v2 ; depictqa_v1 . More broadly, the LLM-as-a-Judge paradigm zheng2023judging ; liu2023g has been extended to vision-language settings chen2024mllm ; lee2024prometheus , video quality understanding he2024videoscore ; zhang2024q ; wang2025aigv , and open evaluation platforms for generative models jiang2024genai . We introduce a VLM-based ranking protocol targeting material channels (metallic, roughness) where VLM priors provide meaningful common-sense judgments.

3 Dataset Construction

3.1 G-buffer Interception

We utilize ReShade reshade to intercept the rendering pipeline at the graphics API level. However, extracting complete G-buffers is non-trivial because modern games employ engine- and title-specific G-buffer packing/encodings, with no standardized layout across titles. To address this, we first conduct offline frame analysis with RenderDoc renderdoc to identify the candidate render passes and their associated render-target attachments, including their formats, dimensions, sample counts.

Based on offline frame inspection and pass tracing, we implement game-specific ReShade add-ons that hook graphics API callbacks to monitor per-frame render-target bindings. During runtime, we maintain a small pool of candidate attachments and GPU-copy only those that satisfy stable invariants, including consistent format as well as extent and recurrent binding patterns. To disambiguate the desired render target, we log lightweight per-frame signatures, including format and extent stability and pass-local draw-call spans. After selection, we bind the selected render target as an input texture to the ReShade effect runtime; the effect shader can then shade it to the screen.

In our capture pipeline, the only normal information we can reliably obtain from the rendering pipeline is a world-space normal buffer. We adopt camera-space normals to match prior inverse rendering work liang2025diffusion , but cannot convert world to camera space due to the lack of reliable access to the view matrix. We therefore reconstruct camera-space normals from depth by inverse projection and finite differences during the ReShade effects stage:

\mathbf{n}=\operatorname{normalize}\!\left(\frac{\partial\mathbf{P}}{\partial x}\times\frac{\partial\mathbf{P}}{\partial y}\right),

(1)

where $\mathbf{P}$ is the view-space position reconstructed from the depth buffer.

Beyond the geometry, exporting material channels introduces additional reliability challenges. Metallic and roughness are often packed into different channels of a single G-buffer render target. Recording these channels via screen-capture video can cause channel coupling artifacts (e.g., inter-channel bleeding) even under near-lossless settings. We therefore decouple the maps and render them into spatially distinct screen regions, ensuring that compression noise does not cross-contaminate material properties.

3.2 Synchronized Multi-Screen Recording

Directly exporting multi-channel G-buffers per frame is prohibitively costly due to storage bandwidth, file-management overhead, and GPU-to-CPU readback stalls. Instead, we shade target buffers to the screen and record them with hardware-accelerated capture, enabling scalable, temporally synchronized acquisition without modifying the game engine.

To ensure strict temporal synchronization across all six data channels, we employ a “mosaic” compositing strategy. We render all G-buffers onto a unified canvas captured via OBS (Open Broadcaster Software) at a near-lossless bitrate. To overcome the resolution limits of a single display, we stitch two 2K monitors, allowing us to record each channel at an effective resolution of 720p. Since expanding the display area inherently increases the game’s field of view, we apply a center-crop to the source buffers before tiling them onto the final output, preserving the intended aspect ratio.

3.3 Scene Traversal Strategies

We collect data from Cyberpunk 2077 and Black Myth: Wukong with distinct traversal strategies to maximize diversity. For Cyberpunk 2077, we use a semi-automated driving setup and define long-range waypoints to generate continuous trajectories with variable speeds, yielding rich temporal dynamics. Beyond driving, we also capture sequences where the player walks along streets and collects indoor scenes to broaden the coverage of viewpoints and environments. For Black Myth: Wukong, we capture exploration sequences from completed save files. We deliberately avoid combat and instead traverse a wide range of environments and routes to obtain diverse appearances and scene content.

3.4 Dataset Statistics

We provide rich annotations for the dataset, covering scene, weather, camera and scene motion, and texture. We first deploy Qwen3-VL-235B-A22B-Instruct Qwen3-VL using vLLM kwon2023efficient . For each video clip, we uniformly sample five frames along the temporal axis and feed the timestamped frames into Qwen to obtain the corresponding annotations.

Annotation labels.

For each clip, we annotate four categorical attributes. The texture summarizes the dominant material/appearance cues (e.g., plastic, metal, brick, sky, painted, stone, glass, …); the weather labels of clips include sunny, cloudy, foggy, rainy, snowy; scene indicates whether the clip is indoor or outdoor; and motion describes camera–scene dynamics with four cases: camera static scene moving, camera moving scene moving, camera moving scene static, and both static. Together, these attributes capture diverse environmental settings and visual characteristics, highlighting the broad variability of our dataset across conditions and content.

Distribution Analysis.

We analyze the distributions of metallic and roughness in the two game sources. As shown in Figure 3, Cyberpunk 2077 exhibits a higher proportion of pixels with large metallic values than Black Myth: Wukong, while Black Myth: Wukong contains more high-roughness regions. This trend is consistent with their dominant visual themes: Cyberpunk 2077 features metal-rich urban environments, whereas Black Myth: Wukong more often depicts natural scenes with rough, diffuse materials. Together, these two domains provide complementary coverage of common real-world material appearances. We further analyze pixel brightness using luminance and the HSV Value channel. Figure 3 shows that Black Myth: Wukong concentrates at low values, consistent with outdoor scenes where occlusions yield more shadowed regions. In contrast, Cyberpunk 2077 is more balanced across the range.

Finally, we filter out clips where both the scene content and the camera remain static throughout the annotation, and we exclude frames with excessively low luminance.

3.5 Dataset Post-processing

To maximize dataset reusability for downstream pixel-aligned tasks, we capture RGB frames with the engine motion blur disabled, providing sharp canonical observations with clean temporal correspondences. Since real videos often contain camera-induced motion blur, we additionally release an offline motion-blurred RGB variant to reduce this domain gap.

We approximate exposure integration by first interpolating $8$ RGB sub-frames with RIFE huang2022real , averaging them in the linear domain, and converting back to RGB:

I^{\text{blur}}_{t}=\mathrm{RGB}\!\Big(\tfrac{1}{K}\sum_{i=1}^{K}\mathrm{Lin}\!\big(\tilde{I}_{t,i}\big)\Big),

(2)

where $\tilde{I}_{t,i}$ denotes the RIFE-interpolated RGB frames, and $\mathrm{Lin}(\cdot)$ / $\mathrm{RGB}(\cdot)$ are the RGB $\leftrightarrow$ linear conversions.

4 VLM-based Evaluation on Real-Scene Test Cases

Validating that our dataset improves real-world generalization requires evaluating material predictions on real captures, where ground truth is generally unavailable. User studies offer one alternative, but they scale poorly: judging material properties in complex scenes demands domain expertise and substantial annotation effort.

In contrast, modern vision-language models (VLMs) encode extensive material-related world knowledge and can serve as scalable judges for relative comparisons without requiring ground truth. We focus on metallic and roughness, as these properties exhibit strong semantic and appearance priors (e.g., recognizable material categories and characteristic specular behavior), which facilitate more consistent pairwise preferences. Moreover, for complex videos, VLMs can leverage global context while still attending to localized cues (e.g., thin structures, specular trims, or brief temporal flicker), enabling them to identify subtle but systematic failure patterns across different methods. We provide the detailed prompt used for VLM evaluation in Figure A.1.

5 Experiments

5.1 Training and Experimental Setup

Our primary baseline is DiffusionRenderer liang2025diffusion , as it is currently the only accessible method that performs video inverse rendering. We additionally include two recent diffusion-based image inverse rendering models zeng2024rgb zheng2025dnf as baselines. Since DiffusionRenderer does not release its training dataset, we cannot reproduce its original training data; we therefore fine-tune the official implementation from the released pre-trained weights to assess whether our dataset improves real-video generalization while keeping the model fixed. We use the Cosmos-based DiffusionRenderer checkpoint as our baseline, as it performs better than the SVD variant. We fully fine-tune the model using fixed-length clips of 57 frames sampled at 24 FPS and a resolution of $1280\times 720$ . We use Cyberpunk 2077 for training and Black Myth: Wukong for testing. We fine-tune two variants on data with and without motion effects, and select the motion-augmented variant as our final model due to its consistently better performance. We also fine-tune a longer-clip variant with 113 frames under the same settings; as shown in Figure 2, it substantially improves long-video inference. During inference, we strictly follow DiffusionRenderer’s original protocol for consistency. For image-based inverse rendering methods, we run inference on each video frame independently.

We also demonstrate game editing as a practical application of our dataset. To implement this, we first utilize Qwen3-VL-235B-A22B-Instruct Qwen3-VL to generate descriptive captions for each video clip. Given that G-buffers provide dense geometric and material priors, our prompts focus exclusively on lighting and environmental effects, enabling users to manipulate these attributes via text during inference. Architecturally, we adapt the Wan 2.1-T2V-1.3B wan2025 by incorporating G-buffers as conditional inputs. Following the original training configuration of the base model, we fully fine-tune it on Black Myth: Wukong at a spatial resolution of $832\times 480$ (480p) and a frame rate of 16 FPS, utilizing 81-frame clips. Evaluation and GeneralizationIn the absence of directly comparable methods for this specific task, we establish a baseline by adapting DiffusionRenderer’s forward renderer. Specifically, we employ DiffusionLight to extract environment maps from the videos, which then serve as the lighting conditions to produce the rendered results. A subset of the Black Myth: Wukong dataset is reserved for testing. Furthermore, to assess the robust generalizability of our model, we conduct cross-dataset evaluations on Cyberpunk 2077. This experiment demonstrates that our model generalizes effectively to unseen game environments, maintaining high-fidelity and controllable video synthesis.

Table 1: Quantitative evaluation of inverse rendering on black myth video dataset. DNF denotes DNF-intrinsic, and DR denotes DiffusionRenderer. Note that RGB

\leftrightarrow

X does not output depth.

	Depth						Normals		Albedo				Metallic		Roughness
	Abs Rel $\downarrow$	RMSE $\downarrow$	RMSE log $\downarrow$	$\delta<1.25$ $\uparrow$	$\delta<1.25^{2}$ $\uparrow$	$\delta<1.25^{3}$ $\uparrow$	AngularError $\downarrow$	Acc@11.25^∘ $\uparrow$	PSNR $\uparrow$	LPIPS $\downarrow$	si-PSNR $\uparrow$	si-LPIPS $\downarrow$	RMSE $\downarrow$	MAE $\downarrow$	RMSE $\downarrow$	MAE $\downarrow$
RGB $\leftrightarrow$ X	-	-	-	-	-	-	78.05^∘	0.035	8.74	0.619	20.11	0.626	0.510	0.503	0.349	0.313
DNF	0.862	0.026	0.918	0.361	0.610	0.762	53.21^∘	0.065	13.84	0.702	15.59	0.701	0.245	0.183	0.566	0.543
DR	1.118	0.030	0.723	0.267	0.496	0.684	45.01^∘	0.110	17.53	0.648	19.90	0.646	0.230	0.134	0.281	0.237
Ours	0.697	0.023	0.430	0.609	0.761	0.852	42.57^∘	0.150	16.44	0.628	21.44	0.635	0.104	0.024	0.266	0.218

Table 2: Quantitative evaluation of inverse rendering on the Sintel dataset.

	Depth					Albedo
	RMSE $\downarrow$	RMSE log $\downarrow$	$\delta<1.25\uparrow$	$\delta<1.25^{2}\uparrow$	$\delta<1.25^{3}\uparrow$	PSNR $\uparrow$	LPIPS $\downarrow$	si-PSNR $\uparrow$	si-LPIPS $\downarrow$
RGB $\leftrightarrow$ X	-	-	-	-	-	8.69	0.613	16.69	0.595
DNF-Intrinsic	0.249	1.090	0.371	0.590	0.710	13.16	0.546	13.68	0.535
DiffusionRenderer	0.268	0.911	0.331	0.560	0.707	14.87	0.505	17.46	0.497
Ours	0.220	0.745	0.478	0.649	0.776	15.40	0.486	17.80	0.491

Table 3: VLM evaluation metrics. R: Roughness; M: Metallic.

Methods		Sem. $\downarrow$	App. $\downarrow$	Temp. $\downarrow$
R	DiffusionRenderer	2.45	2.40	2.10
	Ours	1.78	1.78	2.08
	Ours (w/ motion blur)	1.78	1.83	1.83
M	DiffusionRenderer	2.35	2.28	2.00
	Ours	1.90	2.13	2.15
	Ours (w/ motion blur)	1.75	1.60	1.85

Table 4: User study. Groups 1 and 2 represent samples where the VLM prefers our model and DiffusionRenderer, respectively. The reported percentages denote the agreement rate between human experts and VLM predictions.

Channel	Group 1	Group 2
	prefer our model	prefer DiffusionRenderer
Metallic	85%	70%
Roughness	75%	61%

Table 5: Ablation study on motion blur.

	Depth				Albedo
Method	RMSE log $\downarrow$	$\delta\!<\!1.25\uparrow$	$\delta\!<\!1.25^{2}\uparrow$	$\delta\!<\!1.25^{3}\uparrow$	PSNR $\uparrow$	LPIPS $\downarrow$	si-PSNR $\uparrow$	si-LPIPS $\downarrow$
Ours	0.773	0.467	0.649	0.756	15.73	0.513	17.37	0.513
Ours (w/ motion blur)	0.745	0.478	0.649	0.776	15.40	0.486	17.80	0.491

5.2 Quantitative Evaluation of Inverse Rendering

Evaluation Metrics.

For synthetic benchmarks with ground truth, we report standard metrics for each modality. Following the prior work luo2020consistent , we evaluate depth in disparity space and apply scale-and-shift alignment, reporting AbsRel, RMSE, RMSE-log, and threshold accuracies $\delta<1.25^{n}$ ( $n\!=\!1,2,3$ ). For albedo, we report PSNR and LPIPS, as well as their scale-invariant counterparts (si-PSNR and si-LPIPS) to reduce sensitivity to global intensity scaling. For normals, we report mean angular error and the accuracy under an $11.25^{\circ}$ threshold (Acc@11.25^∘). For material parameters (metallic and roughness), we report RMSE and MAE.

Black Myth Wukong Benchmark.

As there is currently no public benchmark for video inverse rendering, we construct a quantitative test set from our Black Myth: Wukong capture. Specifically, we hold out 39 video clips, each containing 57 frames, covering diverse materials, lighting, and dynamic events. We evaluate all methods on the same held-out clips and report per-modality metrics averaged over frames and then over clips. Our fine-tuned model achieves the best performance on depth and normal estimation, and attains the strongest scale-invariant albedo scores while markedly improving metallic and roughness accuracy in Table 1.

Sintel Benchmark.

We additionally evaluate on the MPI-Sintel final pass, which contains realistic effects such as motion blur and depth-of-field, and provides ground-truth albedo and depth. A subtle difference is that Sintel’s albedo annotation does not enforce a fully-black sky region, which is inconsistent with our dataset convention. To avoid penalizing methods for this annotation mismatch, we exclude samples containing sky regions when evaluating albedo on Sintel. As shown in Table 2, our fine-tuned model achieves the best overall performance on both depth and albedo, improving RMSE/RMSE-log and $\delta$ accuracies for depth while also yielding higher PSNR and lower LPIPS including scale-invariant variants for albedo.

Real-World Video Evaluation.

To evaluate real-scene generalization, we collect 40 real-world video cases from online sources, spanning indoor and outdoor scenes, varying motion magnitude (slow to fast camera/object motion), and diverse times of day.

Because real videos lack ground-truth intrinsic buffers, we use a video-capable vision-language model (VLM) to score and rank predictions. We adopt Gemini 3 Pro team2023gemini as the judge model due to its strong video understanding and temporal reasoning. Given each test video, we compose a fixed-layout grid where the RGB reference and method outputs are synchronously played, and prompt the VLM to rate: (i) temporal consistency, (ii) spatial quality, and (iii) semantic plausibility, producing a structured score and ranking for each modality. We report aggregated VLM rankings in Table 4. Fine-tuning on our dataset improves all metrics, and motion augmentation further improves results except for roughness appearance.

User Study.

We conduct a user study to validate the accuracy of our VLM-based evaluation. We recruit 25 CG experts and use a pairwise preference test between DiffusionRenderer and our fine-tuned model, since judging material cues in complex scenes requires domain expertise. For metallic and roughness, we sample three cases where the VLM prefers DiffusionRenderer and three where it prefers ours, and report the agreement rate between experts and the VLM across questions. As shown in Table 4, expert judgments generally align with the VLM, with lower agreement for roughness due to more ambiguous cues.

5.3 Qualitative Evaluation of Inverse Rendering

We present qualitative comparisons on real-world video sequences in Figure 4, visualizing (top to bottom) albedo, normals, depth, metallic, and roughness. Compared to DiffusionRenderer, our method exhibits superior performance in disentangling intrinsic scene properties. Notably, it produces highly clean albedo maps with thorough delighting, and reconstructs more precise depth and normals that faithfully preserve structures while resisting outdoor illumination artifacts. Furthermore, our fine-tuned model yields semantically accurate metallic and roughness predictions, successfully overcoming transient atmospheric disruptions like smoke and volumetric scattering. These visual results not only corroborate our quantitative findings but also explicitly highlight the critical advantage of our proposed dataset in enabling robust, physically grounded inverse rendering in the wild. More visualizations are detailed in Figure 5.

5.4 Ablation Study of Inverse Renderer

We ablate motion effects by fine-tuning two variants under identical settings: with vs. without motion augmentation. Motion augmentation improves most synthetic metrics (Table 5) and yields better real-video generalization with higher temporal stability (Figure 8), especially under strong motion blur where it reduces flicker and boundary crawling.

5.5 Evaluation of Relighting

We further conduct a qualitative evaluation of relighting to demonstrate the generalization benefits unlocked by our dataset. Specifically, we collect diverse environment maps and synthesize images using the frozen forward renderer of DiffusionRenderer, conditioned on the G-buffers estimated by both the baseline and our fine-tuned inverse renderer. As shown in Figure 6, although the forward renderer is not fine-tuned on our dataset, the images synthesized from our improved G-buffers exhibit significantly better consistency with the target environment maps, particularly in the sky regions where baseline models often struggle. We attribute this superior relighting to our data-centric paradigm’s enhanced ability to decouple intrinsic scene properties from environmental illumination. By disentangling lighting effects and spurious highlights during the inverse rendering stage, the fine-tuned model yields exceptionally clean and accurate G-buffers. Consequently, this enables the off-the-shelf forward renderer to generate highly realistic, illumination-consistent novel views, proving the efficacy of our proposed data. In essence, these results highlight that scaling and improving training data is a highly promising, direct pathway to overcoming the inherent ambiguities of inverse rendering in the wild.

5.6 Evaluation on Game Editing

To showcase the practical efficacy of our dataset for downstream tasks, we investigate its application in high-fidelity video game editing. As illustrated by the qualitative comparisons in Figure 7, our G-buffer-conditioned model, fine-tuned on our data, outperforms existing video-to-video baselines. We benchmark our method against three representative paradigms: (i) a ControlNet-based framework guided by RGB-derived edge maps; (ii) an SDEdit-style stochastic editing pipeline; and (iii) a physics-informed baseline leveraging DiffusionRenderer, where the environment maps are estimated from our outputs via DiffusionLight. To ensure a rigorous and controlled comparison, all diffusion-based baselines utilize the same pre-trained backbone and text prompts as our method.

Compared to these alternatives, our approach achieves a superior balance between editability and visual fidelity to the original game render. Relying solely on edge maps provides spatial conditioning that preserves basic geometry but fails to guarantee material fidelity. Furthermore, because edge extraction from raw RGB frames is inherently unstable, this baseline suffers from severe temporal inconsistencies in the output video. In the context of video games, however, high-quality G-buffers can be stably retrieved as a native byproduct, serving as a superior alternative to noisy, image-derived proxies. Conversely, SDEdit introduces excessive deviation from the input; crucial but visually small objects frequently disappear, which disrupts underlying gameplay logic and degrades player interactivity. While DiffusionRenderer successfully maintains both geometry and material properties, it struggles with aggressive editing tasks, such as dramatic style transfers or the insertion of novel visual effects. Additionally, its reliance on environment maps severely limits user accessibility. In contrast, because our training data inherently captures complex in-game visual effects (e.g., volumetric fog, rain), our model learns a flexible prior. Rather than being strictly bottlenecked by the bare input geometry, it can seamlessly hallucinate and integrate rich atmospheric effects during the editing process.

6 Conclusion

In this work, we present a large-scale video dataset designed to unify inverse and forward rendering. Curated from high-fidelity commercial games, it provides long-term RGB sequences synchronized with dense G-buffers, alongside synthesized motion-blur variants to bridge the sim-to-real gap. To tackle the challenge of evaluating in-the-wild performance without ground truth, we introduce a novel, scalable VLM-based evaluation protocol. Extensive experiments show that fine-tuning DiffusionRenderer on our dataset substantially improves both robust material decomposition and the fidelity of G-buffer-conditioned video synthesis. Ultimately, our data and training recipe enable highly reliable, temporally coherent bidirectional rendering, offering a critical foundation for advanced world simulation and controllable generative editing in the wild.

Appendix

A.1 License Statement and Data Release Policy

Our dataset is constructed using in-game visual and geometric data from two commercial titles: Cyberpunk 2077 (CD PROJEKT RED) and Black Myth: Wukong (Game Science). To ensure strict ethical and legal compliance with the developers’ intellectual property rights, we outline our data collection methodology, release policy, and licensing terms as follows:

•

Data Collection Methodology: To ensure strict legal compliance with the End User License Agreements (EULA), our data curation toolkit operates entirely at the rendering API level. Our pipeline strictly intercepts the graphics API to capture runtime G-buffers (e.g., albedo, normals, depth) and final rendering outputs during gameplay. It does not involve decompiling the game executables, circumventing anti-tamper mechanisms, or extracting proprietary source assets (such as original 3D meshes or textures) from the games’ installation files. This API-level capture aligns with established fair-use practices for dataset curation in the computer vision community.
•

Compliance and Licensing: In accordance with the developers’ Fan Content Policies and EULAs, which permit non-commercial derivative works and sharing, our dataset will be released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license. This strictly limits the usage of our dataset to non-commercial research purposes.
•

Precedents: Our data collection and distribution framework is highly consistent with established and widely recognized synthetic datasets in the computer vision community, such as the GTA-V dataset richter2016playing and VIPER richter2017playing , which have profoundly catalyzed subsequent breakthroughs in various downstream computer vision tasks.
•

Gated Access: To prevent unauthorized mass distribution, we will not provide direct, public download links. Instead, the dataset will be released via gated access. Researchers wishing to use the dataset must formally agree to and sign a strict Terms of Use (ToU) agreement, acknowledging the original copyrights and committing to non-commercial use, before access is granted.
•

Open-Source Toolkit: To promote transparency, reproducibility, and future research in video inverse rendering and relighting, we will fully open-source our data curation toolkit. This will enable researchers to utilize our pipeline to legally curate data from other games, facilitating the continuous expansion and diversification of such datasets.

A.2 VLM Evaluation Prompt

To ensure reproducibility, we provide the exact prompt used for the Vision-Language Model (VLM) evaluation regarding metallic prediction. The details of the prompt design are illustrated in Figure A.1.

References

[1] T. Akenine-Möller, E. Haines, N. Hoffman, A. Pesce, M. Iwanicki, and S. Hillaire. Real-Time Rendering. A K Peters/CRC Press, 4th edition, 2018.
[2] G. Bae and A. J. Davison. Rethinking inductive biases for surface normal estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9535–9545, 2024.
[3] S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y. Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y. Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang, Q. Wang, Y. Wang, T. Xie, Y. Xu, H. Xu, J. Xu, Z. Yang, M. Yang, J. Yang, A. Yang, B. Yu, F. Zhang, H. Zhang, X. Zhang, B. Zheng, H. Zhong, J. Zhou, F. Zhou, J. Zhou, Y. Zhu, and K. Zhu. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631, 2025.
[4] J. T. Barron and J. Malik. Shape, illumination, and reflectance from shading. IEEE transactions on pattern analysis and machine intelligence, 37(8):1670–1687, 2014.
[5] H. B. Barua, K. Stefanov, K. Wong, A. Dhall, and G. Krishnasamy. Gta-hdr: A large-scale synthetic dataset for hdr image reconstruction. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 7876–7886. IEEE, 2025.
[6] Y. Bin, W. Hu, H. Wang, X. Chen, and B. Wang. Normalcrafter: Learning temporally consistent normals from video diffusion priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8330–8339, 2025.
[7] B. Bitterli, C. Wyman, M. Pharr, P. Shirley, A. Lefohn, and W. Jarosz. Spatiotemporal reservoir resampling for real-time ray tracing with dynamic direct lighting. ACM Transactions on Graphics, 39(4):148:1–148:17, 2020.
[8] A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
[9] A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023.
[10] A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y. Zhou, S. R. Richter, and V. Koltun. Depth pro: Sharp monocular metric depth in less than a second. arXiv preprint arXiv:2410.02073, 2024.
[11] M. Boss, V. Jampani, K. Kim, H. Lensch, and J. Kautz. Two-shot spatially-varying brdf and shape estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3982–3991, 2020.
[12] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In European conference on computer vision, pages 611–625. Springer, 2012.
[13] C. Careaga and Y. Aksoy. Colorful diffuse intrinsic image decomposition in the wild. ACM Transactions on Graphics (TOG), 43(6):1–12, 2024.
[14] C. Careaga and Y. Aksoy. Physically controllable relighting of photographs. In Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers, pages 1–10, 2025.
[15] D. Chen, R. Chen, S. Zhang, Y. Wang, Y. Liu, H. Zhou, Q. Zhang, Y. Wan, P. Zhou, and L. Sun. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. In Forty-first International Conference on Machine Learning, 2024.
[16] S. Chen, H. Guo, S. Zhu, F. Zhang, Z. Huang, J. Feng, and B. Kang. Video depth anything: Consistent depth estimation for super-long videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 22831–22840, 2025.
[17] X. Chen, S. Peng, D. Yang, Y. Liu, B. Pan, C. Lv, and X. Zhou. Intrinsicanything: Learning diffusion priors for inverse rendering under unknown illumination. In European Conference on Computer Vision, pages 450–467. Springer, 2024.
[18] Z. Chen, T. Xu, W. Ge, L. Wu, D. Yan, J. He, L. Wang, L. Zeng, S. Zhang, and Y.-C. Chen. Uni-renderer: Unifying rendering and inverse rendering via dual stream diffusion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26504–26513, 2025.
[19] E. Chu, T. Huang, S.-Y. Lin, and J.-C. Chen. Medm: Mediating image diffusion models for video-to-video translation with temporal correspondence guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 1353–1361, 2024.
[20] R. L. Cook and K. E. Torrance. A reflectance model for computer graphics. ACM Transactions on Graphics, 1(1):7–24, 1982.
[21] crosire and R. Contributors. Reshade: A generic post-processing injector for games and video software. https://reshade.me/, 2026. Accessed: 2026-01-21.
[22] M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V. Voleti, S. Y. Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems, 36:35799–35813, 2023.
[23] V. Deschaintre, M. Aittala, F. Durand, G. Drettakis, and A. Bousseau. Single-image svbrdf capture with a rendering-aware deep network. ACM Transactions on Graphics (ToG), 37(4):1–15, 2018.
[24] Q. Dong and Y. Fu. Memflow: Optical flow estimation and prediction with memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19068–19078, 2024.
[25] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. Carla: An open urban driving simulator. In Conference on robot learning, pages 1–16. PMLR, 2017.
[26] A. Eftekhar, A. Sax, J. Malik, and A. Zamir. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10786–10796, 2021.
[27] Y. Enyo and K. Nishino. Diffusion reflectance map: Single-image stochastic inverse rendering of illumination and reflectance. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11873–11883, 2024.
[28] Z. Fan, J. Guo, Y. Wang, T. Xiao, H. Zhang, C. Zou, Z. Chen, P. Hong, Y. Guo, and L.-Q. Yan. Neural layered BRDFs. In ACM SIGGRAPH 2022 Conference Proceedings, 2022.
[29] S. Fu, N. Tamir, S. Sundaram, L. Chai, R. Zhang, T. Dekel, and P. Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. arXiv preprint arXiv:2306.09344, 2023.
[30] X. Fu, W. Yin, M. Hu, K. Wang, Y. Ma, P. Tan, S. Shen, D. Lin, and X. Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. In European Conference on Computer Vision, pages 241–258. Springer, 2024.
[31] D. Gao, G. Chen, Y. Dong, P. Peers, K. Xu, and X. Tong. Deferred neural lighting: Free-viewpoint relighting from unstructured photographs. ACM Transactions on Graphics, 39(6):258:1–258:15, 2020.
[32] J. Gao, C. Gu, Y. Lin, Z. Li, H. Zhu, X. Cao, L. Zhang, and Y. Yao. Relightable 3d gaussians: Realistic point cloud relighting with brdf decomposition and ray tracing. In European Conference on Computer Vision, pages 73–89. Springer, 2024.
[33] S. Ge, A. Mahapatra, G. Parmar, J.-Y. Zhu, and J.-B. Huang. On the content bias in fréchet video distance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7277–7288, 2024.
[34] Y. Ge, Y. Tang, J. Xu, C. Gokmen, C. Li, W. Ai, B. J. Martinez, A. Aydin, M. Anvari, A. K. Chakravarthy, et al. Behavior vision suite: Customizable dataset generation via simulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22401–22412, 2024.
[35] M. Geyer, O. Bar-Tal, S. Bagon, and T. Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023.
[36] R. Grosse, M. K. Johnson, E. H. Adelson, and W. T. Freeman. Ground truth dataset and baseline evaluations for intrinsic image algorithms. In 2009 IEEE 12th International Conference on Computer Vision, pages 2335–2342. IEEE, 2009.
[37] J. He, H. Li, W. Yin, Y. Liang, L. Li, K. Zhou, H. Zhang, B. Liu, and Y.-C. Chen. Lotus: Diffusion-based visual foundation model for high-quality dense prediction. arXiv preprint arXiv:2409.18124, 2024.
[38] K. He, R. Liang, J. Munkberg, J. Hasselgren, N. Vijaykumar, A. Keller, S. Fidler, I. Gilitschenski, Z. Gojcic, and Z. Wang. Unirelight: Learning joint decomposition and synthesis for video relighting. arXiv preprint arXiv:2506.15673, 2025.
[39] X. He, D. Jiang, G. Zhang, M. Ku, A. Soni, S. Siu, H. Chen, A. Chandra, Z. Jiang, A. Arulraj, et al. Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2105–2123, 2024.
[40] Z. He, T. Wang, X. Huang, X. Pan, and Z. Liu. Neural lightrig: Unlocking accurate object normal and material estimation with multi-light diffusion. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26514–26524, 2025.
[41] L. Hoyer, D. Dai, and L. Van Gool. Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9924–9935, 2022.
[42] M. Hu, W. Yin, C. Zhang, Z. Cai, X. Long, H. Chen, K. Wang, G. Yu, C. Shen, and S. Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[43] W. Hu, X. Gao, X. Li, S. Zhao, X. Cun, Y. Zhang, L. Quan, and Y. Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 2005–2015, 2025.
[44] Y. Hu, B. Liu, J. Kasai, Y. Wang, M. Ostendorf, R. Krishna, and N. A. Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023.
[45] K. Huang, K. Sun, E. Xie, Z. Li, and X. Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems, 36:78723–78747, 2023.
[46] P.-H. Huang, K. Matzen, J. Kopf, N. Ahuja, and J.-B. Huang. Deepmvs: Learning multi-view stereopsis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2821–2830, 2018.
[47] Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024.
[48] Z. Huang, T. Zhang, W. Heng, B. Shi, and S. Zhou. Real-time intermediate flow estimation for video frame interpolation. In European Conference on Computer Vision, pages 624–642. Springer, 2022.
[49] D. Jiang, M. Ku, T. Li, Y. Ni, S. Sun, R. Fan, and W. Chen. Genai arena: An open evaluation platform for generative models. Advances in Neural Information Processing Systems, 37:79889–79908, 2024.
[50] H. Jiang, D. Sun, V. Jampani, M.-H. Yang, E. Learned-Miller, and J. Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 9000–9008, 2018.
[51] H. Jin, Y. Li, F. Luan, Y. Xiangli, S. Bi, K. Zhang, Z. Xu, J. Sun, and N. Snavely. Neural gaffer: Relighting any object via diffusion. Advances in Neural Information Processing Systems, 37:141129–141152, 2024.
[52] H. Jin, I. Liu, P. Xu, X. Zhang, S. Han, S. Bi, X. Zhou, Z. Xu, and H. Su. Tensoir: Tensorial inverse rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 165–174, 2023.
[53] J. T. Kajiya. The rendering equation. In Proceedings of the 13th Annual Conference on Computer Graphics and Interactive Techniques, pages 143–150, 1986.
[54] B. Karlsson and R. Contributors. Renderdoc: Graphics debugger. https://renderdoc.org/, 2026. Accessed: 2026-01-21.
[55] B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9492–9502, 2024.
[56] P. Kocsis, L. Höllein, and M. Nießner. Intrinsix: High-quality pbr generation using image priors. arXiv preprint arXiv:2504.01008, 2025.
[57] P. Kocsis, V. Sitzmann, and M. Nießner. Intrinsic image diffusion for indoor single-view material estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5198–5208, 2024.
[58] P. Krähenbühl. Free supervision from video games. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2955–2964, 2018.
[59] A. Kuznetsov, K. Mullia, Z. Xu, M. Hašan, and R. Ramamoorthi. NeuMIP: Multi-resolution neural materials. ACM Transactions on Graphics, 40(4):175:1–175:13, 2021.
[60] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
[61] W.-S. Lai, J.-B. Huang, O. Wang, E. Shechtman, E. Yumer, and M.-H. Yang. Learning blind video temporal consistency. In Proceedings of the European conference on computer vision (ECCV), pages 170–185, 2018.
[62] E. H. Land and J. J. McCann. Lightness and retinex theory. Journal of the Optical society of America, 61(1):1–11, 1971.
[63] S. Lee, S. Kim, S. Park, G. Kim, and M. Seo. Prometheus-vision: Vision-language model as a judge for fine-grained evaluation. In Findings of the Association for Computational Linguistics: ACL 2024, pages 11286–11315, 2024.
[64] C. Lei, Y. Xing, and Q. Chen. Blind video temporal consistency via deep video prior. Advances in Neural Information Processing Systems, 33:1083–1093, 2020.
[65] W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye, Y. Huang, R. Tang, and S. Leutenegger. Interiornet: Mega-scale multi-sensor photo-realistic indoor scenes dataset. arXiv preprint arXiv:1809.00716, 2018.
[66] X. Li, C. Ma, X. Yang, and M.-H. Yang. Vidtome: Video token merging for zero-shot video editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7486–7495, 2024.
[67] Y. Li, L. Jiang, L. Xu, Y. Xiangli, Z. Wang, D. Lin, and B. Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215, 2023.
[68] Z. Li, M. Shafiei, R. Ramamoorthi, K. Sunkavalli, and M. Chandraker. Inverse rendering for complex indoor scenes: Shape, spatially-varying lighting and svbrdf from a single image. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2475–2484, 2020.
[69] Z. Li and N. Snavely. Cgintrinsics: Better intrinsic image decomposition through physically-based rendering. In Proceedings of the European conference on computer vision (ECCV), pages 371–387, 2018.
[70] Z. Li, K. Sunkavalli, and M. Chandraker. Materials for masses: Svbrdf acquisition with a single mobile phone image. In Proceedings of the European conference on computer vision (ECCV), pages 72–87, 2018.
[71] Z. Li, T. Wu, J. Tan, M. Zhang, J. Wang, and D. Lin. Idarb: Intrinsic decomposition for arbitrary number of input views and illuminations. arXiv preprint arXiv:2412.12083, 2024.
[72] Z. Li, T.-W. Yu, S. Sang, S. Wang, M. Song, Y. Liu, Y.-Y. Yeh, R. Zhu, N. Gundavarapu, J. Shi, et al. Openrooms: An open framework for photorealistic indoor scene datasets. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7190–7199, 2021.
[73] F. Liang, A. Kodaira, C. Xu, M. Tomizuka, K. Keutzer, and D. Marculescu. Looking backward: Streaming video-to-video translation with feature banks. arXiv preprint arXiv:2405.15757, 2024.
[74] F. Liang, B. Wu, J. Wang, L. Yu, K. Li, Y. Zhao, I. Misra, J.-B. Huang, P. Zhang, P. Vajda, et al. Flowvid: Taming imperfect optical flows for consistent video-to-video synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8207–8216, 2024.
[75] R. Liang, Z. Gojcic, H. Ling, J. Munkberg, J. Hasselgren, C.-H. Lin, J. Gao, A. Keller, N. Vijaykumar, S. Fidler, et al. Diffusion renderer: Neural inverse and forward rendering with video diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26069–26080, 2025.
[76] Z. Liang, Z. Chen, Y. Chen, T. Wei, T. Wang, and X. Pan. Pi-light: Physics-inspired diffusion for full-image relighting. arXiv preprint arXiv:2601.22135, 2026.
[77] Z. Liang, Q. Zhang, Y. Feng, Y. Shan, and K. Jia. Gs-ir: 3d gaussian splatting for inverse rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21644–21653, 2024.
[78] D. Lin, M. Kettunen, B. Bitterli, J. Pantaleoni, C. Yuksel, and C. Wyman. Generalized resampled importance sampling: Foundations of ReSTIR. ACM Transactions on Graphics, 41(4):75:1–75:23, 2022.
[79] L. Ling, Y. Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y. Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024.
[80] Y. Litman, O. Patashnik, K. Deng, A. Agrawal, R. Zawar, F. De la Torre, and S. Tulsiani. Materialfusion: Enhancing inverse rendering with material diffusion priors. In 2025 International Conference on 3D Vision (3DV), pages 802–812. IEEE, 2025.
[81] A. Liu, S. Ginosar, T. Zhou, A. A. Efros, and N. Snavely. Learning to factorize and relight a city. In European Conference on Computer Vision, pages 544–561. Springer, 2020.
[82] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522, 2023.
[83] Y. Liu, F. Xue, and H. Huang. Urbanscene3d: A large scale urban scene dataset and simulator. arXiv preprint arXiv:2107.04286, 2(3), 2021.
[84] I. Lopes, F. Pizzati, and R. de Charette. Material palette: Extraction of materials from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4379–4388, 2024.
[85] J. Luo, D. Ceylan, J. S. Yoon, N. Zhao, J. Philip, A. Frühstück, W. Li, C. Richardt, and T. Wang. Intrinsicdiffusion: Joint intrinsic layers from latent diffusion models. In ACM SIGGRAPH 2024 Conference Papers, pages 1–11, 2024.
[86] X. Luo, J.-B. Huang, R. Szeliski, K. Matzen, and J. Kopf. Consistent video depth estimation. ACM Transactions on Graphics (ToG), 39(4):71–1, 2020.
[87] X. Ma, X. Xu, L. Zhang, K. Zhou, and H. Wu. Opensvbrdf: A database of measured spatially-varying reflectance. ACM Transactions on Graphics (TOG), 42(6):1–14, 2023.
[88] L. Mehl, J. Schmalfuss, A. Jahedi, Y. Nalivayko, and A. Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4981–4991, 2023.
[89] H. Mikami, K. Fukumizu, S. Murai, S. Suzuki, Y. Kikuchi, T. Suzuki, S.-i. Maeda, and K. Hayashi. A scaling law for synthetic-to-real transfer: How much is your pre-training effective? arXiv preprint arXiv:2108.11018, 2021.
[90] T. Müller, F. Rousselle, J. Novák, and A. Keller. Real-time neural radiance caching for path tracing. arXiv preprint arXiv:2106.12372, 2021.
[91] J. Munkberg, Z. Wang, R. Liang, T. Shen, and J. Hasselgren. Videomat: Extracting pbr materials from video diffusion models. In Computer Graphics Forum, volume 44, page e70180. Wiley Online Library, 2025.
[92] O. Nalbach, E. Arabadzhiyska, D. Mehta, H.-P. Seidel, and T. Ritschel. Deep shading: Convolutional neural networks for screen-space shading. Computer Graphics Forum, 36(4):65–78, 2017.
[93] H. Ouyang, Q. Wang, Y. Xiao, Q. Bai, J. Zhang, K. Zheng, X. Zhou, Q. Chen, and Y. Shen. Codef: Content deformation fields for temporally consistent video processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8089–8099, 2024.
[94] M. Pharr, W. Jakob, and G. Humphreys. Physically Based Rendering: From Theory to Implementation. Morgan Kaufmann, 3rd edition, 2016.
[95] T. Pollok, L. Junglas, B. Ruf, and A. Schumann. Unrealgt: using unreal engine to generate ground truth datasets. In International Symposium on Visual Computing, pages 670–682. Springer, 2019.
[96] C. Qi, X. Cun, Y. Zhang, C. Lei, X. Wang, Y. Shan, and Q. Chen. Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15932–15942, 2023.
[97] W. Qiu, F. Zhong, Y. Zhang, S. Qiao, Z. Xiao, T. S. Kim, and Y. Wang. Unrealcv: Virtual worlds for computer vision. In Proceedings of the 25th ACM international conference on Multimedia, pages 1221–1224, 2017.
[98] G. Rainer, W. Jakob, A. Ghosh, and T. Weyrich. Neural BTF compression and interpolation. Computer Graphics Forum, 38(2):235–244, 2019.
[99] A. Raistrick, L. Mei, K. Kayan, D. Yan, Y. Zuo, B. Han, H. Wen, M. Parakh, S. Alexandropoulos, L. Lipson, et al. Infinigen indoors: Photorealistic indoor scenes using procedural generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21783–21794, 2024.
[100] F. Reda, J. Kontkanen, E. Tabellion, D. Sun, C. Pantofaru, and B. Curless. Film: Frame interpolation for large motion. In European Conference on Computer Vision, pages 250–266. Springer, 2022.
[101] J. Ren, F. Wang, J. Zhang, Q. Zheng, M. Ren, and B. Shi. Diligent102: A photometric stereo benchmark dataset with controlled shape and material variation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12581–12590, 2022.
[102] S. R. Richter, Z. Hayder, and V. Koltun. Playing for benchmarks. In Proceedings of the IEEE international conference on computer vision, pages 2213–2222, 2017.
[103] S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data: Ground truth from computer games. In European conference on computer vision, pages 102–118. Springer, 2016.
[104] M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10912–10922, 2021.
[105] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3234–3243, 2016.
[106] S. Saito, G. Schwartz, T. Simon, J. Li, and G. Nam. Relightable gaussian codec avatars. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 130–141, 2024.
[107] S. Sartor and P. Peers. Matfusion: a generative diffusion model for svbrdf capture. In SIGGRAPH Asia 2023 conference papers, pages 1–10, 2023.
[108] S. Sengupta, J. Gu, K. Kim, G. Liu, D. W. Jacobs, and J. Kautz. Neural inverse rendering of an indoor scene from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8598–8607, 2019.
[109] S. Shah, D. Dey, C. Lovett, and A. Kapoor. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and service robotics: Results of the 11th international conference, pages 621–635. Springer, 2017.
[110] J. Shao, Y. Yang, H. Zhou, Y. Zhang, Y. Shen, V. Guizilini, Y. Wang, M. Poggi, and Y. Liao. Learning temporally consistent video depth from video diffusion priors. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 22841–22852, 2025.
[111] A. Sztrajman, G. Rainer, T. Ritschel, and T. Weyrich. Neural BRDF representation and importance sampling. Computer Graphics Forum, 40(6):332–346, 2021.
[112] G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
[113] Team Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025.
[114] A. Tewari, J. Thies, B. Mildenhall, P. Srinivasan, E. Tretschk, Y. Wang, C. Lassner, V. Sitzmann, R. Martin-Brualla, S. Lombardi, T. Simon, C. Theobalt, M. Niessner, J. T. Barron, G. Wetzstein, M. Zollhoefer, and V. Golyanik. Advances in neural rendering. Computer Graphics Forum, 41(2):218–280, 2022.
[115] J. Thies, M. Zollhöfer, and M. Nießner. Deferred neural rendering: Image synthesis using neural textures. In ACM SIGGRAPH 2019 Conference Proceedings, 2019.
[116] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017.
[117] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, and S. Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 969–977, 2018.
[118] B. Ummenhofer, S. Agrawal, R. Sepulveda, Y. Lao, K. Zhang, T. Cheng, S. Richter, S. Wang, and G. Ros. Objects with lighting: A real-world dataset for evaluating reconstruction and rendering for object relighting. In 2024 International Conference on 3D Vision (3DV), pages 137–147. IEEE, 2024.
[119] T. Unterthiner, S. Van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Fvd: A new metric for video generation. 2019.
[120] G. Vecchio and V. Deschaintre. Matsynth: A modern pbr materials dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22109–22118, 2024.
[121] G. Vecchio, R. Martin, A. Roullier, A. Kaiser, R. Rouffet, V. Deschaintre, and T. Boubekeur. Controlmat: a controlled generative approach to material capture. ACM Transactions on Graphics, 43(5):1–17, 2024.
[122] G. Vecchio, R. Sortino, S. Palazzo, and C. Spampinato. Matfuse: controllable material generation with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4429–4438, 2024.
[123] B. Walter, S. R. Marschner, H. Li, and K. E. Torrance. Microfacet models for refraction through rough surfaces. In Proceedings of the 18th Eurographics Conference on Rendering Techniques, pages 195–206, 2007.
[124] H. Wang, Z. Wang, X. Long, C. Lin, G. Hancke, and R. W. Lau. Mage: Single image to material-aware 3d via the multi-view g-buffer estimation model. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10985–10995, 2025.
[125] J. Wang, K. C. Chan, and C. C. Loy. Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023.
[126] J. Wang, H. Duan, G. Zhai, J. Wang, and X. Min. Aigv-assessor: Benchmarking and evaluating the perceptual quality of text-to-video generation with lmm. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 18869–18880, 2025.
[127] J. Wang, Q. Hu, C. Bao, Y. Zhu, H. Bao, Z. Cui, and G. Zhang. Lightcity: An urban dataset for outdoor inverse rendering and reconstruction under multi-illumination conditions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 26477–26487, 2025.
[128] S. Wang, B. Antic, A. Geiger, and S. Tang. Intrinsicavatar: Physically based inverse rendering of dynamic humans from monocular videos via explicit ray tracing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1877–1888, 2024.
[129] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, and S. Scherer. Tartanair: A dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020.
[130] Y. Wang, L. Lipson, and J. Deng. Sea-raft: Simple, efficient, accurate raft for optical flow. In European Conference on Computer Vision, pages 36–54. Springer, 2024.
[131] Y. Wang, M. Shi, J. Li, C. Hong, Z. Huang, J. Peng, Z. Cao, J. Zhang, K. Xian, and G. Lin. Nvds+: Towards efficient and versatile neural stabilizer for video depth estimation. IEEE transactions on pattern analysis and machine intelligence, 2024.
[132] H. Wu, Z. Zhang, E. Zhang, C. Chen, L. Liao, A. Wang, C. Li, W. Sun, Q. Yan, G. Zhai, et al. Q-bench: A benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181, 2023.
[133] H. Wu, Z. Zhang, W. Zhang, C. Chen, L. Liao, C. Li, Y. Gao, A. Wang, E. Zhang, W. Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090, 2023.
[134] S. Wu, S. Basu, T. Broedermann, L. Van Gool, and C. Sakaridis. Pbr-nerf: Inverse rendering with physics-based neural fields. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10974–10984, 2025.
[135] T. Wu, G. Yang, Z. Li, K. Zhang, Z. Liu, L. Guibas, D. Lin, and G. Wetzstein. Gpt-4v (ision) is a human-aligned evaluator for text-to-3d generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22227–22238, 2024.
[136] T. Wu, J. Zhang, X. Fu, Y. Wang, J. Ren, L. Pan, W. Wu, L. Yang, J. Wang, C. Qian, et al. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 803–814, 2023.
[137] X. Xing, K. Groh, S. Karaoglu, T. Gevers, and A. Bhattad. Luminet: Latent intrinsics meets diffusion models for indoor scene relighting. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 442–452, 2025.
[138] B. Xiong, Z. Li, and Z. Li. Gauu-scene: A scene reconstruction benchmark on large scale 3d reconstruction dataset using gaussian splatting. arXiv preprint arXiv:2401.14032, 2024.
[139] J. Xu, X. Liu, Y. Wu, Y. Tong, Q. Li, M. Ding, J. Tang, and Y. Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36:15903–15935, 2023.
[140] Z. Xu, X. Chen, C. Liu, B. Wang, L. Wang, Z. Montazeri, and L.-Q. Yan. Towards comprehensive neural materials: Dynamic structure-preserving synthesis with accurate silhouette at instant inference speed. In ACM SIGGRAPH 2025 Conference Papers, 2025.
[141] W. Yan, D. Hafner, S. James, and P. Abbeel. Temporally consistent transformers for video generation. In International Conference on Machine Learning, pages 39062–39098. PMLR, 2023.
[142] L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao. Depth anything v2. Advances in Neural Information Processing Systems, 37:21875–21911, 2024.
[143] S. Yang, Y. Zhou, Z. Liu, and C. C. Loy. Rerender a video: Zero-shot text-guided video-to-video translation. In SIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023.
[144] S. Yang, Y. Zhou, Z. Liu, and C. C. Loy. Fresco: Spatial-temporal correspondence for zero-shot video translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8703–8712, 2024.
[145] C. Ye, L. Qiu, X. Gu, Q. Zuo, Y. Wu, Z. Dong, L. Bo, Y. Xiu, and X. Han. Stablenormal: Reducing diffusion variance for stable and sharp normal. ACM Transactions on Graphics (ToG), 43(6):1–18, 2024.
[146] W. Ye, S. Chen, C. Bao, H. Bao, M. Pollefeys, Z. Cui, and G. Zhang. Intrinsicnerf: Learning intrinsic neural radiance fields for editable novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 339–351, 2023.
[147] C. Yeshwanth, Y.-C. Liu, M. Nießner, and A. Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023.
[148] Z. You, X. Cai, J. Gu, T. Xue, and C. Dong. Teaching large language models to regress accurate image quality scores using score distribution. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14483–14494, 2025.
[149] Z. You, J. Gu, X. Cai, Z. Li, K. Zhu, C. Dong, and T. Xue. Enhancing descriptive image quality assessment with a large-scale multi-modal dataset. IEEE Transactions on Image Processing, 2025.
[150] Z. You, Z. Li, J. Gu, Z. Yin, T. Xue, and C. Dong. Depicting beyond scores: Advancing image quality assessment through multi-modal language models. In European Conference on Computer Vision, pages 259–276, 2024.
[151] T. Zeltner, F. Rousselle, A. Weidlich, P. Clarberg, J. Novák, B. Bitterli, A. Evans, T. Davidovič, S. Kallweit, and A. Lefohn. Real-time neural appearance models. ACM Transactions on Graphics, 43(3):33:1–33:17, 2024.
[152] C. Zeng, Y. Dong, P. Peers, H. Wu, and X. Tong. RenderFormer: Transformer-based neural rendering of triangle meshes with global illumination. In ACM SIGGRAPH 2025 Conference Papers, 2025.
[153] Z. Zeng, V. Deschaintre, I. Georgiev, Y. Hold-Geoffroy, Y. Hu, F. Luan, L.-Q. Yan, and M. Hašan. Rgb $\leftrightarrow$ x: Image decomposition and synthesis using material- and lighting-aware diffusion models. In ACM SIGGRAPH 2024 Conference Papers, SIGGRAPH ’24, New York, NY, USA, 2024. Association for Computing Machinery.
[154] L. Zhang, A. Rao, and M. Agrawala. Scaling in-the-wild training for diffusion-based illumination harmonization and editing by imposing consistent light transport. In The Thirteenth International Conference on Learning Representations, 2025.
[155] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
[156] X. Zhang, P. P. Srinivasan, B. Deng, P. Debevec, W. T. Freeman, and J. T. Barron. Nerfactor: Neural factorization of shape and reflectance under an unknown illumination. ACM Transactions on Graphics (ToG), 40(6):1–18, 2021.
[157] Z. Zhang, Z. Jia, H. Wu, C. Li, Z. Chen, Y. Zhou, W. Sun, X. Liu, X. Min, W. Lin, et al. Q-bench-video: Benchmarking the video quality understanding of lmms. arXiv preprint arXiv:2409.20063, 2024.
[158] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023.
[159] R. Zheng, Q. Zhang, C. Long, and W.-S. Zheng. Dnf-intrinsic: Deterministic noise-free diffusion for indoor inverse rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10342–10352, 2025.
[160] Y. Zheng, A. W. Harley, B. Shen, G. Wetzstein, and L. J. Guibas. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19855–19865, 2023.
[161] X. Zhou, M. Hasan, V. Deschaintre, P. Guerrero, Y. Hold-Geoffroy, K. Sunkavalli, and N. K. Kalantari. Photomat: A material generator learned from single flash photos. In ACM SIGGRAPH 2023 conference proceedings, pages 1–11, 2023.
[162] Y. Zhou, Y. Wang, J. Zhou, W. Chang, H. Guo, Z. Li, K. Ma, X. Li, Y. Wang, H. Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling. arXiv preprint arXiv:2509.12201, 2025.
[163] J. Zhu, Y. Huo, Q. Ye, F. Luan, J. Li, D. Xi, L. Wang, R. Tang, W. Hua, H. Bao, et al. I2-sdf: Intrinsic indoor scene reconstruction and editing via raytracing in neural sdfs. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12489–12498, 2023.
[164] J. Zhu, F. Luan, Y. Huo, Z. Lin, Z. Zhong, D. Xi, R. Wang, H. Bao, J. Zheng, and R. Tang. Learning-based inverse rendering of complex indoor scenes with differentiable monte carlo raytracing. In SIGGRAPH Asia 2022 Conference Papers. ACM, 2022.