Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

Li, Junxuan; Khirodkar, Rawal; He, Chengan; Jiang, Zhongshi; Nam, Giljoo; Yang, Lingchen; Lee, Jihyun; Zakharov, Egor; Su, Zhaoen; Abdrashitov, Rinat; Dong, Yuan; Martinez, Julieta; Li, Kai; Tan, Qingyang; Shiratori, Takaaki; Hu, Matthew; Guo, Peihong; Huang, Xuhua; Zarei, Ariyan; Pesavento, Marco; Xu, Yichen; Wen, He; Deng, Teng; Borsos, Wyatt; Thakrar, Anjali; Bazin, Jean-Charles; Stoll, Carsten; Hidalgo, Ginés; Booth, James; Wang, Lucy; Ma, Xiaowen; Rong, Yu; Thalanki, Sairanjith; Cao, Chen; Häne, Christian; Kar, Abhishek; Bouaziz, Sofien; Saragih, Jason; Sheikh, Yaser; Saito, Shunsuke

Abstract:High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.

Comments:	Accepted in CVPR2026. Website: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
Cite as:	arXiv:2604.02320 [cs.CV]
	(or arXiv:2604.02320v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2604.02320

Computer Science > Computer Vision and Pattern Recognition

Title:Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators