Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Chen, Daiwei; Fu, Zhoutong; Jiang, Chengming; Zhang, Haichao; Zhou, Ran; Wang, Tan; Yao, Chunnan; Li, Guoyao; Cai, Rui; Cao, Yihan; Jiang, Ruijie; Borisyuk, Fedor; Shen, Jianqiang; Wu, Jingwei; Vinayak, Ramya Korlakai

Computer Science > Computation and Language

arXiv:2604.02324 (cs)

[Submitted on 2 Apr 2026]

Title:Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Authors:Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang, Ran Zhou, Tan Wang, Chunnan Yao, Guoyao Li, Rui Cai, Yihan Cao, Ruijie Jiang, Fedor Borisyuk, Jianqiang Shen, Jingwei Wu, Ramya Korlakai Vinayak

View PDF HTML (experimental)

Abstract:Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2604.02324 [cs.CL]
	(or arXiv:2604.02324v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2604.02324

Submission history

From: Daiwei Chen [view email]
[v1] Thu, 2 Apr 2026 17:59:19 UTC (11,179 KB)

Computer Science > Computation and Language

Title:Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators