\useunder

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Daiwei Chen^1,2 Zhoutong Fu², Chengming Jiang², Haichao Zhang³, Ran Zhou²
Tan Wang², Chunnan Yao², Guoyao Li², Rui Cai⁴, Yihan Cao², Ruijie Jiang²,
Fedor Borisyuk², Jianqiang Shen², Jingwei Wu², Ramya Korlakai Vinayak¹
¹University of Wisconsin-Madison ²LinkedIn Corporation ³Northeastern University ⁴University of California, Davis Work done during internship at LinkedIn Corporation. Correspondence to: dchen365@wisc.edu

Abstract

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that token initialization is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the Grounded Token Initialization Hypothesis: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

1 Introduction

Pretrained language models (LMs) are increasingly adapted to specialized domains by extending their vocabulary with new learnable tokens. A prominent example is generative retrieval, where items (TIGER; GRreview) or documents (tay2022transformer) are assigned discrete semantic codes and generated autoregressively by the LM; similar challenges arise whenever domain-specific symbols must be integrated into a pretrained vocabulary. These systems introduce thousands of new tokens into the model’s vocabulary, and a fundamental challenge is how to incorporate them into the pretrained embedding space so that the LM can transfer its general-purpose knowledge to the novel-token domain.

The prevailing practice initializes new token embeddings as the mean of the existing vocabulary embeddings (hewitt2021initializing). This heuristic is widely adopted because it is simple, places new tokens on the pretrained embedding manifold, and provides a tighter KL-divergence upper bound on output probabilities. However, it collapses all new tokens into a single point in embedding space, erasing inter-token distinctions and stripping domain-level semantics. An existing alternative (LC-Rec) employs auxiliary-task adaptation of the full LM to induce linguistic signals for new tokens, but the multi-task training introduces an objective mismatch: the auxiliary losses are not aligned with the target downstream task, resulting in limited and inconsistent gains.

In this paper, we identify token-embedding misalignment as a fundamental limitation when extending pretrained LMs with new vocabulary, and propose the Grounded Token Initialization Hypothesis: linguistically grounding novel tokens in an LM’s pretrained embedding space, before fine-tuning, better enables the model to leverage its general-purpose knowledge for novel-token domains. The intuition is that pretrained LM embeddings encode rich linguistic structure, semantically related tokens occupy nearby regions (levy2014neural), and the model’s attention and feed-forward layers have learned to exploit this geometry (gao2019representation). If new tokens are placed meaningfully within this structure, the LM can immediately leverage its existing representations to process them in context, rather than relying on fine-tuning alone to recover from a degenerate starting point. This motivates framing vocabulary extension as a token-grounding problem: new token embeddings should be grounded in linguistically meaningful representations while remaining coherent with the pretrained LM’s embedding geometry.

Refer to caption — Figure 1: Overview of the GTI grounding stage. The LM backbone and original vocabulary embeddings are frozen (snowflake); only the newly introduced Semantic-ID (SID) token embeddings ( $|\mathcal{V}_{\mathrm{SID}}|\!\times\!D$ parameters, fire) are trained. Paired prompts map between natural language descriptions and SID tokens in both directions, grounding the new tokens in the pretrained embedding space. This stage is inserted before standard end-to-end fine-tuning (see Section 3).

Building on this hypothesis, we introduce GTI, a simple and effective grounded token initialization method. Before downstream fine-tuning, GTI freezes the LM backbone and grounds newly introduced token embeddings using paired supervision between natural language descriptions and the corresponding new tokens (Figure 1). This grounding stage resolves the mismatch between well-trained vocabulary embeddings and newly initialized tokens, providing the LM with a semantically structured starting point for subsequent end-to-end fine-tuning of the full model for target downstream tasks.

We validate GTI on Generative Recommendation (GR) (TIGER; pmlr-v235-zhai24a), a challenging and practically important application of vocabulary extension. GRs have attracted growing attention in both academia and industry (ding2026doesgenerativerecommendationgeneralize; mtgr; oneSearch; OneRec), as they dramatically simplify retrieval by autoregressively generating item identifiers token-by-token from user interaction histories, replacing the expensive user–item inner products required by traditional dense-embedding methods (MF; NCF; LightGCN; NGCF). GRs can further exploit scaling-law behavior as model size and data increase (mtgr), offering a clear path to continued improvement. The GR setting is a particularly demanding testbed for grounded token initialization: large sets of new learnable Semantic-ID (SID) tokens must be incorporated into pretrained LMs, each encoding fine-grained item-level semantics and hierarchical codebook structure that should be properly grounded in the LM’s embedding space to support effective retrieval.

Contributions.

1.

Diagnosis. Through spectral and geometric analysis, we characterize the token-embedding misalignment caused by mean initialization: all new learnable tokens collapse into a degenerate, low-rank subspace that does not fully recover under subsequent fine-tuning. This motivates the Grounded Token Initialization Hypothesis: linguistically grounding new tokens before fine-tuning better enables the LM to leverage its pretrained knowledge for the new domain.
2.

Methodology. We introduce GTI, a simple and effective grounding stage that freezes the LM backbone and learns new token embeddings via paired linguistic supervision before standard fine-tuning, providing a semantically structured starting point for downstream adaptation.
3.

Finding. On generative recommendation benchmarks, spanning industry-scale and public datasets, GTI consistently outperforms both direct supervised fine-tuning and LC-Rec (LC-Rec), an existing approach that jointly adapts the full model via auxiliary tasks. These results suggest that token initialization is a key bottleneck in vocabulary extension.

2 Token-Embedding Misalignment

We formalize the vocabulary extension problem in the context of generative retrieval, our primary application domain, and then use spectral and geometric diagnostics to characterize a systematic token-embedding misalignment that arises from standard initialization practices when new tokens are added to a pretrained language model.

Generative Retrieval.

We adopt the framework of TIGER. Each item $I_{i}\in\mathcal{I}$ has content features (title, description, etc.) that a pretrained text encoder maps to a semantic embedding $\mathbf{z}_{i}\in\mathbb{R}^{d}$ . An RQ-VAE (RQ-VAE) with $L$ codebook levels of $K$ entries each discretizes $\mathbf{z}_{i}$ into a Semantic ID $(c_{1},\ldots,c_{L})$ , $c_{l}\in\{1,\ldots,K\}$ , via recursive residual quantization:

\mathbf{r}_{1}:=\mathbf{z}_{i};\qquad c_{l}=\arg\min_{k}\bigl\|\mathbf{r}_{l}-\mathbf{q}_{k}^{(l)}\bigr\|_{2},\quad\mathbf{r}_{l+1}:=\mathbf{r}_{l}-\mathbf{q}_{c_{l}}^{(l)},\quad l=1,\ldots,L,

where $\bigl\{\mathbf{q}_{k}^{(l)}\bigr\}_{k=1}^{K}\!\subset\mathbb{R}^{d}$ is the level- $l$ codebook. The $K\!\times\!L$ SID codes¹¹1SID tokens are labeled by level to encode codebook membership, e.g., <a_1>, <b_1>, <c_1>.. are appended to the LM’s original vocabulary $\mathcal{V}_{\mathrm{text}}$ as new tokens $\mathcal{V}_{\mathrm{SID}}$ . Given a context $\mathbf{x}$ , either a user’s interaction history (retrieval) or a natural language query (search), the LM generates the target Semantic ID autoregressively:

P_{\theta}(c_{1},\ldots,c_{L}\mid\mathbf{x})=\prod_{t=1}^{L}P_{\theta}(c_{t}\mid c_{<t},\mathbf{x}).

Mean-of-Vocabulary Initialization.

Standard practice initializes all novel token embeddings to the mean of the existing vocabulary embeddings (hewitt2021initializing):

\mathbf{e}_{c}:=\frac{1}{|\mathcal{V}_{\mathrm{text}}|}\sum_{v\in\mathcal{V}_{\mathrm{text}}}\mathbf{e}_{v},\quad\forall\;c\in\mathcal{V}_{\mathrm{SID}},

(1)

where $\mathbf{e}_{v}$ denotes the input embedding of token $v$ .

Diagnosing the misalignment.

Under mean-of-vocabulary initialization (Eq. 1), every new token receives an identical embedding, 1) collapsing all inter-token distinctions and 2) discarding the semantic structure each token should encode (Fig. 2, left). This heuristic is nonetheless widely adopted (wolf2020transformers) because it places new tokens on the pretrained manifold and yields a tighter KL-divergence upper bound on output probabilities compared with random initialization (hewitt2021initializing). Random initialization, conversely, assigns distinct vectors to each token but places them without coherent relation to the pretrained manifold, providing no linguistic prior for the model to build on. Pairwise cosine similarities among token embeddings (Fig. 5) confirm that mean initialization produces a near-uniform similarity block across all SID tokens, while random initialization yields unstructured noise.

We examine whether supervised fine-tuning recovers the structure lost under mean initialization. The pairwise similarity among SID embeddings (Fig.2 (c) and Fig.6 Left&Mid) and the singular-value decomposition of the SID embedding matrix $E_{\mathrm{SID}}\in\mathbb{R}^{|\mathcal{V}_{\mathrm{SID}}|\times d}$ after supervised fine-tuning from the mean-initialized state (Fig.2 (b) and Fig.7) reveals rapid spectral decay and low effective rank, confirming that supervised fine-tuning alone does not recover the inter-token structure lost at mean or random initialization. Taken together, these analyses show that neither strategy provides a suitable starting point: mean initialization places tokens on the pretrained manifold but erases discrimination, while random initialization preserves discrimination but lacks linguistic grounding.

Grounded Token Initialization (GTI) Hypothesis.

These observations motivate our central hypothesis: linguistically grounding novel tokens in an LM’s pretrained embedding space, before downstream fine-tuning, better enables the model to leverage its general-purpose knowledge for novel-token domains. Rather than relying on fine-tuning alone to recover from a degenerate initialization, we propose inserting a simple and efficient grounding stage that learns new token embeddings via linguistic supervision with the backbone frozen, before proceeding to standard end-to-end fine-tuning. We operationalize this hypothesis in Section 3 and verify its effectiveness empirically in Section 4.

3 GTI: Grounded Token Initialization Stage

The diagnosis in Section 2 motivates a straightforward modification to the standard training pipeline: before downstream fine-tuning, insert a grounding stage that freezes the LM backbone and only learns new token embeddings via paired linguistic supervision. This design builds on the established principle of training new token embeddings within a frozen LM (ToolkenGPT; YoLLaVA). We term the resulting procedure GTI. Despite its simplicity, we show that this additional stage yields consistent improvements across multiple benchmarks, including both public and industry-scale datasets (Section 4), suggesting that token initialization is a key bottleneck in vocabulary extension.

Algorithm.

Let $\mathcal{V}=\mathcal{V}_{\text{text}}\cup\mathcal{V}_{\text{new}}$ denote the extended vocabulary, where $\mathcal{V}_{\text{new}}$ are the newly added domain tokens. Given a pretrained autoregressive LM with input-embedding matrix $E\in\mathbb{R}^{|\mathcal{V}|\times d}$ , we partition $E$ into the pretrained rows $E_{\text{text}}$ and the new rows $E_{\text{new}}\in\mathbb{R}^{|\mathcal{V}_{\text{new}}|\times d}$ . Each domain entity is associated with a natural-language description $x_{i}$ (e.g., title or definition) and a canonical new-token sequence $y_{i}=(c_{i,1},\dots,c_{i,L})$ . We instantiate GTI in the generative recommendation setting, where $\mathcal{V}_{\text{new}}=\mathcal{V}_{\text{SID}}$ , $x_{i}$ is an item title/description, and $y_{i}$ is the corresponding SID sequence.

We construct a grounding corpus $\mathcal{D}_{\mathrm{ground}}=\{(x_{i},y_{i})\}_{i=1}^{n}$ pairing each description with its new token sequence, along with reversed pairs $\{(y_{i},x_{i})\}$ that require the model to generate descriptions from new tokens²²2Bidirectional training encourages new token embeddings to encode semantics in both the input and output directions; see ablation in Section 4.3 and template details in Appendix 7.2.. Using an instruction-style prompt template $\texttt{prompt}(x)$ (Listing as follows), we minimize the negative log-likelihood over $E_{\text{new}}$ :

\min_{E_{\text{new}}}\sum_{(x,y)\in\mathcal{D}_{\text{ground}}}\sum_{t=1}^{|y|}-\log P_{\theta}\big(y_{t}\big|y_{<t},\texttt{prompt}(x)\big)

(2)

where $\theta$ denotes all LM parameters. During this stage, all parameters except $E_{\text{new}}$ are held fixed, including $E_{\text{text}}$ and the LM head, which shares weights with $E$ via the standard tied-embedding parameterization. This weight tying means the grounding stage simultaneously shapes how the model reads and generates new tokens. After grounding, we retain the learned $E_{\text{new}}$ as initialization and proceed with standard supervised fine-tuning of all model parameters $\theta$ . Implementation details are provided in Algorithm 1.

 


4 Experiments


We evaluate GTI within the highly demanding domain of generative recommendation. This domain serves as an ideal testbed for the initialization bottleneck, as it requires incorporating thousands of new Semantic-ID (SID) tokens into a pretrained language model. To empirically validate whether aligning these tokens with the model’s pre-existing linguistic geometry can prevent semantic collapse, we evaluate across two diverse environments: an industrial-scale candidate retrieval system and the public Vibrent Clothes Rental benchmark.



4.1 Setup


Datasets.


We evaluate across two distinct scales and domains.


(1) Industrial candidate retrieval.³³3Data access and use complied with internal privacy and security frameworks. Member profile attributes were processed in accordance with applicable member controls and visibility settings, and analyses were conducted on de‑identified datasets with results reported in aggregate only. This dataset consists of job requirement–candidate pairs from a world-leading recruitment platform. Each pair is categorized into three relevance levels (good, good&maybe, and not match) by an internal LLM evaluator according to how many job requirements a candidate satisfies. Due to data-sharing constraints, we report only relative performance gains over the SFT baseline for this dataset.


(2) Vibrent Clothes Rental. To validate generalizability, we adapt the public Vibrent Clothes Rental Dataset (vibrent_clothes_rental_dataset) into a generative retrieval task, treating users as queries and clothing items as candidates based on historical rental transactions.



Baselines.


To strictly isolate the initialization bottleneck, all methods share an identical Qwen3-0.6B backbone and RQ-VAE tokenization structure, differing only in how they introduce novel tokens.


(1) Vanilla SFT: New SID tokens are mean-initialized (Eq. 1), inducing a semantic collapse. The model relies entirely on downstream fine-tuning to disambiguate tokens from this degenerate starting point.


(2) LC-Rec (LC-Rec): A recent multi-task approach that begins from the same collapsed state but attempts to recover semantic structure by applying auxiliary natural language alignment objectives during fine-tuning.


(3) GTI (Ours): Using the grounding stage described in Section 3, we ground the new SID tokens into distinct, semantically meaningful regions of the frozen LM’s embedding space, providing a structurally rich starting point for the subsequent SFT procedure.



Evaluation Metrics.


We measure retrieval accuracy using Top- $K$  Precision, Recall, and NDCG. For the industrial dataset, we sample 200 jobs as evaluation queries (retrieving 200 candidates each). To comply with data-sharing constraints, we isolate the direct performance uplift of our grounding stage by reporting results strictly as a relative percentage gain over the standard SFT baseline, formulated as  $(M_{\text{method}}-M_{\text{Baseline}})/M_{\text{Baseline}}$ . For the public Vibrent dataset, we adopt the standard leave-one-out sequence splitting strategy (SelfAttnSeqRec; P5).


\useunder



Table 1: Relative Precision@K gain (%) over SFT baseline on a real-world candidate retrieval dataset. Bold and underline denote the best result.




Methodology
Precision@K (Good Match)
Precision@K (Good & Maybe Match)

P@5
P@10
P@20
P@50
P@100
P@5
P@10
P@20
P@50
P@100

MI+Vanilla SFT (Baseline)
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%

MI+Multi-task SFT (LC-Rec)
+6.38%
+5.20%
+3.87%
+3.00%
+3.47%
+5.63%
+5.35%
+2.98%
+3.32%
+3.05%

GTI+Multi-task SFT (Ours)
+21.63%
+13.59%
+8.16%
+6.35%
+4.25%
+15.83%
+10.89%
+5.74%
+5.87%
+4.10%

GTI: extra gain over LC-Rec ( $\Delta$ )
+15.25%
+8.39%
+4.29%
+3.35%
+0.78%
+10.20%
+5.54%
+2.76%
+2.55%
+1.05%






Figure 3: Relative gain versus candidate pool size. Left/Middle: Relative Precision@K gain under Good Match and Good & Maybe Match; Right: Relative NDCG@K gain (Composite). GTI consistently outperforms both baselines across all pool sizes, with the largest gains at small  $K$ . Shaded areas denote variability across runs.



Implementation Details.


Across both datasets, we employ Qwen3-0.6B as the backbone language model. Semantic IDs are constructed via RQ-VAE, following the formulation in TIGER. For GTI, the grounding stage freezes all parameters except  $E_{\text{new}}$  and trains for 8,000 steps with batch size 128; all parameters are then unfrozen for an additional 8,000 steps at the same batch size, followed by the standard SFT procedure used for the baseline. All experiments use four NVIDIA H100 GPUs.


Industrial dataset. Candidate-level semantic representations are obtained by fine-tuning Mistral-E5 in a two-tower architecture with recruiter engagement signals, producing 1024-dimensional embeddings. The RQ-VAE uses  $L=3$  codebook levels with  $K=8{,}192$  codes per level. The subsequent SFT baseline trains with a batch size of 512 for 1,600 steps.


Public dataset. Item-level semantic representations are derived using the off-the-shelf Qwen3-Embedding-0.6B encoder, yielding 1024-dimensional vectors. The RQ-VAE uses a 3-layer MLP encoder–decoder with ReLU activations,  $L=4$  codebook levels with  $K=256$  codes per level (32-dimensional codes), and the diversity regularizer of LETTER to encourage balanced codebook utilization. The RQ-VAE is trained for 20K epochs. The SFT baseline trains with batch size 512 for 1,600 steps.


Table 2: Relative NDCG@K (Composite) gain (%) over SFT baseline on a real-world candidate retrieval dataset. Bold and underline denote the best result.


Methodology
NDCG@K (Composite)


@5
@10
@20
@50
@100



MI+Vanilla SFT (Baseline)
0.00%
0.00%
0.00%
0.00%
0.00%



MI+Multi-task SFT (LC-Rec)
+6.94%
+4.38%
+1.94%
+1.95%
+1.01%


GTI+Multi-task SFT (Ours)

+17.88%

+12.03%

+6.90%

+4.99%

+2.89%


GTI: extra gain over LC-Rec ( $\Delta$ )
+10.94%
+7.65%
+4.96%
+3.04%
+1.88%




Table 3: Relative Recall@K and NDCG@K (%) over SFT baseline on Vibrent Dataset.




Methodology
Recall@K
NDCG@K

@5
@10
@20
@50
@100
@5
@10
@20
@50
@100

MI+Vanilla SFT (Baseline)
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%
0.00%

MI+Multi-task SFT (LC-Rec)
+7.69%
+11.86%
+13.41%
+12.03%
+15.73%
+8.47%
+10.74%
+11.30%
+11.18%
+13.26%

GTI+Vanilla SFT (Ours)
+1.71%
+22.03%
+26.02%
+21.55%
+18.54%
-5.19%
+8.02%
+12.23%
+12.83%
+12.46%






Figure 4: Relative gain versus candidate pool size. Left: Relative Recall@K gain; Right: Relative NDCG@K gain. Shaded areas denote variability across runs.





4.2 Overall Performance Analysis


Tables 1 &  2 and Figure 3 detail the overall performance on the industrial-scale dataset.


The effectiveness of GTI Initialization. Across all cutoffs, evaluation metrics, and relevance thresholds (Good Match and Good & Maybe Match), GTI outperforms both baselines. Under the strict Good Match criterion, GTI achieves +21.63% relative gain at P@5 over vanilla SFT, compared to +6.38% for LC-Rec, yielding an extra gain  $\Delta$  of 15.25% attributable to the grounding stage. This pattern is consistent across evaluation settings: under Good & Maybe Match, GTI maintains a clear advantage (+15.83% vs. +5.63% at P@5), and NDCG@5 exhibits the same trend (+17.88% vs. +6.94%). Sweeping the candidate pool size from 5 to 200 (Figure 3) further confirms that the improvement is robust across retrieval scales.


Evidence for the GTI hypothesis. The comparison between LC-Rec and GTI provides a controlled test of our hypothesis, as both methods introduce linguistic supervision for new tokens but differ in when it is applied: LC-Rec incorporates auxiliary language modeling objectives during fine-tuning while retaining mean initialization, whereas GTI addresses the initialization directly through a grounding stage that precedes fine-tuning. The consistent performance gap (extra gain  $\Delta$ ) between the two methods, despite sharing the same downstream SFT procedure, suggests that grounding new tokens before fine-tuning provides a more effective starting point than relying on auxiliary objectives alone, consistent with the Grounded Token Initialization hypothesis.


Controlled comparison on public dataset.


To disentangle the effect of grounded initialization from that of multi-task adaptation and assess the generalization of our method beyond the proprietary dataset, we compare GTI+Vanilla SFT against LC-Rec (Multi-task SFT) on the public Vibrent dataset (Table 4 and Figure 4). Even without auxiliary objectives during fine-tuning, GTI achieves substantially higher Recall at  $K\geq 10$  (e.g., +26.02% vs. +13.41% at Recall@20) and comparable NDCG, indicating that the grounding stage alone accounts for a large portion of the downstream improvement.





4.3 Further Analysis


Figure 5: Pairwise cosine-similarity matrices under three initialization strategies. Each matrix shows similarities between 50 pretrained tokens (upper-left block) and 50 SID tokens (bottom-right block)⁴⁴4For better visualization, we randomly choose 50 tokens separately from pretrained tokens or Semantic-ID tokens. Random initialization (left) yields noninformative SID embeddings. Mean initialization (middle) collapses SID tokens into a near-uniform block. GTI (right) produces differentiated intra-SID structure with meaningful affinities to pretrained tokens.




Figure 6: Pairwise SID similarity after fine-tuning (public dataset). We visualize the pairwise cosine similarity matrix of SID embeddings at the fine-tuned checkpoint. GTI is the only initialization strategy that preserves a clear blockwise hierarchical semantics among SID tokens, suggesting improved preservation of semantic geometry. By contrast, mean and random initialization produce flat or noisy similarity patterns even after SFT stage.


The preceding results establish that grounded initialization improves downstream performance; we now investigate why. We use spectral and geometric diagnostics on the SID embedding subspace, both at initialization and after fine-tuning. These analyses provide direct evidence to the Grounded Token
Initialization Hypothesis (Section 2).


Grounded initialization produces differentiated embedding geometry.


Figure 5 visualizes pairwise cosine similarities among pretrained vocabulary tokens and SID tokens under three initialization strategies. Random initialization avoids uniformity but yields unstructured noise with no coherent affinity to the pretrained manifold. Mean initialization produces a uniform SID block, confirming the collapse diagnosed in Section 2. In contrast, GTI produces rich, differentiated structure within the SID block together with coherent cross-block affinities to relevant lexical tokens.


Figure 7: (a) Singular-Value Spectra of SID embedding matrix after SFT: GTI initialization yields slower spectral decay and higher effective rank than mean initialization. (b) Representational Similarity Analysis (RSA) of SID embeddings after SFT: We compare the pairwise geometry of the ground-truth RQ-VAE codebook vectors and the learned SID embeddings using Pearson  $r$  and Spearman  $\rho$ . GTI initialization achieves the highest correlation under both metrics, indicating better preservation of the semantic structure among SID embeddings than mean or random initialization.



Grounded structure persists through fine-tuning.


We next examine whether the structure induced by grounding persists through fine-tuning. (1)  Pairwise cosine similarities among SID embeddings after fine-tuning on the public dataset (Figure 6) show that only the GTI-initialized model preserves the blockwise hierarchical structure encoded by the RQ-VAE; mean and random initialization produce flat or noisy similarity patterns. (2)  The singular-value spectrum of  $E_{\mathrm{SID}}\in\mathbb{R}^{|\mathcal{V}_{\text{SID}}|\times d}$  after fine-tuning on the industrial dataset (Figure 7a) shows that mean initialization leads to rapid spectral decay and low effective rank, while grounded initialization yields slower decay and higher effective rank, indicating a non-degenerate subspace with multiple active directions along which items differ (see Appendix for extended SVD analysis of the industrial dataset).(3) Representational similarity analysis (RSA) between the learned SID embeddings and the ground-truth RQ-VAE codebook vectors (Figure 7b) shows that GTI-initialized embeddings better preserve the original semantic structure through training. Taken together, these results suggest that the grounding stage seeds embedding structure that persists through fine-tuning, corroborating the downstream performance gains.



5 Related Work


Vocabulary Extension in Language Models.


Extending a pretrained LM’s vocabulary with new tokens is a recurring challenge. Standard approaches initialize new embeddings at the vocabulary mean (hewitt2021initializing) or randomly, then rely on fine-tuning. ToolkenGPT (ToolkenGPT) and Yo’LLaVA (YoLLaVA) show that training only new token embeddings against a frozen LM can be effective for tool invocation and visual concept grounding, respectively. GTI reframes this mechanism as an initialization strategy: by grounding new tokens before fine-tuning, the learned structure serves as a starting point that benefits arbitrary downstream tasks, rather than being tied to a specific end use.



Generative Recommendation.


We adopt generative recommendation as our primary evaluation domain, as it requires injecting thousands of novel tokens into a pretrained LM, making it a demanding testbed for vocabulary extension. This paradigm frames retrieval as autoregressive decoding of Semantic IDs (SIDs) discretized via RQ-VAE (vq-vae; RQ-VAE; TIGER; LC-Rec). We provide an extended discussion in Appendix 7.5.



6 Conclusion


Through spectral and geometric diagnostics, we show that mean-of-vocabulary initialization collapses new tokens into a degenerate subspace that fine-tuning does not fully recover. Motivated by this diagnosis, we propose GTI, a lightweight grounding stage that learns only the new token embeddings via paired linguistic supervision before standard fine-tuning. On generative recommendation benchmarks spanning industrial-scale and public datasets, GTI consistently outperforms both mean initialization and auxiliary-task adaptation, with further analyses confirming that grounded structure persists through fine-tuning. These findings support the Grounded Token Initialization Hypothesis. As the grounding mechanism makes no assumptions about the downstream task, an important direction for future work is to test its generality in broader vocabulary-extension settings beyond recommendation.


References







7 Appendix



7.1 Datasets



7.1.1 Retrieval Dataset


Industrial Candidate Retrieval Dataset.


The industrial-scale candidate retrieval dataset⁵⁵5Data access and use complied with internal privacy and security frameworks. Member profile attributes were processed in accordance with applicable member controls and visibility settings, and analyses were conducted on de-identified datasets with results reported in aggregate only. consists of job requirement–candidate pairs collected in 2025 from a world-leading professional networking platform with global user coverage and evaluated by our internal LLM judge. According to product policy, which measures how many job requirements a candidate satisfies, each pair is assigned to one of three relevance levels: good match, good&maybe match, and not match. We use the good match pairs for supervised fine-tuning (SFT).


The member profile dataset contains profiles of users who provide at least one of the following attributes: geographic location, job positions, education history, or skill information.



Vibrent Dataset.


The
Vibrent Clothes Rental Dataset is
a publicly available dataset from Kaggle. To complement our industrial dataset with a publicly available benchmark, we also evaluate our method
on it.
The dataset contains anonymized user–item rental transactions from a clothing rental platform. We construct a candidate retrieval task
by treating users as queries and clothing items as candidates, where observed rental interactions are
considered positive relevance signals, and non-interacted items are treated as negatives during training.






7.2 Prompt Templates



7.2.1 Prompt Template: Auxiliary Task (Item Title/Description  $\leftrightarrow$  New Vocabulary Tokens)


      


7.2.2 Prompt Template: Search Query Task


   


7.2.3 Prompt Template: Retrieval Task


   


7.3 Implementation Details


We utilize the pre-trained Qwen3-Embedding-0.6B encoder to extract semantic representations for items. The encoder processes item metadata including titles and descriptions to generate 1024-dimensional dense vectors that capture semantic similarities between items.
We process text features of products by concatenating them as: [TITLE] [DESCRIPTION]. We set the maximum input sequence length as 2048. The final outputs are dense semantic embeddings:  $z_{i}\in\mathbb{R}^{1024}$  for item  $i$ .


Our Residual Quantized Variational Autoencoder (RQ-VAE) follows the TIGER (TIGER) framework with carefully designed architectural specifications to ensure effective quantization of semantic representations. The encoder architecture consists of a 3-layer Multi-Layer Perceptron (MLP) with hidden dimensions of [1024, 512, 256], utilizing ReLU activation functions and applying a dropout rate of 0.1 between layers. The residual quantization mechanism employs four codebook layers, each containing 256 entries with 32-dimensional codes. This hierarchical quantization approach enables fine-grained representation of semantic information while maintaining discrete tokenization properties essential for language model integration. We trained the model for 20,000 epochs to achieve a high codebook utilization rate and minimize collision rates. To further prevent collisions where multiple items map to identical sequences of semantic IDs, we employed the Sinkhorn-Knopp trick used by LC-Rec (LC-Rec), which ensures uniform distribution of item semantics across codebook embeddings in the final layer.


The base language model employs Qwen3-0.6B with hidden dimension of 1024. The model architecture comprises 28 transformer layers supporting a maximum context length of 32,768 tokens. This configuration provides sufficient capacity for processing sequential recommendation tasks while maintaining computational efficiency. Parameter-efficient fine-tuning is implemented through Quantized Low-Rank Adaptation (QLoRA) with a rank of 8 and alpha value of 32. The LoRA adaptation applies a dropout rate of 0.05 and targets key projection matrices including q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, and down_proj. We also set LoRA modules to be saved as embed_tokens and lm_head, so that only the embedding layer and the language modeling head are preserved during training while other modules can remain frozen. This configuration enables efficient adaptation while preserving pre-trained knowledge.


We implement the token-embedding grounding stage of GTI by extending the Hugging Face TRL (huggingface_trl_quickstart) SFTTrainer to update only the Semantic-ID embedding matrix while freezing the LM backbone; the trainer consumes paired (title/description, SemID) examples and optimizes the embeddings as outlined in the pseudo code below. Unless otherwise stated, we train for 10 epochs with a learning rate of 1e-3 and a batch size 16.




Input: Pretrained model  $\mathcal{M}$  with embedding matrix  $E\in\mathbb{R}^{V\times d}$ ; new token indices  $\mathcal{T}\subseteq\{0,\ldots,V-1\}$ ; paired corpus  $\mathcal{D}=\{(\text{text}_{j},\text{token}_{j})\}$ 


Output: Model  $\mathcal{M}$  with grounded embeddings for tokens in  $\mathcal{T}$ 




1ex


// Setup: freeze all parameters except new token embeddings 




1exFreeze all parameters of  $\mathcal{M}$  


Construct binary mask  $\mathbf{m}\in\{0,1\}^{V}$  where  $m_{i}=1$  iff  $i\in\mathcal{T}$  


 $\mathbf{M}\leftarrow\mathbf{m}\otimes\mathbf{1}_{d}$  

  // Broadcast to  $\mathbb{R}^{V\times d}$ 




1ex

1ex


// Training: update only new token embeddings via masked gradients 




1exfor each batch  $\mathcal{B}\subset\mathcal{D}$  do 

   
 $\mathcal{L}\leftarrow\textsc{LM\_Loss}(\mathcal{M},\mathcal{B})$  

     // Forward pass

   
 $\nabla E\leftarrow\nabla_{E}\mathcal{L}$  

     // Compute gradients

   
 $E\leftarrow E-\eta\cdot(\nabla E\odot\mathbf{M})$  

     // Update only new token embeddings

   


 end for




Algorithm 1 GTI Grounding Stage



7.4 Analysis Details


Table 4: Additional retrieval results on the Vibrent dataset.
We report Recall@K and NDCG@K for Baseline (MI+Vanilla SFT),
LC-Rec (MI+Multi-task SFT), and our method
GTI+Vanilla SFT. GTI+Vanilla SFT achieves the best performance
on most Recall@K metrics and remains competitive on NDCG@K, further supporting
the effectiveness of grounded token initialization for generative retrieval.




Methodology
Recall@K
NDCG@K

@5
@10
@20
@50
@100
@5
@10
@20
@50
@100

MI+Vanilla SFT (Baseline)
0.0226
0.0342
0.0475
0.0771
0.1031
0.0150
0.0188
0.0222
0.0280
0.0322

MI+Multi-task SFT (LC-Rec)
0.0243
0.0382
0.0539
0.0863
0.1194
0.0163
0.0208
0.0247
0.0311
0.0365

GTI+Vanilla SFT (Ours)
0.0230
0.0417
0.0599
0.0937
0.1222
0.0143
0.0203
0.0249
0.0316
0.0362





Representation Similarity Analysis (RSA).


To quantitatively measure whether the learned representations preserves the semantic structure of SID new vocabulary tokens, we perform representational similarity analysis. Given the well-trained RQ-VAE codebooks, which encode the compressed representation of SID new vocabulary tokens, we define the oracle semantic embeddings as  $X=\{x_{1},...,x_{n}\},x_{i}\in\mathbf{R}^{32}$ . And let the corresponding learned token embeddings from language model as  $\hat{X}=\{\hat{x}_{1},...,\hat{x}_{n}\},x_{i}\in\mathbf{R}^{d}$ , where  $d$  depends on the language model dimensionality. We construct pairwise token similarity matrices  $S_{X},S_{\hat{X}}\in\mathbf{R}^{n\times n}$ , where:




 $(S_{X})_{i,j}=\cos(x_{i},x_{j}),\qquad(S_{\hat{X}})_{i,j}=\cos(\hat{x}_{i},\hat{x}_{j}).$ 



We then vectorize the upper-triangular entries of  $S_{X}$  and  $S_{\hat{X}}$  and compute their correlation (We implement both Spearman correlation and Pearson correlation to capture complementary aspects of representational alignment). This yields an RSA score that quantifies the extent to which the learned representation space preserves the pairwise semantic relations of the oracle space. Since RSA compares representational geometry rather than coordinates directly, it is well suited to our setting where the oracle and learned embeddings live in different ambient dimensions ( $32$  vs.  $d$ ).



Extended SVD analysis of the industrial dataset.


The slower spectral decay and higher effective rank observed with GTI initialization suggest that this method preserves a more expressive and diverse feature space throughout the SFT process, preventing the dimensional collapse often associated with mean initialization.


Figure 8: Singular-Value Spectra of SID embedding matrix after SFT for Industrial dataset.



7.5 Full Related Work


RQ-VAE and Semantic IDs.


Vector-quantized autoencoders (vq-vae; haichao) learn discrete item representations by mapping continuous embeddings to codebook entries. Residual Quantized VAEs (RQ-VAE) (RQ-VAE) extend this with a hierarchy of residual codebooks, producing multi-level Semantic IDs (SIDs) that capture progressively finer semantic distinctions. Unlike conventional item IDs, SIDs carry compositional structure amenable to autoregressive generation, making them a standard component in generative recommendation (TIGER; LC-Rec; mtgr). Crucially, each codebook entry becomes a new token in the LM vocabulary, and how these tokens are initialized is precisely the bottleneck our work addresses.



Generative Recommendation.


Generative retrieval reframes recommendation as autoregressive decoding of item identifiers rather than nearest-neighbor search in embedding space (Aniket; chen2025pal). TIGER (TIGER) introduced RQ-VAE-learned SIDs as generation targets, and LC-Rec (LC-Rec) added auxiliary linguistic objectives during fine-tuning to improve SID representations. Several systems have demonstrated industrial-scale deployment: MTGR (mtgr) integrates generative retrieval with DLRM cross-feature signals; OneSearch (oneSearch) combines keyword-enhanced quantization with preference-aware rewards; and OneRec (OneRec; OneRec_TechReport) unifies retrieval and ranking via session-wise generation. Complementary directions include LLM-driven knowledge-graph recommenders (cai2025boosting) and MLLM-based world-knowledge integration (zhang2025linkedout). All of these systems must inject novel tokens into a pretrained LM; our work addresses a step that is upstream of and complementary to their contributions, namely how those tokens should be initialized.



Connection to Dimensional Collapse.


The initialization collapse we diagnose is related to dimensional collapse in contrastive and self-supervised learning (jing2021understanding; jiang2024hard), where learned representations are restricted to a low-dimensional subspace, eliminating fine-grained distinctions (Figure 2). Mean-of-vocabulary initialization induces a similar effect: all new tokens start at the same point, forming a rank-deficient configuration. jiang2024hard show that appropriate initialization can mitigate dimensional collapse in contrastive learning, which parallels our finding that grounding new tokens before fine-tuning preserves a higher-rank, more differentiated embedding subspace.

Methodology	NDCG@K (Composite)
Methodology	@5	@10	@20	@50	@100
MI+Vanilla SFT (Baseline)	0.00%	0.00%	0.00%	0.00%	0.00%
MI+Multi-task SFT (LC-Rec)	+6.94%	+4.38%	+1.94%	+1.95%	+1.01%
GTI+Multi-task SFT (Ours)	+17.88%	+12.03%	+6.90%	+4.99%	+2.89%
GTI: extra gain over LC-Rec ( $\Delta$ )	+10.94%	+7.65%	+4.96%	+3.04%	+1.88%

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Abstract

1 Introduction

2 Token-Embedding Misalignment

Generative Retrieval.

Mean-of-Vocabulary Initialization.

Diagnosing the misalignment.

Grounded Token Initialization (GTI) Hypothesis.

3 GTI: Grounded Token Initialization Stage

Algorithm.

4 Experiments

4.1 Setup

Datasets.

Baselines.

Evaluation Metrics.

Implementation Details.

4.2 Overall Performance Analysis

Controlled comparison on public dataset.

4.3 Further Analysis

Grounded initialization produces differentiated embedding geometry.

Grounded structure persists through fine-tuning.

5 Related Work

Vocabulary Extension in Language Models.

Generative Recommendation.

6 Conclusion

References

7 Appendix

7.1 Datasets

7.1.1 Retrieval Dataset

Industrial Candidate Retrieval Dataset.

Vibrent Dataset.

7.2 Prompt Templates

7.2.1 Prompt Template: Auxiliary Task (Item Title/Description ↔\leftrightarrow New Vocabulary Tokens)

7.2.2 Prompt Template: Search Query Task

7.2.3 Prompt Template: Retrieval Task

7.3 Implementation Details

7.4 Analysis Details

Representation Similarity Analysis (RSA).

Extended SVD analysis of the industrial dataset.

7.5 Full Related Work

RQ-VAE and Semantic IDs.

Generative Recommendation.

Connection to Dimensional Collapse.

7.2.1 Prompt Template: Auxiliary Task (Item Title/Description $\leftrightarrow$ New Vocabulary Tokens)