License: CC BY 4.0
arXiv:2604.02324v1 [cs.CL] 02 Apr 2026
\useunder

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Daiwei Chen1,2 Zhoutong Fu2, Chengming Jiang2, Haichao Zhang3, Ran Zhou2
Tan Wang2,  Chunnan Yao2,  Guoyao Li2,  Rui Cai4,  Yihan Cao2,  Ruijie Jiang2,
Fedor Borisyuk2,  Jianqiang Shen2,  Jingwei Wu2,  Ramya Korlakai Vinayak1
1
University of Wisconsin-Madison 2LinkedIn Corporation 3Northeastern University 4University of California, Davis
Work done during internship at LinkedIn Corporation. Correspondence to: dchen365@wisc.edu
Abstract

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that token initialization is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the Grounded Token Initialization Hypothesis: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

1 Introduction

Pretrained language models (LMs) are increasingly adapted to specialized domains by extending their vocabulary with new learnable tokens. A prominent example is generative retrieval, where items (TIGER; GRreview) or documents (tay2022transformer) are assigned discrete semantic codes and generated autoregressively by the LM; similar challenges arise whenever domain-specific symbols must be integrated into a pretrained vocabulary. These systems introduce thousands of new tokens into the model’s vocabulary, and a fundamental challenge is how to incorporate them into the pretrained embedding space so that the LM can transfer its general-purpose knowledge to the novel-token domain.

The prevailing practice initializes new token embeddings as the mean of the existing vocabulary embeddings (hewitt2021initializing). This heuristic is widely adopted because it is simple, places new tokens on the pretrained embedding manifold, and provides a tighter KL-divergence upper bound on output probabilities. However, it collapses all new tokens into a single point in embedding space, erasing inter-token distinctions and stripping domain-level semantics. An existing alternative (LC-Rec) employs auxiliary-task adaptation of the full LM to induce linguistic signals for new tokens, but the multi-task training introduces an objective mismatch: the auxiliary losses are not aligned with the target downstream task, resulting in limited and inconsistent gains.

In this paper, we identify token-embedding misalignment as a fundamental limitation when extending pretrained LMs with new vocabulary, and propose the Grounded Token Initialization Hypothesis: linguistically grounding novel tokens in an LM’s pretrained embedding space, before fine-tuning, better enables the model to leverage its general-purpose knowledge for novel-token domains. The intuition is that pretrained LM embeddings encode rich linguistic structure, semantically related tokens occupy nearby regions (levy2014neural), and the model’s attention and feed-forward layers have learned to exploit this geometry (gao2019representation). If new tokens are placed meaningfully within this structure, the LM can immediately leverage its existing representations to process them in context, rather than relying on fine-tuning alone to recover from a degenerate starting point. This motivates framing vocabulary extension as a token-grounding problem: new token embeddings should be grounded in linguistically meaningful representations while remaining coherent with the pretrained LM’s embedding geometry.

Refer to caption
Figure 1: Overview of the GTI grounding stage. The LM backbone and original vocabulary embeddings are frozen (snowflake); only the newly introduced Semantic-ID (SID) token embeddings (|𝒱SID|×D|\mathcal{V}_{\mathrm{SID}}|\!\times\!D parameters, fire) are trained. Paired prompts map between natural language descriptions and SID tokens in both directions, grounding the new tokens in the pretrained embedding space. This stage is inserted before standard end-to-end fine-tuning (see Section 3).

Building on this hypothesis, we introduce GTI, a simple and effective grounded token initialization method. Before downstream fine-tuning, GTI freezes the LM backbone and grounds newly introduced token embeddings using paired supervision between natural language descriptions and the corresponding new tokens (Figure 1). This grounding stage resolves the mismatch between well-trained vocabulary embeddings and newly initialized tokens, providing the LM with a semantically structured starting point for subsequent end-to-end fine-tuning of the full model for target downstream tasks.

We validate GTI on Generative Recommendation (GR) (TIGER; pmlr-v235-zhai24a), a challenging and practically important application of vocabulary extension. GRs have attracted growing attention in both academia and industry (ding2026doesgenerativerecommendationgeneralize; mtgr; oneSearch; OneRec), as they dramatically simplify retrieval by autoregressively generating item identifiers token-by-token from user interaction histories, replacing the expensive user–item inner products required by traditional dense-embedding methods (MF; NCF; LightGCN; NGCF). GRs can further exploit scaling-law behavior as model size and data increase (mtgr), offering a clear path to continued improvement. The GR setting is a particularly demanding testbed for grounded token initialization: large sets of new learnable Semantic-ID (SID) tokens must be incorporated into pretrained LMs, each encoding fine-grained item-level semantics and hierarchical codebook structure that should be properly grounded in the LM’s embedding space to support effective retrieval.

Contributions.

  1. 1.

    Diagnosis. Through spectral and geometric analysis, we characterize the token-embedding misalignment caused by mean initialization: all new learnable tokens collapse into a degenerate, low-rank subspace that does not fully recover under subsequent fine-tuning. This motivates the Grounded Token Initialization Hypothesis: linguistically grounding new tokens before fine-tuning better enables the LM to leverage its pretrained knowledge for the new domain.

  2. 2.

    Methodology. We introduce GTI, a simple and effective grounding stage that freezes the LM backbone and learns new token embeddings via paired linguistic supervision before standard fine-tuning, providing a semantically structured starting point for downstream adaptation.

  3. 3.

    Finding. On generative recommendation benchmarks, spanning industry-scale and public datasets, GTI consistently outperforms both direct supervised fine-tuning and LC-Rec (LC-Rec), an existing approach that jointly adapts the full model via auxiliary tasks. These results suggest that token initialization is a key bottleneck in vocabulary extension.

2 Token-Embedding Misalignment

We formalize the vocabulary extension problem in the context of generative retrieval, our primary application domain, and then use spectral and geometric diagnostics to characterize a systematic token-embedding misalignment that arises from standard initialization practices when new tokens are added to a pretrained language model.

Generative Retrieval.

We adopt the framework of TIGER. Each item IiI_{i}\in\mathcal{I} has content features (title, description, etc.) that a pretrained text encoder maps to a semantic embedding 𝐳id\mathbf{z}_{i}\in\mathbb{R}^{d}. An RQ-VAE (RQ-VAE) with LL codebook levels of KK entries each discretizes 𝐳i\mathbf{z}_{i} into a Semantic ID (c1,,cL)(c_{1},\ldots,c_{L}), cl{1,,K}c_{l}\in\{1,\ldots,K\}, via recursive residual quantization:

𝐫1:=𝐳i;cl=argmink𝐫l𝐪k(l)2,𝐫l+1:=𝐫l𝐪cl(l),l=1,,L,\mathbf{r}_{1}:=\mathbf{z}_{i};\qquad c_{l}=\arg\min_{k}\bigl\|\mathbf{r}_{l}-\mathbf{q}_{k}^{(l)}\bigr\|_{2},\quad\mathbf{r}_{l+1}:=\mathbf{r}_{l}-\mathbf{q}_{c_{l}}^{(l)},\quad l=1,\ldots,L,

where {𝐪k(l)}k=1Kd\bigl\{\mathbf{q}_{k}^{(l)}\bigr\}_{k=1}^{K}\!\subset\mathbb{R}^{d} is the level-ll codebook. The K×LK\!\times\!L SID codes111SID tokens are labeled by level to encode codebook membership, e.g., <a_1>, <b_1>, <c_1>.. are appended to the LM’s original vocabulary 𝒱text\mathcal{V}_{\mathrm{text}} as new tokens 𝒱SID\mathcal{V}_{\mathrm{SID}}. Given a context 𝐱\mathbf{x}, either a user’s interaction history (retrieval) or a natural language query (search), the LM generates the target Semantic ID autoregressively:

Pθ(c1,,cL𝐱)=t=1LPθ(ctc<t,𝐱).P_{\theta}(c_{1},\ldots,c_{L}\mid\mathbf{x})=\prod_{t=1}^{L}P_{\theta}(c_{t}\mid c_{<t},\mathbf{x}).
Mean-of-Vocabulary Initialization.

Standard practice initializes all novel token embeddings to the mean of the existing vocabulary embeddings (hewitt2021initializing):

𝐞c:=1|𝒱text|v𝒱text𝐞v,c𝒱SID,\mathbf{e}_{c}:=\frac{1}{|\mathcal{V}_{\mathrm{text}}|}\sum_{v\in\mathcal{V}_{\mathrm{text}}}\mathbf{e}_{v},\quad\forall\;c\in\mathcal{V}_{\mathrm{SID}}, (1)

where 𝐞v\mathbf{e}_{v} denotes the input embedding of token vv.

Refer to caption

Figure 2: Token-embedding collapse under mean initialization and the effect of grounding. (a) Left: Mean initialization maps all SID tokens (white triangles) to a single point, collapsing inter-token distinctions. Top-right: GTI grounds SID tokens (colored triangles) into distinct regions by training only the |𝒱SID|×d|\mathcal{V}_{\mathrm{SID}}|\!\times\!d embedding parameters while freezing the backbone. Bottom-right: Fine-tuning without grounding does not fully resolve the collapse (see Figure 7). (b)&(c) GTI initialization yields higher effective rank and preserves blockwise hierarchical structure among SID tokens after downstream task supervised finetuning.
Diagnosing the misalignment.

Under mean-of-vocabulary initialization (Eq. 1), every new token receives an identical embedding, 1) collapsing all inter-token distinctions and 2) discarding the semantic structure each token should encode (Fig. 2, left). This heuristic is nonetheless widely adopted (wolf2020transformers) because it places new tokens on the pretrained manifold and yields a tighter KL-divergence upper bound on output probabilities compared with random initialization (hewitt2021initializing). Random initialization, conversely, assigns distinct vectors to each token but places them without coherent relation to the pretrained manifold, providing no linguistic prior for the model to build on. Pairwise cosine similarities among token embeddings (Fig. 5) confirm that mean initialization produces a near-uniform similarity block across all SID tokens, while random initialization yields unstructured noise.

We examine whether supervised fine-tuning recovers the structure lost under mean initialization. The pairwise similarity among SID embeddings (Fig.2 (c) and Fig.6 Left&Mid) and the singular-value decomposition of the SID embedding matrix ESID|𝒱SID|×dE_{\mathrm{SID}}\in\mathbb{R}^{|\mathcal{V}_{\mathrm{SID}}|\times d} after supervised fine-tuning from the mean-initialized state (Fig.2 (b) and Fig.7) reveals rapid spectral decay and low effective rank, confirming that supervised fine-tuning alone does not recover the inter-token structure lost at mean or random initialization. Taken together, these analyses show that neither strategy provides a suitable starting point: mean initialization places tokens on the pretrained manifold but erases discrimination, while random initialization preserves discrimination but lacks linguistic grounding.

Grounded Token Initialization (GTI) Hypothesis.

These observations motivate our central hypothesis: linguistically grounding novel tokens in an LM’s pretrained embedding space, before downstream fine-tuning, better enables the model to leverage its general-purpose knowledge for novel-token domains. Rather than relying on fine-tuning alone to recover from a degenerate initialization, we propose inserting a simple and efficient grounding stage that learns new token embeddings via linguistic supervision with the backbone frozen, before proceeding to standard end-to-end fine-tuning. We operationalize this hypothesis in Section 3 and verify its effectiveness empirically in Section 4.

3 GTI: Grounded Token Initialization Stage

The diagnosis in Section 2 motivates a straightforward modification to the standard training pipeline: before downstream fine-tuning, insert a grounding stage that freezes the LM backbone and only learns new token embeddings via paired linguistic supervision. This design builds on the established principle of training new token embeddings within a frozen LM (ToolkenGPT; YoLLaVA). We term the resulting procedure GTI. Despite its simplicity, we show that this additional stage yields consistent improvements across multiple benchmarks, including both public and industry-scale datasets (Section 4), suggesting that token initialization is a key bottleneck in vocabulary extension.

Algorithm.

Let 𝒱=𝒱text𝒱new\mathcal{V}=\mathcal{V}_{\text{text}}\cup\mathcal{V}_{\text{new}} denote the extended vocabulary, where 𝒱new\mathcal{V}_{\text{new}} are the newly added domain tokens. Given a pretrained autoregressive LM with input-embedding matrix E|𝒱|×dE\in\mathbb{R}^{|\mathcal{V}|\times d}, we partition EE into the pretrained rows EtextE_{\text{text}} and the new rows Enew|𝒱new|×dE_{\text{new}}\in\mathbb{R}^{|\mathcal{V}_{\text{new}}|\times d}. Each domain entity is associated with a natural-language description xix_{i} (e.g., title or definition) and a canonical new-token sequence yi=(ci,1,,ci,L)y_{i}=(c_{i,1},\dots,c_{i,L}). We instantiate GTI in the generative recommendation setting, where 𝒱new=𝒱SID\mathcal{V}_{\text{new}}=\mathcal{V}_{\text{SID}}, xix_{i} is an item title/description, and yiy_{i} is the corresponding SID sequence.

We construct a grounding corpus 𝒟ground={(xi,yi)}i=1n\mathcal{D}_{\mathrm{ground}}=\{(x_{i},y_{i})\}_{i=1}^{n} pairing each description with its new token sequence, along with reversed pairs {(yi,xi)}\{(y_{i},x_{i})\} that require the model to generate descriptions from new tokens222Bidirectional training encourages new token embeddings to encode semantics in both the input and output directions; see ablation in Section 4.3 and template details in Appendix 7.2.. Using an instruction-style prompt template prompt(x)\texttt{prompt}(x) (Listing as follows), we minimize the negative log-likelihood over EnewE_{\text{new}}:

minEnew(x,y)𝒟groundt=1|y|logPθ(yt|y<t,prompt(x))\min_{E_{\text{new}}}\sum_{(x,y)\in\mathcal{D}_{\text{ground}}}\sum_{t=1}^{|y|}-\log P_{\theta}\big(y_{t}\big|y_{<t},\texttt{prompt}(x)\big) (2)

where θ\theta denotes all LM parameters. During this stage, all parameters except EnewE_{\text{new}} are held fixed, including EtextE_{\text{text}} and the LM head, which shares weights with EE via the standard tied-embedding parameterization. This weight tying means the grounding stage simultaneously shapes how the model reads and generates new tokens. After grounding, we retain the learned EnewE_{\text{new}} as initialization and proceed with standard supervised fine-tuning of all model parameters θ\theta. Implementation details are provided in Algorithm 1.

\__box_backend_scale:Nnn Item Title/Description \rightarrow Semantic IDs (Text\rightarrowNew Vocabulary Tokens) 

4 Experiments

We evaluate GTI within the highly demanding domain of generative recommendation. This domain serves as an ideal testbed for the initialization bottleneck, as it requires incorporating thousands of new Semantic-ID (SID) tokens into a pretrained language model. To empirically validate whether aligning these tokens with the model’s pre-existing linguistic geometry can prevent semantic collapse, we evaluate across two diverse environments: an industrial-scale candidate retrieval system and the public Vibrent Clothes Rental benchmark.

4.1 Setup

Datasets.

We evaluate across two distinct scales and domains.

(1) Industrial candidate retrieval.333Data access and use complied with internal privacy and security frameworks. Member profile attributes were processed in accordance with applicable member controls and visibility settings, and analyses were conducted on de‑identified datasets with results reported in aggregate only. This dataset consists of job requirement–candidate pairs from a world-leading recruitment platform. Each pair is categorized into three relevance levels (good, good&maybe, and not match) by an internal LLM evaluator according to how many job requirements a candidate satisfies. Due to data-sharing constraints, we report only relative performance gains over the SFT baseline for this dataset.

(2) Vibrent Clothes Rental. To validate generalizability, we adapt the public Vibrent Clothes Rental Dataset (vibrent_clothes_rental_dataset) into a generative retrieval task, treating users as queries and clothing items as candidates based on historical rental transactions.

Baselines.

To strictly isolate the initialization bottleneck, all methods share an identical Qwen3-0.6B backbone and RQ-VAE tokenization structure, differing only in how they introduce novel tokens.

(1) Vanilla SFT: New SID tokens are mean-initialized (Eq. 1), inducing a semantic collapse. The model relies entirely on downstream fine-tuning to disambiguate tokens from this degenerate starting point.

(2) LC-Rec (LC-Rec): A recent multi-task approach that begins from the same collapsed state but attempts to recover semantic structure by applying auxiliary natural language alignment objectives during fine-tuning.

(3) GTI (Ours): Using the grounding stage described in Section 3, we ground the new SID tokens into distinct, semantically meaningful regions of the frozen LM’s embedding space, providing a structurally rich starting point for the subsequent SFT procedure.

Evaluation Metrics.

We measure retrieval accuracy using Top-KK Precision, Recall, and NDCG. For the industrial dataset, we sample 200 jobs as evaluation queries (retrieving 200 candidates each). To comply with data-sharing constraints, we isolate the direct performance uplift of our grounding stage by reporting results strictly as a relative percentage gain over the standard SFT baseline, formulated as (MmethodMBaseline)/MBaseline(M_{\text{method}}-M_{\text{Baseline}})/M_{\text{Baseline}}. For the public Vibrent dataset, we adopt the standard leave-one-out sequence splitting strategy (SelfAttnSeqRec; P5).

\useunder

Table 1: Relative Precision@K gain (%) over SFT baseline on a real-world candidate retrieval dataset. Bold and underline denote the best result.

Methodology Precision@K (Good Match) Precision@K (Good & Maybe Match) P@5 P@10 P@20 P@50 P@100 P@5 P@10 P@20 P@50 P@100 MI+Vanilla SFT (Baseline) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% MI+Multi-task SFT (LC-Rec) +6.38% +5.20% +3.87% +3.00% +3.47% +5.63% +5.35% +2.98% +3.32% +3.05% GTI+Multi-task SFT (Ours) +21.63% +13.59% +8.16% +6.35% +4.25% +15.83% +10.89% +5.74% +5.87% +4.10% GTI: extra gain over LC-Rec (Δ\Delta) +15.25% +8.39% +4.29% +3.35% +0.78% +10.20% +5.54% +2.76% +2.55% +1.05%

Refer to caption

Figure 3: Relative gain versus candidate pool size. Left/Middle: Relative Precision@K gain under Good Match and Good & Maybe Match; Right: Relative NDCG@K gain (Composite). GTI consistently outperforms both baselines across all pool sizes, with the largest gains at small KK. Shaded areas denote variability across runs.
Implementation Details.

Across both datasets, we employ Qwen3-0.6B as the backbone language model. Semantic IDs are constructed via RQ-VAE, following the formulation in TIGER. For GTI, the grounding stage freezes all parameters except EnewE_{\text{new}} and trains for 8,000 steps with batch size 128; all parameters are then unfrozen for an additional 8,000 steps at the same batch size, followed by the standard SFT procedure used for the baseline. All experiments use four NVIDIA H100 GPUs.

Industrial dataset. Candidate-level semantic representations are obtained by fine-tuning Mistral-E5 in a two-tower architecture with recruiter engagement signals, producing 1024-dimensional embeddings. The RQ-VAE uses L=3L=3 codebook levels with K=8,192K=8{,}192 codes per level. The subsequent SFT baseline trains with a batch size of 512 for 1,600 steps.

Public dataset. Item-level semantic representations are derived using the off-the-shelf Qwen3-Embedding-0.6B encoder, yielding 1024-dimensional vectors. The RQ-VAE uses a 3-layer MLP encoder–decoder with ReLU activations, L=4L=4 codebook levels with K=256K=256 codes per level (32-dimensional codes), and the diversity regularizer of LETTER to encourage balanced codebook utilization. The RQ-VAE is trained for 20K epochs. The SFT baseline trains with batch size 512 for 1,600 steps.

Table 2: Relative NDCG@K (Composite) gain (%) over SFT baseline on a real-world candidate retrieval dataset. Bold and underline denote the best result.
Methodology NDCG@K (Composite)
@5 @10 @20 @50 @100
MI+Vanilla SFT (Baseline) 0.00% 0.00% 0.00% 0.00% 0.00%
MI+Multi-task SFT (LC-Rec) +6.94% +4.38% +1.94% +1.95% +1.01%
GTI+Multi-task SFT (Ours) +17.88% +12.03% +6.90% +4.99% +2.89%
GTI: extra gain over LC-Rec (Δ\Delta) +10.94% +7.65% +4.96% +3.04% +1.88%
Table 3: Relative Recall@K and NDCG@K (%) over SFT baseline on Vibrent Dataset.

Methodology Recall@K NDCG@K @5 @10 @20 @50 @100 @5 @10 @20 @50 @100 MI+Vanilla SFT (Baseline) 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% MI+Multi-task SFT (LC-Rec) +7.69% +11.86% +13.41% +12.03% +15.73% +8.47% +10.74% +11.30% +11.18% +13.26% GTI+Vanilla SFT (Ours) +1.71% +22.03% +26.02% +21.55% +18.54% -5.19% +8.02% +12.23% +12.83% +12.46%

Refer to caption

Figure 4: Relative gain versus candidate pool size. Left: Relative Recall@K gain; Right: Relative NDCG@K gain. Shaded areas denote variability across runs.

4.2 Overall Performance Analysis

Tables 1 &  2 and Figure 3 detail the overall performance on the industrial-scale dataset.

The effectiveness of GTI Initialization. Across all cutoffs, evaluation metrics, and relevance thresholds (Good Match and Good & Maybe Match), GTI outperforms both baselines. Under the strict Good Match criterion, GTI achieves +21.63% relative gain at P@5 over vanilla SFT, compared to +6.38% for LC-Rec, yielding an extra gain Δ\Delta of 15.25% attributable to the grounding stage. This pattern is consistent across evaluation settings: under Good & Maybe Match, GTI maintains a clear advantage (+15.83% vs. +5.63% at P@5), and NDCG@5 exhibits the same trend (+17.88% vs. +6.94%). Sweeping the candidate pool size from 5 to 200 (Figure 3) further confirms that the improvement is robust across retrieval scales.

Evidence for the GTI hypothesis. The comparison between LC-Rec and GTI provides a controlled test of our hypothesis, as both methods introduce linguistic supervision for new tokens but differ in when it is applied: LC-Rec incorporates auxiliary language modeling objectives during fine-tuning while retaining mean initialization, whereas GTI addresses the initialization directly through a grounding stage that precedes fine-tuning. The consistent performance gap (extra gain Δ\Delta) between the two methods, despite sharing the same downstream SFT procedure, suggests that grounding new tokens before fine-tuning provides a more effective starting point than relying on auxiliary objectives alone, consistent with the Grounded Token Initialization hypothesis.

Controlled comparison on public dataset.

To disentangle the effect of grounded initialization from that of multi-task adaptation and assess the generalization of our method beyond the proprietary dataset, we compare GTI+Vanilla SFT against LC-Rec (Multi-task SFT) on the public Vibrent dataset (Table 4 and Figure 4). Even without auxiliary objectives during fine-tuning, GTI achieves substantially higher Recall at K10K\geq 10 (e.g., +26.02% vs. +13.41% at Recall@20) and comparable NDCG, indicating that the grounding stage alone accounts for a large portion of the downstream improvement.

4.3 Further Analysis

Refer to caption
Figure 5: Pairwise cosine-similarity matrices under three initialization strategies. Each matrix shows similarities between 50 pretrained tokens (upper-left block) and 50 SID tokens (bottom-right block)444For better visualization, we randomly choose 50 tokens separately from pretrained tokens or Semantic-ID tokens. Random initialization (left) yields noninformative SID embeddings. Mean initialization (middle) collapses SID tokens into a near-uniform block. GTI (right) produces differentiated intra-SID structure with meaningful affinities to pretrained tokens.

Refer to caption

Figure 6: Pairwise SID similarity after fine-tuning (public dataset). We visualize the pairwise cosine similarity matrix of SID embeddings at the fine-tuned checkpoint. GTI is the only initialization strategy that preserves a clear blockwise hierarchical semantics among SID tokens, suggesting improved preservation of semantic geometry. By contrast, mean and random initialization produce flat or noisy similarity patterns even after SFT stage.

The preceding results establish that grounded initialization improves downstream performance; we now investigate why. We use spectral and geometric diagnostics on the SID embedding subspace, both at initialization and after fine-tuning. These analyses provide direct evidence to the Grounded Token Initialization Hypothesis (Section 2).

Grounded initialization produces differentiated embedding geometry.

Figure 5 visualizes pairwise cosine similarities among pretrained vocabulary tokens and SID tokens under three initialization strategies. Random initialization avoids uniformity but yields unstructured noise with no coherent affinity to the pretrained manifold. Mean initialization produces a uniform SID block, confirming the collapse diagnosed in Section 2. In contrast, GTI produces rich, differentiated structure within the SID block together with coherent cross-block affinities to relevant lexical tokens.

Refer to caption
Figure 7: (a) Singular-Value Spectra of SID embedding matrix after SFT: GTI initialization yields slower spectral decay and higher effective rank than mean initialization. (b) Representational Similarity Analysis (RSA) of SID embeddings after SFT: We compare the pairwise geometry of the ground-truth RQ-VAE codebook vectors and the learned SID embeddings using Pearson rr and Spearman ρ\rho. GTI initialization achieves the highest correlation under both metrics, indicating better preservation of the semantic structure among SID embeddings than mean or random initialization.
Grounded structure persists through fine-tuning.

We next examine whether the structure induced by grounding persists through fine-tuning. (1)  Pairwise cosine similarities among SID embeddings after fine-tuning on the public dataset (Figure 6) show that only the GTI-initialized model preserves the blockwise hierarchical structure encoded by the RQ-VAE; mean and random initialization produce flat or noisy similarity patterns. (2)  The singular-value spectrum of ESID|𝒱SID|×dE_{\mathrm{SID}}\in\mathbb{R}^{|\mathcal{V}_{\text{SID}}|\times d} after fine-tuning on the industrial dataset (Figure 7a) shows that mean initialization leads to rapid spectral decay and low effective rank, while grounded initialization yields slower decay and higher effective rank, indicating a non-degenerate subspace with multiple active directions along which items differ (see Appendix for extended SVD analysis of the industrial dataset).(3) Representational similarity analysis (RSA) between the learned SID embeddings and the ground-truth RQ-VAE codebook vectors (Figure 7b) shows that GTI-initialized embeddings better preserve the original semantic structure through training. Taken together, these results suggest that the grounding stage seeds embedding structure that persists through fine-tuning, corroborating the downstream performance gains.

5 Related Work

Vocabulary Extension in Language Models.

Extending a pretrained LM’s vocabulary with new tokens is a recurring challenge. Standard approaches initialize new embeddings at the vocabulary mean (hewitt2021initializing) or randomly, then rely on fine-tuning. ToolkenGPT (ToolkenGPT) and Yo’LLaVA (YoLLaVA) show that training only new token embeddings against a frozen LM can be effective for tool invocation and visual concept grounding, respectively. GTI reframes this mechanism as an initialization strategy: by grounding new tokens before fine-tuning, the learned structure serves as a starting point that benefits arbitrary downstream tasks, rather than being tied to a specific end use.

Generative Recommendation.

We adopt generative recommendation as our primary evaluation domain, as it requires injecting thousands of novel tokens into a pretrained LM, making it a demanding testbed for vocabulary extension. This paradigm frames retrieval as autoregressive decoding of Semantic IDs (SIDs) discretized via RQ-VAE (vq-vae; RQ-VAE; TIGER; LC-Rec). We provide an extended discussion in Appendix 7.5.

6 Conclusion

Through spectral and geometric diagnostics, we show that mean-of-vocabulary initialization collapses new tokens into a degenerate subspace that fine-tuning does not fully recover. Motivated by this diagnosis, we propose GTI, a lightweight grounding stage that learns only the new token embeddings via paired linguistic supervision before standard fine-tuning. On generative recommendation benchmarks spanning industrial-scale and public datasets, GTI consistently outperforms both mean initialization and auxiliary-task adaptation, with further analyses confirming that grounded structure persists through fine-tuning. These findings support the Grounded Token Initialization Hypothesis. As the grounding mechanism makes no assumptions about the downstream task, an important direction for future work is to test its generality in broader vocabulary-extension settings beyond recommendation.

References

7 Appendix

7.1 Datasets

7.1.1 Retrieval Dataset

Industrial Candidate Retrieval Dataset.

The industrial-scale candidate retrieval dataset555Data access and use complied with internal privacy and security frameworks. Member profile attributes were processed in accordance with applicable member controls and visibility settings, and analyses were conducted on de-identified datasets with results reported in aggregate only. consists of job requirement–candidate pairs collected in 2025 from a world-leading professional networking platform with global user coverage and evaluated by our internal LLM judge. According to product policy, which measures how many job requirements a candidate satisfies, each pair is assigned to one of three relevance levels: good match, good&maybe match, and not match. We use the good match pairs for supervised fine-tuning (SFT).

The member profile dataset contains profiles of users who provide at least one of the following attributes: geographic location, job positions, education history, or skill information.

Vibrent Dataset.

The Vibrent Clothes Rental Dataset is a publicly available dataset from Kaggle. To complement our industrial dataset with a publicly available benchmark, we also evaluate our method on it. The dataset contains anonymized user–item rental transactions from a clothing rental platform. We construct a candidate retrieval task by treating users as queries and clothing items as candidates, where observed rental interactions are considered positive relevance signals, and non-interacted items are treated as negatives during training.

7.2 Prompt Templates

7.2.1 Prompt Template: Auxiliary Task (Item Title/Description \leftrightarrow New Vocabulary Tokens)

\__box_backend_scale:Nnn Item Title/Description \rightarrow Semantic IDs666Most of Item Title/Description \leftrightarrow Semantic IDs prompts and retrieval prompts are adapted from (LC-Rec).(Title\rightarrowNew Vocabulary Tokens)  \__box_backend_scale:Nnn Item Title/Description \rightarrow Semantic IDs (Description\rightarrowNew Vocabulary Tokens)  \__box_backend_scale:Nnn Item Title/Description \rightarrow Semantic IDs (Title+Description\rightarrowNew Vocabulary Tokens)  \__box_backend_scale:Nnn Semantic IDs \rightarrow Item Title/Description (New Vocabulary Tokens\rightarrowTitle)  \__box_backend_scale:Nnn Semantic IDs \rightarrow Item Title/Description (New Vocabulary Tokens\rightarrowDescription)  \__box_backend_scale:Nnn Semantic IDs \rightarrow Item Title/Description (New Vocabulary Tokens\rightarrowTitle+Description) 

7.2.2 Prompt Template: Search Query Task

\__box_backend_scale:Nnn Candidate Description \rightarrow Semantic Id Alignment Prompt 777For brevity, we illustrate only three representative prompting templates.  \__box_backend_scale:Nnn Semantic Id \rightarrow Candidate Description Alignment Prompt  \__box_backend_scale:Nnn Search Query \rightarrow Candidate Semantic Id Alignment Prompt 

7.2.3 Prompt Template: Retrieval Task

\__box_backend_scale:Nnn Retrieval Prompt (Template 1)  \__box_backend_scale:Nnn Retrieval Prompt (Template 2)  \__box_backend_scale:Nnn Retrieval Prompt (Template 3) 

7.3 Implementation Details

We utilize the pre-trained Qwen3-Embedding-0.6B encoder to extract semantic representations for items. The encoder processes item metadata including titles and descriptions to generate 1024-dimensional dense vectors that capture semantic similarities between items. We process text features of products by concatenating them as: [TITLE] [DESCRIPTION]. We set the maximum input sequence length as 2048. The final outputs are dense semantic embeddings: zi1024z_{i}\in\mathbb{R}^{1024} for item ii.

Our Residual Quantized Variational Autoencoder (RQ-VAE) follows the TIGER (TIGER) framework with carefully designed architectural specifications to ensure effective quantization of semantic representations. The encoder architecture consists of a 3-layer Multi-Layer Perceptron (MLP) with hidden dimensions of [1024, 512, 256], utilizing ReLU activation functions and applying a dropout rate of 0.1 between layers. The residual quantization mechanism employs four codebook layers, each containing 256 entries with 32-dimensional codes. This hierarchical quantization approach enables fine-grained representation of semantic information while maintaining discrete tokenization properties essential for language model integration. We trained the model for 20,000 epochs to achieve a high codebook utilization rate and minimize collision rates. To further prevent collisions where multiple items map to identical sequences of semantic IDs, we employed the Sinkhorn-Knopp trick used by LC-Rec (LC-Rec), which ensures uniform distribution of item semantics across codebook embeddings in the final layer.

The base language model employs Qwen3-0.6B with hidden dimension of 1024. The model architecture comprises 28 transformer layers supporting a maximum context length of 32,768 tokens. This configuration provides sufficient capacity for processing sequential recommendation tasks while maintaining computational efficiency. Parameter-efficient fine-tuning is implemented through Quantized Low-Rank Adaptation (QLoRA) with a rank of 8 and alpha value of 32. The LoRA adaptation applies a dropout rate of 0.05 and targets key projection matrices including q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, and down_proj. We also set LoRA modules to be saved as embed_tokens and lm_head, so that only the embedding layer and the language modeling head are preserved during training while other modules can remain frozen. This configuration enables efficient adaptation while preserving pre-trained knowledge.

We implement the token-embedding grounding stage of GTI by extending the Hugging Face TRL (huggingface_trl_quickstart) SFTTrainer to update only the Semantic-ID embedding matrix while freezing the LM backbone; the trainer consumes paired (title/description, SemID) examples and optimizes the embeddings as outlined in the pseudo code below. Unless otherwise stated, we train for 10 epochs with a learning rate of 1e-3 and a batch size 16.

Input: Pretrained model \mathcal{M} with embedding matrix EV×dE\in\mathbb{R}^{V\times d}; new token indices 𝒯{0,,V1}\mathcal{T}\subseteq\{0,\ldots,V-1\}; paired corpus 𝒟={(textj,tokenj)}\mathcal{D}=\{(\text{text}_{j},\text{token}_{j})\}
Output: Model \mathcal{M} with grounded embeddings for tokens in 𝒯\mathcal{T}
1ex
// Setup: freeze all parameters except new token embeddings
1exFreeze all parameters of \mathcal{M}
Construct binary mask 𝐦{0,1}V\mathbf{m}\in\{0,1\}^{V} where mi=1m_{i}=1 iff i𝒯i\in\mathcal{T}
𝐌𝐦𝟏d\mathbf{M}\leftarrow\mathbf{m}\otimes\mathbf{1}_{d}
// Broadcast to V×d\mathbb{R}^{V\times d}
1ex
1ex
// Training: update only new token embeddings via masked gradients
1exfor each batch 𝒟\mathcal{B}\subset\mathcal{D} do
 LM_Loss(,)\mathcal{L}\leftarrow\textsc{LM\_Loss}(\mathcal{M},\mathcal{B})
 // Forward pass
 EE\nabla E\leftarrow\nabla_{E}\mathcal{L}
 // Compute gradients
 EEη(E𝐌)E\leftarrow E-\eta\cdot(\nabla E\odot\mathbf{M})
 // Update only new token embeddings
 
end for
Algorithm 1 GTI Grounding Stage

7.4 Analysis Details

Table 4: Additional retrieval results on the Vibrent dataset. We report Recall@K and NDCG@K for Baseline (MI+Vanilla SFT), LC-Rec (MI+Multi-task SFT), and our method GTI+Vanilla SFT. GTI+Vanilla SFT achieves the best performance on most Recall@K metrics and remains competitive on NDCG@K, further supporting the effectiveness of grounded token initialization for generative retrieval.

Methodology Recall@K NDCG@K @5 @10 @20 @50 @100 @5 @10 @20 @50 @100 MI+Vanilla SFT (Baseline) 0.0226 0.0342 0.0475 0.0771 0.1031 0.0150 0.0188 0.0222 0.0280 0.0322 MI+Multi-task SFT (LC-Rec) 0.0243 0.0382 0.0539 0.0863 0.1194 0.0163 0.0208 0.0247 0.0311 0.0365 GTI+Vanilla SFT (Ours) 0.0230 0.0417 0.0599 0.0937 0.1222 0.0143 0.0203 0.0249 0.0316 0.0362

Representation Similarity Analysis (RSA).

To quantitatively measure whether the learned representations preserves the semantic structure of SID new vocabulary tokens, we perform representational similarity analysis. Given the well-trained RQ-VAE codebooks, which encode the compressed representation of SID new vocabulary tokens, we define the oracle semantic embeddings as X={x1,,xn},xi𝐑32X=\{x_{1},...,x_{n}\},x_{i}\in\mathbf{R}^{32}. And let the corresponding learned token embeddings from language model as X^={x^1,,x^n},xi𝐑d\hat{X}=\{\hat{x}_{1},...,\hat{x}_{n}\},x_{i}\in\mathbf{R}^{d}, where dd depends on the language model dimensionality. We construct pairwise token similarity matrices SX,SX^𝐑n×nS_{X},S_{\hat{X}}\in\mathbf{R}^{n\times n}, where:

(SX)i,j=cos(xi,xj),(SX^)i,j=cos(x^i,x^j).(S_{X})_{i,j}=\cos(x_{i},x_{j}),\qquad(S_{\hat{X}})_{i,j}=\cos(\hat{x}_{i},\hat{x}_{j}).

We then vectorize the upper-triangular entries of SXS_{X} and SX^S_{\hat{X}} and compute their correlation (We implement both Spearman correlation and Pearson correlation to capture complementary aspects of representational alignment). This yields an RSA score that quantifies the extent to which the learned representation space preserves the pairwise semantic relations of the oracle space. Since RSA compares representational geometry rather than coordinates directly, it is well suited to our setting where the oracle and learned embeddings live in different ambient dimensions (3232 vs. dd).

Extended SVD analysis of the industrial dataset.

The slower spectral decay and higher effective rank observed with GTI initialization suggest that this method preserves a more expressive and diverse feature space throughout the SFT process, preventing the dimensional collapse often associated with mean initialization.

Refer to caption
Figure 8: Singular-Value Spectra of SID embedding matrix after SFT for Industrial dataset.

7.5 Full Related Work

RQ-VAE and Semantic IDs.

Vector-quantized autoencoders (vq-vae; haichao) learn discrete item representations by mapping continuous embeddings to codebook entries. Residual Quantized VAEs (RQ-VAE) (RQ-VAE) extend this with a hierarchy of residual codebooks, producing multi-level Semantic IDs (SIDs) that capture progressively finer semantic distinctions. Unlike conventional item IDs, SIDs carry compositional structure amenable to autoregressive generation, making them a standard component in generative recommendation (TIGER; LC-Rec; mtgr). Crucially, each codebook entry becomes a new token in the LM vocabulary, and how these tokens are initialized is precisely the bottleneck our work addresses.

Generative Recommendation.

Generative retrieval reframes recommendation as autoregressive decoding of item identifiers rather than nearest-neighbor search in embedding space (Aniket; chen2025pal). TIGER (TIGER) introduced RQ-VAE-learned SIDs as generation targets, and LC-Rec (LC-Rec) added auxiliary linguistic objectives during fine-tuning to improve SID representations. Several systems have demonstrated industrial-scale deployment: MTGR (mtgr) integrates generative retrieval with DLRM cross-feature signals; OneSearch (oneSearch) combines keyword-enhanced quantization with preference-aware rewards; and OneRec (OneRec; OneRec_TechReport) unifies retrieval and ranking via session-wise generation. Complementary directions include LLM-driven knowledge-graph recommenders (cai2025boosting) and MLLM-based world-knowledge integration (zhang2025linkedout). All of these systems must inject novel tokens into a pretrained LM; our work addresses a step that is upstream of and complementary to their contributions, namely how those tokens should be initialized.

Connection to Dimensional Collapse.

The initialization collapse we diagnose is related to dimensional collapse in contrastive and self-supervised learning (jing2021understanding; jiang2024hard), where learned representations are restricted to a low-dimensional subspace, eliminating fine-grained distinctions (Figure 2). Mean-of-vocabulary initialization induces a similar effect: all new tokens start at the same point, forming a rank-deficient configuration. jiang2024hard show that appropriate initialization can mitigate dimensional collapse in contrastive learning, which parallels our finding that grounding new tokens before fine-tuning preserves a higher-rank, more differentiated embedding subspace.

BETA