No Single Best Model for Diversity:
Learning a Router for Sample Diversity

Yuhan Liu^♠ Fangyuan Xu^♠ Vishakh Padmakumar^♡
Daphne Ippolito^♢ Eunsol Choi^♠
^♠ New York University ^♡ Stanford University ^♢ Carnegie Mellon University
yl13579@nyu.edu

Abstract

When posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce diversity coverage, a metric that measures the total quality scores assigned to each unique answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-WildChat, our trained router outperforms the single best model baseline ( $26.3\%$ vs $23.8\%$ ). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.

1 Introduction

Various tasks require language models (LMs) to generate diverse high-quality responses. These range from creative writing (Padmakumar and He, 2023; Chung et al., 2025), dialogues (Lin and Tomlin, ), code (Wu et al., 2026), math (Wu et al., 2025), scientific discovery (Novikov et al., 2025; Gottweis et al., 2025), survey response simulation (Meister et al., 2024) to synthetic data generation (Honovich et al., 2022). Evaluation on these tasks thus should move beyond the quality of a single output to the diversity and quality of a set of outputs. However, existing metrics fall short in quantifying the coverage of open-ended answer space. In this work, we introduce diversity coverage, a metric that measures the total quality scores assigned to all unique answers in the predicted answer set, relative to the best possible answer set with the same number of answers.

Prior works focused on methods for improving the diversity of generations from a single LLM, such as changing the inference hyperparameters (Holtzman et al., 2019; Kambhatla et al., 2022; Santurkar et al., 2023; Nguyen et al., 2024) or prompt (Lu et al., 2024b; Zhang et al., 2025b). In this paper, we ask the question of whether diversity coverage can be further improved by taking advantage of bountiful LMs, each with differing behaviours towards the same prompt. We hypothesize that heterogeneous LLMs can be effectively ensembled in order to leverage their complementary strengths. Through a pilot study, we identify that no single LLM dominates in generating a diverse and high-quality output set, and different LLMs excel at answering different open-ended questions (Section 3). If we can pick the optimal model for each example, $33.0\%$ diversity coverage can be achieved on NB-WildChat, revealing a large gap compared to using the overall best model in the ensemble ( $23.8\%$ ).

However, determining the optimal model for a question can be challenging, as optimizing diversity involves analyzing the joint effect of all sampled answers. Although many previous works have proposed routing to select the best LLM (Jiang et al., 2023; Zhang et al., 2025a; Lu et al., 2024a), candidates are ranked based on a single response to the ground truth. This problem becomes exponentially more challenging when comparing answer sets as it is expensive to sample and compare multiple outputs for each example. The difficulty is further increased for open-ended questions as there are no gold answer sets available. Lastly, one might expect that simple heuristics based on model metadata (e.g., family or size) could help predict diversity. In fact, we show in Figure 1 that models of different families and sizes generate the most diverse outputs for a disjoint set of queries.

Motivated by this, we propose a simple approach to predict the best model to respond to any given query. We train a router that scores the diversity and quality of each candidate model in the routing pool and routes to the best LLM to generate the answers. We frame it as a classification task and create a training dataset of queries and model scores. Our router outperforms top overall baselines on the NB-WildChat and also generalizes to NB-Curated. Using the same trained router, we explore further ensembling outputs from two models per query, which brings further performance gains. We will release the code and dataset publicly. Our key contributions are:

1.

We propose a new metric, diversity coverage, to jointly measure the diversity and quality of an answer set to open-ended questions with diverse output space.
2.

Motivated by the finding that no single model excels at generating diverse outputs to all queries, we propose to train a router which predicts the best model for generation given an input query. We evaluate our method comprehensively on multiple datasets, demonstrating competitive performance on both in-domain (NB-WildChat) and out-of-domain (NB-Curated) settings.
3.

We further investigate the effect of training data size, prompting strategies as well as inference efficiency of using a diversity router. More broadly, our findings highlight the potential of multi-LLM systems, where models with complementary strengths are combined to produce more diverse and high-quality solutions.

Refer to caption — Figure 1: Left: LLMs exhibit different diversity coverage. Right: There is no universal best model on *NB-WildChat*. A model is only considered to be the best model if its diversity scores are $5\%$ higher than the second most best candidate. Queries without a model satisfying this margin are labeled as “No dominant single models”. On *Simple Questions*, all models perform similarly, resulting in $100\%$ of “No dominant single models”. On *NB-WildChat*, there is no model that consistently dominates all queries.

2 Task Formulation

Many queries admit multiple valid responses rather than a single correct answer. We summarize datasets containing such queries in Table 1. Depending on the number of possible answers, we categorize the datasets into two types:

•

Fixed answer set. Each query has a finite ground truth answer set $A^{*}$ . The predicted answer is correct if it belongs to $A^{*}$ .
•

Open-ended answer set. Queries admit infinitely many valid answers, and listing them comprehensively is infeasible. The validity of the predicted answer can be evaluated based on quality and assigned a scalar value.

Dataset	#	Example
Dataset	#	Question	Possible Valid Answers
Simple Questions (Zhang et al., 2024)	$23$	Output a random country in North America.	United States/ Canada/ Mexico / …
NB-Curated ¹¹1The original released dataset has 100 questions. We filter out 4 questions that ask for multiple answers (violating our hypothesis) and 4 questions that do not have multiple correct answers. (Zhang et al., 2025c)	$92$	Tell me a funny joke.	Who is Adam and why he is optimizing my code? / What did a late tomato say to other tomatoes? I will ketchup, …
NB-WildChat (Zhang et al., 2025c)	$1$ k	Give me a very simple way to remember the formula for tangent.	Tangent = Opposite /Adjacent / Draw a unit circle …
Infinity-Chat (Jiang et al., 2025)	$26$ k	Write a story about America.	In the heartland of America there was a small town …/ My decision to go to the United States…

Table 1: Dataset statistics. Simple Questions has a fixed answer set and all other datasets have open-ended answer sets.

2.1 Task definition

Given a query $q$ and generation budget $B$ (i.e., the number of answers produced by model), the task is to derive an answer set $A=\{a_{1},\ldots,a_{B}\}$ that covers as many distinct and high-quality answers as possible. Our task definition assumes two functions:

•

uniq( $q,A$ ): Given a query $q$ and an answer set $A$ , it outputs a subset of $A_{d}$ , consisting of distinct answers only (i.e., no two answers in $A_{d}$ are equivalent to each other). We follow prior work (Zhang et al., 2025c) to derive $A_{d}$ . We iterate over answers in $A$ and greedily add a new answer to $A_{d}$ if the current answer $a$ is not equivalent to any answer already in $A_{d}$ . The equivalence of two answers is determined by exact string match for queries with fixed answer set and an equivalence classifier for open-ended question. The process is described in Appendix C.
•

quality( $q,a_{i}$ ): Given an individual answer $a_{i}$ for query $q$ , it outputs a scalar value representing the quality of the answer $a_{i}$ . This can be done through either comparing against ground truth answer sets (factual queries) or using a reward model (open-ended queries).

2.2 New Evaluation Metric: Diversity Coverage

Given a predicted set of answers $A=\{a_{1},\ldots,a_{B}\}$ to query $q$ , we introduce a new metric, diversity coverage (div-cov) as follows:

\operatorname{div-cov}(q,A)\coloneqq\frac{1}{\operatorname{max-uniq-sum}(q,B)}\sum_{a\in\text{uniq}({q,A})}\operatorname{quality}(q,a)

We define max-uniq-sum( $q,B$ ) as the maximum score that one can reach by generating an answer set of size $B$ where each answer in the set is distinct and achieves maximum quality.

For questions with a fixed answer set $A^{*}$ , assuming $B\geq|A^{*}|$ , this measures the proportion of unique ground-truth answers covered by the answer set, equivalent to the coverage rate metric proposed by Zhang et al. (2024).

Prior works measure diversity by the number of unique, valid outputs for questions with fixed answer set (Zhang et al., 2024); or via pairwise embedding-based similarity (Zhang et al., 2025b; Jiang et al., 2025). Such metric either does not work for questions with open-ended answer space, or do not account for the quality of the answers (Zhang et al., 2025b; Jiang et al., 2025). Zhang et al. (2025c) proposes a unified metric for quality and diversity considering an ordered answer list, which penalizes answers generated later to account for user patience. This paper focuses on evaluating the quality and diversity of a set of answers, regardless of generation order, making diversity coverage better suited to our purpose.

3 A pilot study on ensembling models to maximize diversity coverage

In this section, we first study whether a strong LLM can dominate other models in diversity coverage for a range of questions (Section 3). We found no single model dominates, motivating us to explore the upper bound of gains if use a pool of LLMs instead of a single LLM (Section 3.2). We compare several ensembling strategies under oracle model selection setting. We find that picking the best LLM per question is the most promising, which leads us to develop model router in later sections.

3.1 No single model is best at diversity coverage for all questions

Model sets

We study $18$ models from four open-source model families with different parameter counts: Llama (Llama-3.2-1B, Llama-3.2-3B, Llama-3.1-8B, Llama-3.3-70B), Qwen (Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B, Qwen3-8B, Qwen3-14B, Qwen2.5-72B), OLMo (OLMo-2-0425-1B, OLMo-2-1124-7B, OLMo-2-1124-13B, OLMo-2-0325-32B), Gemma(gemma-3-1b, gemma-3-4b, gemma-3-12b, gemma-3-27b).

Settings

For each model and query, we sample $N$ answers with a prompt which instructs the model to enumerate as many valid answers as possible (see Appendix B for the full template). This prompt encourages models to explore the space of possible responses rather than produce a single canonical answer.²²2Prior work Zhang et al. (2025b) has found that this method elicits more diverse answer set compared to sampling multiple single answer. We further compare different prompt templates (e.g. generate one or two answers in a single generation) in Appendix H.1, finding that generating all answers in a single prompt elicits the most diverse answer set. We keep the decoding method fixed throughout the paper (described in Appendix D). For each query, we define the “dominant” model by the following two criteria: (1) the model achieves the highest diversity coverage, and (2) the score is at least $5\%$ higher than that of the answer sets generated by any other models.

Results

On Simple Questions, we find no dominant model for any query. For NB-WildChat queries, this remains true for 30% of queries, and more than 5 models are dominant at least on 5% of queries, suggesting optimizing model choices per question can be fruitful. Figure 1 compares per-model frequency of achieving the best diversity coverage on NB-WildChat.

3.2 Oracle experiment: how much does picking the best model(s) per query improve?

Motivated by the finding that different models can generate diverse outputs for different queries, can we ensemble outputs from models to achieve diverse outputs? We assume an oracle setting, where we have access to the diversity coverage scores of all LLMs on all queries. We compare three strategies to ensemble models with a fixed generation budget $B$ ³³3In our experiments, we fix $B$ to be $50$ answers per question if not otherwise stated.:

	SQ	Curated	WildChat
Top overall model	$96.9\%$	$47.0\%$	$23.8\%$
Top two overall models	$97.1\%$	$45.6\%$	$25.6\%$
Random model / query	$92.7\%$	$37.5\%$	$18.1\%$
Top model / query	97.9%	59.6%	33.0%

Table 2: Diversity coverage scores for ensembling multiple LLMs on Simple Questions (SQ), NB-Curated (Curated) and NB-WildChat (WildChat).

•

Top overall model. We select the single model with the best average diversity coverage per dataset, representing the best possible performance without ensembling. The selected top models are respectively: Llama-3.1-8B, Qwen3-14B and OLMo-2-0425-1B.
•

Top two overall models. We select two models with the highest average diversity coverage score per the dataset, then ensemble their outputs, generating $B/2$ answers per model. The selected model pairs are respectively: (Llama-3.1-8B, Llama-3.3-70B), (Qwen3-14B, Llama-3.1-8B), (OLMo-2-0425-1B, OLMo-2-1124-7B).
•

Top model per query. For each query, we select the model with highest diversity coverage. This represents the oracle performance of always choosing the best LLM per given query. We also report the performance of randomly choosing a model per query (Random model per query) as a baseline method.

Table 2 shows that query-level model selection (Top model per query) is consistently the best strategy among all three datasets. The gap increases as questions become more open-ended (on NB-Curated and NB-WildChat). For Simple Questions, using one best single LLM (Top overall model) can recover $96.9\%$ of the ground truth targets. Open-ended questions, however, are more challenging, and choosing the best model per query yields non-trivial gains. This is evidenced by results on NB-Curated, where selecting the top model per query ( $59.6\%$ ) yields a $27\%$ relative improvement over the second-best baseline ( $47.0\%$ ), and on NB-WildChat where the improvement is $29\%$ .

4 Learning to ensemble multiple models for diverse outputs

Oracle routing significantly improves diversity coverage, but it is costly as we need to sample and evaluate outputs from all candidate LLMs. This motivates us to train a router to predict the most promising model without sampling the entire answer sets from all models.

4.1 Router

Problem setting

Given a query $q$ and a suite of models $M=\{m_{1},m_{2},\cdots m_{n}\}$ , a router ranks them, by $\operatorname{div-cov}(q,A_{i})$ where $A_{i}$ is the generated answer set from $m_{i}$ for query $q$ for some budget $B$ . The oracle model index for $q$ is defined as $i^{*}=\arg\max_{i}\operatorname{div-cov}(q,A^{(i)})$ . Such index $i_{j}^{*}$ for each query $q_{j}$ consist of the router training data $\mathcal{D}=\{(q_{j},i_{j}^{*})\}$ .

Classification Objectives

We compare two classification formulation for the router:

•

$|\mathcal{M}|$ -way classification: the router is a single classifier $r_{\theta}:\mathcal{Q}\rightarrow\{1,\ldots,|\mathcal{M}|\}$ which predicts the oracle best model index $i_{j}^{*}$ for each query $q_{j}$ . Let $r_{\theta}(q)_{i}$ denote the predicted probability of selecting model $m_{i}$ , we train the router with cross-entropy loss: $\mathcal{L}_{\text{multi}}=\mathbb{E}_{(q_{j},i_{j}^{*})\sim\mathcal{D}}\left[-\log r_{\theta}(q_{j})_{i_{j}^{*}}\right].$
•

Binary classification: For each LLM $m_{i}$ , we derive a binary training dataset $\mathcal{D}^{(i)}=\{(q_{j},y_{j}^{(i)})\}$ from $\mathcal{D}$ , where $y_{j}^{(i)}=\mathbbm{1}[i=i_{j}^{*}]$ indicates whether $m_{i}$ is the oracle best model for query $q_{j}$ . We then train a binary classifier $r_{\theta}^{(i)}:\mathcal{Q}\rightarrow[0,1]$ to predict this label using binary cross-entropy loss. At inference time, the router evaluates all binary classifiers $\{r_{\theta}^{(i)}(q)\}_{i=1}^{|\mathcal{M}|}$ and selects the model with the highest predicted score: $\arg\max_{i=1}^{|\mathcal{M}|}r_{\theta}^{(i)}(q)$ .

Query encoding

We experiment with two input featurizations: (1) infly/inf-retriever-v1 (Yang et al., 2025), a retriever fine-tuned from Qwen-2-7B for information retrieval tasks. We refer to it as model-agnostic encodings (agn). (2) Model hidden states: we encode the query using each model $m_{i}$ and extract the representation from the final layer’s last hidden state. We hypothesize that this representation encodes rich information on how the model decodes its outputs. We refer to it as model-specific encodings (spec).

4.2 Experiment settings

Training and evaluation data

We split the $1,000$ NB-WildChat prompts from (Zhang et al., 2025c) into train, validation and test sets containing 70%, 10% and 20% of the data respectively. We conduct out-of-domain evaluation on NB-Curated questions.

Evaluation metrics

Diversity coverage jointly measures the diversity and quality of the generated answer set. To disentangle the effect, we additionally report metrics that measure each aspect. Quality (Qual) measures the average quality score across all sampled answers: $\frac{1}{|A|}\sum_{a\in A}\operatorname{quality}(q,a).$ Uniqueness (Unq) measures the number of semantically non-equivalent answers: $|A_{d}|$ . Unique Quality (Unq Qual) measures the average quality score over unique answers only: $\frac{1}{|A_{d}|}\sum_{a\in{\text{uniq}(q,A)}}\operatorname{quality}(q,a).$ Together, these metrics reveal whether improvements in cumulative diversity arise from generating more distinct answers, improving answer quality, or both.

Baselines

We consider several non-routing baselines. For a fair comparison with our trained routers, we restrict these methods to access only training-set labels and evaluate them on the test set. We implement baselines from Section 3.2: Top overall, Top two overall, Random model per query. We also include a Frequency baseline, where models are sampled proportional to their frequency of reaching highest diversity coverage. We additionally compute Top model per query and Top two models per query as oracle performance on diversity coverage, using ground-truth labels on the test set. Specifically, Top model per query is implemented by selecting the best model per query. Top two models per query are the best pair over all model combinations. If two models are selected, we take half from each model.

Router Models

We implement three types of router models, and describe their implementation details in Appendix E.

•

KNN (Fix, 1985) This is a simple, non-parametric classifier, where the predictions are obtained from K nearest neighbours from the training data, $K\in{1,5}$ .
•

BERT (Devlin et al., 2019) Following other routing literature(Ong et al., 2024; Zhang et al., 2025a), we fine-tune BERT with a classification head which makes a selection over the $|\mathcal{M}|$ models following.⁴⁴4We did not experiment with implementing $|\mathcal{M}|$ BERT models for binary classification given that fine-tuning BERT is computationally more expensive than fine-tuning the 2-layer MLP router.
•

MLP We report results for training $|\mathcal{M}|$ binary MLP classifiers and training one MLP classifier for $|\mathcal{M}|$ -way classification. We report results for (1) using inf-retriever to encode the query for all classifiers: Binary MLP (agn) and M-way MLP (agn), (2) using the candidate model $m_{i}$ ’s last layer hidden states to encode the query for the respective classifier: Binary MLP (spec) and M-way MLP (spec).

Method	NB-WildChat				NB-Curated (OOD)
Method	#Unq	Qual	Unq Qual	Cov.	#Unq	Qual	Unq Qual	Cov.
Top overall	42.6	3.0	2.9	23.8%	35.4	6.0	5.7	38.6%
Frequency	33.1	3.8	3.6	21.0%	28.2	7.2	7.1	39.6%
Random model per query	27.8	3.7	3.6	18.1%	27.8	7.2	7.0	37.5%
Top model per query (oracle)	38.8	4.5	4.4	33.0%	30.3	7.6	7.4	59.6%
KNN (N=1)	34.3	3.7	3.6	23.1%	28.2	7.3	7.1	39.7%
KNN (N=5)	34.9	3.8	3.7	24.1%	29.8	7.3	7.1	40.2%
M-way BERT	40.3	3.3	3.2	24.4%	35.0	6.3	6.2	40.3%
M-way MLP (agn)	35.1	3.9	3.8	25.3%	30.1	7.6	7.5	40.3%
M-way MLP (spec)	39.3	3.5	3.4	25.9%	34.6	6.3	6.1	40.2%
Binary MLP (agn)	38.4	3.5	3.4	25.7%^∗∗	32.8	7.1	7.0	40.7%^∗∗
Binary MLP (spec)	38.1	3.6	3.5	26.3%^∗∗	30.8	7.0	6.8	39.3%^ns

Table 3: A per-query router selects over 18 models to maximize diversity coverage (Cov.). We train our best MLP router for 5 runs with random seeds to compute statistical significance for our best system (bolded) against Top overall, ^∗∗ indicating significantly better and ^ns indicating not significant.

5 Results

5.1 Performance Evaluation

We report performances in Table 3.⁵⁵5We also report accuracy (i.e., how frequently it predicted ground truth best model) in Table 6 in Appendix. Top overall is the best-performing non-routing baseline for in-domain evaluation on NB-WildChat. This indicates that the LLM chosen from training labels maintains strong diversity coverage on the test set. Frequency baseline generalizes better to out-of-domain NB-Curated questions. KNN routers yield only marginal improvement. MLP-based routers outperform other baselines. Specifically, binary routers with model-specific query encodings bring the greatest gains ( $26.3\%$ ), surpassing the Top overall baseline ( $23.8\%$ ). On MLP classifiers, model-specific query encodings (spec) provide more useful information than model-agnostic encoding (agn), but show worse generalization.

Method	NB-WildChat				NB-Curated (OOD)
Method	#Unq	Qual	Unq Qual	Cov.	#Unq	Qual	Unq Qual	Cov.
Top 2 overall	39.1	3.4	3.2	23.8%	31.6	6.7	6.3	38.3%
Top 2 per query	40.7	4.5	4.5	35.8%	41.3	7.7	7.6	62.6%
Router	38.4	3.8	3.6	26.7%^∗∗	32.3	7.1	6.8	42.2%^∗∗

Table 4: Performance of ensembling two models per query. We report the performance of the best single model router (Binary MLP (spec)) in Table 3 by ensembling the top 2 models ranked by the prediction scores. ^∗∗ indicating significantly better compared to Top 2 overall.

Router trained to select single model can be used to ensemble outputs from two models which provides further gains.

We observe consistent gains when using our trained router (Binary MLP(spec)) to select two models, as presented in Table 4 both in-domain ( $26.41\%$ vs. $23.8\%$ ) and out-of-domain. Moving from Top overall to Top 2 overall, the best diversity coverage of the non-routing baseline does not improve. But the oracle (Top per query) stably increases from one to two models. We show that our best routers are significantly better than Top overall and Top two overall baseline cross 5 checkpoints trained under different random seeds. We further discuss how the number of model selected affect the answer diversity in Section H.2.

Scaling training data size consistently produces a better router.

Would training on a larger data set improve performance? On Infinity-Chat, we show in Figure 2 that router performance increases steadily with training data sizes varying from 500, 1k to 2k. We further find that training also scales on NB-WildChat and can incur generalization across the two datasets in Appendix F.

5.2 Efficiency Evaluation

In Figure 3, we show the inference time efficiency of generating an answer set of various methods on NB-WildChat. We use 2H200 GPUs for answer sampling⁶⁶6We assume no parallelization in sampling generations. If two models are selected, the process is performed sequentially (i.e. model by model). and 1 H200 GPU for diversity coverage calculation. We compare the latency of three methods: Top (Top overall), Router, which is the Binary MLP classifier (spec), and Oracle (Top model per query).

Inferencing with our router is about $2\text{-}3\times$ slower than the Top baseline. The routing itself is not very costly, yet sampling becomes more expensive as the router often directs to a bigger LLM than the top baseline model (OLMo-2-0425-1B).

Oracle setting, while showing the strongest performance, is also much more expensive, introducing up to $19\times$ computation overhead compared to our router. This is because its routing involves brute-force computing diversity coverage for all models to find the best candidate per query. In contrast, our router introduces only a fixed overhead that does not scale with the number of selected models.

6 Discussions: Different Prompt Templates

Method	Prompt Template
Method	Gen 1	Gen 2	Gen All
Top overall	18.5%	19.7%	23.8%
Random	9.9%	13.2%	18.1%
Frequency	15.6%	17.1%	21.0%
Oracle (G-1)	\cellcolorLightGrey25.6%	\cellcolorLightGrey!4022.3%	\cellcolorLightGrey!4020.4%
Oracle (G-2)	\cellcolorLightGrey!4019.5%	\cellcolorLightGrey28.3%	\cellcolorLightGrey!4021.0%
Oracle (G-All)	\cellcolorLightGrey!4014.7%	\cellcolorLightGrey!4018.8%	\cellcolorLightGrey33.0%
Router (G-1)	\cellcolorLightGrey19.1%	\cellcolorLightGrey!4020.4%	\cellcolorLightGrey!4014.4%
Router (G-2)	\cellcolorLightGrey!4019.7%	\cellcolorLightGrey21.6%	\cellcolorLightGrey!4019.0%
Router (G-All)	\cellcolorLightGrey!4014.8%	\cellcolorLightGrey!4018.1%	\cellcolorLightGrey26.2%

Figure 4: Div-Cov (%) results on NB-WildChat with various prompting strategies. Training the router under each prompting strategy (in domain and out-of-domain evaluation).

Large amount of work in diversity has focused on improving the prompt, while throughout this paper we used a fixed prompt template to sample answers and compute diversity coverage. In this last section, we explore two alternative prompt templates, with the exact prompts provided in Appendix B:

•

Generate one (G-1): to produce one random answer for the given question.
•

Generate two (G-2): to provide two different answers for given question.
•

Generate all (G-All): to list all possible answers sequentially. This is our default.

Table 7 summarizes the results. We use the same baselines as in Table 3 in the first block. Comparing across three prompts, we find our default prompt (G-All) overall achieves the highest performance,as shown by the diversity scores (Cov.) in the first block.

In the second and third block, we report the oracle (Top model per query) and router results. We use our best router (Binary MLP(spec)) for the experiment. Router (X) is a router trained under prompt type X. Oracle (X) denotes that we always use ground truth labels derived by sampling with prompt X as predictions. Training a router improves diversity for all prompts, as all routers beat their Top overall baselines. However, we see little generalization in both oracle and trained router across prompts. For instance, when generating with G-1 prompt, Oracle model chosen for the G-All prompt performs worse than baselines under G-1 prompt. Moreover, larger gains are observed when routing under better prompts. You can find more detailed comparison in Appendix H.1.

Degrading Answer Quality While Listing Multiple Answers

Should we always use G-All prompt? Figure 5 plots average answer quality under two prompt strategies (G-1 and G-All). For generate-all prompt, we plot the quality of answers at different location within the same generation. For generate-one prompt producing one answer per generation, the quality is plotted as one dashed line. We find two trends: (1) G-1 prompt consistently generates answers with higher average answer quality than G-All prompt and (2) In G-All prompt, as the generation continues, the answer quality decreases and the variance of quality scores increases. Therefore, when individual answer quality is more important, G-1 prompt, while harder to elicit diverse answers, can be more appropriate.

7 Related Work

Improving output diversity

Concerns about the output diversity of LLMs (Padmakumar and He, 2023; Anderson et al., 2024; West and Potts, 2025) promoted two categories of solutions: methods that modify model weights (Lanchantin et al., 2025; Chung et al., 2025; Sorensen et al., 2025; Puri et al., 2026) and inference methods (Welleck et al., 2024; Levy et al., 2023; Meister et al., 2024; Xiao et al., 2025; Kambhatla et al., 2022; Santurkar et al., 2023; Hayati et al., 2024; Wang et al., 2025). A suite of work proposes advanced prompting strategies, such as denial prompting (Lu et al., 2024b), probabilistic prompting (Wong et al., 2024), and verbalized sampling (Zhang et al., 2025b). All of these methods focus on improving the diversity of a single model, whereas we study a multi-LLM setting.

Routers for LLMs

Researchers find that looping in multiple models is often better than sticking to one (Jiang et al., 2023; Feng et al., 2024; 2025a; 2025b; 2026a; 2026b). Building on this insight, many works train a router that selects among multiple LLMs to achieve better task performance (Jiang et al., 2023; Zhang et al., 2025a; Lu et al., 2024a) or efficiency (Chen et al., 2024; Ding et al., 2024; Ong et al., 2024; Zhang et al., 2025a). Simple methods (Ding et al., 2024; Ong et al., 2024) demonstrate the effectiveness of routing by switching between a stronger and a weaker model, which balances cost and quality. Other works (Jiang et al., 2023; Lu et al., 2024a) train routers with many top performing LLMs to leverage their complementary expertise. All existing methods are proposed to enhance the end performance measured within a single generation per question. However, none of the above discusses how routing can benefit the diversity and quality of a set of derived answers. To the best of our knowledge, we are the first to propose a router to promote diversity coverage by harnessing the complementary efforts from heterogeneous models.

8 Conclusion

In this paper, we study mixing outputs from multiple LLMs as a strategy to improve response diversity. We first formalize diversity as the coverage of high-quality responses and propose unified evaluation metrics that apply to both finite and open-ended answer spaces. To optimize these metrics, we introduce a router that dynamically selects the most suitable LLM(s) for each query, showing improved performance. Further scaling the training data consistently improves the router. We make few simplifying assumptions: (1) when two models are selected, their outputs are mixed in equal proportions; and (2) only one or two models are used per query. Future research is encouraged to relax these limitations and explore efficiency-aware routing.

Acknowledgments

This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise. The work is partially funded by NSF CAREER award 2443271.

References

B. R. Anderson, J. H. Shah, and M. Kreminski (2024) Homogenization effects of large language models on human creative ideation. In Proceedings of the 16th conference on creativity & cognition, pp. 413–425. Cited by: §7.
S. Chen, W. Jiang, B. Lin, J. T. Kwok, and Y. Zhang (2024) RouterDC: query-based router by dual contrastive learning for assembling large language models. ArXiv abs/2409.19886. External Links: Link Cited by: §7.
J. J. Y. Chung, V. Padmakumar, M. Roemmele, Y. Sun, and M. Kreminski (2025) Modifying large language model post-training for diverse creative writing. arXiv preprint arXiv:2503.17126. Cited by: §1, §7.
J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186. Cited by: 2nd item.
D. Ding, A. Mallick, C. Wang, R. Sim, S. Mukherjee, V. Rühle, L. V. S. Lakshmanan, and A. H. Awadallah (2024) Hybrid llm: cost-efficient and quality-aware query routing. ArXiv abs/2404.14618. External Links: Link Cited by: §7.
S. Feng, Y. Bai, Z. Yang, Y. Wang, Z. Tan, J. Yan, Z. Lei, W. Ding, W. Shi, H. Wang, Z. Qi, Y. Jiang, H. Wang, C. Huang, Y. Fei, J. Yao, Y. Du, L. Zettlemoyer, Y. Choi, and Y. Tsvetkov (2026a) MoCo: a one-stop shop for model collaboration research. External Links: 2601.21257, Link Cited by: §7.
S. Feng, W. Ding, A. Liu, Z. Wang, W. Shi, Y. Wang, Z. Shen, X. Han, H. Lang, C. Lee, T. Pfister, Y. Choi, and Y. Tsvetkov (2025a) When one llm drools, multi-llm collaboration rules. External Links: 2502.04506, Link Cited by: §7.
S. Feng, K. Panaganti, Y. Tsvetkov, and W. Yu (2026b) The single-multi evolution loop for self-improving model collaboration systems. External Links: 2602.05182, Link Cited by: §7.
S. Feng, T. Sorensen, Y. Liu, J. Fisher, C. Y. Park, Y. Choi, and Y. Tsvetkov (2024) Modular pluralism: pluralistic alignment via multi-LLM collaboration. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y. Al-Onaizan, M. Bansal, and Y. Chen (Eds.), Miami, Florida, USA, pp. 4151–4171. External Links: Link, Document Cited by: §7.
S. Feng, Z. Wang, P. Goyal, Y. Wang, W. Shi, H. Xia, H. Palangi, L. Zettlemoyer, Y. Tsvetkov, C. Lee, and T. Pfister (2025b) Heterogeneous swarms: jointly optimizing model roles and weights for multi-llm systems. External Links: 2502.04510, Link Cited by: §7.
E. Fix (1985) Discriminatory analysis: nonparametric discrimination, consistency properties. Vol. 1, USAF school of Aviation Medicine. Cited by: 1st item.
J. Gottweis, W. Weng, A. Daryin, T. Tu, A. Palepu, P. Sirkovic, A. Myaskovsky, F. Weissenberger, K. Rong, R. Tanno, et al. (2025) Towards an ai co-scientist. arXiv preprint arXiv:2502.18864. Cited by: §1.
S. A. Hayati, M. Lee, D. Rajagopal, and D. Kang (2024) How far can we extract diverse perspectives from large language models?. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y. Al-Onaizan, M. Bansal, and Y. Chen (Eds.), Miami, Florida, USA, pp. 5336–5366. External Links: Link, Document Cited by: §7.
A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi (2019) The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751. Cited by: §1.
O. Honovich, T. Scialom, O. Levy, and T. Schick (2022) Unnatural instructions: tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689. Cited by: §1.
D. Jiang, X. Ren, and B. Y. Lin (2023) LLM-blender: ensembling large language models with pairwise ranking and generative fusion. In Annual Meeting of the Association for Computational Linguistics, External Links: Link Cited by: §1, §7.
L. Jiang, Y. Chai, M. Li, M. Liu, R. Fok, N. Dziri, Y. Tsvetkov, M. Sap, A. Albalak, and Y. Choi (2025) Artificial hivemind: the open-ended homogeneity of language models (and beyond). External Links: Link Cited by: Appendix F, §2.2, Table 1.
G. Kambhatla, I. Stewart, and R. Mihalcea (2022) Surfacing racial stereotypes through identity portrayal. In Proceedings of the 2022 ACM conference on Fairness, Accountability, and Transparency, pp. 1604–1615. Cited by: §1, §7.
J. Lanchantin, A. Chen, S. Dhuliawala, P. Yu, J. Weston, S. Sukhbaatar, and I. Kulikov (2025) Diverse preference optimization. arXiv preprint arXiv:2501.18101. Cited by: §7.
I. Levy, B. Bogin, and J. Berant (2023) Diverse demonstrations improve in-context compositional generalization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki (Eds.), Toronto, Canada, pp. 1401–1422. External Links: Link, Document Cited by: §7.
[21] J. Lin and N. Tomlin User simulators bridge rl with real-world interaction. Note: https://jessylin.com/2025/07/10/user-simulators-1/ Cited by: §1.
C. Y. Liu, L. Zeng, J. Liu, R. Yan, J. He, C. Wang, S. Yan, Y. Liu, and Y. Zhou (2024) Skywork-reward: bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451. Cited by: footnote 7.
K. Lu, H. Yuan, R. Lin, J. Lin, Z. Yuan, C. Zhou, and J. Zhou (2024a) Routing to the expert: efficient reward-guided ensemble of large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), K. Duh, H. Gomez, and S. Bethard (Eds.), Mexico City, Mexico, pp. 1964–1974. External Links: Link, Document Cited by: §1, §7.
Y. Lu, D. Wang, T. Li, D. Jiang, S. Khudanpur, M. Jiang, and D. Khashabi (2024b) Benchmarking language model creativity: a case study on code generation. arXiv preprint arXiv:2407.09007. Cited by: §1, §7.
C. Meister, T. Pimentel, G. Wiher, and R. Cotterell (2023) Locally typical sampling. Transactions of the Association for Computational Linguistics (TACL) 11. Cited by: Appendix I.
N. Meister, C. Guestrin, and T. Hashimoto (2024) Benchmarking distributional alignment of large language models. arXiv preprint arXiv:2411.05403. Cited by: §1, §7.
M. N. Nguyen, A. Baker, C. Neo, A. Roush, A. Kirsch, and R. Shwartz-Ziv (2024) Turning up the heat: min-p sampling for creative and coherent llm outputs. arXiv preprint arXiv:2407.01082. Cited by: §1.
A. Novikov, N. Vũ, M. Eisenberger, E. Dupont, P. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. Ruiz, A. Mehrabian, et al. (2025) Alphaevolve: a coding agent for scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131. Cited by: §1.
I. Ong, A. Almahairi, V. Wu, W. Chiang, T. Wu, J. Gonzalez, M. W. Kadous, and I. Stoica (2024) RouteLLM: learning to route llms with preference data. ArXiv abs/2406.18665. External Links: Link Cited by: 2nd item, §7.
V. Padmakumar and H. He (2023) Does writing with language models reduce content diversity?. arXiv preprint arXiv:2309.05196. Cited by: Appendix I, §1, §7.
I. Puri, M. Damani, I. Shenfeld, M. Ghassemi, J. Andreas, and Y. Kim (2026) Reaching beyond the mode: rl for distributional reasoning in language models. External Links: 2603.24844, Link Cited by: §7.
M. Roemmele, A. S. Gordon, and R. Swanson (2017) Evaluating story generation systems using automated linguistic analyses. In SIGKDD 2017 Workshop on Machine Learning for Creativity, pp. 13–17. Cited by: Appendix I.
S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang, and T. Hashimoto (2023) Whose opinions do language models reflect?. In International Conference on Machine Learning, pp. 29971–30004. Cited by: §1, §7.
A. See, A. Pappu, R. Saxena, A. Yerukola, and C. D. Manning (2019) Do massively pretrained language models make better storytellers?. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Hong Kong, China, pp. 843–861. External Links: Link, Document Cited by: Appendix I.
T. Sorensen, B. Newman, J. Moore, C. Park, J. Fisher, N. Mireshghallah, L. Jiang, and Y. Choi (2025) Spectrum tuning: post-training for distributional coverage and in-context steerability. arXiv preprint arXiv:2510.06084. Cited by: Appendix I, §7.
G. Tevet and J. Berant (2021) Evaluating the evaluation of diversity in natural language generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, pp. 326–346. External Links: Link, Document Cited by: Appendix I.
Q. Wang, S. Pan, T. Linzen, and E. Black (2025) Multilingual prompting for improving llm generation diversity. arXiv preprint arXiv:2505.15229. Cited by: §7.
S. Welleck, A. Bertsch, M. Finlayson, H. Schoelkopf, A. Xie, G. Neubig, I. Kulikov, and Z. Harchaoui (2024) From decoding to meta-generation: inference-time algorithms for large language models. arXiv preprint arXiv:2406.16838. Cited by: §7.
P. West and C. Potts (2025) Base models beat aligned models at randomness and creativity. arXiv preprint arXiv:2505.00047. Cited by: §7.
J. Wong, Y. Orlovskiy, M. Luo, S. A. Seshia, and J. E. Gonzalez (2024) Simplestrat: diversifying language model generation with stratification. arXiv preprint arXiv:2410.09038. Cited by: §7.
C. H. Wu, S. Goyal, and A. Raghunathan (2025) Mode-conditioning unlocks superior test-time scaling. arXiv preprint arXiv:2512.01127. Cited by: §1.
J. Wu, H. Li, X. Zhang, J. Guo, J. Luo, S. Liu, Y. Huang, R. Chu, S. Li, and Y. Yang (2026) X-coder: advancing competitive programming with fully synthetic tasks, solutions, and tests. arXiv preprint arXiv:2601.06953. Cited by: §1.
W. Xiao, H. Zhao, and L. Huang (2025) The role of diversity in in-context learning for large language models. arXiv preprint arXiv:2505.19426. Cited by: §7.
J. Yang, J. Wan, Y. Yao, Y. X. Wei Chu, E. Wang, and Y. Qi (2025) Inf-retriever-v1 (revision 5f469d7). Hugging Face. External Links: Link, Document Cited by: §4.1.
H. Zhang, T. Feng, and J. You (2025a) Router-r1: teaching llms multi-round routing and aggregation via reinforcement learning. ArXiv abs/2506.09033. External Links: Link Cited by: §1, 2nd item, §7.
J. Zhang, S. Yu, D. Chong, A. Sicilia, M. R. Tomz, C. D. Manning, and W. Shi (2025b) Verbalized sampling: how to mitigate mode collapse and unlock llm diversity. arXiv preprint arXiv:2510.01171. Cited by: §H.1, §1, §2.2, §7, footnote 2.
Y. Zhang, H. Diddee, S. Holm, H. Liu, X. Liu, V. Samuel, B. Wang, and D. Ippolito (2025c) NoveltyBench: evaluating language models for humanlike diversity. arXiv preprint arXiv:2504.05228. Cited by: Appendix C, Appendix I, Appendix I, 1st item, §2.2, Table 1, Table 1, §4.2, footnote 7.
Y. Zhang, A. Schwarzschild, N. Carlini, Z. Kolter, and D. Ippolito (2024) Forcing diffuse distributions out of language models. arXiv preprint arXiv:2404.10859. Cited by: Appendix I, §2.2, §2.2, Table 1.

Appendix A The distribution of most diverse models

We attach in Figure 6 the freauency of each model being the best model if threshold set to $10\%$ .

Appendix B Prompts

Figure 7: Prompt, generate one, simple questions

Figure 8: Prompt, generate two, simple questions

Figure 9: Prompt, generate all, simple questions

Figure 10: Prompt, generate one, open-ended questions

Figure 11: Prompt, generate two, open-ended questions

Figure 12: Prompt, generate all, open-ended questions

Figure 13: Prompt_verbalized_all, simple questions

Figure 14: Prompt_verbalized_all, open-ended questions

Figure 15: System_vanilla, simple questions and open-ended questions

Figure 16: System_verbalized_all, simple questions and open-ended questions

Appendix C Diversity coverage calculation details on open-ended questions

Here we discuss the details of how we evaluate the quality and diversity of answers to open-ended questions. We follow the exact procedure to partition the answer set and calculate. the quality scores in Zhang et al. (2025c). To determine semantic equivalence, we apply their equivalence classifier (used in line 5 in algorithm below) to all pairs of generations and retain a subset with no mutually equivalent pairs (see Algorithm 1 below). This classifier is finetuned with $1100$ pairs of human annotated generations conditioned on prompts sampled from NB-Curated and NB-WildChat. We then score the quality of each answer ${a^{\prime}}\in{A_{d}}$ , following their process : the score is first derived by a reward model and later mapped to $\{1,\ldots,10\}$ ⁷⁷7We use Skywork-Reward-Gemma-2-27B-v0.2 model (Liu et al., 2024) as the reward model and the equivalence classifier released by Zhang et al. (2025c) at https://huggingface.co/yimingzhang/deberta-v3-large-generation-similarity).. Their mapping is calibrated by aligning the distribution of reward model scores (from 2,400 MT-Bench generations) with GPT-4–judged quality scores, using thresholds to map reward values to the 1–10 scale.

Algorithm 1 Extract semantically nonequivalent generations

1:Sampled generations

A=\{a_{1},\dots,a_{B}\}

2:Equivalence classifier

\textsc{Eq}(\cdot,\cdot)

3:Set of semantically nonequivalent generations

A_{d}

A_{d}\leftarrow\emptyset

5:for

i=1

B

is\_duplicate\leftarrow\textbf{False}

7: for each

a^{\prime}\in A_{d}

s\leftarrow\textsc{Eq}(a_{i},a^{\prime})

\triangleright

Similarity score

9: if

s>\tau

then

10:

is\_duplicate\leftarrow\textbf{True}

11: break

12: end if

13: end for

14: if not

is\_duplicate

then

15:

A_{d}\leftarrow A_{d}\cup\{a_{i}\}

16: end if

17:end for

18:return

A_{d}

Appendix D Decoding settings

We set target number ( $N$ ) of answers to $50$ if not otherwise stated. The temperature and top $-p$ are fixed to be $1.0$ and $1.0$ respectively. The max tokens is set to be $4096$ . We use 2 H200 GPUs for all models. The batch size is $64$ . We repeat the sampling process until $N$ answers are collected. The inference time varies by model sizes and familities. We disable the thinking mode for Qwen models.

Appendix E Router implementation details

We use Adam optimizer, a learning rate of $1e-3$ . For BERT classifier, we use the AdamW optimizer with a learning rate of $2e-5$ . During training, we perform a grid search over options of {soft, one-hot} labels, weight decay and hidden dimensions. Routers are selected based on the best scores on the validation set. We experiment with soft labels and one-hot labels to provide the training signals. The soft labels are drawn by normalizing the diverse coverage scores against the most diverse model for this query. One-hot labels are derived by $\mathbbm{1}[m_{i}=m_{j}^{*}]$ . We find that soft labels work best with M-way MLP classifier while one-hot labels are best for Binary MLP classifier.

Appendix F Scaling router training data

Specifically, we experiment with training the router on $500$ and $1$ K samples from NB-WildChat , and $500$ , $1$ K, and $2$ K samples from Infinity-Chat (Jiang et al., 2025). The results are shown in Table 5. Increasing NB-WildChat training data from $500$ to $1$ K improves diversity coverage on the NB-WildChat test set, though it does not transfer to Infinity-Chat. In contrast, scaling Infinity-Chat data from $500$ to $2$ K steadily improves performance on both the Infinity-Chat test set and the NB-WildChat test set, indicating stronger generalization. Finally, jointly training on a combination of NB-WildChat and Infinity-Chat further improves performance, slightly surpassing the best router ( $26.4\%$ vs. $26.3\%$ ) trained on $1$ K NB-WildChat data in Table 3.

			Evaluation Data
Method			NB-WildChat	Infinity-Chat
Random			18.13%	18.24%
Top Overall			23.83%	23.13%
Oracle			33.04%	30.50%
	Training Data	Size
Router	NB-WildChat	500	25.28% $\scriptstyle\pm 0.28\%$	22.58% $\scriptstyle\pm 0.30\%$
Router	NB-WildChat	1K	26.27% $\scriptstyle\pm 0.13\%$	22.58% $\scriptstyle\pm 0.39\%$
Router	Infinity-Chat	500	23.98% $\scriptstyle\pm 0.67\%$	22.54% $\scriptstyle\pm 0.60\%$
Router	Infinity-Chat	1K	24.95% $\scriptstyle\pm 0.28\%$	23.54% $\scriptstyle\pm 0.12\%$
Router	Infinity-Chat	2K	25.13% $\scriptstyle\pm 0.36\%$	23.78% $\scriptstyle\pm 0.16\%$
Router	NB-WildChat and Infinity-Chat	1K and 1K	26.05% $\scriptstyle\pm 0.32\%$	23.36% $\scriptstyle\pm 0.23\%$
Router	NB-WildChat and Infinity-Chat	1K and 2K	26.40% $\scriptstyle\pm 0.21\%$	23.55% $\scriptstyle\pm 0.10\%$

Table 5: Router performance (diversity coverage) steadily improves with more training data. We report the average and variance of 5 training runs with different random seeds.

Appendix G Router Performance

Method	NB-WildChat					NB-Curated (OOD)
Method	Acc	#U	Q	UQ	Cov.	Acc	#U	Q	UQ	Cov.
Top Overall	19.5%	42.6	3.0	2.9	23.8%	3.4%	35.4	6.0	5.7	38.6%
Random M / Q	5.9%	27.8	3.7	3.6	18.1%	5.6%	27.8	7.2	7.0	37.5%
Frequency	12.0%	33.1	3.8	3.6	21.0%	9.0%	28.2	7.2	7.1	39.6%
Top M / Q (oracle)	100%	38.8	4.5	4.4	33.0%	100%	30.3	7.6	7.4	59.6%
$1$ NN	16.5%	34.3	3.7	3.6	23.1%	5.6%	28.2	7.3	7.1	39.7%
$5$ NN	17.5%	34.9	3.8	3.7	24.1%	12.4%	29.8	7.3	7.1	40.2%
M-way BERT	22.0%	40.3	3.3	3.2	24.4%	11.2%	35.0	6.3	6.2	40.3%
M-way MLP(agn)	24.0%	35.1	3.9	3.8	25.3%	12.4%	30.1	7.6	7.5	40.3%
M-way MLP(spec)	27.0%	39.3	3.5	3.4	25.9%	5.6%	34.6	6.3	6.1	40.2%
Binary MLP (agn)	23.9%	38.4	3.5	3.4	25.7%^∗∗	10.8%	32.8	7.1	7.0	40.7%^∗∗
Binary MLP (spec)	23.9%	38.1	3.6	3.5	26.3%^∗∗	13.3%	30.8	7.0	6.8	39.3%^ns

Table 6: A per-query router selecting over 18 models to maximize diversity coverage (Cov.).#U denotes number of unique outputs, Q denotes average quality, and UQ denotes quality of unique outputs. Accuracy(Acc) measures how frequently the router predicts the oracle model(ground truth target). Random M / Q denotes random model per query, and Top M / Q denotes top model per query.

Appendix H Discussion

H.1 Different Prompt Templates

Prompting methods affect generation diversity (Zhang et al., 2025b) . We show that model ensembling is effective for answers generated by sequential prompting: model are asked to generate as many distinct answers in one generation, where the latter answers are dependent of previous answers. Does it also work for other prompting methods? We extend the study in section 3 to compare three different prompt types⁸⁸8Please refer to Appendix B for the exact prompts.:

•

Generate one: The model is prompted to produce one random answer for the given question.
•

Generate two: The model is prompted to provide two possible and different answers for given question.
•

Generate all (our default setting): The model is prompted to list out all possible answers sequentially.

Method	Gen 1			Gen 2			Gen All
Method	#Unq	Quality	Cov.	#Unq	Quality	Cov.	#Unq	Quality	Cov.
Random	13.7	4.9	9.9%	18.3	4.5	13.2%	27.8	3.7	18.1%
Frequency	25.4	3.9	15.6%	25.2	4.1	17.1%	33.1	3.8	21.0%
Top Overall	31.8	3.2	18.5%	34.8	3.1	19.7%	42.6	3.0	23.8%
Oracle (G-1)	\cellcolorLightGrey 32.8	\cellcolorLightGrey4.1	\cellcolorLightGrey25.6%	\cellcolorLightGrey!4032.1	\cellcolorLightGrey!403.7	\cellcolorLightGrey!4022.3%	\cellcolorLightGrey!4031.0	\cellcolorLightGrey!403.3	\cellcolorLightGrey!4020.4%
Oracle (G-2)	\cellcolorLightGrey!4024.1	\cellcolorLightGrey!404.3	\cellcolorLightGrey!4019.5%	\cellcolorLightGrey32.3	\cellcolorLightGrey4.7	\cellcolorLightGrey28.3%	\cellcolorLightGrey!4031.6	\cellcolorLightGrey!403.3	\cellcolorLightGrey!4021.0%
Oracle (G-All)	\cellcolorLightGrey!4016.4	\cellcolorLightGrey!405.0	\cellcolorLightGrey!4014.7%	\cellcolorLightGrey!4023.1	\cellcolorLightGrey!404.6	\cellcolorLightGrey!4018.8%	\cellcolorLightGrey38.8	\cellcolorLightGrey4.5	\cellcolorLightGrey33.0%
Router (G-1)	\cellcolorLightGrey33.1	\cellcolorLightGrey3.2	\cellcolorLightGrey19.1%	\cellcolorLightGrey!4033.3	\cellcolorLightGrey!403.3	\cellcolorLightGrey!4020.4%	\cellcolorLightGrey!4025.8	\cellcolorLightGrey!402.8	\cellcolorLightGrey!4014.4%
Router (G-2)	\cellcolorLightGrey!4026.5 \cellcolorLightGrey!40	\cellcolorLightGrey!404.2	\cellcolorLightGrey!4019.7%	\cellcolorLightGrey30.5	\cellcolorLightGrey3.9	\cellcolorLightGrey21.6%	\cellcolorLightGrey!4029.2	\cellcolorLightGrey!403.2	\cellcolorLightGrey!4019.0%
Router (G-All)	\cellcolorLightGrey!4018.3	\cellcolorLightGrey!404.7	\cellcolorLightGrey!4014.8%	\cellcolorLightGrey!4024.4	\cellcolorLightGrey!404.3	\cellcolorLightGrey!4018.1%	\cellcolorLightGrey37.5	\cellcolorLightGrey3.7	\cellcolorLightGrey26.2%

Table 7: Training the router under different prompting strategies (in domain and out-of-domain evaluation) on NB-WildChat. Router (X) is a router trained under prompt type X. Oracle (X) denotes that we always use ground truth labels of prompt X as predictions. Training a router improves diversity for all prompts, as all routers beat their Top Overall baselines. However, different prompt templates seem to elicit different levels of diversity in LLMs, as the oracle predictions don’t generalize across prompts.

Routing improves diversity for all prompts, yet a router trained on one prompt does not generalize to others.

We ablate the prompting strategies, retrain routers, and evaluate them on all types of prompts. The performance is presented in Table 7. We find that generate all prompt incurs most diversity coverage, as shown by the diversity scores (Cov.) of the random/ oracle baselines. Training a router in-domain consistently improves diversity coverage, yet neither oracle labels nor routers generalize across prompts. Finally, larger gains are observed when routing under better prompts.

	Gen 1		Gen 2		Gen All
	Cov.	Len	Cov.	Len	Cov.	Len
Random	$16.7\%$	$49.3$	$22.4\%$	$37.0$	$37.5\%$	$17.1$
Top model	$33.5\%$	$47.7$	$35.4\%$	$38.8$	$47.0\%$	$22.6$
Top 2 models	$32.7\%$	$31.6$	$36.1\%$	$34.2$	$45.6\%$	$24.0$
Top model per query	$43.6\%$	$37.8$	$46.6\%$	$30.3$	$59.6\%$	$21.9$

Table 8: Oracle divresity coverage (Cov.) and answer lengths (Len) for different prompt temlates on NB-Curated Questions.

Tradeoff of the generate all prompt.

Despite being the best method, there is a trade-off between diversity and quality for generate all. Under the routing setting, we observe in Table 8 that the length of the answers decreases from generate one, generate two to generate all, up to $66\%$ (from $49.3$ to $17.1$ ). Besides, as shown in Table 7, though the number of unique answers sampled increases, the average answer quality deteriorates from generate one, generate two to generate all. This claim is further supported by comparing average answer quality among different prompting methods in Figure 17. It shows that generate all has the lowest answer quality while generate one has the highest. These findings hold for models across different sizes and families. Interestingly, a closer look into the answer generation process suggests that answers generated at later positions have worse quality under the sequential generate all prompt in Figure 5.

H.2 Discussions: Other configurations/hyperparameters that we can vary

Ratio	#Unq	Qual	Unq Qual	Cov.
Oracle model pair
0:50	42.50	4.16	3.95	35.40%
5:45	43.90	4.11	3.94	36.82%
10:40	45.00	4.13	3.99	37.56%
15:35	45.50	4.09	3.88	37.10%
20:30	46.30	3.93	3.80	36.46%
25:25	45.30	4.00	3.87	36.40%
Top 2 model pair
0:50	36.70	3.74	3.53	24.36%
5:45	38.10	3.73	3.46	25.58%
10:40	39.20	3.58	3.34	25.60%
15:35	40.50	3.40	3.20	25.34%
20:30	42.10	3.33	3.15	26.28%
25:25	43.60	3.26	3.12	26.98%
30:20	44.30	3.12	3.04	26.54%
35:15	45.60	3.06	3.02	27.48%
40:10	46.20	2.99	2.98	27.62%
45:5	46.70	2.91	2.90	27.10%
50:0	47.20	2.85	2.88	27.10%
avg	42.75	3.27	3.15	26.36%
Random model pair
0:50	35.93	3.24	3.23	20.86%
5:45	36.56	3.26	3.24	21.51%
10:40	37.19	3.23	3.21	21.72%
15:35	37.49	3.22	3.20	21.79%
20:30	37.73	3.23	3.20	21.86%
25:25	37.72	3.22	3.19	21.89%

Table 9: Exploring different ratios while varying models per question. Performance is reported on 10 questions sampled from NB-WildChat. Top 2 models are olmo-2-0425-1b and olmo-2-0325-32b.

Strategy	#Unq	Qual	Unq Qual	Cov.
Oracle ratio	44.80	3.66	3.55	32.58%
Overall best (40:10)	46.20	2.99	2.98	27.62%
Half/half	43.60	3.26	3.12	26.98%

Table 10: Always use top 2 models ( olmo-2-0425-1b and olmo-2-0325-32b), while varying ratios per question.

More flexible proportions of sampling per model

In the previous setting of routing to two models, we fix the sampled answers to be split equally (i.e., if there are two models selected to generate 50 answers, each would contribute to 25 answers). Will a more flexible proportion lead to more diversity? Under the same setting of sampling 50 answers from two models, we experiment with a set of possible ratios 0.0:1.0, 0.1:0.9, 0.2:0.8, 0.3:0.7, 0.4:0.6, 0.5:0.5(original) to assign the budget between two models. We conduct two experiments: (1) pick a ratio for all the questions, vary model choices (2) fix two models to ensemble (top 2 by individual performance), varying ratios for each question. We present the results in table 9 and table 10 respectively. We find that for oracle/random/top 2 model pairs, different global ratios don’t have much difference in output diversity. If we fix 2 models to ensemble and optimize ratios for each question, the score can be improved over rigid half/half mixing ( $32.58\%$ vs $26.98\%$ ).

$N$	#Unq	Qual	Unq Qual	Cov.
1	42.50	4.16	3.95	35.40%
2	45.30	4.00	3.87	36.08%
3	44.00	4.04	3.86	35.72%
4	44.20	4.13	3.79	35.96%
5	43.80	4.13	3.79	35.74%
6	44.30	3.93	3.58	35.04%
7	43.40	4.08	3.76	35.12%
8	43.30	3.94	3.70	34.02%
9	42.90	4.02	3.73	34.00%
10	43.10	3.92	3.65	33.36%
11	42.40	3.87	3.66	31.82%
12	42.40	3.80	3.56	30.92%
13	40.40	3.83	3.65	29.56%
14	40.10	3.80	3.63	28.88%
15	38.80	3.70	3.58	27.38%
16	37.90	3.66	3.53	26.10%
17	37.30	3.73	3.54	25.66%
18	36.60	3.67	3.56	24.32%

Table 11: Oracle models: fix

N

models to select for all questions and vary model choices per question.

$N$	#Unq	Qual	Unq Qual	Cov.
1	47.20	2.85	2.88	27.10%
2	43.60	3.26	3.12	26.98%
3	41.80	3.21	3.06	24.90%
4	41.90	3.31	3.07	25.40%
5	40.60	3.47	3.27	25.42%
6	40.70	3.53	3.34	26.60%
7	40.90	3.59	3.42	27.00%
8	40.00	3.65	3.54	27.26%
9	38.40	3.85	3.60	26.48%
10	38.90	3.84	3.64	27.20%
11	37.70	3.81	3.52	25.70%
12	38.50	3.62	3.30	25.00%
13	35.20	3.70	3.48	23.22%
14	37.20	3.57	3.35	23.96%
15	36.40	3.60	3.44	23.98%
16	35.80	3.67	3.48	23.82%
17	36.40	3.38	3.28	21.98%
18	34.90	3.57	3.52	22.50%

Table 12: Top

N

models: fix

N

models to select for all questions and vary model choices per question.

$N$	#Unq	Qual	Unq Qual	Cov.
1	35.93	3.24	3.23	20.86%
2	37.72	3.22	3.19	21.85%
3	38.63	3.23	3.17	22.38%
4	39.03	3.21	3.13	22.60%
5	39.11	3.22	3.13	22.89%
6	38.89	3.29	3.18	23.36%
7	38.53	3.34	3.22	23.58%
8	38.33	3.37	3.26	23.70%
9	38.35	3.44	3.32	24.08%
10	38.05	3.47	3.34	24.26%
11	37.82	3.46	3.32	23.66%
12	37.45	3.49	3.36	23.80%
13	37.44	3.53	3.39	23.70%
14	36.93	3.54	3.41	23.49%
15	36.46	3.58	3.44	23.49%
16	35.94	3.59	3.45	23.32%
17	36.66	3.63	3.51	23.92%
18	36.60	3.67	3.56	24.32%

Table 13: Random models: fix

N

models to select for all questions and vary model choices per question.

Strategy	#Unq	Qual	Unq Qual	Cov.
Oracle $N$	44.10	3.93	3.73	34.24%
Best overall ( $N=8$ )	40.00	3.65	3.54	27.26%
Random $N$	39.23	3.53	3.35	25.25%

Table 14: Fix model order to ensemle (ranking of individual performance) and vary

N

per question.

Varying the number of models to ensemble from

In previous experiments, we fix the number of activated models to be $1$ (routing the best model per query) or $2$ (routing to two best models per query). Will sampling answers from more models, while keeping the total number of answers unchanged, improve diversity? We answer this question by two experiments: (1) fix number of models $N$ for all, vary selected models per questions (2) fix the order of model to be selected (ranked by individual performance), vary the number $N$ per question. We present the results in Table 11, Table 12, Table 13, and Table 14. We find that routing to a custom model per question remains the most promising approach (under oracle settings). Routing to two models can offer further gains. But ensembling $>$ 2 models does not improve output diversity.

Scaling the number of candidates and generations

In this paper, we study selecting models from a pool of 18 candidates. However, in a real-world setting, there are hundreds of models users can choose from. Therefore, future work can explore employing a larger pool of LLMs that better harness their complementary strengths of uncovering more diverse answers. Besides, the number of answers to open-ended questions is infinitely large, and future work is encouraged to explore sample sizes beyond $50$ .

Appendix I Extended related work

Measuring output diversity

Traditional metrics to measure lexical diversity and text style are based on token and POS n-grams statistics (Roemmele et al., 2017; See et al., 2019; Tevet and Berant, 2021; Meister et al., 2023) and embedding similarity between candidatesPadmakumar and He (2023). Later works go beyond the distinctness of outputs and also measure the validity of each response. Zhang et al. (2024) propose to evaluate the diversity of LLMs by calculating the coverage of gold targets and the KL-divergence from the desired distribution. However, providing ground-truth distributions for open-ended questions is non-trivial. Closely related to our work, Zhang et al. (2025c) introduce the notion of user-perceived utility, which jointly models uniqueness and quality while accounting for user patience. In this framework, uniqueness is computed by partitioning sampled answers into non-equivalent groups, and answer quality is estimated using reward model scores. However, this metric penalizes later-generated responses, whereas our goal is to assess how well a set of answers covers the answer space regardless of generation order.

Similarly, Sorensen et al. (2025) evaluates how well a model covers an open-ended output space using validity and diversity metrics. However, their evaluation relies on expensive human annotations and thus is only experimented with a single model with four generations per prompt. In this work, we build on the framework of Zhang et al. (2025c) and propose diversity coverage, a metric that evaluates how well a set of generated answers covers the valid answer space across many generations and multiple LLMs without requiring additional human supervision.

Appendix J Verbalized Sampling

Similar to our baselines, Verbalized Sampling is a recent prompting technique that increases LLM output diversity. We decided not to include it in the main experiments since it performs similarly (if not worse than) our generated all baseline. We include the evidence below in Table 15 and Table 16:

Prompt	Model	Cov. %
Prompt	Model	1	10	20	50	100	1000
prompt_vanilla	Llama 8B	$6.09$	$47.96$	$66.70$	$88.56$	$92.44$	$96.51$
prompt_verbalized_all		$6.09$	$47.96$	$66.69$	$89.83$	$93.81$	$97.76$
system_vanilla		$6.09$	$43.42$	$55.31$	$72.00$	$83.02$	$95.23$
system_verbalized_all		$6.09$	$44.11$	$60.26$	$79.16$	$89.30$	$94.98$
prompt_vanilla	GPT-4o	$6.09$	$48.20$	$67.18$	$89.80$	$92.80$	$94.10$
prompt_verbalized_all		$6.09$	$46.60$	$62.27$	$84.26$	$87.15$	$92.21$
system_vanilla		$6.09$	$43.83$	$54.77$	$68.12$	$76.33$	$91.21$
system_verbalized_all		$6.09$	$44.46$	$56.07$	$68.59$	$77.62$	$92.00$

Table 15: Compared results with verbalized sampling on Simple Questions generating up to

1,10,20,...1000

answers. Prompt_vanilla is the existing generate-all prompt. System_verbalized_all is the original prompt proposed in verbalized sampling.

Prompt	Model	Cov. %
Prompt	Model	1	5	10	20	50	100	200
prompt_vanilla	Llama 8B	$0.38$	$1.67$	$3.04$	$5.22$	$10.00$	$16.17$	$26.13$
prompt_verbalized_all		$0.37$	$1.63$	$2.98$	$4.99$	$9.24$	$14.46$	$23.76$
system_vanilla		$0.34$	$1.34$	$2.42$	$4.19$	$8.91$	$14.86$	$24.55$
system_verbalized_all		$0.37$	$1.57$	$2.82$	$4.54$	$9.05$	$15.31$	$25.53$
prompt_vanilla	GPT-4o	$0.42$	$1.84$	$3.51$	$6.30$	$11.01$	$16.47$	$24.78$
prompt_verbalized_all		$0.40$	$1.86$	$3.29$	$5.00$	$8.65$	$13.09$	$19.61$
system_vanilla		$0.41$	$1.57$	$2.69$	$4.08$	$7.30$	$11.05$	$16.94$
system_verbalized_all		$0.40$	$1.71$	$2.68$	$4.16$	$7.30$	$11.12$	$17.50$

Table 16: Compared results with verbalized sampling on NB-Curated generating up to

1,5,10,...200

responses.Prompt_vanilla is the existing generate-all prompt. System_verbalized_all is the original prompt proposed in verbalized sampling.

Appendix K Generating diverse outputs out of a single model

K.1 Experiment settings

Decoding settings

For each prompting strategy and desired number of answers $N$ , we repeatedly sample generations from the model until we collect $N$ answers. For Simple Questions, we use $N=\{1,10,20,50,100,1000\}$ . For NB-Curated, we use $N=\{1,5,10,20,50,100,200\}$ . We set the temperature to $1.0$ , top_p to $1.0$ . The max_len is set to 2048 by default. For the generate all setting in NB-Curated, we extend the max_len to 4096 because generations can not be finished within 2048 tokens.

K.2 Results

Compare different prompting strategies

Figure 18 shows how different prompting strategies affect the diversity of combined answers. For all models on both datasets, sequential generation enables a lot more answer diversity than parallel methods. With the best prompt, models on Simple Questions saturate to more than $90\%$ of coverage rate. This suggests that for easy diversity questions, nearly all models have good knowledge of the full answer space. On NB-Curated, answer diversity keeps growing as more generations are inferred. This reveals large diversity potential in uncovering more unique and high-quality responses to open-ended queries.

How does model size affect diversity?

According to Figure 18, most models’ performances are pretty similar on Simple Questions, except for the smallest model Qwen 0.6B). On NB-Curated, medium-sized models (Llama 8B and Qwen 14B) consistently have higher overall diversity than extremely large or small ones. We hypothesize that these models balance answer distinctness and quality best, therefore achieving the highest diversity performance. Figure 20 shows answer uniqueness is inversely proportional to the model sizes. And Figure 17 shows answer quality is proportional to model size. Finally, we also noticed that model rankings are largely unchanged regardless of the number of collected answers.

No Single Best Model for Diversity: Learning a Router for Sample Diversity

Abstract

1 Introduction

2 Task Formulation

2.1 Task definition

2.2 New Evaluation Metric: Diversity Coverage

3 A pilot study on ensembling models to maximize diversity coverage

3.1 No single model is best at diversity coverage for all questions

Model sets

Settings

Results

3.2 Oracle experiment: how much does picking the best model(s) per query improve?

4 Learning to ensemble multiple models for diverse outputs

4.1 Router

Problem setting

Classification Objectives

Query encoding

4.2 Experiment settings

Training and evaluation data

Evaluation metrics

Baselines

Router Models

5 Results

5.1 Performance Evaluation

Router trained to select single model can be used to ensemble outputs from two models which provides further gains.

Scaling training data size consistently produces a better router.

5.2 Efficiency Evaluation

6 Discussions: Different Prompt Templates

Degrading Answer Quality While Listing Multiple Answers

7 Related Work

Improving output diversity

Routers for LLMs

8 Conclusion

Acknowledgments

References

Appendix A The distribution of most diverse models

Appendix B Prompts

Appendix C Diversity coverage calculation details on open-ended questions

Appendix D Decoding settings

Appendix E Router implementation details

Appendix F Scaling router training data

Appendix G Router Performance

Appendix H Discussion

H.1 Different Prompt Templates

Routing improves diversity for all prompts, yet a router trained on one prompt does not generalize to others.

Tradeoff of the generate all prompt.

H.2 Discussions: Other configurations/hyperparameters that we can vary

More flexible proportions of sampling per model

Varying the number of models to ensemble from

Scaling the number of candidates and generations

Appendix I Extended related work

Measuring output diversity

Appendix J Verbalized Sampling

Appendix K Generating diverse outputs out of a single model

K.1 Experiment settings

Decoding settings

K.2 Results

Compare different prompting strategies

How does model size affect diversity?

No Single Best Model for Diversity:
Learning a Router for Sample Diversity