No Single Best Model for Diversity:
Learning a Router for Sample Diversity
Abstract
When posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce diversity coverage, a metric that measures the total quality scores assigned to each unique answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-WildChat, our trained router outperforms the single best model baseline ( vs ). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.
1 Introduction
Various tasks require language models (LMs) to generate diverse high-quality responses. These range from creative writing (Padmakumar and He, 2023; Chung et al., 2025), dialogues (Lin and Tomlin, ), code (Wu et al., 2026), math (Wu et al., 2025), scientific discovery (Novikov et al., 2025; Gottweis et al., 2025), survey response simulation (Meister et al., 2024) to synthetic data generation (Honovich et al., 2022). Evaluation on these tasks thus should move beyond the quality of a single output to the diversity and quality of a set of outputs. However, existing metrics fall short in quantifying the coverage of open-ended answer space. In this work, we introduce diversity coverage, a metric that measures the total quality scores assigned to all unique answers in the predicted answer set, relative to the best possible answer set with the same number of answers.
Prior works focused on methods for improving the diversity of generations from a single LLM, such as changing the inference hyperparameters (Holtzman et al., 2019; Kambhatla et al., 2022; Santurkar et al., 2023; Nguyen et al., 2024) or prompt (Lu et al., 2024b; Zhang et al., 2025b). In this paper, we ask the question of whether diversity coverage can be further improved by taking advantage of bountiful LMs, each with differing behaviours towards the same prompt. We hypothesize that heterogeneous LLMs can be effectively ensembled in order to leverage their complementary strengths. Through a pilot study, we identify that no single LLM dominates in generating a diverse and high-quality output set, and different LLMs excel at answering different open-ended questions (Section 3). If we can pick the optimal model for each example, diversity coverage can be achieved on NB-WildChat, revealing a large gap compared to using the overall best model in the ensemble ().
However, determining the optimal model for a question can be challenging, as optimizing diversity involves analyzing the joint effect of all sampled answers. Although many previous works have proposed routing to select the best LLM (Jiang et al., 2023; Zhang et al., 2025a; Lu et al., 2024a), candidates are ranked based on a single response to the ground truth. This problem becomes exponentially more challenging when comparing answer sets as it is expensive to sample and compare multiple outputs for each example. The difficulty is further increased for open-ended questions as there are no gold answer sets available. Lastly, one might expect that simple heuristics based on model metadata (e.g., family or size) could help predict diversity. In fact, we show in Figure 1 that models of different families and sizes generate the most diverse outputs for a disjoint set of queries.
Motivated by this, we propose a simple approach to predict the best model to respond to any given query. We train a router that scores the diversity and quality of each candidate model in the routing pool and routes to the best LLM to generate the answers. We frame it as a classification task and create a training dataset of queries and model scores. Our router outperforms top overall baselines on the NB-WildChat and also generalizes to NB-Curated. Using the same trained router, we explore further ensembling outputs from two models per query, which brings further performance gains. We will release the code and dataset publicly. Our key contributions are:
-
1.
We propose a new metric, diversity coverage, to jointly measure the diversity and quality of an answer set to open-ended questions with diverse output space.
-
2.
Motivated by the finding that no single model excels at generating diverse outputs to all queries, we propose to train a router which predicts the best model for generation given an input query. We evaluate our method comprehensively on multiple datasets, demonstrating competitive performance on both in-domain (NB-WildChat) and out-of-domain (NB-Curated) settings.
-
3.
We further investigate the effect of training data size, prompting strategies as well as inference efficiency of using a diversity router. More broadly, our findings highlight the potential of multi-LLM systems, where models with complementary strengths are combined to produce more diverse and high-quality solutions.
2 Task Formulation
Many queries admit multiple valid responses rather than a single correct answer. We summarize datasets containing such queries in Table 1. Depending on the number of possible answers, we categorize the datasets into two types:
-
•
Fixed answer set. Each query has a finite ground truth answer set . The predicted answer is correct if it belongs to .
-
•
Open-ended answer set. Queries admit infinitely many valid answers, and listing them comprehensively is infeasible. The validity of the predicted answer can be evaluated based on quality and assigned a scalar value.
| Dataset | # | Example | |
| Question | Possible Valid Answers | ||
| Simple Questions (Zhang et al., 2024) | Output a random country in North America. | United States/ Canada/ Mexico / … | |
| NB-Curated 111The original released dataset has 100 questions. We filter out 4 questions that ask for multiple answers (violating our hypothesis) and 4 questions that do not have multiple correct answers. (Zhang et al., 2025c) | Tell me a funny joke. | Who is Adam and why he is optimizing my code? / What did a late tomato say to other tomatoes? I will ketchup, … | |
| NB-WildChat (Zhang et al., 2025c) | k | Give me a very simple way to remember the formula for tangent. | Tangent = Opposite /Adjacent / Draw a unit circle … |
| Infinity-Chat (Jiang et al., 2025) | k | Write a story about America. | In the heartland of America there was a small town …/ My decision to go to the United States… |
2.1 Task definition
Given a query and generation budget (i.e., the number of answers produced by model), the task is to derive an answer set that covers as many distinct and high-quality answers as possible. Our task definition assumes two functions:
-
•
uniq(): Given a query and an answer set , it outputs a subset of , consisting of distinct answers only (i.e., no two answers in are equivalent to each other). We follow prior work (Zhang et al., 2025c) to derive . We iterate over answers in and greedily add a new answer to if the current answer is not equivalent to any answer already in . The equivalence of two answers is determined by exact string match for queries with fixed answer set and an equivalence classifier for open-ended question. The process is described in Appendix C.
-
•
quality(): Given an individual answer for query , it outputs a scalar value representing the quality of the answer . This can be done through either comparing against ground truth answer sets (factual queries) or using a reward model (open-ended queries).
2.2 New Evaluation Metric: Diversity Coverage
Given a predicted set of answers to query , we introduce a new metric, diversity coverage (div-cov) as follows:
We define max-uniq-sum() as the maximum score that one can reach by generating an answer set of size where each answer in the set is distinct and achieves maximum quality.
For questions with a fixed answer set , assuming , this measures the proportion of unique ground-truth answers covered by the answer set, equivalent to the coverage rate metric proposed by Zhang et al. (2024).
Prior works measure diversity by the number of unique, valid outputs for questions with fixed answer set (Zhang et al., 2024); or via pairwise embedding-based similarity (Zhang et al., 2025b; Jiang et al., 2025). Such metric either does not work for questions with open-ended answer space, or do not account for the quality of the answers (Zhang et al., 2025b; Jiang et al., 2025). Zhang et al. (2025c) proposes a unified metric for quality and diversity considering an ordered answer list, which penalizes answers generated later to account for user patience. This paper focuses on evaluating the quality and diversity of a set of answers, regardless of generation order, making diversity coverage better suited to our purpose.
3 A pilot study on ensembling models to maximize diversity coverage
In this section, we first study whether a strong LLM can dominate other models in diversity coverage for a range of questions (Section 3). We found no single model dominates, motivating us to explore the upper bound of gains if use a pool of LLMs instead of a single LLM (Section 3.2). We compare several ensembling strategies under oracle model selection setting. We find that picking the best LLM per question is the most promising, which leads us to develop model router in later sections.
3.1 No single model is best at diversity coverage for all questions
Model sets
We study models from four open-source model families with different parameter counts: Llama (Llama-3.2-1B, Llama-3.2-3B, Llama-3.1-8B, Llama-3.3-70B), Qwen (Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B, Qwen3-8B, Qwen3-14B, Qwen2.5-72B), OLMo (OLMo-2-0425-1B, OLMo-2-1124-7B, OLMo-2-1124-13B, OLMo-2-0325-32B), Gemma(gemma-3-1b, gemma-3-4b, gemma-3-12b, gemma-3-27b).
Settings
For each model and query, we sample answers with a prompt which instructs the model to enumerate as many valid answers as possible (see Appendix B for the full template). This prompt encourages models to explore the space of possible responses rather than produce a single canonical answer.222Prior work Zhang et al. (2025b) has found that this method elicits more diverse answer set compared to sampling multiple single answer. We further compare different prompt templates (e.g. generate one or two answers in a single generation) in Appendix H.1, finding that generating all answers in a single prompt elicits the most diverse answer set. We keep the decoding method fixed throughout the paper (described in Appendix D). For each query, we define the “dominant” model by the following two criteria: (1) the model achieves the highest diversity coverage, and (2) the score is at least higher than that of the answer sets generated by any other models.
Results
On Simple Questions, we find no dominant model for any query. For NB-WildChat queries, this remains true for 30% of queries, and more than 5 models are dominant at least on 5% of queries, suggesting optimizing model choices per question can be fruitful. Figure 1 compares per-model frequency of achieving the best diversity coverage on NB-WildChat.
3.2 Oracle experiment: how much does picking the best model(s) per query improve?
Motivated by the finding that different models can generate diverse outputs for different queries, can we ensemble outputs from models to achieve diverse outputs? We assume an oracle setting, where we have access to the diversity coverage scores of all LLMs on all queries. We compare three strategies to ensemble models with a fixed generation budget 333In our experiments, we fix to be answers per question if not otherwise stated.:
| SQ | Curated | WildChat | |
| Top overall model | |||
| Top two overall models | |||
| Random model / query | |||
| Top model / query | 97.9% | 59.6% | 33.0% |
-
•
Top overall model. We select the single model with the best average diversity coverage per dataset, representing the best possible performance without ensembling. The selected top models are respectively: Llama-3.1-8B, Qwen3-14B and OLMo-2-0425-1B.
-
•
Top two overall models. We select two models with the highest average diversity coverage score per the dataset, then ensemble their outputs, generating answers per model. The selected model pairs are respectively: (Llama-3.1-8B, Llama-3.3-70B), (Qwen3-14B, Llama-3.1-8B), (OLMo-2-0425-1B, OLMo-2-1124-7B).
-
•
Top model per query. For each query, we select the model with highest diversity coverage. This represents the oracle performance of always choosing the best LLM per given query. We also report the performance of randomly choosing a model per query (Random model per query) as a baseline method.
Table 2 shows that query-level model selection (Top model per query) is consistently the best strategy among all three datasets. The gap increases as questions become more open-ended (on NB-Curated and NB-WildChat). For Simple Questions, using one best single LLM (Top overall model) can recover of the ground truth targets. Open-ended questions, however, are more challenging, and choosing the best model per query yields non-trivial gains. This is evidenced by results on NB-Curated, where selecting the top model per query () yields a relative improvement over the second-best baseline (), and on NB-WildChat where the improvement is .
4 Learning to ensemble multiple models for diverse outputs
Oracle routing significantly improves diversity coverage, but it is costly as we need to sample and evaluate outputs from all candidate LLMs. This motivates us to train a router to predict the most promising model without sampling the entire answer sets from all models.
4.1 Router
Problem setting
Given a query and a suite of models , a router ranks them, by where is the generated answer set from for query for some budget . The oracle model index for is defined as . Such index for each query consist of the router training data .
Classification Objectives
We compare two classification formulation for the router:
-
•
-way classification: the router is a single classifier which predicts the oracle best model index for each query . Let denote the predicted probability of selecting model , we train the router with cross-entropy loss:
-
•
Binary classification: For each LLM , we derive a binary training dataset from , where indicates whether is the oracle best model for query . We then train a binary classifier to predict this label using binary cross-entropy loss. At inference time, the router evaluates all binary classifiers and selects the model with the highest predicted score: .
Query encoding
We experiment with two input featurizations: (1) infly/inf-retriever-v1 (Yang et al., 2025), a retriever fine-tuned from Qwen-2-7B for information retrieval tasks. We refer to it as model-agnostic encodings (agn). (2) Model hidden states: we encode the query using each model and extract the representation from the final layer’s last hidden state. We hypothesize that this representation encodes rich information on how the model decodes its outputs. We refer to it as model-specific encodings (spec).
4.2 Experiment settings
Training and evaluation data
We split the NB-WildChat prompts from (Zhang et al., 2025c) into train, validation and test sets containing 70%, 10% and 20% of the data respectively. We conduct out-of-domain evaluation on NB-Curated questions.
Evaluation metrics
Diversity coverage jointly measures the diversity and quality of the generated answer set. To disentangle the effect, we additionally report metrics that measure each aspect. Quality (Qual) measures the average quality score across all sampled answers: Uniqueness (Unq) measures the number of semantically non-equivalent answers: . Unique Quality (Unq Qual) measures the average quality score over unique answers only: Together, these metrics reveal whether improvements in cumulative diversity arise from generating more distinct answers, improving answer quality, or both.
Baselines
We consider several non-routing baselines. For a fair comparison with our trained routers, we restrict these methods to access only training-set labels and evaluate them on the test set. We implement baselines from Section 3.2: Top overall, Top two overall, Random model per query. We also include a Frequency baseline, where models are sampled proportional to their frequency of reaching highest diversity coverage. We additionally compute Top model per query and Top two models per query as oracle performance on diversity coverage, using ground-truth labels on the test set. Specifically, Top model per query is implemented by selecting the best model per query. Top two models per query are the best pair over all model combinations. If two models are selected, we take half from each model.
Router Models
We implement three types of router models, and describe their implementation details in Appendix E.
-
•
KNN (Fix, 1985) This is a simple, non-parametric classifier, where the predictions are obtained from K nearest neighbours from the training data, .
-
•
BERT (Devlin et al., 2019) Following other routing literature(Ong et al., 2024; Zhang et al., 2025a), we fine-tune BERT with a classification head which makes a selection over the models following.444We did not experiment with implementing BERT models for binary classification given that fine-tuning BERT is computationally more expensive than fine-tuning the 2-layer MLP router.
-
•
MLP We report results for training binary MLP classifiers and training one MLP classifier for -way classification. We report results for (1) using inf-retriever to encode the query for all classifiers: Binary MLP (agn) and M-way MLP (agn), (2) using the candidate model ’s last layer hidden states to encode the query for the respective classifier: Binary MLP (spec) and M-way MLP (spec).
| Method | NB-WildChat | NB-Curated (OOD) | ||||||
| #Unq | Qual | Unq Qual | Cov. | #Unq | Qual | Unq Qual | Cov. | |
| Top overall | 42.6 | 3.0 | 2.9 | 23.8% | 35.4 | 6.0 | 5.7 | 38.6% |
| Frequency | 33.1 | 3.8 | 3.6 | 21.0% | 28.2 | 7.2 | 7.1 | 39.6% |
| Random model per query | 27.8 | 3.7 | 3.6 | 18.1% | 27.8 | 7.2 | 7.0 | 37.5% |
| Top model per query (oracle) | 38.8 | 4.5 | 4.4 | 33.0% | 30.3 | 7.6 | 7.4 | 59.6% |
| KNN (N=1) | 34.3 | 3.7 | 3.6 | 23.1% | 28.2 | 7.3 | 7.1 | 39.7% |
| KNN (N=5) | 34.9 | 3.8 | 3.7 | 24.1% | 29.8 | 7.3 | 7.1 | 40.2% |
| M-way BERT | 40.3 | 3.3 | 3.2 | 24.4% | 35.0 | 6.3 | 6.2 | 40.3% |
| M-way MLP (agn) | 35.1 | 3.9 | 3.8 | 25.3% | 30.1 | 7.6 | 7.5 | 40.3% |
| M-way MLP (spec) | 39.3 | 3.5 | 3.4 | 25.9% | 34.6 | 6.3 | 6.1 | 40.2% |
| Binary MLP (agn) | 38.4 | 3.5 | 3.4 | 25.7%∗∗ | 32.8 | 7.1 | 7.0 | 40.7%∗∗ |
| Binary MLP (spec) | 38.1 | 3.6 | 3.5 | 26.3%∗∗ | 30.8 | 7.0 | 6.8 | 39.3%ns |
5 Results
5.1 Performance Evaluation
We report performances in Table 3.555We also report accuracy (i.e., how frequently it predicted ground truth best model) in Table 6 in Appendix. Top overall is the best-performing non-routing baseline for in-domain evaluation on NB-WildChat. This indicates that the LLM chosen from training labels maintains strong diversity coverage on the test set. Frequency baseline generalizes better to out-of-domain NB-Curated questions. KNN routers yield only marginal improvement. MLP-based routers outperform other baselines. Specifically, binary routers with model-specific query encodings bring the greatest gains (), surpassing the Top overall baseline (). On MLP classifiers, model-specific query encodings (spec) provide more useful information than model-agnostic encoding (agn), but show worse generalization.
| Method | NB-WildChat | NB-Curated (OOD) | ||||||
| #Unq | Qual | Unq Qual | Cov. | #Unq | Qual | Unq Qual | Cov. | |
| Top 2 overall | 39.1 | 3.4 | 3.2 | 23.8% | 31.6 | 6.7 | 6.3 | 38.3% |
| Top 2 per query | 40.7 | 4.5 | 4.5 | 35.8% | 41.3 | 7.7 | 7.6 | 62.6% |
| Router | 38.4 | 3.8 | 3.6 | 26.7%∗∗ | 32.3 | 7.1 | 6.8 | 42.2%∗∗ |
Router trained to select single model can be used to ensemble outputs from two models which provides further gains.
We observe consistent gains when using our trained router (Binary MLP(spec)) to select two models, as presented in Table 4 both in-domain ( vs. ) and out-of-domain. Moving from Top overall to Top 2 overall, the best diversity coverage of the non-routing baseline does not improve. But the oracle (Top per query) stably increases from one to two models. We show that our best routers are significantly better than Top overall and Top two overall baseline cross 5 checkpoints trained under different random seeds. We further discuss how the number of model selected affect the answer diversity in Section H.2.
Scaling training data size consistently produces a better router.
Would training on a larger data set improve performance? On Infinity-Chat, we show in Figure 2 that router performance increases steadily with training data sizes varying from 500, 1k to 2k. We further find that training also scales on NB-WildChat and can incur generalization across the two datasets in Appendix F.
5.2 Efficiency Evaluation
In Figure 3, we show the inference time efficiency of generating an answer set of various methods on NB-WildChat. We use 2H200 GPUs for answer sampling666We assume no parallelization in sampling generations. If two models are selected, the process is performed sequentially (i.e. model by model). and 1 H200 GPU for diversity coverage calculation. We compare the latency of three methods: Top (Top overall), Router, which is the Binary MLP classifier (spec), and Oracle (Top model per query).
Inferencing with our router is about slower than the Top baseline. The routing itself is not very costly, yet sampling becomes more expensive as the router often directs to a bigger LLM than the top baseline model (OLMo-2-0425-1B).
Oracle setting, while showing the strongest performance, is also much more expensive, introducing up to computation overhead compared to our router. This is because its routing involves brute-force computing diversity coverage for all models to find the best candidate per query. In contrast, our router introduces only a fixed overhead that does not scale with the number of selected models.
6 Discussions: Different Prompt Templates
| Method | Prompt Template | ||
| Gen 1 | Gen 2 | Gen All | |
| Top overall | 18.5% | 19.7% | 23.8% |
| Random | 9.9% | 13.2% | 18.1% |
| Frequency | 15.6% | 17.1% | 21.0% |
| Oracle (G-1) | \cellcolorLightGrey25.6% | \cellcolorLightGrey!4022.3% | \cellcolorLightGrey!4020.4% |
| Oracle (G-2) | \cellcolorLightGrey!4019.5% | \cellcolorLightGrey28.3% | \cellcolorLightGrey!4021.0% |
| Oracle (G-All) | \cellcolorLightGrey!4014.7% | \cellcolorLightGrey!4018.8% | \cellcolorLightGrey33.0% |
| Router (G-1) | \cellcolorLightGrey19.1% | \cellcolorLightGrey!4020.4% | \cellcolorLightGrey!4014.4% |
| Router (G-2) | \cellcolorLightGrey!4019.7% | \cellcolorLightGrey21.6% | \cellcolorLightGrey!4019.0% |
| Router (G-All) | \cellcolorLightGrey!4014.8% | \cellcolorLightGrey!4018.1% | \cellcolorLightGrey26.2% |
Large amount of work in diversity has focused on improving the prompt, while throughout this paper we used a fixed prompt template to sample answers and compute diversity coverage. In this last section, we explore two alternative prompt templates, with the exact prompts provided in Appendix B:
-
•
Generate one (G-1): to produce one random answer for the given question.
-
•
Generate two (G-2): to provide two different answers for given question.
-
•
Generate all (G-All): to list all possible answers sequentially. This is our default.
Table 7 summarizes the results. We use the same baselines as in Table 3 in the first block. Comparing across three prompts, we find our default prompt (G-All) overall achieves the highest performance,as shown by the diversity scores (Cov.) in the first block.
In the second and third block, we report the oracle (Top model per query) and router results. We use our best router (Binary MLP(spec)) for the experiment. Router (X) is a router trained under prompt type X. Oracle (X) denotes that we always use ground truth labels derived by sampling with prompt X as predictions. Training a router improves diversity for all prompts, as all routers beat their Top overall baselines. However, we see little generalization in both oracle and trained router across prompts. For instance, when generating with G-1 prompt, Oracle model chosen for the G-All prompt performs worse than baselines under G-1 prompt. Moreover, larger gains are observed when routing under better prompts. You can find more detailed comparison in Appendix H.1.
Degrading Answer Quality While Listing Multiple Answers
Should we always use G-All prompt? Figure 5 plots average answer quality under two prompt strategies (G-1 and G-All). For generate-all prompt, we plot the quality of answers at different location within the same generation. For generate-one prompt producing one answer per generation, the quality is plotted as one dashed line. We find two trends: (1) G-1 prompt consistently generates answers with higher average answer quality than G-All prompt and (2) In G-All prompt, as the generation continues, the answer quality decreases and the variance of quality scores increases. Therefore, when individual answer quality is more important, G-1 prompt, while harder to elicit diverse answers, can be more appropriate.
7 Related Work
Improving output diversity
Concerns about the output diversity of LLMs (Padmakumar and He, 2023; Anderson et al., 2024; West and Potts, 2025) promoted two categories of solutions: methods that modify model weights (Lanchantin et al., 2025; Chung et al., 2025; Sorensen et al., 2025; Puri et al., 2026) and inference methods (Welleck et al., 2024; Levy et al., 2023; Meister et al., 2024; Xiao et al., 2025; Kambhatla et al., 2022; Santurkar et al., 2023; Hayati et al., 2024; Wang et al., 2025). A suite of work proposes advanced prompting strategies, such as denial prompting (Lu et al., 2024b), probabilistic prompting (Wong et al., 2024), and verbalized sampling (Zhang et al., 2025b). All of these methods focus on improving the diversity of a single model, whereas we study a multi-LLM setting.
Routers for LLMs
Researchers find that looping in multiple models is often better than sticking to one (Jiang et al., 2023; Feng et al., 2024; 2025a; 2025b; 2026a; 2026b). Building on this insight, many works train a router that selects among multiple LLMs to achieve better task performance (Jiang et al., 2023; Zhang et al., 2025a; Lu et al., 2024a) or efficiency (Chen et al., 2024; Ding et al., 2024; Ong et al., 2024; Zhang et al., 2025a). Simple methods (Ding et al., 2024; Ong et al., 2024) demonstrate the effectiveness of routing by switching between a stronger and a weaker model, which balances cost and quality. Other works (Jiang et al., 2023; Lu et al., 2024a) train routers with many top performing LLMs to leverage their complementary expertise. All existing methods are proposed to enhance the end performance measured within a single generation per question. However, none of the above discusses how routing can benefit the diversity and quality of a set of derived answers. To the best of our knowledge, we are the first to propose a router to promote diversity coverage by harnessing the complementary efforts from heterogeneous models.
8 Conclusion
In this paper, we study mixing outputs from multiple LLMs as a strategy to improve response diversity. We first formalize diversity as the coverage of high-quality responses and propose unified evaluation metrics that apply to both finite and open-ended answer spaces. To optimize these metrics, we introduce a router that dynamically selects the most suitable LLM(s) for each query, showing improved performance. Further scaling the training data consistently improves the router. We make few simplifying assumptions: (1) when two models are selected, their outputs are mixed in equal proportions; and (2) only one or two models are used per query. Future research is encouraged to relax these limitations and explore efficiency-aware routing.
Acknowledgments
This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise. The work is partially funded by NSF CAREER award 2443271.
References
- Homogenization effects of large language models on human creative ideation. In Proceedings of the 16th conference on creativity & cognition, pp. 413–425. Cited by: §7.
- RouterDC: query-based router by dual contrastive learning for assembling large language models. ArXiv abs/2409.19886. External Links: Link Cited by: §7.
- Modifying large language model post-training for diverse creative writing. arXiv preprint arXiv:2503.17126. Cited by: §1, §7.
- Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pp. 4171–4186. Cited by: 2nd item.
- Hybrid llm: cost-efficient and quality-aware query routing. ArXiv abs/2404.14618. External Links: Link Cited by: §7.
- MoCo: a one-stop shop for model collaboration research. External Links: 2601.21257, Link Cited by: §7.
- When one llm drools, multi-llm collaboration rules. External Links: 2502.04506, Link Cited by: §7.
- The single-multi evolution loop for self-improving model collaboration systems. External Links: 2602.05182, Link Cited by: §7.
- Modular pluralism: pluralistic alignment via multi-LLM collaboration. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y. Al-Onaizan, M. Bansal, and Y. Chen (Eds.), Miami, Florida, USA, pp. 4151–4171. External Links: Link, Document Cited by: §7.
- Heterogeneous swarms: jointly optimizing model roles and weights for multi-llm systems. External Links: 2502.04510, Link Cited by: §7.
- Discriminatory analysis: nonparametric discrimination, consistency properties. Vol. 1, USAF school of Aviation Medicine. Cited by: 1st item.
- Towards an ai co-scientist. arXiv preprint arXiv:2502.18864. Cited by: §1.
- How far can we extract diverse perspectives from large language models?. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y. Al-Onaizan, M. Bansal, and Y. Chen (Eds.), Miami, Florida, USA, pp. 5336–5366. External Links: Link, Document Cited by: §7.
- The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751. Cited by: §1.
- Unnatural instructions: tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689. Cited by: §1.
- LLM-blender: ensembling large language models with pairwise ranking and generative fusion. In Annual Meeting of the Association for Computational Linguistics, External Links: Link Cited by: §1, §7.
- Artificial hivemind: the open-ended homogeneity of language models (and beyond). External Links: Link Cited by: Appendix F, §2.2, Table 1.
- Surfacing racial stereotypes through identity portrayal. In Proceedings of the 2022 ACM conference on Fairness, Accountability, and Transparency, pp. 1604–1615. Cited by: §1, §7.
- Diverse preference optimization. arXiv preprint arXiv:2501.18101. Cited by: §7.
- Diverse demonstrations improve in-context compositional generalization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki (Eds.), Toronto, Canada, pp. 1401–1422. External Links: Link, Document Cited by: §7.
- [21] User simulators bridge rl with real-world interaction. Note: https://jessylin.com/2025/07/10/user-simulators-1/ Cited by: §1.
- Skywork-reward: bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451. Cited by: footnote 7.
- Routing to the expert: efficient reward-guided ensemble of large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), K. Duh, H. Gomez, and S. Bethard (Eds.), Mexico City, Mexico, pp. 1964–1974. External Links: Link, Document Cited by: §1, §7.
- Benchmarking language model creativity: a case study on code generation. arXiv preprint arXiv:2407.09007. Cited by: §1, §7.
- Locally typical sampling. Transactions of the Association for Computational Linguistics (TACL) 11. Cited by: Appendix I.
- Benchmarking distributional alignment of large language models. arXiv preprint arXiv:2411.05403. Cited by: §1, §7.
- Turning up the heat: min-p sampling for creative and coherent llm outputs. arXiv preprint arXiv:2407.01082. Cited by: §1.
- Alphaevolve: a coding agent for scientific and algorithmic discovery. arXiv preprint arXiv:2506.13131. Cited by: §1.
- RouteLLM: learning to route llms with preference data. ArXiv abs/2406.18665. External Links: Link Cited by: 2nd item, §7.
- Does writing with language models reduce content diversity?. arXiv preprint arXiv:2309.05196. Cited by: Appendix I, §1, §7.
- Reaching beyond the mode: rl for distributional reasoning in language models. External Links: 2603.24844, Link Cited by: §7.
- Evaluating story generation systems using automated linguistic analyses. In SIGKDD 2017 Workshop on Machine Learning for Creativity, pp. 13–17. Cited by: Appendix I.
- Whose opinions do language models reflect?. In International Conference on Machine Learning, pp. 29971–30004. Cited by: §1, §7.
- Do massively pretrained language models make better storytellers?. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Hong Kong, China, pp. 843–861. External Links: Link, Document Cited by: Appendix I.
- Spectrum tuning: post-training for distributional coverage and in-context steerability. arXiv preprint arXiv:2510.06084. Cited by: Appendix I, §7.
- Evaluating the evaluation of diversity in natural language generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, pp. 326–346. External Links: Link, Document Cited by: Appendix I.
- Multilingual prompting for improving llm generation diversity. arXiv preprint arXiv:2505.15229. Cited by: §7.
- From decoding to meta-generation: inference-time algorithms for large language models. arXiv preprint arXiv:2406.16838. Cited by: §7.
- Base models beat aligned models at randomness and creativity. arXiv preprint arXiv:2505.00047. Cited by: §7.
- Simplestrat: diversifying language model generation with stratification. arXiv preprint arXiv:2410.09038. Cited by: §7.
- Mode-conditioning unlocks superior test-time scaling. arXiv preprint arXiv:2512.01127. Cited by: §1.
- X-coder: advancing competitive programming with fully synthetic tasks, solutions, and tests. arXiv preprint arXiv:2601.06953. Cited by: §1.
- The role of diversity in in-context learning for large language models. arXiv preprint arXiv:2505.19426. Cited by: §7.
- Inf-retriever-v1 (revision 5f469d7). Hugging Face. External Links: Link, Document Cited by: §4.1.
- Router-r1: teaching llms multi-round routing and aggregation via reinforcement learning. ArXiv abs/2506.09033. External Links: Link Cited by: §1, 2nd item, §7.
- Verbalized sampling: how to mitigate mode collapse and unlock llm diversity. arXiv preprint arXiv:2510.01171. Cited by: §H.1, §1, §2.2, §7, footnote 2.
- NoveltyBench: evaluating language models for humanlike diversity. arXiv preprint arXiv:2504.05228. Cited by: Appendix C, Appendix I, Appendix I, 1st item, §2.2, Table 1, Table 1, §4.2, footnote 7.
- Forcing diffuse distributions out of language models. arXiv preprint arXiv:2404.10859. Cited by: Appendix I, §2.2, §2.2, Table 1.
Appendix A The distribution of most diverse models
We attach in Figure 6 the freauency of each model being the best model if threshold set to .
Appendix B Prompts
Appendix C Diversity coverage calculation details on open-ended questions
Here we discuss the details of how we evaluate the quality and diversity of answers to open-ended questions. We follow the exact procedure to partition the answer set and calculate. the quality scores in Zhang et al. (2025c). To determine semantic equivalence, we apply their equivalence classifier (used in line 5 in algorithm below) to all pairs of generations and retain a subset with no mutually equivalent pairs (see Algorithm 1 below). This classifier is finetuned with pairs of human annotated generations conditioned on prompts sampled from NB-Curated and NB-WildChat. We then score the quality of each answer , following their process : the score is first derived by a reward model and later mapped to 777We use Skywork-Reward-Gemma-2-27B-v0.2 model (Liu et al., 2024) as the reward model and the equivalence classifier released by Zhang et al. (2025c) at https://huggingface.co/yimingzhang/deberta-v3-large-generation-similarity).. Their mapping is calibrated by aligning the distribution of reward model scores (from 2,400 MT-Bench generations) with GPT-4–judged quality scores, using thresholds to map reward values to the 1–10 scale.
Appendix D Decoding settings
We set target number () of answers to if not otherwise stated. The temperature and top are fixed to be and respectively. The max tokens is set to be . We use 2 H200 GPUs for all models. The batch size is . We repeat the sampling process until answers are collected. The inference time varies by model sizes and familities. We disable the thinking mode for Qwen models.
Appendix E Router implementation details
We use Adam optimizer, a learning rate of . For BERT classifier, we use the AdamW optimizer with a learning rate of . During training, we perform a grid search over options of {soft, one-hot} labels, weight decay and hidden dimensions. Routers are selected based on the best scores on the validation set. We experiment with soft labels and one-hot labels to provide the training signals. The soft labels are drawn by normalizing the diverse coverage scores against the most diverse model for this query. One-hot labels are derived by . We find that soft labels work best with M-way MLP classifier while one-hot labels are best for Binary MLP classifier.
Appendix F Scaling router training data
Specifically, we experiment with training the router on and K samples from NB-WildChat , and , K, and K samples from Infinity-Chat (Jiang et al., 2025). The results are shown in Table 5. Increasing NB-WildChat training data from to K improves diversity coverage on the NB-WildChat test set, though it does not transfer to Infinity-Chat. In contrast, scaling Infinity-Chat data from to K steadily improves performance on both the Infinity-Chat test set and the NB-WildChat test set, indicating stronger generalization. Finally, jointly training on a combination of NB-WildChat and Infinity-Chat further improves performance, slightly surpassing the best router ( vs. ) trained on K NB-WildChat data in Table 3.
| Evaluation Data | ||||
| Method | NB-WildChat | Infinity-Chat | ||
| Random | 18.13% | 18.24% | ||
| Top Overall | 23.83% | 23.13% | ||
| Oracle | 33.04% | 30.50% | ||
| Training Data | Size | |||
| Router | NB-WildChat | 500 | 25.28% | 22.58% |
| Router | NB-WildChat | 1K | 26.27% | 22.58% |
| Router | Infinity-Chat | 500 | 23.98% | 22.54% |
| Router | Infinity-Chat | 1K | 24.95% | 23.54% |
| Router | Infinity-Chat | 2K | 25.13% | 23.78% |
| Router | NB-WildChat and Infinity-Chat | 1K and 1K | 26.05% | 23.36% |
| Router | NB-WildChat and Infinity-Chat | 1K and 2K | 26.40% | 23.55% |
Appendix G Router Performance
| Method | NB-WildChat | NB-Curated (OOD) | ||||||||
| Acc | #U | Q | UQ | Cov. | Acc | #U | Q | UQ | Cov. | |
| Top Overall | 19.5% | 42.6 | 3.0 | 2.9 | 23.8% | 3.4% | 35.4 | 6.0 | 5.7 | 38.6% |
| Random M / Q | 5.9% | 27.8 | 3.7 | 3.6 | 18.1% | 5.6% | 27.8 | 7.2 | 7.0 | 37.5% |
| Frequency | 12.0% | 33.1 | 3.8 | 3.6 | 21.0% | 9.0% | 28.2 | 7.2 | 7.1 | 39.6% |
| Top M / Q (oracle) | 100% | 38.8 | 4.5 | 4.4 | 33.0% | 100% | 30.3 | 7.6 | 7.4 | 59.6% |
| NN | 16.5% | 34.3 | 3.7 | 3.6 | 23.1% | 5.6% | 28.2 | 7.3 | 7.1 | 39.7% |
| NN | 17.5% | 34.9 | 3.8 | 3.7 | 24.1% | 12.4% | 29.8 | 7.3 | 7.1 | 40.2% |
| M-way BERT | 22.0% | 40.3 | 3.3 | 3.2 | 24.4% | 11.2% | 35.0 | 6.3 | 6.2 | 40.3% |
| M-way MLP(agn) | 24.0% | 35.1 | 3.9 | 3.8 | 25.3% | 12.4% | 30.1 | 7.6 | 7.5 | 40.3% |
| M-way MLP(spec) | 27.0% | 39.3 | 3.5 | 3.4 | 25.9% | 5.6% | 34.6 | 6.3 | 6.1 | 40.2% |
| Binary MLP (agn) | 23.9% | 38.4 | 3.5 | 3.4 | 25.7%∗∗ | 10.8% | 32.8 | 7.1 | 7.0 | 40.7%∗∗ |
| Binary MLP (spec) | 23.9% | 38.1 | 3.6 | 3.5 | 26.3%∗∗ | 13.3% | 30.8 | 7.0 | 6.8 | 39.3%ns |
Appendix H Discussion
H.1 Different Prompt Templates
Prompting methods affect generation diversity (Zhang et al., 2025b) . We show that model ensembling is effective for answers generated by sequential prompting: model are asked to generate as many distinct answers in one generation, where the latter answers are dependent of previous answers. Does it also work for other prompting methods? We extend the study in section 3 to compare three different prompt types888Please refer to Appendix B for the exact prompts.:
-
•
Generate one: The model is prompted to produce one random answer for the given question.
-
•
Generate two: The model is prompted to provide two possible and different answers for given question.
-
•
Generate all (our default setting): The model is prompted to list out all possible answers sequentially.
| Method | Gen 1 | Gen 2 | Gen All | ||||||
| #Unq | Quality | Cov. | #Unq | Quality | Cov. | #Unq | Quality | Cov. | |
| Random | 13.7 | 4.9 | 9.9% | 18.3 | 4.5 | 13.2% | 27.8 | 3.7 | 18.1% |
| Frequency | 25.4 | 3.9 | 15.6% | 25.2 | 4.1 | 17.1% | 33.1 | 3.8 | 21.0% |
| Top Overall | 31.8 | 3.2 | 18.5% | 34.8 | 3.1 | 19.7% | 42.6 | 3.0 | 23.8% |
| Oracle (G-1) | \cellcolorLightGrey 32.8 | \cellcolorLightGrey4.1 | \cellcolorLightGrey25.6% | \cellcolorLightGrey!4032.1 | \cellcolorLightGrey!403.7 | \cellcolorLightGrey!4022.3% | \cellcolorLightGrey!4031.0 | \cellcolorLightGrey!403.3 | \cellcolorLightGrey!4020.4% |
| Oracle (G-2) | \cellcolorLightGrey!4024.1 | \cellcolorLightGrey!404.3 | \cellcolorLightGrey!4019.5% | \cellcolorLightGrey32.3 | \cellcolorLightGrey4.7 | \cellcolorLightGrey28.3% | \cellcolorLightGrey!4031.6 | \cellcolorLightGrey!403.3 | \cellcolorLightGrey!4021.0% |
| Oracle (G-All) | \cellcolorLightGrey!4016.4 | \cellcolorLightGrey!405.0 | \cellcolorLightGrey!4014.7% | \cellcolorLightGrey!4023.1 | \cellcolorLightGrey!404.6 | \cellcolorLightGrey!4018.8% | \cellcolorLightGrey38.8 | \cellcolorLightGrey4.5 | \cellcolorLightGrey33.0% |
| Router (G-1) | \cellcolorLightGrey33.1 | \cellcolorLightGrey3.2 | \cellcolorLightGrey19.1% | \cellcolorLightGrey!4033.3 | \cellcolorLightGrey!403.3 | \cellcolorLightGrey!4020.4% | \cellcolorLightGrey!4025.8 | \cellcolorLightGrey!402.8 | \cellcolorLightGrey!4014.4% |
| Router (G-2) | \cellcolorLightGrey!4026.5 \cellcolorLightGrey!40 | \cellcolorLightGrey!404.2 | \cellcolorLightGrey!4019.7% | \cellcolorLightGrey30.5 | \cellcolorLightGrey3.9 | \cellcolorLightGrey21.6% | \cellcolorLightGrey!4029.2 | \cellcolorLightGrey!403.2 | \cellcolorLightGrey!4019.0% |
| Router (G-All) | \cellcolorLightGrey!4018.3 | \cellcolorLightGrey!404.7 | \cellcolorLightGrey!4014.8% | \cellcolorLightGrey!4024.4 | \cellcolorLightGrey!404.3 | \cellcolorLightGrey!4018.1% | \cellcolorLightGrey37.5 | \cellcolorLightGrey3.7 | \cellcolorLightGrey26.2% |
Routing improves diversity for all prompts, yet a router trained on one prompt does not generalize to others.
We ablate the prompting strategies, retrain routers, and evaluate them on all types of prompts. The performance is presented in Table 7. We find that generate all prompt incurs most diversity coverage, as shown by the diversity scores (Cov.) of the random/ oracle baselines. Training a router in-domain consistently improves diversity coverage, yet neither oracle labels nor routers generalize across prompts. Finally, larger gains are observed when routing under better prompts.
| Gen 1 | Gen 2 | Gen All | ||||
| Cov. | Len | Cov. | Len | Cov. | Len | |
| Random | ||||||
| Top model | ||||||
| Top 2 models | ||||||
| Top model per query | ||||||
Tradeoff of the generate all prompt.
Despite being the best method, there is a trade-off between diversity and quality for generate all. Under the routing setting, we observe in Table 8 that the length of the answers decreases from generate one, generate two to generate all, up to (from to ). Besides, as shown in Table 7, though the number of unique answers sampled increases, the average answer quality deteriorates from generate one, generate two to generate all. This claim is further supported by comparing average answer quality among different prompting methods in Figure 17. It shows that generate all has the lowest answer quality while generate one has the highest. These findings hold for models across different sizes and families. Interestingly, a closer look into the answer generation process suggests that answers generated at later positions have worse quality under the sequential generate all prompt in Figure 5.
H.2 Discussions: Other configurations/hyperparameters that we can vary
| Ratio | #Unq | Qual | Unq Qual | Cov. |
| Oracle model pair | ||||
| 0:50 | 42.50 | 4.16 | 3.95 | 35.40% |
| 5:45 | 43.90 | 4.11 | 3.94 | 36.82% |
| 10:40 | 45.00 | 4.13 | 3.99 | 37.56% |
| 15:35 | 45.50 | 4.09 | 3.88 | 37.10% |
| 20:30 | 46.30 | 3.93 | 3.80 | 36.46% |
| 25:25 | 45.30 | 4.00 | 3.87 | 36.40% |
| Top 2 model pair | ||||
| 0:50 | 36.70 | 3.74 | 3.53 | 24.36% |
| 5:45 | 38.10 | 3.73 | 3.46 | 25.58% |
| 10:40 | 39.20 | 3.58 | 3.34 | 25.60% |
| 15:35 | 40.50 | 3.40 | 3.20 | 25.34% |
| 20:30 | 42.10 | 3.33 | 3.15 | 26.28% |
| 25:25 | 43.60 | 3.26 | 3.12 | 26.98% |
| 30:20 | 44.30 | 3.12 | 3.04 | 26.54% |
| 35:15 | 45.60 | 3.06 | 3.02 | 27.48% |
| 40:10 | 46.20 | 2.99 | 2.98 | 27.62% |
| 45:5 | 46.70 | 2.91 | 2.90 | 27.10% |
| 50:0 | 47.20 | 2.85 | 2.88 | 27.10% |
| avg | 42.75 | 3.27 | 3.15 | 26.36% |
| Random model pair | ||||
| 0:50 | 35.93 | 3.24 | 3.23 | 20.86% |
| 5:45 | 36.56 | 3.26 | 3.24 | 21.51% |
| 10:40 | 37.19 | 3.23 | 3.21 | 21.72% |
| 15:35 | 37.49 | 3.22 | 3.20 | 21.79% |
| 20:30 | 37.73 | 3.23 | 3.20 | 21.86% |
| 25:25 | 37.72 | 3.22 | 3.19 | 21.89% |
| Strategy | #Unq | Qual | Unq Qual | Cov. |
| Oracle ratio | 44.80 | 3.66 | 3.55 | 32.58% |
| Overall best (40:10) | 46.20 | 2.99 | 2.98 | 27.62% |
| Half/half | 43.60 | 3.26 | 3.12 | 26.98% |
More flexible proportions of sampling per model
In the previous setting of routing to two models, we fix the sampled answers to be split equally (i.e., if there are two models selected to generate 50 answers, each would contribute to 25 answers). Will a more flexible proportion lead to more diversity? Under the same setting of sampling 50 answers from two models, we experiment with a set of possible ratios 0.0:1.0, 0.1:0.9, 0.2:0.8, 0.3:0.7, 0.4:0.6, 0.5:0.5(original) to assign the budget between two models. We conduct two experiments: (1) pick a ratio for all the questions, vary model choices (2) fix two models to ensemble (top 2 by individual performance), varying ratios for each question. We present the results in table 9 and table 10 respectively. We find that for oracle/random/top 2 model pairs, different global ratios don’t have much difference in output diversity. If we fix 2 models to ensemble and optimize ratios for each question, the score can be improved over rigid half/half mixing ( vs ).
| #Unq | Qual | Unq Qual | Cov. | |
| 1 | 42.50 | 4.16 | 3.95 | 35.40% |
| 2 | 45.30 | 4.00 | 3.87 | 36.08% |
| 3 | 44.00 | 4.04 | 3.86 | 35.72% |
| 4 | 44.20 | 4.13 | 3.79 | 35.96% |
| 5 | 43.80 | 4.13 | 3.79 | 35.74% |
| 6 | 44.30 | 3.93 | 3.58 | 35.04% |
| 7 | 43.40 | 4.08 | 3.76 | 35.12% |
| 8 | 43.30 | 3.94 | 3.70 | 34.02% |
| 9 | 42.90 | 4.02 | 3.73 | 34.00% |
| 10 | 43.10 | 3.92 | 3.65 | 33.36% |
| 11 | 42.40 | 3.87 | 3.66 | 31.82% |
| 12 | 42.40 | 3.80 | 3.56 | 30.92% |
| 13 | 40.40 | 3.83 | 3.65 | 29.56% |
| 14 | 40.10 | 3.80 | 3.63 | 28.88% |
| 15 | 38.80 | 3.70 | 3.58 | 27.38% |
| 16 | 37.90 | 3.66 | 3.53 | 26.10% |
| 17 | 37.30 | 3.73 | 3.54 | 25.66% |
| 18 | 36.60 | 3.67 | 3.56 | 24.32% |
| #Unq | Qual | Unq Qual | Cov. | |
| 1 | 47.20 | 2.85 | 2.88 | 27.10% |
| 2 | 43.60 | 3.26 | 3.12 | 26.98% |
| 3 | 41.80 | 3.21 | 3.06 | 24.90% |
| 4 | 41.90 | 3.31 | 3.07 | 25.40% |
| 5 | 40.60 | 3.47 | 3.27 | 25.42% |
| 6 | 40.70 | 3.53 | 3.34 | 26.60% |
| 7 | 40.90 | 3.59 | 3.42 | 27.00% |
| 8 | 40.00 | 3.65 | 3.54 | 27.26% |
| 9 | 38.40 | 3.85 | 3.60 | 26.48% |
| 10 | 38.90 | 3.84 | 3.64 | 27.20% |
| 11 | 37.70 | 3.81 | 3.52 | 25.70% |
| 12 | 38.50 | 3.62 | 3.30 | 25.00% |
| 13 | 35.20 | 3.70 | 3.48 | 23.22% |
| 14 | 37.20 | 3.57 | 3.35 | 23.96% |
| 15 | 36.40 | 3.60 | 3.44 | 23.98% |
| 16 | 35.80 | 3.67 | 3.48 | 23.82% |
| 17 | 36.40 | 3.38 | 3.28 | 21.98% |
| 18 | 34.90 | 3.57 | 3.52 | 22.50% |
| #Unq | Qual | Unq Qual | Cov. | |
| 1 | 35.93 | 3.24 | 3.23 | 20.86% |
| 2 | 37.72 | 3.22 | 3.19 | 21.85% |
| 3 | 38.63 | 3.23 | 3.17 | 22.38% |
| 4 | 39.03 | 3.21 | 3.13 | 22.60% |
| 5 | 39.11 | 3.22 | 3.13 | 22.89% |
| 6 | 38.89 | 3.29 | 3.18 | 23.36% |
| 7 | 38.53 | 3.34 | 3.22 | 23.58% |
| 8 | 38.33 | 3.37 | 3.26 | 23.70% |
| 9 | 38.35 | 3.44 | 3.32 | 24.08% |
| 10 | 38.05 | 3.47 | 3.34 | 24.26% |
| 11 | 37.82 | 3.46 | 3.32 | 23.66% |
| 12 | 37.45 | 3.49 | 3.36 | 23.80% |
| 13 | 37.44 | 3.53 | 3.39 | 23.70% |
| 14 | 36.93 | 3.54 | 3.41 | 23.49% |
| 15 | 36.46 | 3.58 | 3.44 | 23.49% |
| 16 | 35.94 | 3.59 | 3.45 | 23.32% |
| 17 | 36.66 | 3.63 | 3.51 | 23.92% |
| 18 | 36.60 | 3.67 | 3.56 | 24.32% |
| Strategy | #Unq | Qual | Unq Qual | Cov. |
| Oracle | 44.10 | 3.93 | 3.73 | 34.24% |
| Best overall () | 40.00 | 3.65 | 3.54 | 27.26% |
| Random | 39.23 | 3.53 | 3.35 | 25.25% |
Varying the number of models to ensemble from
In previous experiments, we fix the number of activated models to be (routing the best model per query) or (routing to two best models per query). Will sampling answers from more models, while keeping the total number of answers unchanged, improve diversity? We answer this question by two experiments: (1) fix number of models for all, vary selected models per questions (2) fix the order of model to be selected (ranked by individual performance), vary the number per question. We present the results in Table 11, Table 12, Table 13, and Table 14. We find that routing to a custom model per question remains the most promising approach (under oracle settings). Routing to two models can offer further gains. But ensembling 2 models does not improve output diversity.
Scaling the number of candidates and generations
In this paper, we study selecting models from a pool of 18 candidates. However, in a real-world setting, there are hundreds of models users can choose from. Therefore, future work can explore employing a larger pool of LLMs that better harness their complementary strengths of uncovering more diverse answers. Besides, the number of answers to open-ended questions is infinitely large, and future work is encouraged to explore sample sizes beyond .
Appendix I Extended related work
Measuring output diversity
Traditional metrics to measure lexical diversity and text style are based on token and POS n-grams statistics (Roemmele et al., 2017; See et al., 2019; Tevet and Berant, 2021; Meister et al., 2023) and embedding similarity between candidatesPadmakumar and He (2023). Later works go beyond the distinctness of outputs and also measure the validity of each response. Zhang et al. (2024) propose to evaluate the diversity of LLMs by calculating the coverage of gold targets and the KL-divergence from the desired distribution. However, providing ground-truth distributions for open-ended questions is non-trivial. Closely related to our work, Zhang et al. (2025c) introduce the notion of user-perceived utility, which jointly models uniqueness and quality while accounting for user patience. In this framework, uniqueness is computed by partitioning sampled answers into non-equivalent groups, and answer quality is estimated using reward model scores. However, this metric penalizes later-generated responses, whereas our goal is to assess how well a set of answers covers the answer space regardless of generation order.
Similarly, Sorensen et al. (2025) evaluates how well a model covers an open-ended output space using validity and diversity metrics. However, their evaluation relies on expensive human annotations and thus is only experimented with a single model with four generations per prompt. In this work, we build on the framework of Zhang et al. (2025c) and propose diversity coverage, a metric that evaluates how well a set of generated answers covers the valid answer space across many generations and multiple LLMs without requiring additional human supervision.
Appendix J Verbalized Sampling
Similar to our baselines, Verbalized Sampling is a recent prompting technique that increases LLM output diversity. We decided not to include it in the main experiments since it performs similarly (if not worse than) our generated all baseline. We include the evidence below in Table 15 and Table 16:
| Prompt | Model | Cov. % | |||||
| 1 | 10 | 20 | 50 | 100 | 1000 | ||
| prompt_vanilla | Llama 8B | ||||||
| prompt_verbalized_all | |||||||
| system_vanilla | |||||||
| system_verbalized_all | |||||||
| prompt_vanilla | GPT-4o | ||||||
| prompt_verbalized_all | |||||||
| system_vanilla | |||||||
| system_verbalized_all | |||||||
| Prompt | Model | Cov. % | ||||||
| 1 | 5 | 10 | 20 | 50 | 100 | 200 | ||
| prompt_vanilla | Llama 8B | |||||||
| prompt_verbalized_all | ||||||||
| system_vanilla | ||||||||
| system_verbalized_all | ||||||||
| prompt_vanilla | GPT-4o | |||||||
| prompt_verbalized_all | ||||||||
| system_vanilla | ||||||||
| system_verbalized_all | ||||||||
Appendix K Generating diverse outputs out of a single model


K.1 Experiment settings
Decoding settings
For each prompting strategy and desired number of answers , we repeatedly sample generations from the model until we collect answers. For Simple Questions, we use . For NB-Curated, we use . We set the temperature to , top_p to . The max_len is set to 2048 by default. For the generate all setting in NB-Curated, we extend the max_len to 4096 because generations can not be finished within 2048 tokens.
K.2 Results
Compare different prompting strategies
Figure 18 shows how different prompting strategies affect the diversity of combined answers. For all models on both datasets, sequential generation enables a lot more answer diversity than parallel methods. With the best prompt, models on Simple Questions saturate to more than of coverage rate. This suggests that for easy diversity questions, nearly all models have good knowledge of the full answer space. On NB-Curated, answer diversity keeps growing as more generations are inferred. This reveals large diversity potential in uncovering more unique and high-quality responses to open-ended queries.
How does model size affect diversity?
According to Figure 18, most models’ performances are pretty similar on Simple Questions, except for the smallest model Qwen 0.6B). On NB-Curated, medium-sized models (Llama 8B and Qwen 14B) consistently have higher overall diversity than extremely large or small ones. We hypothesize that these models balance answer distinctness and quality best, therefore achieving the highest diversity performance. Figure 20 shows answer uniqueness is inversely proportional to the model sizes. And Figure 17 shows answer quality is proportional to model size. Finally, we also noticed that model rankings are largely unchanged regardless of the number of collected answers.