Capturing Individual Human Preferences with Reward Features

Barreto, André; Dumoulin, Vincent; Mao, Yiran; Rowland, Mark; Perez-Nieves, Nicolas; Shahriari, Bobak; Dauphin, Yann; Precup, Doina; Larochelle, Hugo

Computer Science > Artificial Intelligence

arXiv:2503.17338 (cs)

[Submitted on 21 Mar 2025 (v1), last revised 19 Feb 2026 (this version, v2)]

Title:Capturing Individual Human Preferences with Reward Features

Authors:André Barreto, Vincent Dumoulin, Yiran Mao, Mark Rowland, Nicolas Perez-Nieves, Bobak Shahriari, Yann Dauphin, Doina Precup, Hugo Larochelle

View PDF HTML (experimental)

Abstract:Reinforcement learning from human feedback usually models preferences using a reward function that does not distinguish between people. We argue that this is unlikely to be a good design choice in contexts with high potential for disagreement, like in the training of large language models. We formalise and analyse the problem of learning a reward model that can be specialised to a user. Using the principle of empirical risk minimisation, we derive a probably approximately correct (PAC) bound showing the dependency of the approximation error on the number of training examples, as usual, and also on the number of human raters who provided feedback on them. Based on our theoretical findings, we discuss how to best collect pairwise preference data and argue that adaptive reward models should be beneficial when there is considerable disagreement among users. We also propose a concrete architecture for an adaptive reward model. Our approach leverages the observation that individual preferences can be captured as a linear combination of a set of general reward features. We show how to learn such features and subsequently use them to quickly adapt the reward model to a specific individual, even if their preferences are not reflected in the training data. We present experiments with large language models illustrating our theoretical results and comparing the proposed architecture with a non-adaptive baseline. Consistent with our analysis, the benefits provided by our model increase with the number of raters and the heterogeneity of their preferences. We also show that our model compares favourably to adaptive counterparts, including those performing in-context personalisation.

Comments:	Published at NeurIPS 2025
Subjects:	Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:2503.17338 [cs.AI]
	(or arXiv:2503.17338v2 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2503.17338

Submission history

From: André Barreto [view email]
[v1] Fri, 21 Mar 2025 17:39:33 UTC (99 KB)
[v2] Thu, 19 Feb 2026 16:23:22 UTC (174 KB)

Computer Science > Artificial Intelligence

Title:Capturing Individual Human Preferences with Reward Features

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Capturing Individual Human Preferences with Reward Features

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators