Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts (original) (raw)

Haoxiang Wang∗1Wei Xiong11footnotemark: 1∗1Tengyang Xie2Han Zhao1Tong Zhang1

1University of Illinois Urbana-Champaign 2University of Wisconsin–Madison

Abstract

Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. The RLHF process typically starts by training a reward model (RM) using human preference data. Conventional RMs are trained on pairwise responses to the same user request, with relative ratings indicating which response humans prefer. The trained RM serves as a proxy for human preferences. However, due to the black-box nature of RMs, their outputs lack interpretability, as humans cannot intuitively understand why an RM thinks a response is good or not. As RMs act as human preference proxies, it is desirable for them to be human-interpretable to ensure that their internal decision processes are consistent with human preferences and to prevent reward hacking in LLM alignment. To build RMs with interpretable preferences, we propose a two-stage approach: i) train an Absolute-Rating Multi-Objective Reward Model (ArmoRM) with multi-dimensional absolute-rating data, each dimension corresponding to a human-interpretable objective (e.g., honesty, verbosity, safety); ii) employ a Mixture-of-Experts (MoE) strategy with a gating network that automatically selects the most suitable reward objectives based on the context. We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow MLP on top of the ArmoRM. Our trained model, ArmoRM-Llama3-8B, obtains state-of-the-art performance on RewardBench, a benchmark evaluating RMs for language modeling. Notably, the performance of our model surpasses the LLM-as-a-judge method with GPT-4 judges by a margin, and approaches the performance of the much larger Nemotron-4 340B reward model. Our code and model are released at https://github.com/RLHFlow/RLHF-Reward-Modeling.

1 Introduction

In this paper, we explore the role of reward models (RMs) within the framework of Reinforcement Learning from Human Feedback (RLHF). RMs play a crucial role in aligning large language models (LLMs) as they provide a scalable way to integrate human preferences into the models’ training process, guiding the optimization of their policies. To be more specific and provide more context, we first review the most standard and popular RLHF frameworks and the role of RMs in this framework. Arguably the dominant RLHF approach is a deep reinforcement learning (DRL)-based framework, as developed in key studies [Christiano et al., 2017; Ouyang et al., 2022; Bai et al., 2022]. This framework operates in three stages: 1) Preference data collection; 2) Reward modeling based on the Bradley-Terry model [Bradley and Terry, 1952]; 3) Policy optimization using Proximal Policy Optimization (PPO) [Schulman et al., 2017] and the reward model constructed in stage 2. This framework has achieved tremendous success in the post-training of ChatGPT [Ouyang et al., 2022] and Claude [Bai et al., 2022]. These ideas also extend to other approaches, such as rejection sampling fine-tuning [Dong et al., 2023; Gulcehre et al., 2023] and iterative direct preference learning [Xiong et al., 2023; Guo et al., 2024; Xie et al., 2024]. In these approaches, the intermediate policy is typically iteratively deployed to collect new responses, uses the reward model to label the responses, and fine-tunes the model on the newly collected preference data. In all of these RLHF frameworks, the capacity of the reward model is crucial as it directly affects the quality of the aligned LLMs.

Refer to caption

Figure 1: Architecture of our reward model. It consists of an LLM backbone, a regression layer for multi-objective reward modeling, and a gating layer that outputs coefficients to scalarize the reward objectives into a scalar score.

The most popular reward modeling approach is based on the maximum likelihood estimation (MLE) of the Bradley-Terry (BT) model [Bradley and Terry, 1952]. Despite its widespread use, the BT model is rather limited in the capacity of capturing the complicated human preference [Munos et al., 2023; Swamy et al., 2024; Ye et al., 2024]. In addition to the capacity issue, common RMs, like the BT model, are typically black-box models that output scores or preferences without providing human-interpretable explanations, making it subject to the widely observed phenomenon of reward hacking [Skalse et al., 2022; Singhal et al., 2023; Chen et al., 2024], where the aligned LLMs generate high-reward responses (rated by the RM) that do not align with actual human preferences [Gao et al., 2023; Lin et al., 2023; Coste et al., 2023]. A notable example of this is the verbosity bias, where aligned LLMs produce longer-than-necessary responses because the RM favors length, regardless of quality [Singhal et al., 2023; Wang et al., 2024a; Chen et al., 2024].

In this work, we aim to enhance reward models by making them more interpretable [Molnar, 2020] and steerable [Wong et al., 2021]. Using the aforementioned verbosity bias as an example, suppose the RM’s output is decomposable, meaning that it assigns a high score to a response due to two factors: 40% for its helpfulness and 60% for its length. In this case, we can see that the RM may suffer from the verbosity bias. Furthermore, if the RM is steerable, we could adjust its decision-making process to base its scoring 100% on helpfulness. This would be regardless of response length, thus mitigating the verbosity bias. Enhancing the interpretability of RMs also allows humans to verify whether RMs have similar internal decision processes to humans when acting as proxies for human preferences. We believe that this human-AI interaction process could ensure that RMs are consistent with human values and preferences, making RM-aligned LLMs more reliable and robust.

At a high level, we propose a two-stage approach that first trains a multi-objective RM and then learns a gating layer that scalarizes reward objectives in a mixture-of-experts way. We then empirically validate its effectiveness by training such an RM with Llama-3 8B [Meta, 2024], and obtain state-of-the-art performance on RewardBench, a benchmark to evaluate RMs.

2.1 RLHF Algorithms

The PPO-based RLHF framework is first popularized in Christiano et al. [2017] and further developed by Bai et al. [2022]; Ouyang et al. [2022] to make ChatGPT and Claude, which leverages a reward model to provide feedback during the RLHF process. However, getting the PPO work is challenging in the context of LLMs [Choshen et al., 2019; Engstrom et al., 2020]. Thus, much efforts have been made in proposing alternative approaches to the PPO, such as the REINFORCE algorithm variants [Li et al., 2023; Shao et al., 2024]. Another popular approach is the reward-ranked fine-tuning algorithm (RAFT) [Dong et al., 2023; Gulcehre et al., 2023] that was used in LLaMA2 [Touvron et al., 2023], Llama-3 [Meta, 2024], Qwen2 [qwe, 2024] and Apple Intelligence. To implement rejection sampling, we typically sample n𝑛nitalic_n responses per prompt and use a reward model to rank them according to some criteria. Then, we fine-tune the model on the high-rank responses (e.g., the one with the highest reward value). This algorithm is a strong baseline, especially in reasoning tasks [Aksitov et al., 2023; Havrilla et al., 2024]. All approaches mentioned above leverage external reward models to provide supervision signals during the RLHF process.

There is also a line of works studying direct preference learning algorithms [Zhao et al., 2023; Rafailov et al., 2023; Azar et al., 2023; Tang et al., 2024], which bypasses traditional reward modeling to learn directly from preference datasets in a supervised manner (hence the name direct preference learning). Direct Preference Optimization (DPO) is the most representative one. However, the original DPO is an offline algorithm without further exploration of the environments. The subsequent studies demonstrate that the online iterative variants surpass the original DPO with large margins [Xiong et al., 2023; Liu et al., 2023; Xu et al., 2023; Rosset et al., 2024; Guo et al., 2024; Xie et al., 2024; Zhang et al., 2024; Dong et al., 2024]. Specifically, we can iteratively deploy the intermediate policy to collect new responses and use the external reward model to label them, and further fine-tune the model on the newly collected preference data using the DPO objective.

To summarize, all the existing popular RLHF algorithms require an external reward model to provide preference signals to achieve their best performance.

2.2 Reward modeling in RLHF

Traditionally, reward models in RLHF have utilized the Bradley-Terry (BT) model for preference estimation [Bradley and Terry, 1952; Ouyang et al., 2022; Bai et al., 2022; Wang et al., 2023b; Rafailov et al., 2023]. Despite its widespread use, the BT model’s inability to handle complex, in-transitive preferences has been highlighted in recent studies [Munos et al., 2023; Swamy et al., 2024; Ye et al., 2024]. It is also argued that the DPO-aligned model can serve as a reward function to provide token-wise rewards [Rafailov et al., 2024; Zhong et al., 2024], which are still confined to the BT model. There are also works dropping the BT assumption and directly modeling the probability of response one being preferred over another one [Jiang et al., 2023; Zhao et al., 2023; Liu et al., 2023; Dong et al., 2024]. These models are referred to as the pairwise preference model, as they take two responses as the input. Another line of work explores multi-objective reward models that attempt to capture the complicated human preferences more effectively [Touvron et al., 2023; Wang et al., 2023a, 2024a]. However, the integration of these multi-dimensional signals typically relies on naive methods such as linear combinations, indicating a need for more sophisticated techniques.

3 Methodology

3.1 Multi-Objective Reward Modeling

Most existing reward models for LLM alignment are trained with Bradley-Terry loss on pairwise data with annotated preferences [Bai et al., 2022; Touvron et al., 2023; Ouyang et al., 2022], using the same approach as InstructGPT [Ouyang et al., 2022]. The pairwise preference annotations are essentially binary labels, e.g., {0,1}01\{0,1\}{ 0 , 1 }, indicating which response is preferred by the annotator. We call them relative ratings here. However, in some recent high-quality datasets, the relative ratings are converted from absolute ratings. For instance, UltraFeedback [Cui et al., 2023] is curated with 5-objective absolute ratings: Overall Score, Instruction Following, Truthfulness, Honesty, and Helpfulness (each objective has 5 distinct ratings based on pre-defined rubrics). The dataset is further binarized into pairwise comparisons, using the Overall Score, or the average score of the remaining 4 objectives, for training reward models or DPO. The original ratings are fine-grained, as each objective has continuous integer rating scores (e.g., 1, 2, 3, 4, 5). However, the binarization process discards some fine-grained information. For example, a pair of examples with scores 1:5 is labeled in the same way as another pair with scores 2:3. It is not justified that discarding the fine-grained preference information is beneficial. Hence, we would like to include all fine-grained information for reward modeling.

As the training examples come with multi-objective ratings, the straightforward approach for learning with these ratings is multi-objective regression111This approach is also adopted in Directional Preference Alignment [Wang et al., 2024a] and HelpSteer [Wang et al., 2023a].. Here, we briefly introduce the training procedure. We consider each example to consist of a prompt x𝑥xitalic_x (including contexts from previous conversation turns), response y𝑦yitalic_y, and a k𝑘kitalic_k-dimensional rating vector r∈ℝk𝑟superscriptℝ𝑘r\in\mathbb{R}^{k}italic_r ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, where each dimension corresponds to a reward objective such as helpfulness and truthfulness. Now, we take a pre-trained decoder-only LLM without the original output linear layer as the feature extractor fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. We pass x⊕ydirect-sum𝑥𝑦x\oplus yitalic_x ⊕ italic_y, the concatenation of x𝑥xitalic_x and y𝑦yitalic_y, through the decoder layers and take the hidden state of the final decoder layer on the last token as a d𝑑ditalic_d-dimensional feature. Also, we attach a new linear regression layer w∈ℝd×k𝑤superscriptℝ𝑑𝑘w\in\mathbb{R}^{d\times k}italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT on top of fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, which outputs a k𝑘kitalic_k-dimensional rating prediction. The model can be simply trained with regression loss:

| minθ,w⁡𝔼x,y,r∈D⁢‖w⊤⁢fθ⁢(x⊕y)−r‖22subscript𝜃𝑤subscript𝔼𝑥𝑦𝑟𝐷superscriptsubscriptnormsuperscript𝑤topsubscript𝑓𝜃direct-sum𝑥𝑦𝑟22\displaystyle\min_{\theta,w}\mathbb{E}_{x,y,r\in D}\|w^{\top}f_{\theta}(x% \oplus y)-r\|_{2}^{2}roman_min start_POSTSUBSCRIPT italic_θ , italic_w end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x , italic_y , italic_r ∈ italic_D end_POSTSUBSCRIPT ∥ italic_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ⊕ italic_y ) - italic_r ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | (1) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --- |

3.2 Mixture-of-Experts Scalarization of Reward Objectives

An ArmoRM can predict multi-objective rewards for each response. However, the multi-dimensional outputs need to be reduced to a scalar for ranking or pairwise comparisons of test examples. A straightforward approach is to take a linear combination of multiple objectives [Hu et al., 2024] as in the literature of multitask learning. However, using fixed combination coefficients is too rigid for complex application scenarios. For instance, for prompts that could easily trigger unsafe responses, the safety objective should be assigned a large coefficient, as we wish the reward model to rank unsafe responses lower than safe ones. For prompts for math problem assistance, the safety objective becomes less relevant, and the helpfulness-related objectives should be the primary focus.

With the insight mentioned above, we propose a MoE-style scalarization of reward objectives, conditioned on the prompt x𝑥xitalic_x. On the architecture level, we just need to follow the common MoE practice to add a gating layer, gϕ:ℝd↦{v∈ℛk∣vi≥0⁢and⁢∑vi=1}:subscript𝑔italic-ϕmaps-tosuperscriptℝ𝑑conditional-set𝑣superscriptℛ𝑘subscript𝑣𝑖0andsubscript𝑣𝑖1g_{\phi}:\mathbb{R}^{d}\mapsto\{v\in\mathcal{R}^{k}\mid v_{i}\geq 0~{}\mathrm{% and}~{}\sum v_{i}=1\}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ↦ { italic_v ∈ caligraphic_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∣ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 roman_and ∑ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 }, that outputs non-negative coefficients (summing up to 1) for the reward objectives based on the feature extracted from the prompt, fθ⁢(x)∈ℝdsubscript𝑓𝜃𝑥superscriptℝ𝑑f_{\theta}(x)\in\mathbb{R}^{d}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, i.e., the hidden state on the last token of x𝑥xitalic_x. Notice that fθ⁢(x)subscript𝑓𝜃𝑥f_{\theta}(x)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) is provided for free in the forward pass of fθ⁢(x⊕y)subscript𝑓𝜃direct-sum𝑥𝑦f_{\theta}(x\oplus y)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ⊕ italic_y ), making the pipeline inference-efficient.

The gating layer gϕsubscript𝑔italic-ϕg_{\phi}italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT can simply be a shallow MLP (i.e., fully-connected network) that takes the prompt feature fθ⁢(x)subscript𝑓𝜃𝑥f_{\theta}(x)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) and outputs a k𝑘kitalic_k-dimensional vector, followed by a softmax function to ensure the elements of the output vector are non-negative and summing up to 1.

However, most reward objectives are highly correlated with verbosity, which indicates a strong verbosity bias [Saito et al., 2023]. Using non-negative gating coefficients would make the final output inherit the bias. To resolve the issue, we adjust each reward objective, risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, with a penalty using the verbosity reward objective,

ri′←ri−λi⁢rverbose←superscriptsubscript𝑟𝑖′subscript𝑟𝑖subscript𝜆𝑖subscript𝑟verbose\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{% eq:reward-adjust}}{e}q:reward-adjust}r_{i}^{\prime}\leftarrow r_{i}-\lambda_{i% }r_{\mathrm{verbose}}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT roman_verbose end_POSTSUBSCRIPT (2)

where the penalty coefficient λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is chosen such that for a proper correction metric (e.g., Pearson or Spearman correlation coefficient) and a reference data distribution 𝒟𝒟\mathcal{D}caligraphic_D,

Corr𝒟⁢(ri′,rverbose)=0subscriptCorr𝒟superscriptsubscript𝑟𝑖′subscript𝑟verbose0\displaystyle\addcontentsline{lla}{section}{\numberline{\string\crtrefnumber{% eq:corr}}{e}q:corr}\mathrm{Corr}_{\mathcal{D}}(r_{i}^{\prime},r_{\mathrm{% verbose}})=0roman_Corr start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT roman_verbose end_POSTSUBSCRIPT ) = 0 (3)

The adjusted reward vector is denoted as r′∈ℝksuperscript𝑟′superscriptℝ𝑘r^{\prime}\in\mathbb{R}^{k}italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT.

Finally, we multiply the gating coefficients to the multi-objective rewards, to obtain a scalar score s𝑠sitalic_s for the response y𝑦yitalic_y given prompt x,𝑥x,italic_x ,

R=gϕ⁢(fθ⁢(x))⊤⁢r′𝑅subscript𝑔italic-ϕsuperscriptsubscript𝑓𝜃𝑥topsuperscript𝑟′\displaystyle R=g_{\phi}(f_{\theta}(x))^{\top}r^{\prime}italic_R = italic_g start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (4)

To train the gating layer, we freeze the backbone and the regression layer, and only train the gating layer using the Bradley-Terry loss with an additional scaling variable, β∈ℝ𝛽ℝ\beta\in\mathbb{R}italic_β ∈ blackboard_R,

minϕ,β⁡𝔼⁢[−log⁡exp⁡(β⁢Rchosen)exp⁡(β⁢Rchosen)+exp⁡(β⁢Rrejected)]subscriptitalic-ϕ𝛽𝔼delimited-[]𝛽subscript𝑅chosen𝛽subscript𝑅chosen𝛽subscript𝑅rejected\displaystyle\min_{\phi,\beta}\mathbb{E}\left[-\log\frac{\exp(\beta R_{\mathrm% {chosen}})}{\exp(\beta R_{\mathrm{chosen}})+\exp(\beta R_{\mathrm{rejected}})}\right]roman_min start_POSTSUBSCRIPT italic_ϕ , italic_β end_POSTSUBSCRIPT blackboard_E [ - roman_log divide start_ARG roman_exp ( italic_β italic_R start_POSTSUBSCRIPT roman_chosen end_POSTSUBSCRIPT ) end_ARG start_ARG roman_exp ( italic_β italic_R start_POSTSUBSCRIPT roman_chosen end_POSTSUBSCRIPT ) + roman_exp ( italic_β italic_R start_POSTSUBSCRIPT roman_rejected end_POSTSUBSCRIPT ) end_ARG ] (5)

where Rchosensubscript𝑅chosenR_{\mathrm{chosen}}italic_R start_POSTSUBSCRIPT roman_chosen end_POSTSUBSCRIPT and Rrejectedsubscript𝑅rejectedR_{\mathrm{rejected}}italic_R start_POSTSUBSCRIPT roman_rejected end_POSTSUBSCRIPT are the preference scores for the chosen and rejected responses in each pairwise example, (x,ychosen,yrejected)𝑥subscript𝑦chosensubscript𝑦rejected(x,y_{\mathrm{chosen}},y_{\mathrm{rejected}})( italic_x , italic_y start_POSTSUBSCRIPT roman_chosen end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT roman_rejected end_POSTSUBSCRIPT ).

4 Experiment

Table 1: Performance comparison on RewardBench. The benchmark consists of four primary categories (weight 1.0) and one category of prior sets (weight 0.5). The weighted average accuracy is computed as the overall score.

Implementation of ArmoRM

We use the Llama-3 8B [Meta, 2024] architecture and initialize the model backbone with parameters from a Bradley-Terry RM of Llama-3 8B trained by Dong et al. [2024]. We append a linear layer to the backbone, and train it with regression loss while keeping the backbone frozen. The training involves 19 objectives (including helpfulness, correctness, verbosity, etc.) from 8 datasets, with details presented in Appendix A.

Implementation of MoE

The gating layer is a ReLU MLP of 3 hidden layers with 1024 hidden units. For the correlation metric CorrCorr\mathrm{Corr}roman_Corr in Eq. (3), we adopt the Spearman correlation [Spearman, 1904], and use UltraFeedback [Cui et al., 2023] as the reference data distribution 𝒟𝒟\mathcal{D}caligraphic_D. The scaling variable β𝛽\betaitalic_β is initialized with a value of 100, and the gating layer is trained with the LLM backbone kept frozen. The training is conducted on 10 pairwise preference datasets, with details in Appendix A.

Software

Our training code is built with PyTorch [Paszke et al., 2019], HuggingFace’s Transformers [Wolf et al., 2019] and Scikit-learn [Pedregosa et al., 2011].

Hardware

Training ArmoRM (the multi-objective reward modeling stage) only involves training the last linear layer (i.e., linear probing), so we save features extracted from the backbone locally and then conduct linear probing with Scikit-learn’s linear regression solver on a CPU. For the MoE stage, we also save features locally, and then train the gating layer on a single NVIDIA A6000 GPU.

Hyperparameters

The gating layer is trained using the AdamW optimizer [Loshchilov and Hutter, 2019] with a learning rate of 0.001 for 10,000 steps with a batch size of 1024. We also apply a cosine decay learning rate scheduler.

Evaluation Benchmark

RewardBench [Lambert et al., 2024] is the first benchmark constructed to evaluate reward models for language modeling. It consists of a diverse set of tasks designed to assess the performance of reward models for LLM alignment, including four primary categories (Chat, Chat Hard, Safety, Reasoning) and a category of prior sets. Each category consists of multiple datasets with pairwise preference data, where each pair includes a chosen and a rejected text response. The overall score is computed as a weighted average over the five categories, where the four primary categories have weights 1.0 and the prior-sets category has weight 0.5.

Evaluation Results

Table 1 compares the performance of our approach (ArmoRM + MoE) against other reward models. Several key observations can be made from these results:

5 Conclusion

In this work, we addressed the critical issue of interpretability in reward models for RLHF in the context of aligning LLMs with human preferences. We proposed a novel two-stage approach, consisting of an ArmoRM and a MoE strategy with a gating network. Our ArmoRM, trained with Llama-3 8B, achieved state-of-the-art performance on RewardBench, demonstrating the effectiveness of our reward modeling approach.

References

Appendix A Experimental Details

Licenses

The model we use and fine-tune follows the Meta Llama3 license. All the datasets we use are open-sourced and can be used for research purposes (some could be used for commercial purposes, such as HelpSteer [Wang et al., 2023a]).

Personally Identifying Info or Offensive Content

For all datasets used in this work, according to their data curation process descriptions, they do not contain any information that names or uniquely identifies individual people, except for some examples that contain celebrity names. However, BeaverTails [Ji et al., 2023], PKU-RLHF [Ji et al., 2023], and HH-RLHF [Bai et al., 2022, Ganguli et al., 2022] contain offensive content, which is deliberately selected to build human preference datasets that aim to teach LLMs which responses are safe to generate.

Multi-Objective Training Datasets

In the stage of multi-objective reward modeling, we use training datasets with corresponding reward objectives detailed below.

Multi-Objective Data Pre-processing

When merging multiple datasets with absolute ratings (e.g., UltraFeedback and HelpSteer), we observe some issues with the data. Here, we present the issues and our approach to tackle them:

Training Data of MoE

In the stage of the gating layer, we use the following preference datasets:

Preference Data Pre-processing

For datasets that are not binarized into response pairs (e.g., HelpSteer, UltraFeedback, SHP), we take the binarized versions pre-processed in Dong et al. [2024].