Model Extrapolation Expedites Alignment (original) (raw)

Chujie Zheng1,2 Ziqi Wang3 Heng Ji3 Minlie Huang1 Nanyun Peng222footnotemark: 2
1The CoAI Group, DCST, BNRist, Tsinghua University
2University of California, Los Angeles 3University of Illinois Urbana-Champaign
chujiezhengchn@gmail.com aihuang@tsinghua.edu.cn violetpeng@cs.ucla.edu
Work done during Chujie’s visit to UCLA. Project repository: <github.com/chujiezheng/LLM-Extrapolation>. Corresponding authors.

Abstract

Given the high computational cost of preference alignment training of large language models (LLMs), exploring efficient methods to reduce the training overhead remains an important and compelling research problem. Motivated by the observation that alignment training typically involves only small parameter changes without injecting new knowledge into models, we propose a straightforward method called ExPO (model extrapolation) to expedite LLMs’ alignment with human preferences. Given a partially-trained model and its initial SFT checkpoint, ExPO improves the implicit optimization objective of alignment training by simply amplifying the parameter change based on a first-order approximation, without any additional training overhead. Through controlled experiments, we demonstrate that ExPO boosts a DPO model trained with only 20% steps to outperform the fully-trained one. Moreover, we show that ExPO notably improves existing open-source LLMs (ranging from 1.8B to 70B parameters) on the leading AlpacaEval 2.0 and MT-Bench benchmarks, which highlights ExPO’s broader utility in efficiently enhancing LLM alignment.

Model Extrapolation Expedites Alignment

Chujie Zheng1,2††thanks: Work done during Chujie’s visit to UCLA. Project repository: <github.com/chujiezheng/LLM-Extrapolation>. Ziqi Wang3 Heng Ji3 Minlie Huang1††thanks: Corresponding authors. Nanyun Peng222footnotemark: 2 1The CoAI Group, DCST, BNRist, Tsinghua University 2University of California, Los Angeles 3University of Illinois Urbana-Champaign chujiezhengchn@gmail.com aihuang@tsinghua.edu.cn violetpeng@cs.ucla.edu

1 Introduction

After conventional unsupervised pre-training on massive textual corpora and supervised fine-tuning (SFT) on high-quality demonstration data, large language models (LLMs) usually require a dedicated training stage to align with human preferences (OpenAI, 2022, 2023; Bai et al., 2022), as exemplified by the well-known Reinforcement Learning from Human Feedback (RLHF; Ouyang et al. 2022; Schulman et al. 2017) and Direct Preference Optimization (DPO; Rafailov et al. 2023). However, alignment training still requires expensive computational resources (Ji et al., 2024; Meng et al., 2024), particularly for the larger-sized LLMs (e.g., 70B parameters). This underscores the significance of exploring more efficient alignment methods to reduce the training overhead.

Our work is first motivated by the observation that preference alignment training typically does not inject new knowledge into models, thereby likely inducing only small changes of model parameters. We support this hypothesis through three arguments.First, mainstream alignment algorithms like RLHF and DPO incorporate a constraint term (e.g., the KL divergence term) to prevent excessive deviation from the initial SFT checkpoint.Second, in recent open-source LLM alignment projects (Tunstall et al., 2023; Wang et al., 2023; Ivison et al., 2023), preference alignment training usually adopts smaller learning rates (e.g., 5e-7) and fewer training steps (e.g., 400~500 steps) than SFT.Third, we take the zephyr-7b-dpo model (Tunstall et al., 2023) trained by HuggingFace as a specific instance. For any two among the pre-trained, SFT, and DPO checkpoints and for any corresponding parameter tensors 𝐏1subscript𝐏1\mathbf{P}_{1}bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐏2subscript𝐏2\mathbf{P}_{2}bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we compute the Frobenius norm ‖𝐏1−𝐏2‖normsubscript𝐏1subscript𝐏2\left\|\mathbf{P}_{1}-\mathbf{P}_{2}\right\|∥ bold_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ (and a normalized variant)111The Frobenius norm of tensor 𝐏𝐏\mathbf{P}bold_P is defined as: ‖𝐏‖=∑iPi2norm𝐏subscript𝑖superscriptsubscriptP𝑖2\left\|\mathbf{P}\right\|=\sqrt{\sum_{i}\mathrm{P}_{i}^{2}}∥ bold_P ∥ = square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, while the normalized variant is defined as: ‖𝐏‖=1|𝐏|⁢∑iPi2norm𝐏1𝐏subscript𝑖superscriptsubscriptP𝑖2\left\|\mathbf{P}\right\|=\sqrt{\frac{1}{|\mathbf{P}|}\sum_{i}\mathrm{P}_{i}^{% 2}}∥ bold_P ∥ = square-root start_ARG divide start_ARG 1 end_ARG start_ARG | bold_P | end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, where |𝐏|𝐏|\mathbf{P}|| bold_P | denotes the element number of 𝐏𝐏\mathbf{P}bold_P.. In Table 1, we show that the parameter change of alignment training (i.e., from SFT to DPO) is fairly small, whose absolute value of normalized Frobenius distance is merely 6.348×10−66.348superscript1066.348\times 10^{-6}6.348 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, and is also significantly smaller than that of SFT (i.e., from Pre-trained to SFT).Therefore, in this work we hypothesize that preference alignment training usually involves only small parameter changes.

Table 1: Parameter changes of zephyr-7b-dpo.

Based on this hypothesis, we formally apply a first-order approximation to the implicit optimization objective of alignment training. We empirically justify the soundness of this approximation with open-source LLMs, where we show that an interpolated model between the DPO/RLHF model and the initial SFT checkpoint generally exhibits intermediate alignment performance compared to the original models. Building upon the first-order approximation, we propose a straightforward method called ExPO (model extrapolation) to expedite LLMs’ alignment with human preferences.ExPO amplifies the parameter change of alignment training to improve the implicit optimization objective, thus bypassing the additional training overhead to achieve better alignment performance.

We conduct controlled experiments to validate ExPO’s effectiveness. We show that ExPO notably boosts the DPO models using fewer training steps (e.g., only 20%) to outperform the fully-trained one, with the improvement of up to 8.4% length-controlled win rate on AlpacalEval 2.0 (Li et al., 2023). We then conduct ablation studies to identify several key factors influencing ExPO’s efficacy, including training data quality, training hyperparameters, and optimizer. Furthermore, we extend ExPO’s application to twelve open-source LLMs ranging from 1.8B to 70B parameters, which have undergone varied alignment training such as offline DPO, iterative DPO, or online RLHF. We show that ExPO consistently improves these LLMs by up to 4.5% on AlpacaEval 2.0 and 0.37 on MT-Bench (Zheng et al., 2023b), suggesting that ExPO can also serve as a practical and efficient means to compensate for potential training inadequacy of existing, already-aligned LLMs. In summary, our work demonstrates the efficacy of model extrapolation in enabling efficient LLM alignment, which can inspire follow-up studies and broader applications in future work.

2 Methodology

2.1 Formulation

We denote the language model’s parameter space as 𝚯𝚯\bm{\Theta}bold_Θ and suppose that the alignment performance can be quantified by a continuous scalar function ω:𝚯→ℝ:𝜔→𝚯ℝ\omega:\bm{\Theta}\to\mathbb{R}italic_ω : bold_Θ → blackboard_R, where the higher ω⁢(𝜽)𝜔𝜽\omega({\bm{\theta}})italic_ω ( bold_italic_θ ) indicates the better alignment with human preferences. In other words, ω⁢(𝜽)𝜔𝜽\omega({\bm{\theta}})italic_ω ( bold_italic_θ ) is the implicit optimization objective of alignment training. Note that ω⁢(𝜽)𝜔𝜽\omega({\bm{\theta}})italic_ω ( bold_italic_θ ) may not have an analytic form. In practice, we can employ a reward model as a proxy to compare the relative values of ω⁢(𝜽)𝜔𝜽\omega({\bm{\theta}})italic_ω ( bold_italic_θ ) by calculating the expected reward score on a development set of instructions. We suppose that the model ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (parameterized by 𝜽1subscript𝜽1{\bm{\theta}}_{1}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) has undergone moderate alignment training, and denote its SFT checkpoint as ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (parameterized by 𝜽0subscript𝜽0{\bm{\theta}}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT), which is used for initializing ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and satisfies ω⁢(𝜽0)<ω⁢(𝜽1)𝜔subscript𝜽0𝜔subscript𝜽1\omega({\bm{\theta}}_{0})<\omega({\bm{\theta}}_{1})italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) < italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).

2.2 First-order Approximation

Based on the aforementioned observation, we suppose that the parameter change from ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, denoted as ‖𝜽1−𝜽0‖=‖Δ⁢𝜽‖normsubscript𝜽1subscript𝜽0normΔ𝜽\left\|{\bm{\theta}}_{1}-{\bm{\theta}}_{0}\|=\|\Delta{\bm{\theta}}\right\|∥ bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ = ∥ roman_Δ bold_italic_θ ∥, is small. We can formally perform a Taylor Expansion of ω𝜔\omegaitalic_ω at 𝜽0subscript𝜽0{\bm{\theta}}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and retain the first-order term:

ω⁢(𝜽0+γ⁢Δ⁢𝜽)≈ω⁢(𝜽0)+γ⁢∇ω⁢(𝜽0)⋅Δ⁢𝜽,𝜔subscript𝜽0𝛾Δ𝜽𝜔subscript𝜽0⋅𝛾∇𝜔subscript𝜽0Δ𝜽\displaystyle\omega({\bm{\theta}}_{0}+\gamma\Delta{\bm{\theta}})\approx\omega(% {\bm{\theta}}_{0})+\gamma\nabla\omega({\bm{\theta}}_{0})\cdot\Delta{\bm{\theta% }},italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_γ roman_Δ bold_italic_θ ) ≈ italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_γ ∇ italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋅ roman_Δ bold_italic_θ , (1)

where we define γ∈[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] to ensure that ‖γ⁢Δ⁢𝜽‖norm𝛾Δ𝜽\left\|\gamma\Delta{\bm{\theta}}\right\|∥ italic_γ roman_Δ bold_italic_θ ∥ remains small. In particular, setting γ=1𝛾1\gamma=1italic_γ = 1 gives:

ω⁢(𝜽1)≈ω⁢(𝜽0)+∇ω⁢(𝜽0)⋅Δ⁢𝜽,𝜔subscript𝜽1𝜔subscript𝜽0⋅∇𝜔subscript𝜽0Δ𝜽\displaystyle\omega({\bm{\theta}}_{1})\approx\omega({\bm{\theta}}_{0})+\nabla% \omega({\bm{\theta}}_{0})\cdot\Delta{\bm{\theta}},italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≈ italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ∇ italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋅ roman_Δ bold_italic_θ , (2)
⟹⟹\displaystyle\Longrightarrow\quad⟹ ∇ω⁢(𝜽0)⋅Δ⁢𝜽≈ω⁢(𝜽1)−ω⁢(𝜽0)>0.⋅∇𝜔subscript𝜽0Δ𝜽𝜔subscript𝜽1𝜔subscript𝜽00\displaystyle\nabla\omega({\bm{\theta}}_{0})\cdot\Delta{\bm{\theta}}\approx% \omega({\bm{\theta}}_{1})-\omega({\bm{\theta}}_{0})>0.∇ italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋅ roman_Δ bold_italic_θ ≈ italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) > 0 . (3)

Thus, the first-order approximation (Equation 1) essentially predicts that ω⁢(𝜽0+γ⁢Δ⁢𝜽)𝜔subscript𝜽0𝛾Δ𝜽\omega({\bm{\theta}}_{0}+\gamma\Delta{\bm{\theta}})italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_γ roman_Δ bold_italic_θ ) will improve as γ∈[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] increases.

Refer to caption

Figure 1: Interpolated models usually exhibit intermediate performance between the original DPO/RLHF models and the SFT checkpoints, while their performance improves with increasing γ𝛾\gammaitalic_γ in Equation 1.

To verify this, we conduct experiments using several open-source DPO/RLHF LLMs (Tunstall et al., 2023; Cai et al., 2024; Zhu et al., 2023). We vary γ𝛾\gammaitalic_γ within [0,1]01[0,1][ 0 , 1 ] and construct interpolated models parameterized by 𝜽0+γ⁢Δ⁢𝜽=(1−γ)⁢𝜽0+γ⁢𝜽1subscript𝜽0𝛾Δ𝜽1𝛾subscript𝜽0𝛾subscript𝜽1{\bm{\theta}}_{0}+\gamma\Delta{\bm{\theta}}=(1-\gamma){\bm{\theta}}_{0}+\gamma% {\bm{\theta}}_{1}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_γ roman_Δ bold_italic_θ = ( 1 - italic_γ ) bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_γ bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Their alignment performance is evaluated on the UltraFeedback (Cui et al., 2023) development set using two open-source reward models: RM-Mistral-7B and FsfairX-LLaMA3-RM-v0.1 (detailed experimental setups are described in Section 3.1). Notably, when γ=0𝛾0\gamma=0italic_γ = 0 or 1111, the constructed models degenerate to the original SFT checkpoint ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the DPO/RLHF model ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, respectively. The results in Figure 1 show that the interpolated models constructed via 𝜽0+γ⁢Δ⁢𝜽subscript𝜽0𝛾Δ𝜽{\bm{\theta}}_{0}+\gamma\Delta{\bm{\theta}}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_γ roman_Δ bold_italic_θ can generate fluent and coherent responses. Moreover, their alignment performance always lies between the original SFT model ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the DPO/RLHF model ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and improves with increasing γ𝛾\gammaitalic_γ, which is consistent with the predictions of the first-order approximation. We thereby empirically justify the soundness of the first-order approximation.

2.3 ExPO: Model Extrapolation

In the above first-order approximation, we constrain γ∈[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] to maintain the approximation’s validity along the straight-line path between 𝜽0subscript𝜽0{\bm{\theta}}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝜽1subscript𝜽1{\bm{\theta}}_{1}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We now consider extending this approximation to the “extension” of the line connecting 𝜽0subscript𝜽0{\bm{\theta}}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝜽1subscript𝜽1{\bm{\theta}}_{1}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT beyond 𝜽1subscript𝜽1{\bm{\theta}}_{1}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Let γ>1𝛾1\gamma>1italic_γ > 1 and define α=γ−1>0𝛼𝛾10\alpha=\gamma-1>0italic_α = italic_γ - 1 > 0, denoting 𝜽2=𝜽0+γ⁢Δ⁢𝜽=𝜽0+(1+α)⁢Δ⁢𝜽subscript𝜽2subscript𝜽0𝛾Δ𝜽subscript𝜽01𝛼Δ𝜽{\bm{\theta}}_{2}={\bm{\theta}}_{0}+\gamma\Delta{\bm{\theta}}={\bm{\theta}}_{0% }+(1+\alpha)\Delta{\bm{\theta}}bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_γ roman_Δ bold_italic_θ = bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 + italic_α ) roman_Δ bold_italic_θ. By choosing appropriate α𝛼\alphaitalic_α such that ‖(1+α)⁢Δ⁢𝛉‖norm1𝛼Δ𝛉\left\|(1+\alpha)\Delta{\bm{\theta}}\right\|∥ ( 1 + italic_α ) roman_Δ bold_italic_θ ∥ remains small, we can reformulate the first-order approximation as:

ω⁢(𝜽2)≈𝜔subscript𝜽2absent\displaystyle\omega({\bm{\theta}}_{2})\approxitalic_ω ( bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≈ ω⁢(𝜽0)+(1+α)⁢∇ω⁢(𝜽0)⋅Δ⁢𝜽𝜔subscript𝜽0⋅1𝛼∇𝜔subscript𝜽0Δ𝜽\displaystyle\ \omega({\bm{\theta}}_{0})+(1+\alpha)\nabla\omega({\bm{\theta}}_% {0})\cdot\Delta{\bm{\theta}}italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + ( 1 + italic_α ) ∇ italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋅ roman_Δ bold_italic_θ (4)
(By Equation 1)
≈\displaystyle\approx≈ ω⁢(𝜽1)+α⁢∇ω⁢(𝜽0)⋅Δ⁢𝜽.𝜔subscript𝜽1⋅𝛼∇𝜔subscript𝜽0Δ𝜽\displaystyle\ \omega({\bm{\theta}}_{1})+\alpha\nabla\omega({\bm{\theta}}_{0})% \cdot\Delta{\bm{\theta}}.italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_α ∇ italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋅ roman_Δ bold_italic_θ . (5)
(By Equation 2)

According to Equation 3, we approximately have ω⁢(𝜽2)>ω⁢(𝜽1)𝜔subscript𝜽2𝜔subscript𝜽1\omega({\bm{\theta}}_{2})>\omega({\bm{\theta}}_{1})italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) > italic_ω ( bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). This suggests that, starting from a partially-aligned model ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and its SFT checkpoint ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, by selecting appropriate α>0𝛼0\alpha>0italic_α > 0, we can construct a new model ℳ2subscriptℳ2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT parameterized by 𝜽2subscript𝜽2{\bm{\theta}}_{2}bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT through amplifying the parameter change Δ⁢𝜽Δ𝜽\Delta{\bm{\theta}}roman_Δ bold_italic_θ:

𝜽2=𝜽0+(1+α)⁢Δ⁢𝜽=𝜽1+α⁢Δ⁢𝜽,subscript𝜽2subscript𝜽01𝛼Δ𝜽subscript𝜽1𝛼Δ𝜽\displaystyle{\bm{\theta}}_{2}={\bm{\theta}}_{0}+(1+\alpha)\Delta{\bm{\theta}}% ={\bm{\theta}}_{1}+\alpha\Delta{\bm{\theta}},bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ( 1 + italic_α ) roman_Δ bold_italic_θ = bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_α roman_Δ bold_italic_θ , (6)

such that ℳ2subscriptℳ2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT achieves better alignment performance than ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Consequently, we improve the implicit optimization objective ω⁢(𝛉)𝜔𝛉\omega({\bm{\theta}})italic_ω ( bold_italic_θ ) of alignment training without requiring additional training.

Since the process of Equation 6 essentially “extrapolates” the parameters of ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT along the line connecting 𝜽0subscript𝜽0{\bm{\theta}}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝜽1subscript𝜽1{\bm{\theta}}_{1}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we refer to the procedure defined by Equation 6 as ExPO (model extrapolation). Figure 2 illustrates the ExPO method, where the orange curve from 𝜽0subscript𝜽0{\bm{\theta}}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to 𝜽1subscript𝜽1{\bm{\theta}}_{1}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT indicates the actual training trajectory from ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and the straight orange line from 𝜽1subscript𝜽1{\bm{\theta}}_{1}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 𝜽2subscript𝜽2{\bm{\theta}}_{2}bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denotes the extrapolation from ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to ℳ2subscriptℳ2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In practice, the hyperparameter α𝛼\alphaitalic_α in Equation 6 (controlling the extrapolation length) can be tuned using inference-level computational resources. For example, hyperparameter search for a 7B model requires only a single A10 24GB GPU, while a 70B model needs two A100 80GB GPUs. As high-performance LLM inference frameworks like vLLM (Kwon et al., 2023) and SGLang (Zheng et al., 2023c) continue to rapidly develop, the costs of hyperparameter search will keep decreasing.

Refer to caption

Figure 2: The orange curve indicates the training trajectory from 𝜽0subscript𝜽0{\bm{\theta}}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to 𝜽1subscript𝜽1{\bm{\theta}}_{1}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, while the orange line denotes the extrapolation from 𝜽1subscript𝜽1{\bm{\theta}}_{1}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT along Δ⁢𝜽Δ𝜽\Delta{\bm{\theta}}roman_Δ bold_italic_θ, thus producing 𝜽2subscript𝜽2{\bm{\theta}}_{2}bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Connection to Model Averaging/Interpolation

It is worth noting that the idea of “model averaging” has been explored in prior work. Specifically, previous work has discovered that deep neural networks often exhibit mode connectivity (Garipov et al., 2018; Entezari et al., 2022; Zhao et al., 2020; Frankle et al., 2020). This property implies that between two local optima in the parameter space, there typically exists a path where model performance (e.g., validation accuracy or loss) does not degrade significantly during traversal. Empirical studies (Izmailov et al., 2018; Lin et al., 2024; Wortsman et al., 2022) have shown that even with simple linear interpolation paths between two local optima, the loss along the path remains low, and performance often lies between the original models, which is consistent with our observations in Figure 1. Recent LLM research (Lin et al., 2023; Yu et al., 2024; Akiba et al., 2024; Goddard et al., 2024) has further explored interpolation across multiple fine-tuned models (i.e., models initialized from the same pre-trained checkpoint but fine-tuned on different data) to create new models with combined capabilities. Note that Equation 6 can be rewritten as: 𝜽2=(1−γ)⁢𝜽0+γ⁢𝜽1subscript𝜽21𝛾subscript𝜽0𝛾subscript𝜽1{\bm{\theta}}_{2}=(1-\gamma){\bm{\theta}}_{0}+\gamma{\bm{\theta}}_{1}bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( 1 - italic_γ ) bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_γ bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, which means ExPO can be viewed as a generalized form of model interpolation with weights exceeding 1. Hence, the hypothesis we formulated based on the characteristics of preference alignment (i.e., small parameter changes) and the derived ExPO method essentially extend the weight range of traditional model interpolation (from [0,1]01[0,1][ 0 , 1 ] to (1,+∞)1(1,+\infty)( 1 , + ∞ )).

In the following sections, we will conduct extensive experiments to validate the effectiveness of ExPO in reducing the computational costs of preference alignment training.

3 Controlled Experiments

3.1 Setup and Evaluation Protocol

Models and Training Recipe

Our controlled experiments are based on the training recipe of the zephyr-7b-dpo model. Specifically, we use the UltraFeedback (Cui et al., 2023) dataset for model training, which contains diverse instruction-response pairs with GPT-4-annotated preference labels and is split into 61K and 1K data as the training and development sets, respectively. For DPO training, we use zephyr-7b-dpo’s SFT checkpoint for model initialization and as the reference model. We adopt the global batch size of 128, the learning rate of 5e-7, and the AdamW optimizer (Loshchilov and Hutter, 2019). Note that while zephyr-7b-dpo is trained for 478 steps in total (i.e., one epoch), in § 3.2 we will vary the training steps, or equivalently, the training data size. We train the models on 8 A100 80GB GPUs.

Inference Details

We employ the vLLM (Kwon et al., 2023) library for high-throughput model inference. We use top-k𝑘kitalic_k (k=40𝑘40k=40italic_k = 40) and nucleus sampling (Holtzman et al., 2020) (p=0.9𝑝0.9p=0.9italic_p = 0.9) with a temperature of 0.7. To avoid repetition in generated texts, we set both the factors of presence penalty and frequency penalty to 0.1. We set the sampling random seed to 42.

To determine the optimal α𝛼\alphaitalic_α value in ExPO, we use a combination of binary search and grid search with manually tuned intervals (see Appendix B for details). We select the α𝛼\alphaitalic_α giving the highest expected reward on the UltraFeedback development set (1K instructions), as calculated by the reward model RM-Mistral-7B.

Evaluation Protocol

We resort to AlpacaEval 2.0 (Li et al., 2023) for model evaluation, which is a leading benchmark that assesses LLMs’ instruction-following ability and their alignment with human preferences. It contains a fixed set of 805 instructions chosen to be representative of real user cases. For each instruction, it calculates the probability that a GPT-4 Turbo evaluator prefers the output of the evaluated model over the GPT-4 baseline, thus providing an affordable and replicable alternative to human annotation. The win rate over the GPT-4 baseline is computed as the expected preference probability, while the length-controlled (LC) win rate (Dubois et al., 2024) alleviates the length bias of the GPT-4 Turbo evaluator (i.e., the prior preference toward longer responses).

In § 3.2, we report both the raw and LC win rates, as well as the expected reward score over the 805 instructions calculated. For subsequent experiments, unless otherwise stated, we report the expected reward score on the UltraFeedback development set (1K instructions) for ease of analysis.

Table 2: Evaluation results on AlpacaEval 2.0 of applying ExPO to DPO models trained with varying steps (ℳ1∗superscriptsubscriptℳ1\mathcal{M}_{1}^{*}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT).

Refer to caption

Figure 3: Reward distribution on UltraFeedback (development set) for the extrapolated models in Table 2.

Refer to caption

Figure 4: ℳ2subscriptℳ2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT’s reward scores and response lengths on UltraFeedback (development set) varying with α𝛼\alphaitalic_α (x-axis) for the partially-trained DPO models in § 3.2. Dashed vertical lines correspond to the optimal α𝛼\alphaitalic_α values.α=0𝛼0\alpha=0italic_α = 0 indicates that ExPO is not applied (i.e., ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT).

3.2 Analysis of Varying Training Steps

We first investigate whether ExPO can enhance LLMs with limited alignment training. Given that the full training of zephyr-7b-dpo consists of 478 steps (one epoch over the UltraFeedback training data), we initialize from the same SFT checkpoint (ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) and use the aforementioned training configuration to train DPO models (ℳ1∗superscriptsubscriptℳ1\mathcal{M}_{1}^{*}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) with 10%, 20%, and 40% of the full training steps. We directly use zephyr-7b-dpo as the 100%-step (full-training) model ℳ1100%superscriptsubscriptℳ1percent100\mathcal{M}_{1}^{100\%}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 100 % end_POSTSUPERSCRIPT. For these DPO models, we apply ExPO to derive extrapolated models ℳ2∗superscriptsubscriptℳ2\mathcal{M}_{2}^{*}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Main Results

As shown in Table 2, while fewer training steps generally yield lower alignment performance, ExPO effectively bridges the gap caused by reduced training steps. For example, ExPO boosts ℳ110%superscriptsubscriptℳ1percent10\mathcal{M}_{1}^{10\%}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 % end_POSTSUPERSCRIPT’s LC win rate from 10.4% to ℳ210%superscriptsubscriptℳ2percent10\mathcal{M}_{2}^{10\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 % end_POSTSUPERSCRIPT’s 16.3% and ℳ120%superscriptsubscriptℳ1percent20\mathcal{M}_{1}^{20\%}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 20 % end_POSTSUPERSCRIPT from 12.9% to ℳ220%superscriptsubscriptℳ2percent20\mathcal{M}_{2}^{20\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 20 % end_POSTSUPERSCRIPT’s 21.3%, enabling these extrapolated models to match or even surpass the fully-trained ℳ1100%superscriptsubscriptℳ1percent100\mathcal{M}_{1}^{100\%}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 100 % end_POSTSUPERSCRIPT.

Hyperparameter Search Analysis

The optimal α𝛼\alphaitalic_α values for ℳ210%superscriptsubscriptℳ2percent10\mathcal{M}_{2}^{10\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 % end_POSTSUPERSCRIPT, ℳ220%superscriptsubscriptℳ2percent20\mathcal{M}_{2}^{20\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 20 % end_POSTSUPERSCRIPT, ℳ240%superscriptsubscriptℳ2percent40\mathcal{M}_{2}^{40\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 40 % end_POSTSUPERSCRIPT, and ℳ2100%superscriptsubscriptℳ2percent100\mathcal{M}_{2}^{100\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 100 % end_POSTSUPERSCRIPT are 8.0, 2.5, 0.5, and 0.3, respectively. Figure 3 illustrates the reward distributions of these extrapolated models, showing that their response distributions shift toward higher reward regions compared to the original ℳ1∗superscriptsubscriptℳ1\mathcal{M}_{1}^{*}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT models. In Figure 4, we show that increasing α𝛼\alphaitalic_α within a reasonable range consistently improves alignment performance. However, excessively large α𝛼\alphaitalic_α causes sharp performance drops and abnormal response length increases (e.g., generating gibberish or failing to terminate). This indicates that overly large α𝛼\alphaitalic_α violates the first-order approximation (Equation 4) as ‖(1+α)⁢Δ⁢𝜽‖norm1𝛼Δ𝜽\left\|(1+\alpha)\Delta{\bm{\theta}}\right\|∥ ( 1 + italic_α ) roman_Δ bold_italic_θ ∥ becomes too large. Additionally, since more training steps lead to larger ‖Δ⁢𝜽‖normΔ𝜽\left\|\Delta{\bm{\theta}}\right\|∥ roman_Δ bold_italic_θ ∥, smaller α𝛼\alphaitalic_α values are required for models with more training steps (e.g., ℳ1100%superscriptsubscriptℳ1percent100\mathcal{M}_{1}^{100\%}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 100 % end_POSTSUPERSCRIPT) to maintain the validity of Equation 4, which is consistent with our hyperparameter search results.

Table 3: Ablation results on UltraFeedback (development set) of adjusting training data quality. “N/A” denotes that the reward score does not improve after applying ExPO with the smallest α=0.1𝛼0.1\alpha=0.1italic_α = 0.1.

Computational Cost Analysis

The fully-trained model ℳ1100%superscriptsubscriptℳ1percent100\mathcal{M}_{1}^{100\%}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 100 % end_POSTSUPERSCRIPT requires about 12 GPU hours (A100 80GB). In contrast, ℳ220%superscriptsubscriptℳ2percent20\mathcal{M}_{2}^{20\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 20 % end_POSTSUPERSCRIPT’s hyperparameter search takes about 0.5 GPU hour, and combined with ℳ120%superscriptsubscriptℳ1percent20\mathcal{M}_{1}^{20\%}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 20 % end_POSTSUPERSCRIPT’s about 2.5-hour training, the total cost is about 3 GPU hours, leading to a 75% reduction compared to full training while achieving comparable or better alignment performance. Moreover, ExPO’s hyperparameter search, which only involves model inference, also significantly reduces hardware requirements, e.g., a 7B model requires only a single A10 24GB GPU for search, whereas training typically needs 8 A100 80GB GPUs. The above results reaffirm the soundness of the first-order approximation and demonstrate ExPO’s effectiveness in reducing computational costs for LLM alignment.

Other Observations

We also observe two other noteworthy phenomena: (1) Extrapolated alignment performance does not strictly increase with training steps. For example, ℳ220%superscriptsubscriptℳ2percent20\mathcal{M}_{2}^{20\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 20 % end_POSTSUPERSCRIPT outperforms ℳ2100%superscriptsubscriptℳ2percent100\mathcal{M}_{2}^{100\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 100 % end_POSTSUPERSCRIPT, suggesting ExPO’s efficacy depends on factors like training data and hyperparameters. We will explore these factors in § 3.3 and 3.4. (2) Even fully trained models like ℳ1100%superscriptsubscriptℳ1percent100\mathcal{M}_{1}^{100\%}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 100 % end_POSTSUPERSCRIPT benefit from ExPO (LC win rate increases by 2.8%), indicating that existing already-aligned models may not be fully optimized, and ExPO can fill this gap. We will apply ExPO to more existing, already-aligned models in § 4.1.

3.3 Analysis of Training Data Quality

In the previous section, we observed that alignment performance after model extrapolation does not strictly improve with increased training steps. We conjecture that this occurs because more training makes the model more prone to learning spurious features from data, such as length bias222In the UltraFeedback training set, preferred and non-preferred responses have average lengths of 319 and 277 tokens, respectively. (Park et al., 2024). According to Equation 6, under our controlled experimental setup where all ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are initialized from the same SFT model ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝜽0subscript𝜽0{\bm{\theta}}_{0}bold_italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the highest achievable performance of the extrapolated model ℳ2subscriptℳ2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is uniquely determined by Δ⁢𝛉Δ𝛉\Delta{\bm{\theta}}roman_Δ bold_italic_θ. Hence, ExPO’s effectiveness requires Δ⁢𝜽Δ𝜽\Delta{\bm{\theta}}roman_Δ bold_italic_θ to indicate the direction that genuinely improves alignment performance. Learning spurious features like length bias degrades the “quality” of Δ⁢𝜽Δ𝜽\Delta{\bm{\theta}}roman_Δ bold_italic_θ, thus undermining the extrapolation performance. Figure 5 illustrates this phenomenon: as training steps increase (from 𝜽1subscript𝜽1{\bm{\theta}}_{1}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 𝜽1′superscriptsubscript𝜽1′{\bm{\theta}}_{1}^{\prime}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT), the model can learn spurious features from training data, leading to the degraded alignment performance of extrapolated models (e.g., 𝜽2′superscriptsubscript𝜽2′{\bm{\theta}}_{2}^{\prime}bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT underperforms 𝜽2subscript𝜽2{\bm{\theta}}_{2}bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT).

Refer to caption

Figure 5: Increasing training steps (from 𝜽1subscript𝜽1{\bm{\theta}}_{1}bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to 𝜽1′subscriptsuperscript𝜽′1{\bm{\theta}}^{\prime}_{1}bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) can make the model more prone to learning spurious features from training data, such as length bias. This consequently impairs the direction of Δ⁢𝜽Δ𝜽\Delta{\bm{\theta}}roman_Δ bold_italic_θ and the achievable performance of ExPO (e.g., 𝜽2′subscriptsuperscript𝜽′2{\bm{\theta}}^{\prime}_{2}bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT underperforms 𝜽2subscript𝜽2{\bm{\theta}}_{2}bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT).

Table 4: Ablation results of the training epochs, learning rate, and optimizer on UltraFeedback (development set).

To analyze how training data quality affects ExPO’s effectiveness in a controlled manner, we take length bias as an example and manually inject length bias into the training data. Unlike the random sampling in § 3.2, we sort the UltraFeedback training data by the length difference between preferred and non-preferred responses in descending order. We then train models on the sorted samples orderly so that models will prioritize learning from samples with larger length differences. From Table 3, while introducing length bias temporarily boosts reward scores (ℳ110%,bsuperscriptsubscriptℳ1percent10b\mathcal{M}_{1}^{10\%,\mathrm{b}}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 % , roman_b end_POSTSUPERSCRIPT and ℳ120%,bsuperscriptsubscriptℳ1percent20b\mathcal{M}_{1}^{20\%,\mathrm{b}}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 20 % , roman_b end_POSTSUPERSCRIPT outperform ℳ110%superscriptsubscriptℳ1percent10\mathcal{M}_{1}^{10\%}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 % end_POSTSUPERSCRIPT and ℳ120%superscriptsubscriptℳ1percent20\mathcal{M}_{1}^{20\%}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 20 % end_POSTSUPERSCRIPT), extrapolated models consistently underperform (ℳ210%,bsuperscriptsubscriptℳ2percent10b\mathcal{M}_{2}^{10\%,\mathrm{b}}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 % , roman_b end_POSTSUPERSCRIPT and ℳ220%,bsuperscriptsubscriptℳ2percent20b\mathcal{M}_{2}^{20\%,\mathrm{b}}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 20 % , roman_b end_POSTSUPERSCRIPT are worse than ℳ210%superscriptsubscriptℳ2percent10\mathcal{M}_{2}^{10\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 % end_POSTSUPERSCRIPT and ℳ220%superscriptsubscriptℳ2percent20\mathcal{M}_{2}^{20\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 20 % end_POSTSUPERSCRIPT). Moreover, the optimal α𝛼\alphaitalic_α values for ℳ210%,bsuperscriptsubscriptℳ2percent10b\mathcal{M}_{2}^{10\%,\mathrm{b}}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 % , roman_b end_POSTSUPERSCRIPT and ℳ220%,bsuperscriptsubscriptℳ2percent20b\mathcal{M}_{2}^{20\%,\mathrm{b}}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 20 % , roman_b end_POSTSUPERSCRIPT are 0.2 and 0.4, which are far smaller than those for ℳ210%superscriptsubscriptℳ2percent10\mathcal{M}_{2}^{10\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 % end_POSTSUPERSCRIPT (8.0) and ℳ220%superscriptsubscriptℳ2percent20\mathcal{M}_{2}^{20\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 20 % end_POSTSUPERSCRIPT (2.5). For ℳ140%,bsuperscriptsubscriptℳ1percent40b\mathcal{M}_{1}^{40\%,\mathrm{b}}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 40 % , roman_b end_POSTSUPERSCRIPT, ExPO even fails to yield any improvement. These results demonstrate that training on biased or low-quality data (e.g., with length bias) causes Δ⁢𝜽Δ𝜽\Delta{\bm{\theta}}roman_Δ bold_italic_θ to fail to indicate the direction that genuinely improves alignment performance, thereby diminishing the benefits of model extrapolation.

3.4 Analysis of Training Configurations

Next, we analyze how specific training hyperparameters influence ExPO’s effectiveness. Since ExPO amplifies the parameter change Δ⁢𝜽Δ𝜽\Delta{\bm{\theta}}roman_Δ bold_italic_θ from ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we investigate whether ExPO is equivalent to directly increasing the magnitude of parameter changes, such as by raising the training epochs or learning rate. Additionally, since the training trajectory from ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (and the resulting Δ⁢𝜽Δ𝜽\Delta{\bm{\theta}}roman_Δ bold_italic_θ) is closely tied to the gradient descent algorithm, we also explore the impact of the optimizer on ExPO’s effectiveness. All experiments use the model trained with 20% steps in § 3.2 as the baseline and follow the default training data and configurations.

Training Epochs and Learning Rate

We increase the training epochs or learning rate for ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Table 4 shows that while both adjustments improve ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT’s performance, they also reduce the benefits of model extrapolation (lower ℳ2subscriptℳ2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT performance) and yield smaller optimal α𝛼\alphaitalic_α values. Meanwhile, the ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT models trained with more epochs or larger learning rates generate significantly longer responses compared to the default setup. This suggest that both adjustments also make models prone to learning the length bias in training data, thereby degrading Δ⁢𝜽Δ𝜽\Delta{\bm{\theta}}roman_Δ bold_italic_θ’s quality and the gains from ExPO. Notably, when training epochs are set to 3, ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT cannot benefit from ExPO, likely because the first-order approximation (Equation 4) no longer holds as ‖Δ⁢𝜽‖normΔ𝜽\left\|\Delta{\bm{\theta}}\right\|∥ roman_Δ bold_italic_θ ∥ becomes too large.

Optimizer

We train ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT using three popular optimizers: AdamW (Loshchilov and Hutter, 2019) (default), AdaGrad (Duchi et al., 2011), and RMSprop (Hinton, 2012). Table 4 shows that while AdaGrad converges slowest (lowest ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT performance), it achieves the highest extrapolated alignment performance (ℳ2subscriptℳ2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), slightly surpassing AdamW. Conversely, RMSprop, while yielding the best ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT performance, results in the poorest ℳ2subscriptℳ2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT performance. AdamW, as the dominant optimizer in modern LLM training, strikes a balance between convergence efficiency and extrapolated performance. These results highlight that different optimizers significantly affect Δ⁢𝜽Δ𝜽\Delta{\bm{\theta}}roman_Δ bold_italic_θ’s quality and extrapolation outcomes.

Table 5: Evaluation results on AlpacaEval 2.0 and MT-Bench of applying ExPO to existing DPO/RLHF LLMs.

4 Extended Applications of ExPO

4.1 Applying ExPO to More Existing, Already-aligned LLMs

In § 3.2, we observed that ExPO also brings noticeable performance improvements to the fully-trained zephyr-7b-dpo. This motivates us to apply ExPO to more existing, already-aligned LLMs. As hypothesized in § 1, the normally-trained models should also satisfy the first-order approximation premise, i.e., ‖Δ⁢𝜽‖normΔ𝜽\left\|\Delta{\bm{\theta}}\right\|∥ roman_Δ bold_italic_θ ∥ is small. We select twelve open-source models from HuggingFace for experiments (see Appendix C for their model IDs):

These models cover a diverse range of model sizes (from 1.8B to 70B) and span three mainstream alignment algorithms widely used in practice.

Based on our hyperparameter search experience for zephyr-7b-dpo in § 3.2 (Appendix B), for the twelve models above, we conduct a simple grid search for the optimal α𝛼\alphaitalic_α, using the interval of 0.1 within [0.1, 0.5]. In addition to AlpacaEval 2.0, we also evaluate these models on MT-Bench (Zheng et al., 2023b), another leading benchmark for assessing instruction-tuned LLMs’ general and multi-turn ability. It contains a set of challenging multi-turn open-ended questions covering topics such as writing, role-playing, math, coding, and more. The model-generated answers are judged by GPT-4 via a scalar score (from 1 to 10).

In Table 5, we show that ExPO consistently improves the evaluated LLMs, with notable improvements of up to 10.1% win rate and 4.5% LC win rate on AlpacaEval 2.0 (for internlm2-20b and tulu2-70b, respectively) and 0.37 on MT-Bench (for llama3-8b-iter). This suggests that existing, already-aligned LLMs may still not have been trained to optimality or “saturation”.ExPO offers a practical and efficient means to compensate for potential inadequate training of existing LLMs (or, squeeze more alignment performance out of these models), as it only requires inference-level hardware resources and bypasses the costly additional training overhead.

Table 6: Evaluation results on UntraFeedback of applying ExPO to models trained via different algorithms.

4.2 Applying ExPO to More Alignment Algorithms

So far, we have primarily applied ExPO to models trained via the dominant DPO or RLHF algorithms (§ 3 and 4.1). Since ExPO does not assume the specific training method for ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we expect that ExPO can be applied to models trained via other algorithms than DPO or RLHF. To this end, we use a series of Mistral/LLaMA-3 models released by Meng et al. (2024), which are trained via various alignment algorithms and are all initialized from the same SFT checkpoints. These algorithms include: RRHF (Yuan et al., 2023),SLiC-HF (Zhao et al., 2023a), IPO (Azar et al., 2024), CPO (Xu et al., 2024), KTO (Ethayarajh et al., 2024), R-DPO (Park et al., 2024), and SimPO (Meng et al., 2024). We refer readers to Meng et al. (2024) for elaboration on these algorithms’ optimization objectives as well as the models’ training configurations. Following the previous experience, we search the optimal α𝛼\alphaitalic_α value within the range of [0.1, 0.5] with the interval of 0.1.

As shown in Table 6, ExPO effectively complements various alignment training algorithms. While these models have been carefully tuned according to Meng et al. (2024), they still benefit from model extrapolation. This indicates that ExPO does not rely on specific alignment algorithms but instead generalizes across diverse methods, showcasing its broad compatibility and practical utility.

4.3 Discussion on Failure Cases

Finally, we discuss the failure cases we encountered when applying ExPO to more various models. (1) ExPO supposes ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is an SFT model and ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is one that further undergoes alignment training. However, when we attempted with a pre-trained model as ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and an SFT one as ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we found that model extrapolation usually cannot improve alignment performance and can even lead to model collapse (e.g., the extrapolated model struggles to generate the EOS token or mistakenly generates special tokens). We speculate that this is because SFT typically adopts a larger learning rate and more training steps, and serves to adapt models to the chat templates (Zheng, 2024), so new knowledge is actually injected into models. (2) Another type of failure cases is also related to model overfitting. For example, the Storm-7B model (Liu et al., 2024a) is trained via iterative DPO for three iterations. When experimenting with this model, we found that applying ExPO with even the very small α=0.1𝛼0.1\alpha=0.1italic_α = 0.1 results in severe model collapse, probably because the model overfits to its employed reward model during iterative DPO training.

In both cases, ExPO’s underlying first-order approximation can become invalidated as the resulting ‖Δ⁢𝜽‖normΔ𝜽\left\|\Delta{\bm{\theta}}\right\|∥ roman_Δ bold_italic_θ ∥ is too large. Therefore, we suggest that more deliberate strategies are needed when applying ExPO to models with large parameter changes, e.g., by leveraging the intermediate checkpoints. We note that recent work has made promising exploration (Lin et al., 2025) and expect more follow-up studies in future work.

5 Conclusion

This work demonstrates the efficacy of the ExPO (model extrapolation) method in enabling more efficient LLM alignment with human preferences.ExPO builds upon the hypothesis that alignment training typically involves only small changes of model parameters. Given a partially-trained model ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and its initial SFT checkpoint, ExPO improves the implicit optimization objective of alignment training by simply amplifying the parameter change based on a first-order approximation, thus directly achieving better alignment performance without additional training overhead. We empirically validate ExPO’s effectiveness through controlled experiments, where we show that the DPO model trained with 20% steps can be boosted to outperform the fully-trained one. Furthermore, we extend ExPO’s application to twelve existing, already-aligned LLMs, showing that ExPO consistently improves their performance on the mainstream LLM benchmarks AlpacaEval 2.0 and MT-Bench. This suggests that ExPO can also serve as a practical and efficient means to compensate for potential inadequate alignment training of existing LLMs. Overall, our work highlights the utility of model extrapolation in efficient LLM alignment, which can inspire future research in this direction.

6 Limitations

Hyperparameter Search

The current ExPO adopts the simplest form of uniform extrapolation and requires manual hyperparameter search for α𝛼\alphaitalic_α. Future work could explore how to determine the optimal α𝛼\alphaitalic_α automatically and adaptively (i.e., using different α𝛼\alphaitalic_α values for different model modules). For example, the information from optimizer states and parameter gradients during the later phase of alignment training could be useful for this purpose.

Alignment Tax

While ExPO makes substantial improvements in instruction-following ability and alignment with human preferences, this seems not “free” and can instead incur an additional alignment tax, a widely observed issue in human preference optimization algorithms (Ouyang et al., 2022; Dong et al., 2024; Meng et al., 2024), which indicates the possible fluctuations or drops in downstream task performance after alignment training. We evaluate the models in § 3.2 and 4.1 on the six downstream tasks (Clark et al., 2018; Zellers et al., 2019; Hendrycks et al., 2021; Lin et al., 2022; Sakaguchi et al., 2021; Cobbe et al., 2021) from the Open LLM Leaderboard333We employ the evaluation implementation of Eleuther’s lm-evaluation-harness (version 0.4.4). Note that the mismatch of input templates used for chat-style evaluations (e.g., AlpacaEval 2.0 and MT-Bench) and for these downstream task evaluations could also contribute to the observed alignment tax, as discussed in Meng et al. (2024). (v1; Beeching et al. 2023). We find that in most cases, ExPO amplifies the alignment tax introduced by the alignment training (from ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT). For example, for the partially-trained models in § 3.2 (Appendix D, Figure 6), the original DPO models (ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) show improvements over the initial SFT model (ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) on TruthfulQA and declines on GSM8K, while applying ExPO (ℳ2subscriptℳ2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) leads to further improvements or declines, respectively. For the existing, already-aligned LLMs in § 4.1, the amplification of the alignment tax by ExPO is usually smaller as shown in Figure 7 in Appendix D, suggesting a trade-off between the alignment training overhead (from ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) and the additional alignment tax brought by ExPO (from ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to ℳ2subscriptℳ2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT).

Acknowledgements

We thank Sidi Lu, Yufei Tian, Zi-Yi Dou, and other members of the UCLA PlusLab & NLP group as well as anonymous reviewers for their constructive feedback and discussions.

This work was supported by an Amazon AGI Research Award through UCLA-Amazon Science Hub and a National Science Foundation CAREER award #2339766. This work was also supported by the National Science Foundation for Distinguished Young Scholars (with No. 62125604) and China Scholarship Council (with No. 202306210211).

References

LLM Alignment

Modern large language models (LLMs) are first pre-trained on massive textual corpora with the unsupervised language modeling objective (Brown et al., 2020; Touvron et al., 2023b; Dubey et al., 2024), and then fine-tuned to learn to follow human instructions (OpenAI, 2022, 2023; Ji et al., 2023). The current fine-tuning paradigm typically contains two steps: supervised fine-tuning (SFT) and human preference optimization. Our work focuses on the later step, which aims to adjust the model’s response distribution to better align with human preferences. In this process, the model is usually trained on preference data (“A is better than B”; Zhao et al. 2023b; Zheng et al. 2023a), thus learning to assign higher probabilities to human-preferred responses over the disfavored ones. Common implementations for human preference optimization include Reinforcement Learning from Human Feedback (RLHF; Ouyang et al. 2022; Schulman et al. 2017), Direct Preference Optimization (DPO; Rafailov et al. 2023), and many other DPO’s variants or competitors (Azar et al., 2024; Xu et al., 2024; Ethayarajh et al., 2024; Park et al., 2024; Meng et al., 2024). Given LLMs’ gigantic parameters, the processes from pre-training to SFT and the alignment training still require expensive computational resources. Therefore, exploring more efficient alignment methods to reduce training overhead has always been an important and compelling research challenge (Ji et al., 2024). To address this challenge, we propose the ExPO method, which has demonstrated promising efficacy in expediting LLM alignment.

There is another line of work that attempts to bypass the expensive alignment training by blending multiple models’ token predictions during the inference time (Liu et al., 2021; Lu et al., 2024; Liu et al., 2024b), usually referred to as inference-time alignment methods. In comparison to ExPO, these inference-time methods often require more complex and varied implementations of model inference, which are not typically supported by existing high-performance LLM inference infrastructures (e.g., vLLM). This inconvenience not only reduces the practical efficiency of model inference but also significantly increases the cost of their hyperparameter search processes. In contrast, ExPO only involves regular inference of a single model, which can be seamlessly supported by existing infrastructures, thereby inheriting the merit in inference efficiency.

Model Averaging/Interpolation

Model averaging/interpolation is a commonly used technique in machine learning. It utilizes multiple models trained with different random initializations or data subsets and interpolates the weights of these models to obtain a new model with stronger out-of-distribution generalization (Izmailov et al., 2018; Lin et al., 2024; Wortsman et al., 2022; Lin et al., 2023). This technique is based on the mode connectivity of neural networks (Garipov et al., 2018; Entezari et al., 2022; Zhao et al., 2020; Frankle et al., 2020). Specifically, prior work found that multiple local optima in the parameter space can often be connected by low-loss (linear) paths, particularly for models with residual connection structures (He et al., 2016). This can explain why model interpolation can produce new, functional models when applied to LLMs (as our observations in Figure 1), as residual connection has become a dominant choice of architecture design in modern LLMs like LLaMA (Touvron et al., 2023a). We notice that recent LLMs have widely adopted model interpolation, as exemplified by Gemma-2 (Gemma et al., 2024) and LLaMA-3 (Dubey et al., 2024), possibly also for further enhancement in out-of-distribution generalization.

Appendix B Hyperparameter Search Details

We use the experiments in Table 2 as an example to illustrate how we conduct hyperparameter search.

Starting with ℳ210%superscriptsubscriptℳ2percent10\mathcal{M}_{2}^{10\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 10 % end_POSTSUPERSCRIPT:

(1) First, with an interval of 5, we tried α=5𝛼5\alpha=5italic_α = 5 and α=10𝛼10\alpha=10italic_α = 10. We found that both significantly outperformed ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, but (α=5)>(α=10)𝛼5𝛼10(\alpha=5)>(\alpha=10)( italic_α = 5 ) > ( italic_α = 10 ). (2) Then, setting the search range to [5,10]510[5,10][ 5 , 10 ] with an interval of 1, we applied binary search and tried α=7𝛼7\alpha=7italic_α = 7 and α=8𝛼8\alpha=8italic_α = 8. We found that (α=8)>(α=7)𝛼8𝛼7(\alpha=8)>(\alpha=7)( italic_α = 8 ) > ( italic_α = 7 ). We then tried α=9𝛼9\alpha=9italic_α = 9 and found (α=8)>(α=9)𝛼8𝛼9(\alpha=8)>(\alpha=9)( italic_α = 8 ) > ( italic_α = 9 ). (3) We thus determined α=8𝛼8\alpha=8italic_α = 8 as optimal.

Note that smaller search intervals might yield better results, but we deem this unnecessary in practice.

Then, for ℳ220%superscriptsubscriptℳ2percent20\mathcal{M}_{2}^{20\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 20 % end_POSTSUPERSCRIPT:

(1) With previous experience, we first tried α=2𝛼2\alpha=2italic_α = 2 and α=4𝛼4\alpha=4italic_α = 4 with an interval of 2. We found that α=2𝛼2\alpha=2italic_α = 2 significantly outperformed ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, but α=4𝛼4\alpha=4italic_α = 4 performed worse than ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. (2) Then, setting search ranges to [1,2]12[1,2][ 1 , 2 ] and [2,4]24[2,4][ 2 , 4 ] with an interval of 1, we applied binary search and tried α=1𝛼1\alpha=1italic_α = 1 and α=3𝛼3\alpha=3italic_α = 3. We found that (α=2)>(α=3)>(α=1)𝛼2𝛼3𝛼1(\alpha=2)>(\alpha=3)>(\alpha=1)( italic_α = 2 ) > ( italic_α = 3 ) > ( italic_α = 1 ). (3) Next, with an interval of 0.5 in [2,3]23[2,3][ 2 , 3 ], we tried α=2.5𝛼2.5\alpha=2.5italic_α = 2.5 and found (α=2.5)>(α=2)𝛼2.5𝛼2(\alpha=2.5)>(\alpha=2)( italic_α = 2.5 ) > ( italic_α = 2 ). (4) We thus determined α=2.5𝛼2.5\alpha=2.5italic_α = 2.5 as optimal.

This took 5 searches in total, each taking about 5min (using one A100 80GB, including inference on development set and reward model scoring), totaling about 0.5 GPU hours.

Next, for ℳ240%superscriptsubscriptℳ2percent40\mathcal{M}_{2}^{40\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 40 % end_POSTSUPERSCRIPT:

(1) Based on previous experience, we first tried α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 and found it outperformed ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. (2) Then with an interval of 0.1, we applied grid search and tried α=0.6𝛼0.6\alpha=0.6italic_α = 0.6 and α=0.4𝛼0.4\alpha=0.4italic_α = 0.4. We found that α=0.6𝛼0.6\alpha=0.6italic_α = 0.6 performed worse than ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, while (α=0.5)>(α=0.4)𝛼0.5𝛼0.4(\alpha=0.5)>(\alpha=0.4)( italic_α = 0.5 ) > ( italic_α = 0.4 ). (3) We thus determined α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 as optimal.

Note that the search experience for ℳ240%superscriptsubscriptℳ2percent40\mathcal{M}_{2}^{40\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 40 % end_POSTSUPERSCRIPT is a key motivation for us to use [0.1, 0.5] as search range with 0.1 interval for ℳ2100%superscriptsubscriptℳ2percent100\mathcal{M}_{2}^{100\%}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 100 % end_POSTSUPERSCRIPT and models in § 4.1.

Summary

Overall, we (and in practice) do not search blindly, but flexibly combine binary search, grid search, and dynamically adjusted search intervals.These strategies are simple, practical, and represent consensus in practice.It is also noteworthy that the above search only requires inference-level GPU hardware (e.g., A10 24GB). Therefore, compared to the reduced training overhead (from 12 GPU hours for ℳ1100%superscriptsubscriptℳ1percent100\mathcal{M}_{1}^{100\%}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 100 % end_POSTSUPERSCRIPT to 2.5 GPU hours for ℳ120%superscriptsubscriptℳ1percent20\mathcal{M}_{1}^{20\%}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 20 % end_POSTSUPERSCRIPT) and training-level GPU hardware (from eight A100 80GB to one A10 24GB), the α𝛼\alphaitalic_α search process in ExPO is more economical and efficient.

Table 7: Hyperparameter search results for α𝛼\alphaitalic_α in § 3.2 and 4.1.

Search Interval Optimal α𝛼\alphaitalic_α
Models in § 3.2 (binary/grid search) DPO (10% data) 1.0 8.0
DPO (20% data) 0.5 2.5
DPO (40% data) 0.1 0.5
zephyr-7b-dpo 0.1 0.3
Models in § 4.1 (grid search within [0.1, 0.5]) zephyr-7b-alpha/beta 0.1 0.3/0.1
tulu2-7/13/70b 0.1 0.5
snorkel-7b-iter 0.1 0.3
llama3-8b-iter 0.1 0.3
starling-7b-alpha/beta 0.1 0.2/0.5
internlm2-1.8/7/20b 0.1 0.5

Appendix C HuggingFace Models

HuggingFace Model ID
Reward models weqweasdas/RM-Mistral-7B
sfairXC/FsfairX-LLaMA3-RM-v0.1
zephyr-7b-dpo ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT alignment-handbook/zephyr-7b-sft-full
ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT alignment-handbook/zephyr-7b-dpo-full
zephyr-7b-{alpha/beta} ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT HuggingFaceH4/mistral-7b-sft-{alpha/beta}
ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT HuggingFaceH4/zephyr-7b-{alpha/beta}
tulu2-{7/13/70}b ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT allenai/tulu-2-{7/13/70}b
ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT allenai/tulu-2-dpo-{7/13/70}b
snorkel-7b-iter ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT mistralai/Mistral-7B-Instruct-v0.2
ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT snorkelai/Snorkel-Mistral-PairRM-DPO
llama3-8b-iter ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT RLHFlow/LLaMA3-SFT
ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT RLHFlow/LLaMA3-iterative-DPO-final
starling-7b-alpha ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT openchat/openchat_3.5
ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT berkeley-nest/Starling-LM-7B-alpha
starling-7b-beta ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT openchat/openchat-3.5-0106
ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Nexusflow/Starling-LM-7B-beta
internlm2-{1.8/7/20}b ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT internlm/internlm2-chat-{1_8/7/20}b-sft
ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT internlm/internlm2-chat-{1_8/7/20}b
Mistral-based SFT ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT alignment-handbook/zephyr-7b-sft-full
{RRHF, SLiC-HF, IPO, CPO, KTO, R-DPO, SimPO} ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT princeton-nlp/Mistral-7B-Base-SFT-{*}
LLaMA-3-based SFT ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT princeton-nlp/Llama-3-Base-8B-SFT
{RRHF, SLiC-HF, IPO, CPO, KTO, R-DPO, SimPO} ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT princeton-nlp/Llama-3-Base-8B-SFT-{*}

Appendix D Supplementary Experimental Results of Alignment Tax (§ 6)

Refer to caption

Figure 6: Evaluation results for the models in § 3.2 on downstream tasks. The x-axis denotes the proportions of training steps. As the “cost” of simply improving instruction-following ability and alignment with human preferences, ExPO can also amplify the alignment tax introduced by the alignment training.

Refer to caption

Figure 7: Evaluation results for the LLMs in § 4.1 on downstream tasks. For these already-alighed models, the additional alignment tax brought by ExPO is usually smaller, suggesting a trade-off between the alignment training overhead (from ℳ0subscriptℳ0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) and the additional alignment tax brought by ExPO (from ℳ1subscriptℳ1\mathcal{M}_{1}caligraphic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to ℳ2subscriptℳ2\mathcal{M}_{2}caligraphic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT).