A Bi-Objective ϵ-Constrained Framework for Quality-Cost Optimization in Language Model Ensembles (original) (raw)

A Bi-Objective ϵitalic-ϵ\displaystyle\epsilonitalic_ϵ-Constrained Framework

for Quality-Cost Optimization
in Language Model Ensembles

Aditi Singla01 Kanishk Kukreja11footnotemark: 11{aditya21373,aditi21372,kanishk21393}@iiitd.ac.in

Abstract

We propose an ensembling framework that uses diverse open-sourced Large Language Models (LLMs) to achieve high response quality while maintaining cost efficiency. We formulate a bi-objective optimization problem to represent the quality-cost tradeoff and then introduce an additional budget constraint that reduces the problem to a straightforward 0/1 knapsack problem. We empirically demonstrate that our framework outperforms the existing ensembling approaches in response quality while significantly reducing costs.

Large Language Models (LLMs) excel in traditional NLP problems (OpenAI (2023)), but their high inference costs hinder deployment in high-throughput applications (Anonymous (2023a)). Meanwhile, open-source models are less performant than their closed-source counterparts (Beeching et al. (2023)), but they typically offer lower inference costs (Kaplan et al. (2020)).

Due to the variations in the training datasets of open-source LLMs, we expect these models to have diverse domains of expertise. Jiang et al. empirically verify that no open-source LLM dominates the competition and further exhibits the potential for ensembling LLMs. While naive ensembles increase the response quality, the inference cost is O⁢(N)𝑂𝑁\displaystyle O(N)italic_O ( italic_N ), where N is the number of models in the selection set.

Our work addresses this by a) modeling the tradeoff between response quality and inference cost as a bi-objective combinatorial optimization problem (2.1), b) motivating an ε𝜀\displaystyle\varepsilonitalic_ε-constraint on the bi-objective problem that transforms it into a 0/101\displaystyle 0/10 / 1 knapsack problem (Section 2.2), and c) introducing a framework that outperforms the naive ensemble at a fractional cost (Section 2.3).

To the best of the authors’ knowledge, three approaches to combining LLMs exist in the literature:
LLM-BLENDER (Jiang et al. (2023)) employs a pairwise text ranker and a generative fuser for combining top-k responses but suffers from high inference costs and latency due to the need for N𝑁Nitalic_N LLM invocations and O⁢(N2)𝑂superscript𝑁2O(N^{2})italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) comparisons for ranking fusion. Hybrid LLM (Anonymous (2023b)) trains a router to allocate queries to a large or a small model based on difficulty. However, its robustness is compromised, as the failure of the lighter model results in an expensive model addressing all the queries, and the absence of an explicit cost function limits its generalization to N𝑁Nitalic_N-model scenarios. FrugalGPT (Anonymous (2023a)) greedily selects LLMs through pairwise comparisons and queries them sequentially, using a text quality estimator to determine an optimal stopping point. It faces challenges in model permutation sensitivity to queries and making up to O⁢(K)𝑂𝐾O(K)italic_O ( italic_K ) sequential queries in extreme scenarios.

2 Proposed Framework

Given a query 𝒒𝒒{\bm{q}}bold_italic_q and a set of N𝑁Nitalic_N LLMs 𝕄={m1,…,mN}𝕄subscript𝑚1…subscript𝑚𝑁{\mathbb{M}}=\{{m}_{1},\dots,{m}_{N}\}blackboard_M = { italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, where mi:ℚ→𝔸:subscript𝑚𝑖→ℚ𝔸{m}_{i}:{\mathbb{Q}}\rightarrow{\mathbb{A}}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : blackboard_Q → blackboard_A is a function from the Query Space ℚℚ{\mathbb{Q}}blackboard_Q to the Answer Space 𝔸𝔸{\mathbb{A}}blackboard_A. In the ensembling problem, our goal is to choose a subset ℍ⊂𝕄ℍ𝕄{\mathbb{H}}\subset{\mathbb{M}}blackboard_H ⊂ blackboard_M to maximize 𝔼mi∈ℍ⁢[r⁢(f⁢(mi⁢(𝒒)),𝒒)]subscript𝔼subscript𝑚𝑖ℍdelimited-[]𝑟𝑓subscript𝑚𝑖𝒒𝒒\mathbb{E}_{{m}_{i}\in{\mathbb{H}}}[{r}({f}({m}_{i}({\bm{q}})),{\bm{q}})]blackboard_E start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_H end_POSTSUBSCRIPT [ italic_r ( italic_f ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_q ) ) , bold_italic_q ) ], where, r𝑟\displaystyle{r}italic_r is a quality function r⁢(𝒂,𝒒):𝔸×ℚ→ℝ:𝑟𝒂𝒒→𝔸ℚℝ\displaystyle{r}({\bm{a}},{\bm{q}}):{\mathbb{A}}\times{\mathbb{Q}}\rightarrow% \displaystyle\mathbb{R}italic_r ( bold_italic_a , bold_italic_q ) : blackboard_A × blackboard_Q → blackboard_R that measures quality of response 𝒂𝒂\displaystyle{\bm{a}}bold_italic_a on the query 𝒒𝒒\displaystyle{\bm{q}}bold_italic_q, and f𝑓\displaystyle{f}italic_f is an aggregation function that fuses k𝑘\displaystyle kitalic_k responses into one final response, f:𝔸k→𝔸:𝑓→superscript𝔸𝑘𝔸\displaystyle{f}:{\mathbb{A}}^{k}\rightarrow\displaystyle{\mathbb{A}}italic_f : blackboard_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT → blackboard_A, where k𝑘\displaystyle kitalic_k is the dimension of the aggregation set.

2.1 Model Inference Cost and the Bi-Objective Optimization Problem

Kaplan et al. defines the inference cost in FLOPs per token as cf⁢o⁢r⁢w⁢a⁢r⁢d≈2⁢N+2⁢nl⁢a⁢y⁢e⁢r⁢nc⁢t⁢x⁢dm⁢o⁢d⁢e⁢lsubscript𝑐𝑓𝑜𝑟𝑤𝑎𝑟𝑑2𝑁2subscript𝑛𝑙𝑎𝑦𝑒𝑟subscript𝑛𝑐𝑡𝑥subscript𝑑𝑚𝑜𝑑𝑒𝑙c_{forward}\approx 2N+2n_{layer}n_{ctx}d_{model}italic_c start_POSTSUBSCRIPT italic_f italic_o italic_r italic_w italic_a italic_r italic_d end_POSTSUBSCRIPT ≈ 2 italic_N + 2 italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT, where N𝑁Nitalic_N is non-embedding parameters, nl⁢a⁢y⁢e⁢rsubscript𝑛𝑙𝑎𝑦𝑒𝑟n_{layer}italic_n start_POSTSUBSCRIPT italic_l italic_a italic_y italic_e italic_r end_POSTSUBSCRIPT is the number of layers, nc⁢t⁢xsubscript𝑛𝑐𝑡𝑥n_{ctx}italic_n start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT is tokens in input context, and dm⁢o⁢d⁢e⁢lsubscript𝑑𝑚𝑜𝑑𝑒𝑙d_{model}italic_d start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e italic_l end_POSTSUBSCRIPT is the dimension of the residual stream. Our cost minimization objective is,

min⁢∑mi∈ℍci⋅ti⁢(𝒒)minsubscriptsubscript𝑚𝑖ℍ⋅subscript𝑐𝑖subscript𝑡𝑖𝒒\text{min}\sum_{{m}_{i}\in{\mathbb{H}}}{c}_{i}\cdot{t}_{i}({\bm{q}})min ∑ start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_H end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_q ) (1)

where cisubscript𝑐𝑖{c}_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the inference cost and ti:ℚ→ℝ:subscript𝑡𝑖→ℚℝ{t}_{i}:{\mathbb{Q}}\rightarrow\mathbb{R}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : blackboard_Q → blackboard_R maps 𝒒𝒒{\bm{q}}bold_italic_q to the token count based on misubscript𝑚𝑖\displaystyle{m}_{i}italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Moreover, our experiments suggest that a dependable approach to increase 𝔼mi∈ℍ⁢[r⁢(f⁢(mi⁢(𝒒)),𝒒)]subscript𝔼subscript𝑚𝑖ℍdelimited-[]𝑟𝑓subscript𝑚𝑖𝒒𝒒\mathbb{E}_{{m}_{i}\in{\mathbb{H}}}[{r}({f}({m}_{i}({\bm{q}})),{\bm{q}})]blackboard_E start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_H end_POSTSUBSCRIPT [ italic_r ( italic_f ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_q ) ) , bold_italic_q ) ] involves maximizing the sum of the individual model’s response quality,

max⁢∑mi∈ℍr⁢(mi,𝒒)maxsubscriptsubscript𝑚𝑖ℍ𝑟subscript𝑚𝑖𝒒\text{max}\sum_{{m}_{i}\in{\mathbb{H}}}{r}({m}_{i},{\bm{q}})max ∑ start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_H end_POSTSUBSCRIPT italic_r ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_q ) (2)

Equations (1) and (2) form the bi-objective combinatorial optimization problem.

2.2 ϵitalic-ϵ\epsilonitalic_ϵ-constraint to solve the bi-objective optimization problem

Haimes & Wismer introduced the ϵitalic-ϵ\epsilonitalic_ϵ-constraint method for multi-objective optimization, which involves
optimizing one function while limiting others. We reduce our problem to,

max⁢∑mi∈ℍr⁢(mi,𝒒)maxsubscriptsubscript𝑚𝑖ℍ𝑟subscript𝑚𝑖𝒒\displaystyle\text{max}\sum_{{m}_{i}\in{\mathbb{H}}}{r}({m}_{i},{\bm{q}})max ∑ start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_H end_POSTSUBSCRIPT italic_r ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_q ) (3)
subject to⁢∑mi∈ℍci⋅ti⁢(𝒒)≤ϵsubject tosubscriptsubscript𝑚𝑖ℍ⋅subscript𝑐𝑖subscript𝑡𝑖𝒒italic-ϵ\displaystyle\text{subject to}\sum_{{m}_{i}\in{\mathbb{H}}}{c}_{i}\cdot{t}_{i}% ({\bm{q}})\leq\epsilonsubject to ∑ start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_H end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_q ) ≤ italic_ϵ

Think of it as assigning a budget (ϵitalic-ϵ\epsilonitalic_ϵ) to each query. This simplifies the problem into a 0/1 knapsack scenario with profits r⁢(mi,𝒒)𝑟subscript𝑚𝑖𝒒{r}({m}_{i},{\bm{q}})italic_r ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_q ), costs ci⋅ti⁢(𝒒)⋅subscript𝑐𝑖subscript𝑡𝑖𝒒{c}_{i}\cdot{t}_{i}({\bm{q}})italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_q ), and capacity ϵitalic-ϵ\epsilonitalic_ϵ, efficiently solvable using a dynamic programming subroutine (see A.1).

2.3 MODI: Model Orchestration using DeBERTa Inference

We employ a DeBERTa-based regression model (He et al. (2021)) to predict the response quality for models in our selection set. A.2 provides details on the regression architecture. The predicted quality scores, denoted as r^⁢(mi⁢(𝒒),𝒒)^𝑟subscript𝑚𝑖𝒒𝒒\hat{{r}}({m}_{i}({\bm{q}}),{\bm{q}})over^ start_ARG italic_r end_ARG ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_q ) , bold_italic_q ), guide the 0/1 knapsack subroutine. Ultimately, the selected model outputs are combined using the GEN-FUSER (Jiang et al. (2023)).

3 Experiments and Results

Table 1: MODI demonstrates superior performance compared to baseline LLMs and LLM-BLENDER in the Mix-Instruct task, achieving this at only 20% of the LLM-BLENDER cost.

Our preliminary experiments evaluate our approach using the MixInstruct dataset (Jiang et al. (2023)). We compare the responses of our model against individual LLM baselines and the LLM-BLENDER (Jiang et al. (2023)) results. Further details about the experiments are in Appendix (A.3). The rationale for choosing BARTScore as our comparison metric can be found in Appendix (A.4).

4 Conclusion

We introduce an LLM ensembling framework for Response Quality-Cost optimization. Formulating a bi-objective optimization problem, we apply an ϵitalic-ϵ\epsilonitalic_ϵ-constrained approach to ensemble models within a user-defined budget. Our model surpasses existing ensembling methods while significantly reducing costs. This work establishes a foundation for cost-effective strategies to enhance language model capabilities, showcasing the efficacy of ensembling techniques.

References

Appendix A Appendix

A.1 Dynamic Programming Subroutine to Solve the 0/1 Knapsack Problem

The dynamic programming subroutine provided in Algorithm 1 is designed to solve the 0/1 knapsack problem efficiently. Since the BARTScores are negative, we apply the following transformation on the scores,

Target Score=α+BARTScoreTarget Score𝛼BARTScore\text{Target Score}=\alpha+\text{BARTScore}Target Score = italic_α + BARTScore (4)

where α𝛼\displaystyle\alphaitalic_α is a positive constant chosen such that,

| α>max⁢|BARTScore|𝛼maxBARTScore\alpha>\text{max}|\text{BARTScore}|italic_α > max | BARTScore | | (5) | | -------------------------------------------------------------------------------------- | --------- | | --- |

The subroutine utilizes a dynamic programming approach to find the optimal selection of models within a given budget, maximizing the total target score.
The list ”models” comprises of objects that describe the cost and the target score associated with each model in the selection set 𝕄𝕄\displaystyle{\mathbb{M}}blackboard_M.

Algorithm 1 Knapsack(models, budget)

1:n←length⁢(𝚖𝚘𝚍𝚎𝚕𝚜)←𝑛length𝚖𝚘𝚍𝚎𝚕𝚜n\leftarrow\text{length}(\texttt{models})italic_n ← length ( models )

2:d⁢p𝑑𝑝dpitalic_d italic_p ←←\leftarrow← 2D array of size (n+1)×(𝚋𝚞𝚍𝚐𝚎𝚝+1)𝑛1𝚋𝚞𝚍𝚐𝚎𝚝1(n+1)\times(\texttt{budget}+1)( italic_n + 1 ) × ( budget + 1 )

3:for i𝑖iitalic_i from 1111 to n𝑛nitalic_n do

4: for j𝑗jitalic_j from 00 to budget do

5: if 𝚖𝚘𝚍𝚎𝚕𝚜[i−1][′cost′]≤j\texttt{models}[i-1][^{\prime}cost^{\prime}]\leq jmodels [ italic_i - 1 ] [ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_c italic_o italic_s italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ≤ italic_j then

6: dp[i][j]←max(dp[i−1][j],dp[i−1][j−𝚖𝚘𝚍𝚎𝚕𝚜[i−1][′cost′]]+𝚖𝚘𝚍𝚎𝚕𝚜[i−1][′target_score′])dp[i][j]\leftarrow\max(dp[i-1][j],dp[i-1][j-\texttt{models}[i-1][^{\prime}cost% ^{\prime}]]+\texttt{models}[i-1][^{\prime}target\_score^{\prime}])italic_d italic_p [ italic_i ] [ italic_j ] ← roman_max ( italic_d italic_p [ italic_i - 1 ] [ italic_j ] , italic_d italic_p [ italic_i - 1 ] [ italic_j - models [ italic_i - 1 ] [ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_c italic_o italic_s italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] ] + models [ italic_i - 1 ] [ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t _ italic_s italic_c italic_o italic_r italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ] )

7: else

8: d⁢p⁢[i]⁢[j]←d⁢p⁢[i−1]⁢[j]←𝑑𝑝delimited-[]𝑖delimited-[]𝑗𝑑𝑝delimited-[]𝑖1delimited-[]𝑗dp[i][j]\leftarrow dp[i-1][j]italic_d italic_p [ italic_i ] [ italic_j ] ← italic_d italic_p [ italic_i - 1 ] [ italic_j ]

9: end if

10: end for

11:end for

12:selected_models←←selected_modelsabsent\texttt{selected\_models}\leftarrowselected_models ← empty list

13:j←𝚋𝚞𝚍𝚐𝚎𝚝←𝑗𝚋𝚞𝚍𝚐𝚎𝚝j\leftarrow\texttt{budget}italic_j ← budget

14:for i𝑖iitalic_i from n𝑛nitalic_n to 1111 decrementing do

15: if d⁢p⁢[i]⁢[j]≠d⁢p⁢[i−1]⁢[j]𝑑𝑝delimited-[]𝑖delimited-[]𝑗𝑑𝑝delimited-[]𝑖1delimited-[]𝑗dp[i][j]\neq dp[i-1][j]italic_d italic_p [ italic_i ] [ italic_j ] ≠ italic_d italic_p [ italic_i - 1 ] [ italic_j ] then

16: add 𝚖𝚘𝚍𝚎𝚕𝚜⁢[i−1]𝚖𝚘𝚍𝚎𝚕𝚜delimited-[]𝑖1\texttt{models}[i-1]models [ italic_i - 1 ] to selected_models

17: j←j−𝚖𝚘𝚍𝚎𝚕𝚜[i−1][′cost′]j\leftarrow j-\texttt{models}[i-1][^{\prime}cost^{\prime}]italic_j ← italic_j - models [ italic_i - 1 ] [ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_c italic_o italic_s italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ]

18: end if

19:end for

20:return selected_models

The resulting selected_models list contains the optimal selection of models within the given budget, which is then passed to the GEN-FUSER (Jiang et al. (2023)).

A.2 Regression Model Architecture

Refer to caption

Figure 1: Regression Model Architecture.

The model architecture is based on a DeBERTa-v3-large (He et al. (2021)) backbone. The output of the encoder is passed to an aggregation function. We experimented with multiple aggregation techniques, including average and max pooling of the hidden state embeddings and concatenating the last four-word embeddings of the hidden state. Finally, we realized that the hidden state embeddings corresponding to the CLS token provide the best regression results. The embeddings are passed through a feedforward neural network, the architecture of which is shown in Figure 1.

The embeddings are first passed through a dropout layer (Srivastava et al. (2014)) with p=0.2𝑝0.2\displaystyle p=0.2italic_p = 0.2 to prevent overfitting. Then, a Gaussian Error Linear Unit (Hendrycks & Gimpel (2023)),

GELU⁢(𝒙)=𝒙⁢Φ⁢(𝒙)GELU𝒙𝒙Φ𝒙\text{GELU}(\displaystyle{\bm{x}})={\bm{x}}\Phi({\bm{x}})GELU ( bold_italic_x ) = bold_italic_x roman_Φ ( bold_italic_x ) (6)

,

is applied to the embeddings. The resulting tensors are passed through a Linear layer and then through a Gated Linear Unit (Dauphin et al. (2017)),

G⁢L⁢U⁢(𝐗)=(𝐗∗𝐖+𝐛)⊗σ⁢(𝐗∗𝐕+𝐜)𝐺𝐿𝑈𝐗tensor-product∗𝐗𝐖𝐛𝜎∗𝐗𝐕𝐜\displaystyle GLU(\mathbf{X})=(\mathbf{X}\ast\mathbf{W}+\mathbf{b})\otimes% \sigma(\mathbf{X}\ast\mathbf{V}+\mathbf{c})italic_G italic_L italic_U ( bold_X ) = ( bold_X ∗ bold_W + bold_b ) ⊗ italic_σ ( bold_X ∗ bold_V + bold_c ) (7)

.

Finally, the tensors are passed through a Linear layer with output dimensions equal to the number of models in the selection set 𝕄𝕄\displaystyle{\mathbb{M}}blackboard_M to give the predictions, r^⁢(mi⁢(𝒒),𝒒)^𝑟subscript𝑚𝑖𝒒𝒒\hat{{r}}({m}_{i}({\bm{q}}),{\bm{q}})over^ start_ARG italic_r end_ARG ( italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_q ) , bold_italic_q ).

The model minimizes the Huber Loss (Huber (1964)) given by,

| Lδ⁢(y,f⁢(x))={0.5⁢(y−f⁢(x))2,if ⁢|y−f⁢(x)|≤δδ⁢(|y−f⁢(x)|−0.5⁢δ),otherwise.subscript𝐿𝛿𝑦𝑓𝑥cases0.5superscript𝑦𝑓𝑥2if 𝑦𝑓𝑥𝛿𝛿𝑦𝑓𝑥0.5𝛿otherwiseL_{\delta}(y,f(x))=\begin{cases}0.5(y-f(x))^{2},&\text{if }|y-f(x)|\leq\delta% \\ \delta(|y-f(x)|-0.5\delta),&\text{otherwise}.\end{cases}italic_L start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_y , italic_f ( italic_x ) ) = { start_ROW start_CELL 0.5 ( italic_y - italic_f ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL start_CELL if | italic_y - italic_f ( italic_x ) | ≤ italic_δ end_CELL end_ROW start_ROW start_CELL italic_δ ( | italic_y - italic_f ( italic_x ) | - 0.5 italic_δ ) , end_CELL start_CELL otherwise . end_CELL end_ROW | (8) | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------- | ----------------------------------------------------------------- | ----------------------------------- | ------------------------------------------------------------------------ | --- |

The loss function makes intuitive sense because several outlier queries exist in the training set, which can significantly deteriorate the performance if an L2subscript𝐿2\displaystyle L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT loss function is used.

A.3 Experimental Setup and Hyperparameters

Table 2: Experiment Details

Dataset: We use the MixInstruct dataset introduced by Jiang et al. to benchmark LLM ensembles. The dataset includes 110K instruction-following tasks curated from four diverse sources. We trained our regression model on 10k randomly sampled queries and LLM responses from the training dataset. Our validation and test splits are the same as MixInstruct consisting of 5k instruction examples each.

Evaluation Metric: We use BARTScore (Yang & Yang (2023)) as our quality metric. The rationale for using BARTScore and qualitative comparisons against LLM-BLENDER can be found in A.4.

Budget: We use different fractions of the total FLOPs required by an LLM-BLENDER response on the query as our budget.

Fusion Model: We use the Flan-T5-XL-based (Chung et al. (2022)) GEN-FUSER very generously open-sourced by Jiang et al. as our fusion model.

Baselines: We compare our model’s response with the Language models present in our selection set, a randomly chosen ensemble of models, and LLM-BLENDER.

The details about our training process, including the hardware involved, LLMs used in the selection set, Loss function, Optimizer used, and their specific hyperparameters, are included in Table 2.

A.4 Rationale for Using BARTScore as an Evaluation Metric

BARTScore (Yang & Yang (2023)) is computationally affordable compared to resource-intensive human and GPT-based evaluators. Jiang et al. empirically shows a strong correlation between BARTScore and the GPT-based ranking metric. Further, recent research (Anonymous (2023b)) empirically demonstrates the correlation of BARTScore with human-based evaluations, indicating BARTScore to be a reliable and consistent evaluation approach. Qualitatively, our responses are better than or equivalent to LLM-BLENDER’s, as seen in Table 3.

Table 3: Qualitative comparison of MODI responses with LLM-BLENDER