Applying Dynamic Document Relevance to Retrieval- Augmented Generation for Question-Answering (original) (raw)
Zijian Hei∗,1,2, Weiling Liu∗,1,3, Wenjie Ou∗,1,4, Juyi Qiao1, Junming Jiao1,
Guowen Song,1222Corresponding author 1., Ting Tian,2333Corresponding author 2., Yi Lin,4333Corresponding author 2.
1 Li Auto Inc.,2 Sun Yat-sen University,3 Northeastern University, China,4 Sichuan University
songguowen@lixiang.com
tiant55@mail.sysu.edu.cn, yilin@scu.edu.cn
Abstract
11footnotetext: Contribute equally during internshiping at Li Auto.
Retrieval-Augmented Generation (RAG) has recently demonstrated the performance of Large Language Models (LLMs) in the knowledge-intensive tasks such as Question-Answering (QA). RAG expands the query context by incorporating external knowledge bases to enhance the response accuracy. However, it would be inefficient to access LLMs multiple times for each query and unreliable to retrieve all the relevant documents by a single query. We have found that even though there is low relevance between some critical documents and query, it is possible to retrieve the remaining documents by combining parts of the documents with the query. To mine the relevance, a two-stage retrieval framework called Dynamic-Relevant Retrieval-Augmented Generation (DR-RAG) is proposed to improve document retrieval recall and the accuracy of answers while maintaining efficiency. Additionally, a compact classifier is applied to two different selection strategies to determine the contribution of the retrieved documents to answering the query and retrieve the relatively relevant documents. Meanwhile, DR-RAG call the LLMs only once, which significantly improves the efficiency of the experiment. The experimental results on multi-hop QA datasets show that DR-RAG can significantly improve the accuracy of the answers and achieve new progress in QA systems.
DR-RAG: Applying Dynamic Document Relevance to Retrieval-
Augmented Generation for Question-Answering
Zijian Hei∗,1,2, Weiling Liu∗,1,3, Wenjie Ou∗,1,4, Juyi Qiao1, Junming Jiao1, Guowen Song,122footnotemark: 2, Ting Tian,233footnotemark: 3, Yi Lin,433footnotemark: 3 1 Li Auto Inc.,2 Sun Yat-sen University,3 Northeastern University, China,4 Sichuan University songguowen@lixiang.com tiant55@mail.sysu.edu.cn, yilin@scu.edu.cn
1 Introduction
Large language models (LLMs) have recently made significant improvement in the field of Natural Language Processing (NLP), especially in text generation tasks Brown et al. (2020); Achiam et al. (2023); Touvron et al. (2023b); Anil et al. (2023); Ouyang et al. (2022); Touvron et al. (2023a). Although LLMs excel in various application scenarios, challenges remain regarding the accuracy and timeliness of the generated text, especially in real-time domains. LLMs with intrinsic parameter memories may generate inaccurate or even incorrect text when faced with up-to-date query Min et al. (2023); Mallen et al. (2022); Muhlgay et al. (2023). This issue, known as hallucination, occurs when the text generated by LLMs fails to align with real-world knowledge Ji et al. (2023); Zhang et al. (2023); Kwiatkowski et al. (2019). Therefore, Retrieval-Augmented Generation (RAG) frameworks have been proposed to improve the accuracy of generated text by combining relevant information from external knowledge base with query Arora et al. (2023); Lewis et al. (2020); Borgeaud et al. (2022). RAG has effectively demonstrated its superiority in knowledge-intensive tasks such as open-domain Question-Answering (QA) and has achieved new progress in the LLMs’ performance.
Figure 1: An example shows that retriever easily introduces static-relevant documents due to high relevance (red), but struggles to retrieve dynamic-relevant documents which are of low relevance (blue) but critical for the answer. Stars are levels of retrieval difficulty.
However, irrelevant information reduces the quality of the generated text and further interferes with the ability of LLMs to answer the query in the application Shi et al. (2023). Moreover, the undifferentiated combining strategy in RAG can lead to mixing in some irrelevant information Rony et al. (2022). Inconsistent or contradictory information during combining the document may lead to the introduction of incorrect information and have an impact on the accuracy of the generated answers. In the retrieval, we need to select documents that are highly relevant and decisive for the generation of answers (static-relevant documents) and documents that are low relevant but also crucial to the generation of answers (dynamic-relevant documents). As shown in Fig. 1, an example query is ‘Who is the spouse of the child of Peter Andreas Heiberg?’, which requires the two most relevant documents to obtain the correct answers. Static-relevant documents is easy to be retrieved due to the high relevance with the query on ‘Peter Andreas Heiberg’ and ‘child/son’ (Fig. 1 red). However, dynamic-relevant documents is difficult to be retrieved because it is only related to the query as a ‘spouse/wife’ (Fig. 1 blue). Moreover, the knowledge base contains too much information about ‘spouse’, which may cause dynamic-relevant documents to be ranked lower in the retrieval process. There is a high relevance on ‘Johan Ludvig Heiberg’ and ‘wife’ between static- and dynamic-relevant documents. If ‘spouse/wife’ with the query is also taken into account, we can easily retrieve dynamic-relevant documents to get the answer.
Motivated by the above observations, a novel two-stage retrieval framework called Dynamic-Relevant Retrieval-Augmented Generation (DR-RAG) is proposed to mine the relevance between the query and documents. In the first-retrieval stage, similarity matching (SM) method is used to obtain a certain percentage of documents based on the query. Subsequently, the documents with the query are concatenated to dig further into more in-depth relevance to dynamic-relevant documents. Moreover, we design a classifier that determines whether the retrieved documents contribute to the current query by a predefined threshold. To optimise the documents, we design two approaches, i.e., forward selection and reverse selection. We aim to ensure that the retrieved documents are highly relevant, thus avoiding redundant documents. Through two-stage retrieval and classifier selection strategies, DR-RAG has the ability to retrieve sufficient relevant documents and address complex and multilevel problems. DR-RAG can make full use of the static and dynamic relevance of documents and enhance the model’s performance under diverse queries. To validate the effectiveness of DR-RAG, we conduct extensive experiments by different retrieval strategies on multi-hop QA datasets. The results show that our method can significantly improve the recall and accuracy of the answers.
In short, we summarize the key contributions of this work as follows:
- •
We design an effective RAG framework named DR-RAG, which is effective in multi-hop QA. Two-stage retrieval strategy is proposed to significantly improve the recall and accuracy of the retrieval results. - •
We design a classifier that determines whether the retrieved documents contribute to the current query by setting a predefined threshold. The mechanism can effectively reduces redundant documents and ensures that the retrieved documents are concise and efficient. - •
We conduct experiments on three multi-hop QA datasets to validate our DR-RAG. The experimental results show that DR-RAG has the ability to improve recall by 86.75% and improve by 6.17%, 7.34%, 9.36% in the three metrics (Acc, EM, F1). DR-RAG has significant advantages in complex and multi-hop QA and support the performance of the RAG frameworks in QA systems.
Table 1: The key mathematical notations.
2 Method
In this section, we will describe the DR-RAG framework and its design approach in detail. Specifically, in section 2.1 we will define relevant symbols comprehensively, and in section 2.2 we will describe the whole framework.
Figure 2: An overview of DR-RAG. In step 1, we retrieve static-relevant documents (SR-Documents) due to high relevance with the query. Then we concatenate SR-Documents with the query to retrieve multiple dynamic-relevant documents (DR-Documents) in step 2. Finally, we select each of DR-Documents in turn to concatenate with the query and SR-Documents and feed them into the classifier to select the most relevant DR-Document.
2.1 Preliminaries
To enrich the knowledge of LLMs, we need to retrieve multiple documents to provide comprehensive answers to complex query. For better clarity, we summarize the key notations in Table 1 and the whole framework can be referred to in Fig. 2.
Our goal is to retrieve the most relevant documents 𝒅∗superscript𝒅\boldsymbol{d}^{*}bold_italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from the retrieved documents 𝒅𝒅\boldsymbol{d}bold_italic_d to answer the query and prevent missing key information from the additional knowledge provided to LLMs. However, it is difficult to retrieve all the static- and dynamic-relevant documents through SM method during the retrieval process (Fig. 2). For clearness, we name these two types of relevant documents as 𝒅𝒔𝒕𝒂𝒕∗superscriptsubscript𝒅𝒔𝒕𝒂𝒕\boldsymbol{d_{stat}^{*}}bold_italic_d start_POSTSUBSCRIPT bold_italic_s bold_italic_t bold_italic_a bold_italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_∗ end_POSTSUPERSCRIPT and 𝒅𝒅𝒚𝒏∗superscriptsubscript𝒅𝒅𝒚𝒏\boldsymbol{d_{dyn}^{*}}bold_italic_d start_POSTSUBSCRIPT bold_italic_d bold_italic_y bold_italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_∗ end_POSTSUPERSCRIPT, respectively.
A common approach is to increase the value of k𝑘kitalic_k to expand the possibility of retrieving 𝒅𝒅𝒚𝒏∗superscriptsubscript𝒅𝒅𝒚𝒏\boldsymbol{d_{dyn}^{*}}bold_italic_d start_POSTSUBSCRIPT bold_italic_d bold_italic_y bold_italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_∗ end_POSTSUPERSCRIPT. For instance, in MuSiQue, increasing k𝑘kitalic_k from 3 to 6 only raises the recall rate from 58% to 76%, leaving many relevant documents unretrieved. Furthermore, irrelevant documents will provide LLMs with redundant information. Motivated by the problem, the main research objective of our work is to improve the document recall rate of 𝒅𝒅𝒚𝒏∗superscriptsubscript𝒅𝒅𝒚𝒏\boldsymbol{d_{dyn}^{*}}bold_italic_d start_POSTSUBSCRIPT bold_italic_d bold_italic_y bold_italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_∗ end_POSTSUPERSCRIPT based on dynamic relevance with the same top-k𝑘kitalic_k.
2.2 DR-RAG
In this section, we will give a comprehensive description about the DR-RAG framework, a new two-stage retrieval method compared to traditional reranking methods NetEase Youdao (2023); Chen et al. (2024). From Fig. 2, we retrieve k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT documents through SM method (first-retrieval stage) and employ a classifier C𝐶Citalic_C to model the dynamic relevance between documents (selection process) to enhance the recall rate of the remaining k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT documents. The classifier C𝐶Citalic_C lies in assessing the dynamic relevance between documents to determine whether the information from the documents is crucial to answer the present query.
2.2.1 Query Documents Concatenation
As mentioned before, due to the low relevancy between dynamic-relevant documents and the query, the documents are difficult to be retrieved. Moreover, the only relevant information ‘spouse/wife’ between them is also obscured by the mixed information in the knowledge base because too many documents in D𝐷Ditalic_D will contain ‘spouse’. Therefore, Query Documents Concatenation (QDC) method aims to employ the sentence to match for more useful and relevant information. After the first-retrieval stage, we will obtain k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT static-relevant documents and concatenate 𝒒𝒒\boldsymbol{q}bold_italic_q with each document to form multiple <𝒒,𝒅𝒊,i∈k1𝒒subscript𝒅𝒊𝑖subscript𝑘1\boldsymbol{q},\boldsymbol{d_{i}},i\in k_{1}bold_italic_q , bold_italic_d start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , italic_i ∈ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT> pairs. Moreover, dynamic-relevant documents from D𝐷Ditalic_D can be retrieved by corresponding <𝒒,𝒅𝒊,i∈k1𝒒subscript𝒅𝒊𝑖subscript𝑘1\boldsymbol{q},\boldsymbol{d_{i}},i\in k_{1}bold_italic_q , bold_italic_d start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT , italic_i ∈ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT> pair in the second-retrieval stage. As the case in Fig 2, when 𝒒𝒒\boldsymbol{q}bold_italic_q and 𝒅𝒔𝒕𝒂𝒕∗superscriptsubscript𝒅𝒔𝒕𝒂𝒕\boldsymbol{d_{stat}^{*}}bold_italic_d start_POSTSUBSCRIPT bold_italic_s bold_italic_t bold_italic_a bold_italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_∗ end_POSTSUPERSCRIPT are concatenated, the query contains both the ‘Johan Ludvig Heiberg’ and the relationship ‘spouse/wife’, which is essentially similar to 𝒅𝒅𝒚𝒏∗superscriptsubscript𝒅𝒅𝒚𝒏\boldsymbol{d_{dyn}^{*}}bold_italic_d start_POSTSUBSCRIPT bold_italic_d bold_italic_y bold_italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_∗ end_POSTSUPERSCRIPT. Therefore, 𝒅𝒅𝒚𝒏∗superscriptsubscript𝒅𝒅𝒚𝒏\boldsymbol{d_{dyn}^{*}}bold_italic_d start_POSTSUBSCRIPT bold_italic_d bold_italic_y bold_italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_∗ end_POSTSUPERSCRIPT is more clearly related to the query and thus easily retrieved. The whole process is:
| Cnt={}{𝒅1,𝒅2,…,𝒅k1}=Retriever(𝒒)Cnt=Cnt∪{𝒅1,𝒅2,…,𝒅k1}𝒒i∗=Concat(𝒒,𝒅i){𝒅i,1′,…,𝒅i,k2′}=Retriever(𝒒i∗)Cnt=Cnt∪{𝒅i,j′∣𝒅i,j′∉Cnt∧first}answer=LLM(Concat(𝒒,Cnt))𝐶𝑛𝑡subscript𝒅1subscript𝒅2…subscript𝒅subscript𝑘1Retriever𝒒𝐶𝑛𝑡𝐶𝑛𝑡subscript𝒅1subscript𝒅2…subscript𝒅subscript𝑘1superscriptsubscript𝒒𝑖Concat𝒒subscript𝒅𝑖subscriptsuperscript𝒅′𝑖1…subscriptsuperscript𝒅′𝑖subscript𝑘2Retrieversuperscriptsubscript𝒒𝑖𝐶𝑛𝑡𝐶𝑛𝑡conditional-setsubscriptsuperscript𝒅′𝑖𝑗subscriptsuperscript𝒅′𝑖𝑗𝐶𝑛𝑡𝑓𝑖𝑟𝑠𝑡𝑎𝑛𝑠𝑤𝑒𝑟LLMConcat𝒒𝐶𝑛𝑡\displaystyle\begin{split}Cnt&=\{\}\\ \{\boldsymbol{d}_{1},\boldsymbol{d}_{2},\ldots,\boldsymbol{d}_{k_{1}}\}&=% \texttt{Retriever}(\boldsymbol{q})\\ Cnt&=Cnt\cup\{\boldsymbol{d}_{1},\boldsymbol{d}_{2},\ldots,\boldsymbol{d}_{k_{% 1}}\}\\ \boldsymbol{q}_{i}^{*}&=\texttt{Concat}(\boldsymbol{q},\boldsymbol{d}_{i})\\ \{\boldsymbol{d}^{\prime}_{i,1},\ldots,\boldsymbol{d}^{\prime}_{i,k_{2}}\}&=% \texttt{Retriever}(\boldsymbol{q}_{i}^{*})\\ Cnt&=Cnt\cup\{\boldsymbol{d}^{\prime}_{i,j}\mid\boldsymbol{d}^{\prime}_{i,j}% \not\in Cnt\land first\}\\ answer&=\texttt{LLM}(\texttt{Concat}(\boldsymbol{q},Cnt))\end{split}start_ROW start_CELL italic_C italic_n italic_t end_CELL start_CELL = { } end_CELL end_ROW start_ROW start_CELL { bold_italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_d start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } end_CELL start_CELL = Retriever ( bold_italic_q ) end_CELL end_ROW start_ROW start_CELL italic_C italic_n italic_t end_CELL start_CELL = italic_C italic_n italic_t ∪ { bold_italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_d start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } end_CELL end_ROW start_ROW start_CELL bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL start_CELL = Concat ( bold_italic_q , bold_italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL { bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , … , bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } end_CELL start_CELL = Retriever ( bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_C italic_n italic_t end_CELL start_CELL = italic_C italic_n italic_t ∪ { bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∣ bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∉ italic_C italic_n italic_t ∧ italic_f italic_i italic_r italic_s italic_t } end_CELL end_ROW start_ROW start_CELL italic_a italic_n italic_s italic_w italic_e italic_r end_CELL start_CELL = LLM ( Concat ( bold_italic_q , italic_C italic_n italic_t ) ) end_CELL end_ROW | (1) |
|---|
where k1+k2subscript𝑘1subscript𝑘2k_{1}+k_{2}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is equal to k𝑘kitalic_k. Retriever is a common SM method. 𝒅𝒅\boldsymbol{d}bold_italic_d and 𝒅′superscript𝒅bold-′\boldsymbol{d^{\prime}}bold_italic_d start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT are the relevant document retrieved from D𝐷Ditalic_D in the first and second-retrieval stage. Cnt𝐶𝑛𝑡Cntitalic_C italic_n italic_t is a context containing multiple documents. Cnt=Cnt∪{𝒅i,j′∣𝒅i,j′∉Cnt∧first}𝐶𝑛𝑡𝐶𝑛𝑡conditional-setsubscriptsuperscript𝒅′𝑖𝑗subscriptsuperscript𝒅′𝑖𝑗𝐶𝑛𝑡𝑓𝑖𝑟𝑠𝑡Cnt=Cnt\cup\{\boldsymbol{d}^{\prime}_{i,j}\mid\boldsymbol{d}^{\prime}_{i,j}% \not\in Cnt\land first\}italic_C italic_n italic_t = italic_C italic_n italic_t ∪ { bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∣ bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∉ italic_C italic_n italic_t ∧ italic_f italic_i italic_r italic_s italic_t } means that for for a given 𝒅′superscript𝒅bold-′\boldsymbol{d^{\prime}}bold_italic_d start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT, the first 𝒅i,j′subscriptsuperscript𝒅′𝑖𝑗\boldsymbol{d}^{\prime}_{i,j}bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT in the second-retrieval stage that is not already part of Cnt𝐶𝑛𝑡Cntitalic_C italic_n italic_t will be placed into Cnt𝐶𝑛𝑡Cntitalic_C italic_n italic_t. LLM is a large language model to obtain the answer. answer𝑎𝑛𝑠𝑤𝑒𝑟answeritalic_a italic_n italic_s italic_w italic_e italic_r is the output to answer the query.
2.2.2 Classifier for Selection
While QDC method significantly improves document recall and answer accuracy, there are two key issues to consider: 1) There may be redundant information in the k𝑘kitalic_k retrieved documents, which may affect the response of LLMs; 2) How to determine whether a document retrieved in the second-retrieval stage is valid for an answer to further optimise document recall. Motivated by the issues, two pipelines are designed to dig into in-depth document relevance and solve the issues: 1) Classifier Inverse Selection (CIS): in this pipeline, after the second-retrieval stage we exclude some irrelevant documents from the k𝑘kitalic_k retrieved documents; 2) Classifier Forward Selection (CFS) : we set a judgment condition to each retrieved document in the second-retrieval stage to filter out irrelevant documents which are useless or even play a negative role in the answer. In addition, we will train a classifier C𝐶Citalic_C by a small model with millisecond-level runtime to prevent excessive delays in our pipelines. DR-RAG involves a small binary-classification model where the input consists of 𝒒𝒒\boldsymbol{q}bold_italic_q and two documents. The training objective is to determine the potential contribution of the documents to answering 𝒒𝒒\boldsymbol{q}bold_italic_q. The specific settings are as follows:
| C(𝒒,𝒅∗,𝒅∗)=positiveC(𝒒,𝒅∗,𝒅Δ)=negativeC(𝒒,𝒅Δ,𝒅Δ)=negativeC𝒒superscript𝒅superscript𝒅𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒C𝒒superscript𝒅superscript𝒅Δ𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒C𝒒superscript𝒅Δsuperscript𝒅Δ𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒\displaystyle\begin{split}\text{C}(\boldsymbol{q},\boldsymbol{d}^{*},% \boldsymbol{d}^{*})&=positive\\ \text{C}(\boldsymbol{q},\boldsymbol{d}^{*},\boldsymbol{d}^{\Delta})&=negative% \\ \text{C}(\boldsymbol{q},\boldsymbol{d}^{\Delta},\boldsymbol{d}^{\Delta})&=% negative\\ \end{split}start_ROW start_CELL C ( bold_italic_q , bold_italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_CELL start_CELL = italic_p italic_o italic_s italic_i italic_t italic_i italic_v italic_e end_CELL end_ROW start_ROW start_CELL C ( bold_italic_q , bold_italic_d start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_d start_POSTSUPERSCRIPT roman_Δ end_POSTSUPERSCRIPT ) end_CELL start_CELL = italic_n italic_e italic_g italic_a italic_t italic_i italic_v italic_e end_CELL end_ROW start_ROW start_CELL C ( bold_italic_q , bold_italic_d start_POSTSUPERSCRIPT roman_Δ end_POSTSUPERSCRIPT , bold_italic_d start_POSTSUPERSCRIPT roman_Δ end_POSTSUPERSCRIPT ) end_CELL start_CELL = italic_n italic_e italic_g italic_a italic_t italic_i italic_v italic_e end_CELL end_ROW | (2) |
|---|
where C represents the classifier. positive𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒positiveitalic_p italic_o italic_s italic_i italic_t italic_i italic_v italic_e and negative𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒negativeitalic_n italic_e italic_g italic_a italic_t italic_i italic_v italic_e indicate whether the two documents are critical for the query.
Table 2: Results on different datasets with Llama3-8B as LLM. Adaptive Retrieval and Self-RAG conduct the retrieval module only under specific conditions (unpopular query entities or special retrieval tokens), so their time overhead is much less than other methods. We emphasize our results in bold.
Classifier Inverse Selection In this approach, we selectively exclude some irrelevant documents from the retrieved k𝑘kitalic_k documents to minimize document redundancy. Specifically, after obtaining k𝑘kitalic_k documents in stages, we pair them as <𝒒,𝒅𝒎,𝒅𝒏𝒒subscript𝒅𝒎subscript𝒅𝒏\boldsymbol{q},\boldsymbol{d_{m}},\boldsymbol{d_{n}}bold_italic_q , bold_italic_d start_POSTSUBSCRIPT bold_italic_m end_POSTSUBSCRIPT , bold_italic_d start_POSTSUBSCRIPT bold_italic_n end_POSTSUBSCRIPT> and get Ck2superscriptsubscript𝐶𝑘2C_{k}^{2}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT pairs. The pairs, with the current query 𝒒𝒒\boldsymbol{q}bold_italic_q, are collectively fed into the classifier C𝐶Citalic_C. Similarly, when the classification result of a document and the remaining k𝑘kitalic_k-1 documents is negative, then we consider the document as redundant and should be removed. The whole process is:
| Cnt=Cnt∪{𝒅i,j′∣𝒅i,j′∉Cnt∧first}Pi,j={1if ∃i,C(𝒒,𝒅′i,j,𝒅i)=positive0otherwiseCnt=Cnt−{𝒅′i,j∣Pi,j=0}answer=LLM(Concat(𝒒,Cnt))𝐶𝑛𝑡𝐶𝑛𝑡conditional-setsubscriptsuperscript𝒅′𝑖𝑗subscriptsuperscript𝒅′𝑖𝑗𝐶𝑛𝑡𝑓𝑖𝑟𝑠𝑡subscript𝑃𝑖𝑗cases1if 𝑖𝐶𝒒subscriptsuperscript𝒅bold-′𝑖𝑗subscript𝒅𝑖𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒0𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒𝐶𝑛𝑡𝐶𝑛𝑡conditional-setsubscriptsuperscript𝒅bold-′𝑖𝑗subscript𝑃𝑖𝑗0𝑎𝑛𝑠𝑤𝑒𝑟LLMConcat𝒒𝐶𝑛𝑡\displaystyle\begin{split}Cnt&=Cnt\cup\{\boldsymbol{d}^{\prime}_{i,j}\mid% \boldsymbol{d}^{\prime}_{i,j}\not\in Cnt\land\text{$first$}\}\\ P_{i,j}&=\begin{cases}1&\text{if }\exists i,C(\boldsymbol{q},\boldsymbol{d^{% \prime}}_{i,j},\boldsymbol{d}_{i})=positive\\ 0&\text{$otherwise$}\end{cases}\\ Cnt&=Cnt-\{\boldsymbol{d^{\prime}}_{i,j}\mid P_{i,j}=0\}\\ \text{$answer$}&=\texttt{LLM}(\texttt{Concat}(\boldsymbol{q},Cnt))\end{split}start_ROW start_CELL italic_C italic_n italic_t end_CELL start_CELL = italic_C italic_n italic_t ∪ { bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∣ bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∉ italic_C italic_n italic_t ∧ italic_f italic_i italic_r italic_s italic_t } end_CELL end_ROW start_ROW start_CELL italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_CELL start_CELL = { start_ROW start_CELL 1 end_CELL start_CELL if ∃ italic_i , italic_C ( bold_italic_q , bold_italic_d start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , bold_italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_p italic_o italic_s italic_i italic_t italic_i italic_v italic_e end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e end_CELL end_ROW end_CELL end_ROW start_ROW start_CELL italic_C italic_n italic_t end_CELL start_CELL = italic_C italic_n italic_t - { bold_italic_d start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∣ italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 0 } end_CELL end_ROW start_ROW start_CELL italic_a italic_n italic_s italic_w italic_e italic_r end_CELL start_CELL = LLM ( Concat ( bold_italic_q , italic_C italic_n italic_t ) ) end_CELL end_ROW | (3) |
|---|
where - represents complement. Cnt=Cnt−{𝒅′i,j∣Pi,j=0}𝐶𝑛𝑡𝐶𝑛𝑡conditional-setsubscriptsuperscript𝒅bold-′𝑖𝑗subscript𝑃𝑖𝑗0Cnt=Cnt-\{\boldsymbol{d^{\prime}}_{i,j}\mid P_{i,j}=0\}italic_C italic_n italic_t = italic_C italic_n italic_t - { bold_italic_d start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∣ italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 0 } means 𝒅′i,jsubscriptsuperscript𝒅bold-′𝑖𝑗\boldsymbol{d^{\prime}}_{i,j}bold_italic_d start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT in the second-retrieval stage is classified as negative𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒negativeitalic_n italic_e italic_g italic_a italic_t italic_i italic_v italic_e combined with all 𝒅𝒊𝒅𝒊\boldsymbol{di}bold_italic_d bold_italic_i in the first-retrieval stage, then 𝒅′i,jsubscriptsuperscript𝒅bold-′𝑖𝑗\boldsymbol{d^{\prime}}_{i,j}bold_italic_d start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT will be removed.
Classifier Forward Selection Unlike the CIS method, CFS method aims to remove the irrelevant dynamic-relevant documents in the second-retrieval stage. To achieve this goal, we search for a document 𝒅𝒏subscript𝒅𝒏\boldsymbol{d_{n}}bold_italic_d start_POSTSUBSCRIPT bold_italic_n end_POSTSUBSCRIPT from D𝐷Ditalic_D according to the <𝒒,𝒅𝒎𝒒subscript𝒅𝒎\boldsymbol{q},\boldsymbol{d_{m}}bold_italic_q , bold_italic_d start_POSTSUBSCRIPT bold_italic_m end_POSTSUBSCRIPT> pair, and feed both the query and documents into C𝐶Citalic_C. When the classification result is negative, we will exclude the dynamic-relevant document in the current retrieved documents, and search for the next dynamic-relevant document which can be classified as positive with m. The whole process is:
| Pi,j={1if C(𝒒,𝒅i,𝒅i,j′)=positive0otherwiseCnt=Cnt∪{𝒅i,j′∣Pi,j=1∧first}answer=LLM(Concat(𝒒,Cnt))subscript𝑃𝑖𝑗cases1if 𝐶𝒒subscript𝒅𝑖subscriptsuperscript𝒅′𝑖𝑗𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒0𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒𝐶𝑛𝑡𝐶𝑛𝑡conditional-setsubscriptsuperscript𝒅′𝑖𝑗subscript𝑃𝑖𝑗1𝑓𝑖𝑟𝑠𝑡𝑎𝑛𝑠𝑤𝑒𝑟LLMConcat𝒒𝐶𝑛𝑡\displaystyle\begin{split}P_{i,j}&=\begin{cases}1&\text{if }C(\boldsymbol{q},% \boldsymbol{d}_{i},\boldsymbol{d}^{\prime}_{i,j})=positive\\ 0&\text{$otherwise$}\end{cases}\\ Cnt&=Cnt\cup\{\boldsymbol{d}^{\prime}_{i,j}\mid P_{i,j}=1\land first\}\\ \text{$answer$}&=\texttt{LLM}(\texttt{Concat}(\boldsymbol{q},Cnt))\end{split}start_ROW start_CELL italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_CELL start_CELL = { start_ROW start_CELL 1 end_CELL start_CELL if italic_C ( bold_italic_q , bold_italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) = italic_p italic_o italic_s italic_i italic_t italic_i italic_v italic_e end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e end_CELL end_ROW end_CELL end_ROW start_ROW start_CELL italic_C italic_n italic_t end_CELL start_CELL = italic_C italic_n italic_t ∪ { bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∣ italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 ∧ italic_f italic_i italic_r italic_s italic_t } end_CELL end_ROW start_ROW start_CELL italic_a italic_n italic_s italic_w italic_e italic_r end_CELL start_CELL = LLM ( Concat ( bold_italic_q , italic_C italic_n italic_t ) ) end_CELL end_ROW | (4) |
|---|
where Cnt=Cnt∪{𝒅i,j′∣Pi,j=1∧first}𝐶𝑛𝑡𝐶𝑛𝑡conditional-setsubscriptsuperscript𝒅′𝑖𝑗subscript𝑃𝑖𝑗1𝑓𝑖𝑟𝑠𝑡Cnt=Cnt\cup\{\boldsymbol{d}^{\prime}_{i,j}\mid P_{i,j}=1\land first\}italic_C italic_n italic_t = italic_C italic_n italic_t ∪ { bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∣ italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 ∧ italic_f italic_i italic_r italic_s italic_t } means that for a given 𝒅𝒊subscript𝒅𝒊\boldsymbol{d_{i}}bold_italic_d start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT, the first 𝒅i,j′subscriptsuperscript𝒅′𝑖𝑗\boldsymbol{d}^{\prime}_{i,j}bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT in the second-retrieval stage classified as positive𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒positiveitalic_p italic_o italic_s italic_i italic_t italic_i italic_v italic_e combined with 𝒅𝒊subscript𝒅𝒊\boldsymbol{d_{i}}bold_italic_d start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT will be considered as dynamic-relevant document and placed into Cnt𝐶𝑛𝑡Cntitalic_C italic_n italic_t.
3 Experiment Settings
The experimental details will be described in this section. Due to space constraints, the descriptions of implementation details, retrieval strategy and baseline can seen in Appendix A.1, A.2 and A.3.
3.1 Dataset
We verify the effectiveness of our proposed framework on three multi-hop QA datasets, including HotpotQA, 2Wiki and MuSiQue Yang et al. (2018); Ho et al. (2020); Trivedi et al. (2022b). The datasets require the system to comprehensively collect and contextualize information from multiple documents to answer more complex queries.
Table 3: Results on different LLMs and strategies compared to Adaptive-RAG. We set gpt-3.5-turbo and Llama3-8b as the base LLM. We emphasize our best results in bold. Top-k means the total number of retrieved documents.
Table 4: Ablation study on HotpotQA by Llama3-8B.
4 Results and Analysis
Table 5: Recall rate and actual numbers under different retrieval strategies. Actual numbers represents the actual numbers of documents that we feed into LLMs. A smaller number means fewer redundant documents.
4.1 Main Results
Table 2 and 3 present the performance of DR-RAG in answering multi-hop query, and highlight the advantages of our approach compared to the sota RAG framework Jeong et al. (2024); Asai et al. (2024) across multiple metrics, which is in line with our expectations. Table 5 shows the performance of DR-RAG across various retrieval strategies.
As shown in Table 2, when retrieving the same k𝑘kitalic_k documents, DR-RAG can achieve a higher recall rate and a higher percentage of correct answers. From the results, DR-RAG achieves better performance than other baseline RAG frameworks (self-RAG and Adaptive-RAG) on all three metrics. Moreover, DR-RAG is also less than other RAG frameworks in terms of the number of LLMs responses and the time consumed in QA systems.
4.2 Analysis
Ablation Study We propose a two-stage retrieval and classifier selection strategies to mine the dynamic relevance of documents. As shown in Table 3, we apple two classification methods based on QDC, and the experimental results have achieved further improvement. Table 4 shows the comparison of the effect of DR-RAG with and without QDC. Quantitatively, CIS and CFS can improve DR-RAG’s performance by 2.3% and 4.7% on Acc metric against QDC, while DR-RAG reduces performance by 1.1% and 0.7% on Acc metric without QDC. The results demonstrate that the two strategies are able to efficiently extract document relevance and achieve more accurate answers.
Table 6: Results of different classifier on HotpotQA dataset as Llama3-8B.
Table 7: Results of 500 samples sampled on HotpotQA dataset based on gpt-3.5-turbo and gpt-4-turbo.
Effects of Classifier and LLM Compared to gpt-3.5-turbo, gpt-4-turbo with better document comprehension has the ability to accurately capture the critical information to answer a query. As for textual responses, gpt-4-turbo generates responses of higher quality and more accurate content. Quantitatively, as shown in Table 7, gpt-4-turbo improve by an average of 9.07%, 10.63%, and 12.73% against gpt-3.5-turbo on three metrics. As shown in Table 6, when switching to different kinds or sizes of classifiers, the difference in the metrics is negligible (the extreme difference of EM, F1, and Acc is less than 2%), which suggests that our approach is applicable to different classifiers and that the classifier has little impact on our framework.
Effects of Recall Rate The ability of LLMs to answer domain-specific query correctly almost depends on whether all the necessary information is included in the prompt context. When relevant information is missing, it is difficult for LLMs with the hallucination problem to accurately answer the query. Table 8 illustrates the answers of the query with and without sufficient information provided to LLMs. As seen in Table 5, in 2Wiki, our retrieval strategy already achieves a recall rate of 98% when top-k𝑘kitalic_k is 6. When we feed enough relevant information into LLMs, the accuracy of their answers can be improved accordingly. CFS method achieves higher recall rate by 26.4% and 8.6% than BM25 and SM methods, respectively, which proves the feasibility of DR-RAG.
Table 8: Case study with Llama3-8B, where we present the factual error in red and the accurate infomation in blue.
Effects of Redundant Information We hypothesise that if there is less redundant information in the contextual knowledge, LLMs can fully understand the query to reduce the hallucination. Therefore, CIS method is devised to validate this hypothesis. Invalid information may increase by about 30% as the number of documents fed into LLMs increases, but LLMs fail to judge the information when answering. LLMs may refer to redundant information and provide an answer with incorrect information. The results all validate our hypothesis that we should provide LLMs as little redundant or incorrect information as possible throughout the RAG process. CIS method is effective in removing redundant information, but it may reduce the quality of responses when the reduction in recall is too large. Even though we feed all the relevant documents into LLMs, it is still possible to fail to get the right answer. In Table 5, on dataset 2Wiki, when the number of documents k𝑘kitalic_k provided to LLMs at 4 and 6, there is only a slight increase from CIS to CFS in the recall and instead a decrease in the metrics. Therefore, CFS method is propsed to balance redundant and relevant information.
Increase Recall with Lower Documents In CFS method, it seems impossible to find a match for every <𝒒,𝒅𝒒𝒅\boldsymbol{q},\boldsymbol{d}bold_italic_q , bold_italic_d> pair in the second-retrieval stage because the documents we need have been retrieved. Therefore, there will be cases where the total number of our retrieved documents is less than k𝑘kitalic_k. For instance, in the HotpotQA dataset, when k𝑘kitalic_k is set to 6, the average number of documents actually provided to LLMs is 5.35, thereby reducing irrelevant information to some extent. CFS method in Table 5 achieves a higher recall rate while retrieving fewer actual numbers of documents compared to QDC method. CFS method yields higher scores across the three metrics in our experiments and achieves more significant retrieval capabilities with lower redundant inputs than other methods.
Figure 3: QA performance (F1) and time for different RAG frameworks. We use the GPT-3.5-turbo as the base LLM on the multi-hop QA datasets (MuSiQue, HotpotQA and 2Wiki).
Time for One Response Compared to previous RAG frameworks, DR-RAG also achieves better time optimization during the whole process. Other RAG frameworks may call LLMs multiple times, resulting in high computational cost. In fact, the inference time of LLMs is also a worthwhile optimization in the applications. It takes a lot of time to call LLMs once, and calling them multiple times presents a catastrophic challenge in terms of time overhead. Therefore, we attempt to design a small model with relatively few parameters to achieve better optimization rather than calling LLMs multiple times. In Fig. 3 and Table 2, compared to Adaptive-RAG, we have achieved an average 74.2% reduction in time overheads. Therefore, we can conclude that we can achieve better experimental efficiency and the time overhead makes DR-RAG valuable in applications.
Case Study We conduct a case study to qualitatively compare our DR-RAG against the traditional RAG. Table 8 demonstrates the specific inference cases on the multi-hop datasets. For example, in MuSiQue dataset, our DR-RAG identifies the answer to the query by only using the LLM’s parametric knowledge about ‘partner’. Traditional RAG sometimes generate incorrect responses due to the inclusion of irrelevant information about ‘sister’. Meanwhile, faced with a complex query, DR-RAG can first retrieve static-relevant documents based on ‘cover artist’ and ‘Multiverse: Exploring Poul Anderson’s Worlds’ to get the name ‘Bob Eggleton’. Then, in the second-retrieval stage, by combining the name ‘Bob Eggleton’ with ‘born’ in the query, dynamic-relevant documents can be retrieved to obtain the answer ‘1960’.
5 Conclusion
This paper presents DR-RAG, an innovative RAG framework designed to enhance document retrieval accuracy by leveraging the relevance of different documents in various QA scenarios. Throughout this research, we explore diverse retrieval strategies and conduct comprehensive experimental comparisons. Ultimately, we adopt CFS as the final framework, which not only reduces the number of redundant document but also achieves the most superior performance. Additionally, we analyze the utilization of dynamic document relevance under constrained training resources. The experimental results demonstrate that DR-RAG significantly improves answer quality and reduces the time required for QA systems.
6 Limitations
While DR-RAG has demonstrated excellent performance across multiple datasets for multi-hop QA, its implementation requires the prior training of a distinct classifier. It is uncertain whether our classifier will be effective with niche domains. Therefore, DR-RAG can serve as an invaluable inspiration to train a classifier with private data. In the future, we will collect more comprehensive data to train a more applicable classifier for various QA tasks.
7 Ethics Statement
DR-RAG substantiates its efficacy in real-world scenarios, which are characterized by diverse user queries. However, given the potential variability in user inputs, which may span a range from benign to offensive, it is imperative to consider scenarios where inputs might be detrimental. Such instances could facilitate the retrieval of objectionable content and lead to unsuitable responses by retrieval-augmented LLMs. Addressing this concern necessitates the development of robust methodologies to detect and mitigate offensive or inappropriate content in both user inputs and the documents retrieved within the RAG framework. This area represents a critical part for future research.
References
- Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Anil et al. (2023) Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Arora et al. (2023) Daman Arora, Anush Kini, Sayak Ray Chowdhury, Nagarajan Natarajan, Gaurav Sinha, and Amit Sharma. 2023. Gar-meets-rag paradigm for zero-shot information retrieval. arXiv preprint arXiv:2310.20158.
- Asai et al. (2024) Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-RAG: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations.
- Borgeaud et al. (2022) Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Chan et al. (2024) Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Wei Xue, Yike Guo, and Jie Fu. 2024. Rq-rag: Learning to refine queries for retrieval augmented generation. arXiv preprint arXiv:2404.00610.
- Chen et al. (2024) Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. Preprint, arXiv:2402.03216.
- Chen et al. (2023) Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Dong Yu, and Hongming Zhang. 2023. Dense x retrieval: What retrieval granularity should we use? arXiv preprint arXiv:2312.06648.
- Cheng et al. (2023) Daixuan Cheng, Shaohan Huang, Junyu Bi, Yuefeng Zhan, Jianfeng Liu, Yujing Wang, Hao Sun, Furu Wei, Denvy Deng, and Qi Zhang. 2023. Uprise: Universal prompt retrieval for improving zero-shot evaluation. arXiv preprint arXiv:2303.08518.
- Cheng et al. (2024) Xin Cheng, Di Luo, Xiuying Chen, Lemao Liu, Dongyan Zhao, and Rui Yan. 2024. Lift yourself up: Retrieval-augmented text generation with self-memory. Advances in Neural Information Processing Systems, 36.
- Fan et al. (2021) Angela Fan, Claire Gardent, Chloé Braud, and Antoine Bordes. 2021. Augmenting transformers with knn-based composite memory for dialog. Transactions of the Association for Computational Linguistics, 9:82–99.
- Feng et al. (2024) Zhangyin Feng, Xiaocheng Feng, Dezhi Zhao, Maojin Yang, and Bing Qin. 2024. Retrieval-generation synergy augmented large language models. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 11661–11665. IEEE.
- Ho et al. (2020) Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. 2020. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. arXiv preprint arXiv:2011.01060.
- Jeong et al. (2024) Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park. 2024. Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity. arXiv preprint arXiv:2403.14403.
- Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
- Ke et al. (2024) Zixuan Ke, Weize Kong, Cheng Li, Mingyang Zhang, Qiaozhu Mei, and Michael Bendersky. 2024. Bridging the preference gap between retrievers and llms. arXiv preprint arXiv:2401.06954.
- Khattab et al. (2022) Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. 2022. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024.
- Khot et al. (2022) Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. 2022. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406.
- Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
- Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Liu et al. (2023) Ye Liu, Semih Yavuz, Rui Meng, Meghana Moorthy, Shafiq Joty, Caiming Xiong, and Yingbo Zhou. 2023. Exploring the integration strategies of retriever and large language models. arXiv preprint arXiv:2308.12574.
- Liu et al. (2024) Zihan Liu, Wei Ping, Rajarshi Roy, Peng Xu, Chankyu Lee, Mohammad Shoeybi, and Bryan Catanzaro. 2024. Chatqa: Surpassing gpt-4 on conversational qa and rag. arXiv preprint arXiv:2401.10225.
- Luo and Surdeanu (2023) Fan Luo and Mihai Surdeanu. 2023. Divide & conquer for entailment-aware multi-hop evidence retrieval. arXiv preprint arXiv:2311.02616.
- Ma et al. (2023) Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. 2023. Query rewriting for retrieval-augmented large language models. arXiv preprint arXiv:2305.14283.
- Mallen et al. (2022) Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511.
- Min et al. (2023) Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251.
- Muhlgay et al. (2023) Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, and Yoav Shoham. 2023. Generating benchmarks for factuality evaluation of language models. arXiv preprint arXiv:2307.06908.
- NetEase Youdao (2023) Inc. NetEase Youdao. 2023. Bcembedding: Bilingual and crosslingual embedding for rag. https://github.com/netease-youdao/BCEmbedding.
- Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
- Pereira et al. (2023) Jayr Pereira, Robson Fidalgo, Roberto Lotufo, and Rodrigo Nogueira. 2023. Visconde: Multi-document qa with gpt-3 and neural reranking. In European Conference on Information Retrieval, pages 534–543. Springer.
- Press et al. (2022) Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. 2022. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350.
- Ren et al. (2023) Ruiyang Ren, Yuhao Wang, Yingqi Qu, Wayne Xin Zhao, Jing Liu, Hao Tian, Hua Wu, Ji-Rong Wen, and Haifeng Wang. 2023. Investigating the factual knowledge boundary of large language models with retrieval augmentation. arXiv preprint arXiv:2307.11019.
- Rony et al. (2022) Md Rashad Al Hasan Rony, Ricardo Usbeck, and Jens Lehmann. 2022. Dialokg: Knowledge-structure aware task-oriented dialogue generation. arXiv preprint arXiv:2204.09149.
- Shao et al. (2023) Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. 2023. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. arXiv preprint arXiv:2305.15294.
- Shi et al. (2023) Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210–31227. PMLR.
- Sun et al. (2022) Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. 2022. Recitation-augmented language models. arXiv preprint arXiv:2210.01296.
- Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Trivedi et al. (2022a) Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022a. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509.
- Trivedi et al. (2022b) Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2022b. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554.
- Trivedi et al. (2023) Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2023. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10014–10037, Toronto, Canada. Association for Computational Linguistics.
- Wang et al. (2024) Yu Wang, Nedim Lipka, Ryan A Rossi, Alexa Siu, Ruiyi Zhang, and Tyler Derr. 2024. Knowledge graph prompting for multi-document question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19206–19214.
- Wang et al. (2023) Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan Parvez, and Graham Neubig. 2023. Learning to filter context for retrieval-augmented generation. arXiv preprint arXiv:2311.08377.
- Xu et al. (2023) Shicheng Xu, Liang Pang, Huawei Shen, Xueqi Cheng, and Tat-Seng Chua. 2023. Search-in-the-chain: Towards accurate, credible and traceable large language models for knowledgeintensive tasks. CoRR, vol. abs/2304.14732.
- Yang et al. (2018) Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
- Yu et al. (2022) Wenhao Yu, Dan Iter, Shuohang Wang, Yichong Xu, Mingxuan Ju, Soumya Sanyal, Chenguang Zhu, Michael Zeng, and Meng Jiang. 2022. Generate rather than retrieve: Large language models are strong context generators. arXiv preprint arXiv:2209.10063.
- Yu et al. (2023) Zichun Yu, Chenyan Xiong, Shi Yu, and Zhiyuan Liu. 2023. Augmentation-adapted retriever improves generalization of language models as generic plug-in. arXiv preprint arXiv:2305.17331.
- Zaheer et al. (2021) Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2021. Big bird: Transformers for longer sequences. Preprint, arXiv:2007.14062.
- Zhang et al. (2023) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. 2023. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
- Zhou et al. (2022) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. 2022. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
Appendix A Appendix
A.1 Implementation Details
We follow the standard evaluation approach Jeong et al. (2024) and validate our DR-RAG for QA systems by multiple metrics including F1, EM, and Accuracy (Acc). These metrics provide an objective measure between the prediction results and ground truth. In addition, the efficiency is also another issue we have to tackle. Most existing RAG frameworks Asai et al. (2024); Jeong et al. (2024) require multiple calls to LLMs for inference. Therefore, we consider the number of inferences by LLMs and the time required for responses as our evaluation. To eliminate the effects of different LLMs, we select gpt-3.5-turbo Achiam et al. (2023); Brown et al. (2020) and Llama3-8B Liu et al. (2024) as base LLMs, and accurately acquire the answers to query based on retrieval documents. For the classifier C𝐶Citalic_C, we fine-tune bigbird-roberta-base Zaheer et al. (2021) by the entire training set to accommodate longer input tokens. Due to the imbalance between positive and negative samples in the datasets, we sample the positive and negative examples to construct the datasets with the ratio of 1:1. In addition, we sample about 2300 pieces of data in each dataset, which exceeds the existing experiment Jeong et al. (2024) in sample numbers.
A.2 Retrieval Strategy
DR-RAG aims to solve the problem of low recall in document retrieval. Therefore, five different retrieval strategies are designed to verify the effectiveness of our proposed DR-RAG.
- •
BM25: A method to measure the relevance between 𝒒𝒒\boldsymbol{q}bold_italic_q and the documents. - •
SM: The retrieval documents will be embedded and stored in D𝐷Ditalic_D and the similarity between 𝒒𝒒\boldsymbol{q}bold_italic_q and D𝐷Ditalic_D is calculated to extract the k𝑘kitalic_k most relevant documents. - •
QDC: We first retrieve k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT documents from D𝐷Ditalic_D, concatenate 𝒒𝒒\boldsymbol{q}bold_italic_q with the documents to form multiple pairs and retrieve the k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT most relevant documents for <𝒒,𝒅𝒒𝒅\boldsymbol{q},\boldsymbol{d}bold_italic_q , bold_italic_d> pairs, respectively, until the number of retrieved documents = k𝑘kitalic_k. - •
CIS: To minimize document redundancy in QDC, all k𝑘kitalic_k documents retrieved are pairwise combined, concatenated with 𝒒𝒒\boldsymbol{q}bold_italic_q and then fed into C𝐶Citalic_C to filter out irrelevant documents. - •
CFS: To remove irrelevant dynamic-relevant documents, after retrieving k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT documents, <𝒒,𝒅𝒒𝒅\boldsymbol{q},\boldsymbol{d}bold_italic_q , bold_italic_d> pairs are matched one by one with the remaining documents for similarity. Simultaneously, they have been fed into C𝐶Citalic_C for classification. If classified as negative, the process have been extended to the next document. Otherwise, positive instances will be included in the document set.
A.3 Baseline
We conduct a comprehensive comparison of our retrieval strategies against other RAG frameworks. In DR-RAG, we calculate the recall with different retrieval strategies and then evaluate the accuracy of the answers. Therefore, we select BM25 and SM methods Lewis et al. (2020); Chan et al. (2024) as baselines. Moreover, we choose self-RAG Asai et al. (2024) and Adaptive-RAG Jeong et al. (2024), which are effective RAG frameworks for multi-hop QA, to validate the performance of our DR-RAG. In addition, we add the experimental results of Non-retrieval, original RAG and multi-step approach Trivedi et al. (2023) to enrich our comparisons.
A.4 Related Works
A.4.1 RAG for Multi-hop QA
RAG is a popular framework for LLMs and has received much attention to many tasks, such as QA systems. RAG Lewis et al. (2020) combined a sequence-to-sequence model with external knowledge bases to significantly improve the quality of quizzing and summarization tasks. The decomposition of a complex query Khattab et al. (2022); Press et al. (2022); Pereira et al. (2023); Khot et al. (2022); Zhou et al. (2022) into a series of simpler sub-queries might inevitably require multiple calls to LLMs, resulting in high computational cost. Adaptive-RAG Jeong et al. (2024) evaluated the complexity of the problem by a classifier and selects the most appropriate retrieval strategy based on the classification results. RQ-RAG Chan et al. (2024) aimed to improve the performance of models by optimising search query, including rewriting, decomposition and disambiguation. However, it would be inefficient to access LLMs multiple times for each query and unreliable to retrieve all dynamic-relevant documents by a single query.
A.4.2 Retriever in RAG
The retriever in a RAG system is the key to verify how the retriever can obtain revelant instant contexts from external knowledge bases and alleviate the hallucination of LLMs. Fan et al. (2021) combined K𝐾Kitalic_K Nearest Neighbor (KNN) retrieval with a traditional transformer model to dynamically access historical data and provide enough information by a composite memory. Cheng et al. (2024) proposed Selfmem to make the generated text more relevant to the retrieved information through a self-memory mechanism. Recent research has highlighted the potential applications of LLMs, which can be considered as supervised signals for training retrieval components, even as retrieval components. These findings provide us with new avenues for exploring the ability of retrievers to improve the efficiency of information retrieval based on the document relevance. In our work, we retrieve multiple relevant documents based on the query by a two-stage strategy and design a classifier to determine whether the documents can answer the query, and the remaining relevant documents are fed into LLMs with the query to obtain the answer.
A.5 More Analysis
Optimization under Resource Constraints The classifier C𝐶Citalic_C of document relevance requires certain hardware conditions and resources for data annotation. However, QDC method indicates that dynamic documents relevance can still be utilized without C𝐶Citalic_C. As seen in Table 5, compared to SM method, across all datasets, when top-k𝑘kitalic_k is 4 or 6, there is a significant increase of the retrieval recall by 3.84%. Yet, when top-k𝑘kitalic_k is 3, there is also 6% increase on HotpotQA and a slight increase on the other two datasets. This suggests that by making reasonable choices about top-k𝑘kitalic_k, even in cases where resources are limited, the performance of retrieval can be optimized by leveraging the relevance of relevant documents, thus improving LLMs’ performance in QA tasks.
More Cases Table 9 shows the prompts we provide to LLMs. Contexts contain the documents (Document i𝑖iitalic_i) after retrieving and selecting. Moreover, we show the case of the classifier for selection in CIS and CFS methods in Table 10, the output cases compared to Adaptive-RAG in Table 11, the case of documents retrieved by QDC and CFS methods in Table 12, the case of documents retrieved by QDC and CIS methods in Table 13.
Algorithm 1 Classifier Forward Selection (CFS)
1:
2:Classifier C𝐶Citalic_C
3:Retrieval Function Retriever
4:Input query 𝒒𝒒\boldsymbol{q}bold_italic_q
5:Generated response answer𝑎𝑛𝑠𝑤𝑒𝑟answeritalic_a italic_n italic_s italic_w italic_e italic_r
6:Initialize empty context: Cnt={}𝐶𝑛𝑡\text{$Cnt$}=\{\}italic_C italic_n italic_t = { }
7:Retrieve k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT documents: {𝒅1,𝒅2,…,𝒅k1}=Retriever(𝒒)subscript𝒅1subscript𝒅2…subscript𝒅subscript𝑘1Retriever𝒒\{\boldsymbol{d}_{1},\boldsymbol{d}_{2},\ldots,\boldsymbol{d}_{k_{1}}\}=% \texttt{Retriever}(\boldsymbol{q}){ bold_italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_d start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } = Retriever ( bold_italic_q )
8:Update context: Cnt=Cnt∪{𝒅1,𝒅2,…,𝒅k1}𝐶𝑛𝑡𝐶𝑛𝑡subscript𝒅1subscript𝒅2…subscript𝒅subscript𝑘1\text{$Cnt$}=\text{$Cnt$}\cup\{\boldsymbol{d}_{1},\boldsymbol{d}_{2},\ldots,% \boldsymbol{d}_{k_{1}}\}italic_C italic_n italic_t = italic_C italic_n italic_t ∪ { bold_italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_d start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT }
9:for i=1𝑖1i=1italic_i = 1 to k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT do
10: Construct a new query: 𝒒𝒊∗=concat(𝒒,𝒅𝒊)superscriptsubscript𝒒𝒊concat𝒒subscript𝒅𝒊\boldsymbol{{q_{i}}^{*}}=\texttt{concat}(\boldsymbol{q},\boldsymbol{d_{i}})bold_italic_q start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_∗ end_POSTSUPERSCRIPT = concat ( bold_italic_q , bold_italic_d start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT )
11:end for
12:Retrieve full set of documents for each new query:
13:{𝒅i,1′,𝒅i,2′,,…,𝒅i,k2′}=Retriever(𝒒i∗), for i=1,2,…,k1\{\boldsymbol{d}^{\prime}_{i,1},\boldsymbol{d}^{\prime}_{i,2},,\ldots,% \boldsymbol{d}^{\prime}_{i,k_{2}}\}=\texttt{Retriever}(\boldsymbol{q}_{i}^{*})% ,\text{ for }i=1,2,\ldots,k_{1}{ bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT , , … , bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } = Retriever ( bold_italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , for italic_i = 1 , 2 , … , italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
14:for i=1𝑖1i=1italic_i = 1 to k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT do
15: for j=1𝑗1j=1italic_j = 1 to k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT do
16: if 𝒅i,j′∉Cntsubscriptsuperscript𝒅′𝑖𝑗𝐶𝑛𝑡\boldsymbol{d}^{\prime}_{i,j}\not\in\text{$Cnt$}bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∉ italic_C italic_n italic_t and C(𝒒,𝒅i,𝒅i,j′)=positiveC\text{(}\boldsymbol{q},\boldsymbol{d}_{i},\boldsymbol{d}^{\prime}_{i,j})=positiveitalic_C ( bold_italic_q , bold_italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) = italic_p italic_o italic_s italic_i italic_t italic_i italic_v italic_e then
17: Update context: Cnt=Cnt∪{𝒅i,j′}𝐶𝑛𝑡𝐶𝑛𝑡subscriptsuperscript𝒅′𝑖𝑗\text{$Cnt$}=\text{$Cnt$}\cup\{\boldsymbol{d}^{\prime}_{i,j}\}italic_C italic_n italic_t = italic_C italic_n italic_t ∪ { bold_italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT }
18: end if
19: end for
20:end for
21:Combine the input question with the updated context: input=concat(𝒒,Cnt)𝑖𝑛𝑝𝑢𝑡concat𝒒𝐶𝑛𝑡\text{$input$}=\texttt{concat}(\boldsymbol{q},\text{$Cnt$})italic_i italic_n italic_p italic_u italic_t = concat ( bold_italic_q , italic_C italic_n italic_t )
22:Generate the answer using a large language model: answer=LLM(input)𝑎𝑛𝑠𝑤𝑒𝑟LLM𝑖𝑛𝑝𝑢𝑡\text{$answer$}=\texttt{LLM}(\text{$input$})italic_a italic_n italic_s italic_w italic_e italic_r = LLM ( italic_i italic_n italic_p italic_u italic_t )
23:return answer𝑎𝑛𝑠𝑤𝑒𝑟answeritalic_a italic_n italic_s italic_w italic_e italic_r
Table 9: A case of our prompt provided to LLMs.
| You are a reading comprehension expert, and you need to complete a reading comprehension task. |
|---|
| —————————————— |
| Contexts |
| Document 1: |
| Walk, Don’t Run is a 1966 Technicolor comedy film directed by Charles Walters and starring Cary Grant in his final film role, |
| Samantha Eggar, and Jim Hutton. The film is a remake of the 1943 film "The More the Merrier" and is set during the Olympic |
| Games |
| Document 2: |
| Douglas Sirk( born Hans Detlef Sierck; 26 April 1897 – 14 January 1987) was a German film director best known for his work |
| in Hollywood melodramas of the 1950s. Sirk started his career in Germany as a stage and screen director, but he left to Holly- |
| wood in 1937 because his Jewish wife was persecuted by the Nazis. In the 1950s, he achieved his greatest commercial success |
| with film melodramas like "Imitation of Life All That Heaven Allows Written on the WindMagnificent Obsession" and "A |
| Time to Love and a Time to Die". While those films were initially panned by critics as sentimental women’s pictures, they are |
| today widely regarded by film directors, critics and scholars as masterpieces. His work is seen as "critique of the bourgeoisie |
| in general and of 1950s America in particular", while painting a" compassionate portrait of characters trapped by social con- |
| ditions". Beyond the surface of the film, Sirk worked with complex mise enscenes and lush Technicolor colors to subtly un- |
| derline his message. |
| Document 3: |
| The Mall, The Merrier is a 2019 Philippine musical family comedy film directed by Barry Gonzales, starring Vice Ganda and |
| Anne Curtis. The film is co-produced by Star Cinema and Viva Films under the working title" Momalland". The film pre- |
| miered in Philippine cinemas on December 25, 2019 as one of the official entries to the 2019 Metro Manila Film Festival. " |
| The Mall, The Merrier" marks the first on- screen collaboration between Anne Curtis and Vice Ganda, both of whom are |
| regular hosts in the noontime variety show" It’s Showtime". |
| Document 4:: |
| Robert Wallace Russell( January 19, 1912 – February 11, 1992) was an American writer for movies, plays, and documentaries. |
| He was nominated for two Academy Awards for Best Writing, Original Story and Best Writing, Screenplay on the 1943 film |
| "The More the Merrier". He died in 1992 in New York City, shortly after his 80th birthday. |
| Document 5: |
| Sleep, My Love is a 1948 American film noir directed by Douglas Sirk and starring Claudette Colbert, Robert Cummings and |
| Don Ameche. |
| Document 6: |
| The More the Merrier is a 1943 American comedy film made by Columbia Pictures which makes fun of the housing shortage |
| during World War II, especially in Washington, D.C. The picture stars Jean Arthur, Joel McCrea and Charles Coburn. The |
| movie was directed by George Stevens. The film was written by Richard Flournoy, Lewis R. Foster, Frank Ross, and Robert |
| Russell, from" Two’s a Crowd", an original story by Garson Kanin( uncredited). This film was remade in 1966 as" Walk, |
| Don’t Run", with Cary Grant, Samantha Eggar and Jim Hutton. |
| —————————————— |
| After reading the documents above, answering the following question. Reasoning step by step. At last, you should output the |
| final result via the following format: |
| Answer: ; |
| Please answer the question directly. |
| —————————————— |
| Question |
| Which film has the director who died later, The More The Merrier or Sleep, My Love? |
| —————————————— |
| Give your analysis process first, and then output your answer in a specified format. |
Table 10: Case of the classifier we train for selection in CIS and CFS methods. We mark relevant information that can influence classification results in blue.
Table 11: Cases that the query can be answered correctly in DR-RAG, and can not in Adaptive-RAG. We present wrong answer in red and the right answer in blue.
Table 12: A case of documents retrieved by QDC and CFS on the MuSiQue dataset, where the necessary documents are in blue, and the top-k𝑘kitalic_k is 4.
Table 13: A case of documents retrieved by QDC and CIS on the HotpotQA dataset, where the necessary documents are in blue, and the top-k𝑘kitalic_k is 4.