RAG Architecture (original) (raw)

Last Updated : 9 May, 2026

Retrieval-Augmented Generation (RAG) is an architecture that enhances LLMs by combining them with external knowledge sources, enabling access to up to date and domain specific information for more accurate and relevant responses while reducing hallucinations.

What-is-RAG_

RAG

1. Retrieval Component

The retrieval component identifies relevant data to assist in generating accurate responses. Dense Passage Retrieval (DPR) is a common model that is used to perform retrieval.

2. Generative Component

After retrieval, the relevant data is passed to the generative model (like BART or GPT), which combines it with the query to generate the final response.

**FiD vs. FiE

Aspect Fusion-in-Decoder(FiD) Fusion-in-Encoder(FiE)
Fusion Point Fusion occurs in the decoding phase. Fusion happens at the encoding phase before decoding.
Process Separation Retrieval and generation are kept separate. Retrieval and generation are processed together.
Efficiency Slower due to separate retrieval and generation steps. Faster due to simultaneous process in encoder phase
Complexity More Complex Simpler
Performance Higher-quality response Quicker response generation

Working

RAG follows a structured workflow where a query is processed, relevant information is retrieved and a final response is generated using both retrieved data and model knowledge.

RAG-architecture

Retrieval-Augmented Generation

  1. **Query Processing: The input query is first pre-processed and prepared for further steps, ensuring it is in a suitable form for embedding.
  2. **Embedding Model: The query is passed through an embedding model that converts it into a vector capturing its semantic meaning.
  3. **Vector Database Retrieval: This vector is used to search a vector database to find documents that are most similar to the query.
  4. **Retrieved Contexts: The system retrieves the documents that are closest to the query. These documents are then forwarded to the generative model to help it craft a response.
  5. **LLM Response Generation: The LLM combines the original query with the retrieved context to generate a coherent and accurate response.
  6. **Response: The final response integrates both the model’s internal knowledge and the retrieved information, making it more relevant and up-to-date.

Implementation

This example demonstrates how RAG works by combining vector search with language models to generate accurate responses.

Step 1: Install Dependencies

We will install the required libraries and packages for our model,

!pip install faiss-cpu !pip install sentence-transformers !pip install transformers !pip install langchain==0.1.16

from langchain.memory import ConversationBufferMemory from langchain_core.prompts import PromptTemplate

`

Step 2: Initialize Vector Index and Add Embeddings

Creating a vector database using FAISS and store document embeddings.

import faiss import numpy as np from sentence_transformers import SentenceTransformer

embed_model = SentenceTransformer('all-MiniLM-L6-v2')

documents = [ "RAG combines retrieval and generation.", "It reduces hallucinations using external data.", "FAISS enables fast similarity search.", "Embeddings represent semantic meaning.", "RAG improves LLM accuracy." ]

doc_embeddings = embed_model.encode(documents)

index = faiss.IndexFlatL2(doc_embeddings.shape[1]) index.add(np.array(doc_embeddings).astype('float32'))

print(f"Indexed {index.ntotal} documents.")

`

**Output:

Indexed 5 documents.

Step 3: Define Semantic Search Function

def semantic_search(query_embedding, top_k=3): distances, indices = index.search(query_embedding, top_k) return indices

`

Step 4: Query Embedding and Retrieval

query_embedding = embed_model.encode(["What is RAG?"]).astype('float32') retrieved_indices = semantic_search(query_embedding)

print(retrieved_indices)

`

**Output:

Retrieved document indices for query: [[0 4 1]]

Step 5: Initialize Tokenizer and LLM Model

from transformers import GPT2Tokenizer, GPT2LMHeadModel import torch

tokenizer = GPT2Tokenizer.from_pretrained("gpt2") model = GPT2LMHeadModel.from_pretrained("gpt2")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device)

`

**Output:

Screenshot-2025-09-01-181259

Model Loading and Training

Step 6: Create Prompt with Retrieval Context

prompt_template = PromptTemplate( input_variables=["question", "context"], template="Question: {question}\nContext: {context}\nAnswer:" )

`

Step 7: Initialize Memory and Build Chat Function

memory = ConversationBufferMemory( memory_key="chat_history", return_messages=False )

def chat(question):

query_embedding = embed_model.encode([question]).astype("float32")
retrieved_indices = semantic_search(query_embedding)

context_texts = [documents[i] for i in retrieved_indices[0]]
context = "\n".join(context_texts)

chat_history = memory.load_memory_variables({}).get("chat_history", "")

prompt = prompt_template.format(
    question=question,
    context=context
)

inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)

outputs = model.generate(
    inputs,
    max_new_tokens=80,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

memory.chat_memory.add_user_message(question)
memory.chat_memory.add_ai_message(response)

return response

`

Step 8: Generate Response

We will see the functioning of system and the use of memory,

Python `

print(chat("What is RAG?")) print(chat("How does retrieval help LLMs?"))

`

**Output:

Screenshot-from-2026-04-29-11-05-26

Result

You can download the complete code from here

Advantages