PDF Summarizer using RAG (original) (raw)

Last Updated : 4 May, 2026

A PDF summarizer automatically processes the text content inside PDF files and produces concise summaries or responses to queries, saving users time and effort required to read lengthy documents. This can be useful for research papers, reports, manuals or any long-form content. We can use RAG which integrates two AI concepts:

This combination allows the system to provide more accurate, context-driven and up-to-date responses by grounding them in real document data rather than only relying on pre-trained model knowledge.

Workflow of PDF Summarizer

Let's build a PDF Summarizer using RAG but before that lets see its workflow:

pdf

In the workflow,

Implementation

Step 1: Install the Dependencies

We install the required packages for our model,

!pip install langchain langchain-community pypdf sentence-transformers faiss-cpu transformers

`

Step 2: Import Required Libraries and Configure Logging

We import all the library components needed for file uploads, document loading, text splitting, embedding generation, vector-based search, language model interaction and logging.

import os import logging from google.colab import files from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_community.document_loaders import PyPDFLoader from langchain_community.embeddings import HuggingFaceEmbeddings from langchain_community.vectorstores import FAISS from langchain.chains import RetrievalQA from langchain_community.llms import HuggingFacePipeline from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

logging.basicConfig(level=logging.INFO) logger = logging.getLogger(name)

`

Step 3: Define the RAG System Class

We define a class to keep all components i.e documents, embeddings, vector stores, language models and QA chains organized and accessible.

class LocalRAGSystem: def init(self): self.documents = [] self.vector_store = None self.embeddings = None self.llm = None self.qa_chain = None

`

Step 4: Upload the PDFs

We upload the PDF that is to be summarized.

def upload_pdfs(self): uploaded = files.upload() pdf_paths = list(uploaded.keys()) logger.info(f"Uploaded PDFs: {pdf_paths}") return pdf_paths

`

**Output:

upload

Upload

Step 5: Load and Parse PDF Documents

def load_documents(self, pdf_paths): for pdf_path in pdf_paths: loader = PyPDFLoader(pdf_path) documents = loader.load() self.documents.extend(documents) logger.info(f"Loaded {len(self.documents)} pages in total.")

`

Step 6: Split Documents into Chunks for Embeddings

Here:

def split_documents(self, chunk_size=1000, chunk_overlap=200): text_splitter = RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=chunk_overlap) self.document_chunks = text_splitter.split_documents(self.documents) logger.info(f"Split into {len(self.document_chunks)} chunks.")

`

Step 7: Setup Embedding Model for Vector Store

Here:

def setup_embeddings(self, model_name="sentence-transformers/all-MiniLM-L6-v2"): self.embeddings = HuggingFaceEmbeddings(model_name=model_name) logger.info(f"Embedding model {model_name} loaded.")

`

**Output:

model

Model Loading

Step 8: Create a Vector Store Using FAISS

Here:

def create_vector_store(self): self.vector_store = FAISS.from_documents( self.document_chunks, self.embeddings) logger.info("Created the FAISS vector store.")

`

Step 9: Setup a Local Language Model

def setup_local_llm(self, model_id="google/flan-t5-base", device="auto"): tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map=device) pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer, max_new_tokens=512, temperature=0.7) self.llm = HuggingFacePipeline(pipeline=pipe) logger.info(f"Local LLM {model_id} ready.")

`

Step 10: Setup the RetrievalQA Chain

Here we:

def setup_qa_chain(self, k=3): self.qa_chain = RetrievalQA.from_chain_type( llm=self.llm, chain_type="stuff", retriever=self.vector_store.as_retriever(search_kwargs={"k": k}) ) logger.info(f"Retrieval QA chain set with top {k} documents retrieved.")

`

Step 11: Answer Questions Using the RAG System

def answer_question(self, question): answer = self.qa_chain.run(question) logger.info(f"Answered question: {question}") return answer

`

Step 12: Run the Setup

We execute all preparation steps in sequence:

Ensures system is ready for immediate querying.

Python `

def run_setup(self, chunk_size=1000, chunk_overlap=200, model_id="google/flan-t5-base", k=3): pdf_paths = self.upload_pdfs() self.load_documents(pdf_paths) self.split_documents(chunk_size=chunk_size, chunk_overlap=chunk_overlap) self.setup_embeddings() self.create_vector_store() self.setup_local_llm(model_id=model_id) self.setup_qa_chain(k=k) logger.info("RAG summarizer is ready to answer questions.")

`

Step 13: Example Usage

We initializes the RAG system and runs setup. Lets see querying capabilities:

if name == "main": rag = LocalRAGSystem() rag.run_setup()

q1 = "What is the main topic of these documents?"
print(f"Q: {q1}\nA: {rag.answer_question(q1)}")

q2 = "Summarize the key points from the documents."
print(f"Q: {q2}\nA: {rag.answer_question(q2)}")

`

**Output:

result

Result

The source code can be downloaded from here.

Advantages