GitHub - avnlp/rag-pipelines: Advanced RAG pipelines for medical (HealthBench, MedCaseReasoning, MetaMedQA, PubMedQA) and financial (FinanceBench, Earnings Calls) QA. LangGraph orchestration + BAML structructed generation, Milvus Hybrid search (Dense + BM25 + RRF), three-layer Metadata Enrichment, Contextual AI instruction-following reranker, and DeepEval evaluation. (original) (raw)

DeepWiki CI Ruff ty Bandit Tests Codecov License

This repository contains advanced Retrieval-Augmented Generation (RAG) pipelines specifically designed for domain-specific tasks.

The RAG pipelines follow a standardized architecture:

Each pipeline is configured through YAML files that allow for flexible customization of document processing, retrieval strategies, and generation parameters.

Datasets

The project includes several domain-specific datasets:

Pipeline Architecture

Each pipeline follows a consistent architecture split into two stages:

Indexing Pipeline

The indexing pipeline is an offline process run once per dataset to prepare the vector store:

  1. Load the dataset from Hugging Face Hub (or from PDF files for financial documents).
  2. Chunk documents using configurable strategies from the Unstructured library.
  3. Enrich each chunk with all three metadata layers (structural, dynamic, and fixed) via the Metadata Enricher.
  4. Store the enriched chunks in a Milvus vector database with both dense and sparse (BM25) indexing for hybrid search.

RAG Evaluation Pipeline

The RAG evaluation pipeline is orchestrated by LangGraph and consists of the following nodes:

Components

Contextual Ranker

The Contextual Ranker uses instruction-following reranker models by Contextual AI to reorder documents based on their relevance to a given query.

Metadata Enricher

The Metadata Enricher automatically enriches documents and queries with structured metadata using a three-layer architecture designed for cost/quality tradeoffs.

Three-layer enrichment:

  1. Structural (Layer 1): Rule-based extraction with zero LLM cost - content hashing, word/character counts, language detection, page numbers, section titles, and heading hierarchy.
  2. Dynamic (Layer 2): User-defined fields extracted via a language model. Supports string, number, boolean, and enum field types. The schema is specified per-pipeline in the YAML configuration.
  3. Fixed (Layer 3): RAG-optimized fields automatically generated by the LLM - potential questions the chunk answers, a concise summary, keywords, content type classification, and a descriptive header.

Key capabilities:

Unstructured Document Loaders and Chunker

The project includes document loading and chunking utilities built on the Unstructured library:

Key features:

BAML

BAML is a domain-specific language for defining LLM interactions. All LLM logic in this project — prompts, input/output schemas, client configurations, and test cases — is written in .baml files, fully separated from the Python application code. A Rust-based compiler then generates a typed Python client from these definitions, bridging the two layers.

Installation

The project uses uv for dependency management. First, ensure uv is installed:

Install uv (if not already installed)

pip install uv

Then install the project dependencies:

Install dependencies

uv sync

Activate the virtual environment

source .venv/bin/activate

Usage

Environment Setup

Create a .env file in the project root with the required environment variables:

GROQ_API_KEY=your_groq_api_key MILVUS_URI=your_milvus_uri MILVUS_TOKEN=your_milvus_token UNSTRUCTURED_API_KEY=your_unstructured_api_key

Indexing

Each dataset module includes an indexing script to process and store documents in the vector database:

Example for HealthBench:

cd src/rag_pipelines/healthbench python healthbench_indexing.py

RAG Evaluation

Each dataset module includes a RAG evaluation script to test the pipeline performance:

Example for HealthBench:

cd src/rag_pipelines/healthbench python healthbench_rag.py

Contributing

Please see the CONTRIBUTING.md file for detailed contribution guidelines.

License

This project is licensed under the MIT License - see the LICENSE file for details.