GitHub - Azure/doc-proc-solution-accelerator: Document Processing Solution Accelerator on Azure by Azure. (original) (raw)

doc-proc-solution-accelerator

A comprehensive, enterprise-ready document processing solution built on Azure that enables organizations to rapidly deploy and scale document processing workflows. This accelerator combines the power of Azure AI services, cloud-native architecture, and modern development practices to provide a complete platform for document ingestion, processing, and analysis.

πŸš€ What is this Solution Accelerator?

This solution accelerator provides a production-ready foundation for building document processing applications on Azure. It includes:

Getting Started β†’

✨ Key Benefits

πŸ—οΈ Architecture Overview

The solution consists of the following building blocks:

Text changing depending on mode. Light: 'So light!' Dark: 'So dark!'

πŸ“¦ Core Components

πŸ”§ doc-proc-lib

The Processing Engine - A flexible Python library that serves as the heart of the document processing pipeline.

πŸ“– View detailed documentation β†’

🎨 doc-proc-web

The Management Interface - A modern React-based web application for managing and monitoring document processing workflows.

πŸ“– View detailed documentation β†’

πŸš€ doc-proc-api

The Backend API - FastAPI-based service providing RESTful APIs for document processing operations.

πŸ“– View detailed documentation β†’

⚑ doc-proc-worker

The Processing Engine - High-performance background processing service for executing document processing jobs at scale.

πŸ“– View detailed documentation β†’

πŸ” doc-proc-crawler

The Document Discovery Engine - Intelligent distributed crawler service for automated document discovery and ingestion from various sources.

πŸ“– View detailed documentation β†’

πŸ—οΈ doc-proc-deploy

Infrastructure as Code - Automated deployment templates and scripts for Azure resources.

πŸ“– View deployment documentation β†’

πŸ’‘ Use Cases and Scenarios

This solution accelerator can be applied across various industries and document processing workflows. Below are common use cases with domain-specific examples and configuration patterns.

πŸ“„ Multi-Modal Document Processing

Enterprise Content Unification

Process diverse document types and formats in a unified workflow with intelligent format detection and specialized extraction:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Mixed Content  │───▢│  Format         │───▢│  Specialized    │───▢│  Unified Data   β”‚
β”‚  Repository     β”‚    β”‚  Detection      β”‚    β”‚  Extraction     β”‚    β”‚  Structure      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Supported Document Types:

Intelligent Processing Pipeline:

Key Benefits:

🏒 Financial Services

Invoice Processing Automation

Streamline accounts payable workflows with automated invoice processing:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   PDF/Email     │───▢│  Document AI    │───▢│   Validation    │───▢│   ERP System    β”‚
β”‚   Invoice       β”‚    β”‚   Extraction    β”‚    β”‚   & Approval    β”‚    β”‚   Integration   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Loan Document Processing

Accelerate loan application processing with document verification:

Automated Document Ingestion

Enterprise-scale document discovery and ingestion with intelligent coordination:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   File Shares   │───▢│   Distributed   │───▢│   Document      │───▢│   Processing    β”‚
β”‚   Cloud Storage β”‚    β”‚    Crawler      β”‚    β”‚   Queue         β”‚    β”‚   Pipeline      β”‚
β”‚   API Endpoints β”‚    β”‚   Discovery     β”‚    β”‚   Management    β”‚    β”‚   Execution     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Benefits:

πŸ₯ Healthcare

Medical Records Digitization

Transform paper-based medical records into structured digital formats:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Scanned Chart  │───▢│      OCR +      │───▢│   FHIR Data     │───▢│      EHR        β”‚
β”‚   Documents     β”‚    β”‚   Medical AI    β”‚    β”‚   Mapping       β”‚    β”‚   Integration   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Clinical Trial Document Processing

Process research documents and patient data for clinical trials:

Contract Analysis and Review

Automate contract review processes with AI-powered analysis:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Contract      │───▢│  Clause         │───▢│  Risk Analysis  │───▢│  Review         β”‚
β”‚   Document      β”‚    β”‚  Extraction     β”‚    β”‚  & Compliance   β”‚    β”‚  Dashboard      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Process large volumes of documents for litigation support:

🏭 Manufacturing & Supply Chain

Quality Control Documentation

Process inspection reports, certificates, and compliance documents:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Inspection     │───▢│  Data           │───▢│  Compliance     │───▢│  Quality        β”‚
β”‚  Reports        β”‚    β”‚  Extraction     β”‚    β”‚  Verification   β”‚    β”‚  Dashboard      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Supplier Document Management

Getting Started

Prerequisites

Before getting started, ensure you have the following:

Deployment/Development Environment:

Azure Resources:

Quick Start Options

Choose the deployment method that best fits your needs:

☁️ Azure Cloud Deployment

Production-ready deployment on Azure with full scalability:

Using Powershell for deployment

0. Clone the repository

git clone https://github.com/Azure/doc-proc-solution-accelerator.git cd doc-proc-solution-accelerator

1. Deploy Azure infrastructure (AI Foundry, Container Apps, Cosmos DB, Storage Account, etc.)

pwsh .\doc-proc-deploy\DeployAzureInfra.ps1 -ResourceGroup myResourceGroup -Location westus -p docproc

2. Build and push Docker images to Azure Container Registry

pwsh .\doc-proc-deploy\BuildAndPushImages.ps1 -Registry myregistry.azurecr.io -Tag latest

3. Deploy applications to Azure Container Apps

pwsh .\doc-proc-deploy\DeployApps.ps1 -ResourceGroup myResourceGroup

Using shell scripts for deployment

0. Clone the repository

git clone https://github.com/Azure/doc-proc-solution-accelerator.git cd doc-proc-solution-accelerator

1. Deploy Azure infrastructure (AI Foundry, Container Apps, Cosmos DB, Storage Account, etc.)

./doc-proc-deploy/deploy-azure-infra.sh -g myResourceGroup -l westus -p docproc

2. Build and push Docker images to Azure Container Registry

./doc-proc-deploy/build-and-push-images.sh -r myregistry.azurecr.io -t latest

3. Deploy applications to Azure Container Apps

./doc-proc-deploy/deploy-apps.sh -g myResourceGroup

This creates:

πŸ”§ Local Development

Once the Azure resources are deployed, you can run the solution services locally for development:

cd doc-proc-solution-accelerator

Configure environment variables for each service

Copy .env.example to .env and update with your Azure resource endpoints

cp doc-proc-api.env.example doc-proc-api.env cp doc-proc-worker.env.example doc-proc-worker.env cp doc-proc-crawler.env.example doc-proc-crawler.env cp doc-proc-web.env.example doc-proc-web.env

Edit the .env files with your Azure resource information:

doc-proc-api.env - Add Azure App Configuration endpoint

doc-proc-worker.env - Add Azure App Configuration endpoint

doc-proc-crawler.env - Add Azure App Configuration endpoint

doc-proc-web.env - Update API base URL if different from http://localhost:8090

Start all services locally with auto-reload

pwsh .\doc-proc-deploy\StartServicesLocally.ps1

#./doc-proc-deploy/start-services-locally.sh # if using shell

Required Configuration Values:

This will start:

πŸ’‘ For detailed instructions and additional options, see the comprehensive deployment guide β†’

πŸ“Š Monitoring and Scaling

Auto-Scaling Configuration

Container Apps are configured with intelligent auto-scaling:

Health Monitoring

All services include comprehensive health monitoring:

πŸ” Security Features

🏷️ Repository Structure

doc-proc-solution-accelerator/
β”œβ”€β”€ doc-proc-lib/              # πŸ”§ Core processing library and pipeline engine
β”‚   β”œβ”€β”€ doc/                   # Processing modules and components
β”‚   β”œβ”€β”€ examples/              # Example pipelines and usage patterns
β”‚   β”œβ”€β”€ tests/                 # Unit and integration tests
β”‚   β”œβ”€β”€ pipeline_config.yaml   # Pipeline configuration examples
β”‚   β”œβ”€β”€ service_catalog.yaml   # Service definitions and configurations
β”‚   └── step_catalog.yaml      # Step definitions and configurations
β”œβ”€β”€ doc-proc-api/              # πŸš€ FastAPI backend service
β”‚   β”œβ”€β”€ app/                   # Application code
β”‚   β”‚   β”œβ”€β”€ db/               # Database models and operations
β”‚   β”‚   β”œβ”€β”€ models/           # Pydantic models and schemas
β”‚   β”‚   β”œβ”€β”€ routers/          # API route handlers
β”‚   β”‚   └── services/         # Business logic services
β”‚   β”œβ”€β”€ infra/                # Infrastructure configuration for API
β”‚   β”œβ”€β”€ Dockerfile            # Container configuration
β”‚   └── requirements.txt      # Python dependencies
β”œβ”€β”€ doc-proc-web/              # 🎨 React + TypeScript frontend
β”‚   β”œβ”€β”€ src/                  # Source code
β”‚   β”‚   β”œβ”€β”€ components/       # Reusable UI components
β”‚   β”‚   β”œβ”€β”€ pages/           # Application pages
β”‚   β”‚   β”œβ”€β”€ services/        # API integration services
β”‚   β”‚   └── types/           # TypeScript type definitions
β”‚   β”œβ”€β”€ infra/               # Infrastructure configuration for web
β”‚   β”œβ”€β”€ Dockerfile           # Container configuration
β”‚   └── package.json         # Node.js dependencies
β”œβ”€β”€ doc-proc-worker/           # ⚑ Background processing worker
β”‚   β”œβ”€β”€ app/                  # Worker application code
β”‚   β”œβ”€β”€ demo/                 # Demo scripts and examples
β”‚   β”œβ”€β”€ infra/               # Infrastructure configuration for worker
β”‚   β”œβ”€β”€ tmp/                 # Temporary processing files
β”‚   β”œβ”€β”€ Dockerfile           # Container configuration
β”‚   └── requirements.txt     # Python dependencies
β”œβ”€β”€ doc-proc-crawler/          # πŸ” Document discovery and crawling service
β”‚   β”œβ”€β”€ app/                  # Crawler application code
β”‚   β”‚   β”œβ”€β”€ discovery/        # Distributed coordination and source discovery
β”‚   β”‚   β”œβ”€β”€ sources/          # Source connectors (filesystem, cloud, API, etc.)
β”‚   β”‚   β”œβ”€β”€ models/           # Data models for crawling and coordination
β”‚   β”‚   └── proxy/            # Azure service integration proxies
β”‚   β”œβ”€β”€ infra/               # Infrastructure configuration for crawler
β”‚   β”œβ”€β”€ DISTRIBUTED_ARCHITECTURE.md  # Distributed coordination documentation
β”‚   β”œβ”€β”€ Dockerfile           # Container configuration
β”‚   β”œβ”€β”€ run_crawler.py       # Main crawler entry point
β”‚   └── requirements.txt     # Python dependencies
β”œβ”€β”€ doc-proc-deploy/           # πŸ—οΈ Infrastructure as Code and deployment
β”‚   β”œβ”€β”€ infra/
β”‚   β”‚   └── bicep/           # Azure Bicep templates
β”‚   β”‚       β”œβ”€β”€ main.bicep   # Main infrastructure template
β”‚   β”‚       └── modules/     # Reusable Bicep modules
β”‚   β”œβ”€β”€ deploy-azure-infra.sh     # Deploy infrastructure script
β”‚   β”œβ”€β”€ build-and-push-images.sh  # Build and push Docker images
β”‚   β”œβ”€β”€ deploy-apps.sh            # Deploy applications script
β”‚   β”œβ”€β”€ start-services-locally.sh # Local development setup
β”‚   └── DEPLOYMENT.md             # Detailed deployment guide
β”œβ”€β”€ bicepconfig.json          # Bicep configuration
β”œβ”€β”€ logo.svg                  # Solution logo
β”œβ”€β”€ LICENSE                   # MIT license
└── README.md                 # This documentation

πŸ’‘ Planned Features

Some of the great features planned for the next release:

🀝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details on how to:

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ†˜ Support


⚑ Ready to get started? Follow the Getting Started guide above or dive deep into the doc-proc-lib documentation.