GitHub - veeva/Vault-Direct-Data-API-Accelerators (original) (raw)

Direct Data Accelerator

Introduction

Direct Data API is a new class of API that provides high-speed read-only data access to Vault. Direct Data API is a reliable, easy-to-use, timely, and consistent API for extracting Vault data. It is designed for organizations that wish to replicate large amounts of Vault data to an external database, data warehouse, or data lake.

The Direct Data accelerators are discrete groups of Python scripts, intended to facilitate the loading of data from Vault to external systems. Each accelerator provides a working example of using Direct Data API to connect Vault to an object storage system and a target data system.

accelerator-diagram

Overview

This project provides accelerator implementations that facilitate the loading of data from Vault to the following systems:

These accelerators perform the following fundamental processes:

Architecture

The architecture of the accelerators is designed to be easily extendable, so that they can be custom fit to individual developer needs and systems. The core components of each accelerator are below.

Core Components

Classes:

The four fundamental classes that are being leveraged by each accelerator are:

These classes, except for the VaultService class, can be extended to support any target system.

services

Configuration Files:

Each accelerator includes two configuration files that include the required parameters for connecting to Vault and the external systems. Each accelerator’s config file has examples of required parameters for that specific implementation. The files are described below.

{ "authenticationType": "BASIC", "idpOauthAccessToken": "", "idpOauthScope": "openid", "idpUsername": "", "idpPassword": "", "vaultUsername": "integration.user@cholecap.com", "vaultPassword": "Password123", "vaultDNS": "cholecap.veevavault.com", "vaultSessionId": "", "vaultClientId": "Cholecap-Vault-", "vaultOauthClientId": "", "vaultOauthProfileId": "", "logApiErrors": true, "httpTimeout": null, "validateSession": true }

{ "convert_to_parquet": false, "extract_document_content" : true, "retrieve_document_text" : true, "direct_data": { "start_time": "2000-01-01T00:00Z", "stop_time": "2025-04-09T00:00Z", "extract_type": "incremental" }, "s3": { "iam_role_arn": "arn:aws:iam::123456789:role/Direct-Data-Role", "bucket_name": "vault-direct-data-bucket", "direct_data_folder": "direct-data", "archive_filepath": "direct-data/201287-20250409-0000-F.tar.gz", "extract_folder": "201287-20250409-0000-F", "document_content_folder": "extracted_doc_content", "document_text_folder": "extracted_doc_text" }, "redshift": { "host": "direct-data.123GUID.us-east-1.redshift.amazonaws.com", "port": "5439", "user": "user", "password": "password", "database": "database", "schema": "direct_data", "iam_redshift_s3_read": "arn:aws:iam::123456789:role/RedshiftS3Read" } }

Scripts:

The logic that moves and transforms data between systems is handled in the included scripts.

import sys

from accelerators.redshift.services.redshift_service import RedshiftService

sys.path.append('.') from common.scripts import (direct_data_to_object_storage, download_and_unzip_direct_data_files, extract_doc_content, load_data, retrieve_doc_text) from common.services.aws_s3_service import AwsS3Service from common.services.vault_service import VaultService from common.utilities import read_json_file

def main(): config_filepath: str = "path/to/connector_config.json" vapil_settings_filepath: str = "path/to/vapil_settings.json"

config_params: dict = read_json_file(config_filepath) direct_data_params: dict = config_params['direct_data'] s3_params: dict = config_params['s3'] redshift_params: dict = config_params['redshift']

extract_document_content: bool = config_params.get('extract_document_content') retrieve_document_text: bool = config_params.get('retrieve_document_text')

object_storage_root: str = f's3://{s3_params["bucket_name"]}'

s3_params['convert_to_parquet'] = config_params['convert_to_parquet'] redshift_params['convert_to_parquet'] = config_params['convert_to_parquet'] redshift_params['object_storage_root'] = object_storage_root

s3_service: AwsS3Service = AwsS3Service(s3_params) redshift_service: RedshiftService = RedshiftService(redshift_params) vault_service: VaultService = VaultService(vapil_settings_filepath)

direct_data_to_object_storage.run(vault_service=vault_service, object_storage_service=s3_service, direct_data_params=direct_data_params)

download_and_unzip_direct_data_files.run(object_storage_service=s3_service)

load_data.run(object_storage_service=s3_service, database_service=redshift_service, direct_data_params=direct_data_params)

if extract_document_content: extract_doc_content.run(object_storage_service=s3_service, vault_service=vault_service)

if retrieve_document_text: retrieve_doc_text.run(object_storage_service=s3_service, vault_service=vault_service)

if name == "main": main()

direct-data-to-object-storage

download-and-unzip-direct-data-files

load-data

extract-doc-content

retrieve_doc_text

Implementations

To use the Direct Data accelerators, see the following prerequisites. Individual implementations will have additional prerequisites.

The following accelerator implementations are currently available as working examples.

Snowflake Accelerator

This accelerator leverages the ability to integrate Snowflake with S3 and seamlessly load data directly from S3 into Snowflake. This process utilizes the COPY INTO command that allows for loading directly from a file in addition to inferring the table schema from the same file. Inferring schemas is generally recommended only when dealing with Full files.

snowflake-accelerator

Pre-requisites

Supported File Formats

Databricks Accelerator

There are several ways to handle and load data into Databricks. The DataBricks accelerator utilizes the COPY INTO command to load data directly from S3 to Delta Lake.

databricks-accelerator

Pre-requisites

Supported File Formats

Considerations

Redshift Accelerator

Similar to the other accelerators, the Redshift accelerator leverages the COPY command to load data into tables from S3.

redshift-accelerator

Pre-requisites

Supported File Formats

Considerations

SQL Database Accelerator

The SQL Database accelerator leverages the BULK INSERT command to load data into tables from Blob Storage.

sql-database-accelerator

Pre-requisites

Supported File Formats

Considerations

Fabric Warehouse Accelerator

The Fabric accelerator leverages the COPY INTO command to load data into tables from Blob Storage.

fabric-accelerator

Pre-requisites

Supported File Formats

SQLite Accelerator

The SQLite Accelerator leverages the Pandas Python library and the Dataframe#to_sql() command to load data from the CSV files into a local SQLite database.

sqlite-accelerator

The SQLite Accelerator differs from the other Accelerators, because the files and database are stored locally. The specific implementation details are below.

Architecture

This accelerator performs the following fundamental processes:

The following classes are being leveraged by the SQLite Accelerator:

The logic that moves and transforms data between systems is handled in the included scripts.

Pre-requisites

Supported File Formats