Document AI Warehouse overview (original) (raw)

Skip to main content

Document AI Warehouse overview

Conceptual Overview

Document AI Warehouse is an integrated, cloud-based platform to store, search, organize, govern and analyze documents and their structured metadata (called Properties). Documents include structured (e.g. forms, invoices) and unstructured (e.g. contracts, research papers) and their Properties (metadata) includes AI-extracted data from documents and manually or AI-assigned tags (for example, account number, loan ID, document type).

Key Benefits and Features

Document AI Warehouse offers several advantages over legacy repositories. Following are some features and benefits:

*The UI is in Preview and expected to go GA soon.

**OCR and other document extractors are available in Document AI products but not included in Document AI Warehouse.

***These features are not part of Document AI Warehouse. These features are enabled by external open source components and scripts that customers can deploy or customize and are not implemented within Document AI Warehouse.

Disclaimers and Known Limitations

For more information about Disclaimers and Known Limitations, seeDisclaimers and Known Limitations

Terminology

Following are terms used in Document AI Warehouse.

Terms, Concepts Definition, Examples
Document A record in Document AI Warehouse that users can search, manage, and enforce access control on. It comprises the raw document and some associated metadata.[Images stored in Document AI Warehouse are also referred to as "Documents"]
Raw Document [Content] The raw content file (pdf/image/binary/blob) of the Document.
Schema [Document Type] Each document is of a certain document type and is specified by a schema. E.g. an Invoice contains the following schema: Supplier Name, Vendor Name, Invoice Amount, etc.
Property [Metadata] Fields of the Document Schema that may either be extracted from the document or enriched (labeled) by users. Currently Metadata includes the following types: Free Text values, Enum, Numeric, Date, Map (a JSON hierarchy of key-value pairs). We plan to support Boolean, Money, and other types going forward.
Doc extractors (DocAI and others) Documents may be extracted by an AI pipeline, so that the extractions can be ingested and managed in Document AI Warehouse (as Metadata) along with the Raw Document. The extraction can be done by Document AI Specialized parsers (for Procurement forms, Lending forms, others)OCR, AutoML, Forms parser (for images such as TIFF/PNG/etc.)Other custom modelsText extracting tools for specialized document formats such as PDFs, Office documents and others.Note that Document AI Warehouse can work with any extraction pipeline that calls Document AI Warehouse APIs to ingest/update documents.
Folders A folder is a virtual collection of documents (virtual because the same document can be contained in one or more folders). It has a "Document Type/Schema" and contains metadata and Access Control Lists just like documents.A user needs Edit permission to the Folder and View permission to the Document), in order to add a Document to a Folder
Links Links are used to add documents to folders or to link related documents together. Links do not have a "Link Type"
Related Documents Documents can be related by directional links from one document to another.
Link Permissions A user needs Edit permission to the Link-from object (e.g. Folder) and View permission to the Link-To object (e.g. Document), in order to add a Document to a Folder
Policy A policy evaluated when a document/folder is created/updated, and is used to validate or update document metadata, ACLs or add/move/remove docs from folders. A policy comprises: A Trigger, for example, upon DocUpdate/DocCreateCondition, for example, Invoice.Amount <$1000Action, for example, Update Doc Metadata, Return Condition Evaluation, Add Doc to Folder, etc.A policy is typically associated with a Document Type.It is expressed in a low-code Common Expression Language (JSON format, specified later)
Notification Policy Is a special type of policy where the Action is publishing a message to a Pub/Sub} Topic when a certain condition is met. Consuming applications / workflows may consume the message to trigger actions on the documents or other parts of a business workflow.
Policy Engine, Policy APIs Engine: The server that evaluates policies and takes actionsAPI: Admin API used to create/update/read/delete policies.
Faceted Search A Facet is a metadata filter used in a search query. For example, search for Bank Statements from "Month = March 2021" and "Branch State = CA" filters the Search results by these 2 facets. Facet is typically an enumerated field.. We will support Date and Numeric facets in future releases.Facets for a Document type are specified in the Document Schema by Admins (via Admin API)
Semantic Search Semantic search supports synonyms or "semantically related" terms in the search query. E.g. "Driver license" returns "driver permit".
Search Histogram Histogram is a search API feature that returns the distribution (counts) of search results by facet. For example, the Search results for Driver License returns the histogram "CA 500, NV 150, …"
Universal Access vs Doc-level Access Control Two access modes are supported in Document AI Warehouse for each project Universal access - any user can access any document in the project. The API is access-controlled to user accounts or service accounts but no document-level permissionsDoc-level ACL - users are granted document-level permissions. Each document has R/U/D permissions assigned to users/groups.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-06-15 UTC.