Document Store (original) (raw)

You can think of the Document Store as a database that stores your data and provides them to the Retriever at query time. Learn how to use Document Store in a pipeline or how to create your own.

Document Store is an object that stores your documents. In Haystack, a Document Store is different from a component, as it doesn’t have the run() method. You can think of it as an interface to your database – you put the information there, or you can look through it. This means that a Document Store is not a piece of a pipeline but rather a tool that the components of a pipeline have access to and can interact with.

👍
Work with Retrievers

The most common way to use a Document Store in Haystack is to fetch documents using a Retriever. A Document Store will often have a corresponding Retriever to get the most out of specific technologies. See more information in our Retriever documentation.

📘
How to choose a Document Store?

To learn about different types of Document Stores and their strengths and disadvantages, head to the Choosing a Document Store page.

Document Stores in Haystack are designed to use the following methods as part of their protocol:

count_documents returns the number of documents stored in the given store as an integer.
filter_documents returns a list of documents that match the provided filters.
write_documents writes or overwrites documents into the given store and returns the number of documents that were written as an integer.
delete_documents deletes all documents with given document_ids from the Document Store.

To use a Document Store in a pipeline, you must initialize it first.

See the installation and initialization details for each Document Store in the "Document Stores" section in the navigation panel on your left.

Convert your data into Document objects before writing them into a Document Store along with its metadata and document ID.

The ID field is mandatory, so if you don’t choose a specific ID yourself, Haystack will do its best to come up with a unique ID based on the document’s information and assign it automatically. However, since Haystack uses the document’s contents to create an ID, two identical documents might have identical IDs. Keep it in mind as you update your documents, as the ID will not be updated automatically.

document_store = ChromaDocumentStore()
documents = [
    Document(
      'meta'={'name': DOCUMENT_NAME, ...}
            'id'="document_unique_id",
            'content'="this is content"
      ),
      ...
]
document_store.write_documents(documents)

To write documents into the InMemoryDocumentStore, simply call the .write_documents() function:

document_store.write_documents([
    Document(content="My name is Jean and I live in Paris."), 
    Document(content="My name is Mark and I live in Berlin."), 
    Document(content="My name is Giorgio and I live in Rome.")
])

📘
DocumentWriter

See DocumentWriter component docs to write your documents into a Document Store in a pipeline.

The DuplicatePolicy is a class that defines the different options for handling documents with the same ID in a DocumentStore. It has three possible values:

OVERWRITE: Indicates that if a document with the same ID already exists in the DocumentStore, it should be overwritten with the new document.
SKIP: If a document with the same ID already exists, the new document will be skipped and not added to the DocumentStore.
FAIL: Raises an error if a document with the same ID already exists in the DocumentStore. It prevents duplicate documents from being added.

Here is an example of how you could apply the policy to skip the existing document:

from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy

document_store = InMemoryDocumentStore()
document_writer = DocumentWriter(document_store = document_store, policy=DuplicatePolicy.SKIP)

All custom document stores must implement the protocol with four mandatory methods: count_documents,filter_documents, write_documents, and delete_documents.

The init function should indicate all the specifics for the chosen database or vector store.

We also recommend having a custom corresponding Retriever to get the most out of a specific Document Store.

See Creating Custom Document Stores page for more details.

Updated 10 months ago