txtai - File To HTML (original) (raw)

pipeline

The File To HTML pipeline transforms files to HTML. It supports the following text extraction backends.

Apache Tika

Apache Tika detects and extracts metadata and text from over a thousand different file types. See this link for a list of supported document formats.

Apache Tika requires Java to be installed. An alternative to that is starting a separate Apache Tika service via this Docker Image and setting these environment variables.

Docling

Docling parses documents and exports them to the desired format with ease and speed. This is a library that has rapidly gained popularity starting in late 2024. Docling excels in parsing formatting elements from PDFs (tables, sections etc).

See this link for a list of supported document formats.

Example

The following shows a simple example using this pipeline.

`from txtai.pipeline import FileToHTML

Create and run pipeline

html = FileToHTML() html("/path/to/file") `

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

`# Create pipeline using lower case class name filetohtml:

Run pipeline with workflow

workflow: html: tasks: - action: filetohtml `

Run with Workflows

`from txtai import Application

Create and run pipeline with workflow

app = Application("config.yml") list(app.workflow("html", ["/path/to/file"])) `

Run with API

`CONFIG=config.yml uvicorn "txtai.api:app" &

curl
-X POST "http://localhost:8000/workflow"
-H "Content-Type: application/json"
-d '{"name":"html", "elements":["/path/to/file"]}' `

Methods

Python documentation for the pipeline.

`init(backend='available')`

Creates a new File to HTML pipeline.

Parameters:

Name	Type	Description	Default
backend		backend to use to extract content, supports "tika", "docling" or "available" (default) which finds the first available	'available'

Source code in txtai/pipeline/data/filetohtml.py

34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

def __init__(self, backend="available"): """ Creates a new File to HTML pipeline. Args: backend: backend to use to extract content, supports "tika", "docling" or "available" (default) which finds the first available """ # Lowercase backend parameter backend = backend.lower() if backend else None # Check for available backend if backend == "available": backend = "tika" if Tika.available() else "docling" if Docling.available() else None # Create backend instance self.backend = Tika() if backend == "tika" else Docling() if backend == "docling" else None

`call(path)`

Converts file at path to HTML. Returns None if no backend is available.