GitHub - ScrapeGraphAI/toonify: Toonify: Compact data format reducing LLM token usage by 30-60% (original) (raw)

Toonify Logo

TOON (Token-Oriented Object Notation)

English | 中文 | 한국어

A compact, human-readable serialization format designed for passing structured data to Large Language Models with significantly reduced token usage.

Python Version License: MIT

Overview

TOON achieves CSV-like compactness while adding explicit structure, making it ideal for:

Key Features

Installation

For development:

With Pydantic support:

pip install toonify[pydantic]

Quick Start

Python API

from toon import encode, decode

Encode Python dict to TOON

data = { 'products': [ {'sku': 'LAP-001', 'name': 'Gaming Laptop', 'price': 1299.99}, {'sku': 'MOU-042', 'name': 'Wireless Mouse', 'price': 29.99} ] }

toon_string = encode(data) print(toon_string)

Output:

products[2]{sku,name,price}:

LAP-001,Gaming Laptop,1299.99

MOU-042,Wireless Mouse,29.99

Decode TOON back to Python

result = decode(toon_string) assert result == data

Command Line

Encode JSON to TOON

toon input.json -o output.toon

Decode TOON to JSON

toon input.toon -o output.json

Use with pipes

cat data.json | toon -e > data.toon

Show token statistics

toon data.json --stats

Pydantic Integration

TOON supports direct conversion from Pydantic models:

from pydantic import BaseModel from toon import encode_pydantic, decode_to_pydantic

Define Pydantic models

class User(BaseModel): id: int name: str email: str

Encode Pydantic models to TOON

users = [ User(id=1, name='Alice', email='alice@example.com'), User(id=2, name='Bob', email='bob@example.com') ]

toon = encode_pydantic(users) print(toon)

Output:

[2]{id,name,email}:

1,Alice,alice@example.com

2,Bob,bob@example.com

Decode TOON back to Pydantic models

decoded_users = decode_to_pydantic(toon, User) assert all(isinstance(u, User) for u in decoded_users)

Features:

See examples/pydantic_usage.py for more examples.

Response Structure Templates for LLM Prompts

TOON provides a powerful feature to generate response structure templates that can be included in LLM prompts. This tells the model exactly what format to return data in, without needing to provide examples with actual data.

from toon import generate_structure

Define the expected response structure

schema = { "name": "name of the person", "age": "age of the person", "occupation": "job description of the person" }

Generate the structure template

structure = generate_structure(schema) print(structure)

Output:

name:

age:

occupation:

Use in your LLM prompt

prompt = f"""Extract person information from the text and return it in this format: {structure}

Text: [your text here...]"""

For arrays and complex structures:

schema = { "products": [{ "name": "product name", "price": "price in USD", "rating": "rating from 1-5" }] }

structure = generate_structure(schema) print(structure)

Output:

products[N]{name,price,rating}:

,,<rating from 1-5>

...

With Pydantic models:

from pydantic import BaseModel, Field from toon import generate_structure_from_pydantic

class Product(BaseModel): name: str = Field(description="product name") price: float = Field(description="price in USD") in_stock: bool = Field(description="availability status")

Generate structure from model

structure = generate_structure_from_pydantic(Product)

Use in LLM prompts without providing examples

Benefits:

See examples/structure_template_usage.py for comprehensive examples.

TOON Format Specification

Basic Syntax

# Simple key-value pairs
title: Machine Learning Basics
chapters: 12
published: true

Arrays

Primitive arrays (inline):

temperatures: [72.5,68.3,75.1,70.8,73.2]
categories: [electronics,computers,accessories]

Tabular arrays (uniform objects with header):

inventory[3]{sku,product,stock}:
  KB-789,Mechanical Keyboard,45
  MS-456,RGB Mouse Pad,128
  HD-234,USB Headset,67

List arrays (non-uniform or nested):

tasks[2]:
  Complete documentation
  Review pull requests

Nested Objects

server:
  hostname: api-prod-01
  config:
    port: 8080
    region: us-east

Quoting Rules

Strings are quoted only when necessary:

simple: ProductName
quoted: "Product, Description"
escaped: "Size: 15\" display"
multiline: "First feature\nSecond feature"

API Reference

encode(data, options=None)

Convert Python object to TOON string.

Parameters:

Example:

toon = encode(data, { 'delimiter': 'tab', 'indent': 4, 'key_folding': 'safe' })

decode(toon_string, options=None)

Convert TOON string to Python object.

Parameters:

Example:

data = decode(toon_string, { 'expand_paths': 'safe', 'strict': False })

encode_pydantic(model, options=None, exclude_unset=False, exclude_none=False, exclude_defaults=False, by_alias=False)

Convert Pydantic model(s) to TOON string.

Parameters:

Example:

from pydantic import BaseModel from toon import encode_pydantic

class User(BaseModel): id: int name: str email: str | None = None

user = User(id=1, name='Alice') toon = encode_pydantic(user, exclude_none=True)

decode_to_pydantic(toon_string, model_class, options=None)

Decode TOON string to Pydantic model(s).

Parameters:

Returns:

Example:

from pydantic import BaseModel from toon import decode_to_pydantic

class User(BaseModel): id: int name: str

toon = "id: 1\nname: Alice" user = decode_to_pydantic(toon, User)

generate_structure(schema, options=None)

Generate a TOON structure template from a schema definition for use in LLM prompts.

Parameters:

Returns:

Example:

from toon import generate_structure

schema = { "name": "name of the person", "age": "age of the person", "occupation": "job description" }

structure = generate_structure(schema) print(structure)

Output:

name:

age:

occupation:

Use in LLM prompt:

prompt = f"Extract person info in this format:\n{structure}"

generate_structure_from_pydantic(model_class, options=None, include_descriptions=True)

Generate a TOON structure template from a Pydantic model for use in LLM prompts.

Parameters:

Returns:

Example:

from pydantic import BaseModel, Field from toon import generate_structure_from_pydantic

class User(BaseModel): id: int = Field(description="user identifier") name: str = Field(description="full name") email: str = Field(description="email address")

structure = generate_structure_from_pydantic(User) print(structure)

Output:

id:

name:

email:

CLI Usage

usage: toon [-h] [-o OUTPUT] [-e] [-d] [--delimiter {comma,tab,pipe}]
            [--indent INDENT] [--stats] [--no-strict]
            [--key-folding {off,safe}] [--flatten-depth DEPTH]
            [--expand-paths {off,safe}]
            [input]

TOON (Token-Oriented Object Notation) - Convert between JSON and TOON formats

positional arguments:
  input                 Input file path (or "-" for stdin)

optional arguments:
  -h, --help            show this help message and exit
  -o, --output OUTPUT   Output file path (default: stdout)
  -e, --encode          Force encode mode (JSON to TOON)
  -d, --decode          Force decode mode (TOON to JSON)
  --delimiter {comma,tab,pipe}
                        Array delimiter (default: comma)
  --indent INDENT       Indentation size (default: 2)
  --stats               Show token statistics
  --no-strict           Disable strict validation (decode only)
  --key-folding {off,safe}
                        Key folding mode (encode only)
  --flatten-depth DEPTH Maximum key folding depth (encode only)
  --expand-paths {off,safe}
                        Path expansion mode (decode only)

Advanced Features

Key Folding

Collapse single-key chains into dotted paths:

data = { 'api': { 'response': { 'product': { 'title': 'Wireless Keyboard' } } } }

With key_folding='safe'

toon = encode(data, {'key_folding': 'safe'})

Output: api.response.product.title: Wireless Keyboard

Path Expansion

Expand dotted keys into nested objects:

toon = 'store.location.zipcode: 10001'

With expand_paths='safe'

data = decode(toon, {'expand_paths': 'safe'})

Result: {'store': {'location': {'zipcode': 10001}}}

Custom Delimiters

Choose the delimiter that best fits your data:

Tab delimiter (better for spreadsheet-like data)

toon = encode(data, {'delimiter': 'tab'})

Pipe delimiter (when data contains commas)

toon = encode(data, {'delimiter': 'pipe'})

Format Comparison

JSON vs TOON

JSON (247 bytes):

{ "products": [ {"id": 101, "name": "Laptop Pro", "price": 1299}, {"id": 102, "name": "Magic Mouse", "price": 79}, {"id": 103, "name": "USB-C Cable", "price": 19} ] }

TOON (98 bytes, 60% reduction):

products[3]{id,name,price}:
  101,Laptop Pro,1299
  102,Magic Mouse,79
  103,USB-C Cable,19

When to Use TOON

Use TOON when:

Use JSON when:

Development

Setup

git clone https://github.com/ScrapeGraphAI/toonify.git cd toonify pip install -e .[dev]

Running Tests

pytest pytest --cov=toon --cov-report=term-missing

Running Examples

python examples/basic_usage.py python examples/advanced_features.py

Performance

Benchmarked across 50 diverse, real-world datasets:

💰 Cost Impact: At GPT-4 pricing, TOON saves $2,147 per million API requests and $5,408 per billion tokens.

📊 View Full Benchmark Results →

Contributing

Contributions are welcome! We appreciate bug fixes, feature additions, documentation improvements, and more.

Quick Start:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes with tests
  4. Run tests (pytest)
  5. Commit your changes (git commit -m 'Add amazing feature')
  6. Push to the branch (git push origin feature/amazing-feature)
  7. Open a Pull Request

For detailed guidelines, please see our Contributing Guide.

License

MIT License - see LICENSE file for details.

Credits

Python implementation inspired by the TypeScript TOON library at toon-format/toon.


Made with love by the ScrapeGraph team

ScrapeGraphAI Logo