GitHub - allenai/duplodocus: Tooling for exact and MinHash deduplication of large-scale text datasets (original) (raw)

High-performance exact and fuzzy (MinHash) document deduplication tool, natively implemented in Rust for processing large-scale JSONL datasets.

Table of Contents

Overview

This tool provides four deduplication strategies optimized for different dataset sizes and requirements:

Method Storage Best For
Exact + Memory In-memory Small datasets (<10GB), simple exact matching
Exact + Disk Disk-based Large datasets, exact matching, distributed processing
MinHash + Memory In-memory Small datasets (<10GB), fuzzy matching
MinHash + Disk Disk-based Large datasets, fuzzy matching, distributed processing

Key Features

Theory

Some notes on the theory behind this tooling and some details about the internals are contained in the primer.

Installation

Prerequisites

AWS EC2 Setup (Optional)

For large-scale processing on AWS i4i/i7i instances with NVMe drives:

Configure RAID0 array from NVMe drives

sudo yum install mdadm -y sudo mdadm --create /dev/md0 --level=0 --raid-devices=8
/dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1
/dev/nvme5n1 /dev/nvme6n1 /dev/nvme7n1 /dev/nvme8n1 sudo mkfs.xfs /dev/md0 sudo mkdir /mnt/raid0 sudo mount /dev/md0 /mnt/raid0 sudo chown -R $USER /mnt/raid0

Install build dependencies

sudo yum install gcc cmake openssl-devel g++ htop git -y

Install s5cmd for fast S3 transfers

wget https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-64bit.tar.gz tar -xvzf s5cmd_2.2.2_Linux-64bit.tar.gz sudo mv s5cmd /usr/local/bin

Build from Source

Install Rust

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source ~/.bashrc

Clone and build

git clone git@github.com:allenai/duplodocus.git cd dedup-tool cargo build --release

Binary will be at: ./target/release/dedup-tool

Download Data (if using S3)

Configure AWS credentials

aws configure

Download JSONL files

s5cmd cp -sp s3://your-bucket/path/to/data/* /mnt/raid0/input_data/

Quick Start

Exact Deduplication (Small Dataset)

Remove documents with identical content:

cargo run --release -- exact-dedup-memory
--input-dir /data/documents
--output-dir /data/unique
--text-key "content"

Fuzzy Deduplication (Small Dataset)

Find and remove near-duplicates:

cargo run --release -- minhash-memory
--input-dir /data/documents
--storage-dir /tmp/work
--output-dir /data/deduped
--text-key "text"
--num-buckets 20
--bucket-size 5
--remove-duplicates true
--cleanup-storage

Deduplication Methods

Exact Deduplication

Memory-Based (Simple)

Best for datasets under 100GB. Processes everything in one pass:

cargo run --release -- exact-dedup-memory
--input-dir /data/docs
--output-dir /data/unique
--text-key "content"
--annotate-key "duplicate_info" # Optional: annotate instead of remove

Options:

Disk-Based (Distributed)

For large datasets or distributed processing:

Step 1: Group documents by hash

cargo run --release -- exact-dedup-disk-group
--input-dir /data/docs
--storage-dir /scratch/work
--hash-key "doc_hash"
--num-bins 100

Step 2: Remove duplicates

cargo run --release -- exact-dedup-disk-prune
--storage-dir /scratch/work
--output-dir /data/unique
--hash-key "doc_hash"

Fuzzy Deduplication (MinHash)

Memory-Based (Simple)

All-in-one fuzzy deduplication for smaller datasets:

cargo run --release -- minhash-memory
--input-dir /data/docs
--storage-dir /tmp/work
--output-dir /data/deduped
--text-key "text"
--num-buckets 20
--bucket-size 5
--ngram-size 5
--remove-duplicates true
--cleanup-storage

Key Parameters:

Disk-Based (Distributed)

For large-scale distributed processing across multiple machines:

Step 1: Build file map (run once)

cargo run --release -- mh-build-file-map
--input-dir /data/docs
--storage-dir /shared/work

Step 2: Hash documents (parallel across workers)

Worker 0

cargo run --release -- mh-hash-docs
--local-input /data/docs
--storage-dir /shared/work
--text-key "text"
--path-chunk 0
--num-path-chunks 10
--num-buckets 20
--bucket-size 5

Worker 1

cargo run --release -- mh-hash-docs
--local-input /data/docs
--storage-dir /shared/work
--text-key "text"
--path-chunk 1
--num-path-chunks 10
--num-buckets 20
--bucket-size 5

... repeat for workers 2-9

Step 3: Gather edges (run once, requires all signatures)

cargo run --release -- mh-gather-edges
--storage-dir /shared/work

Step 4: Build Union-Find (run once on single machine)

cargo run --release -- mh-build-uf
--storage-dir /shared/work
--num-path-chunks 10

Step 5: Clean files (parallel across workers)

Worker 0

cargo run --release -- mh-clean-files
--input-dir /data/docs
--storage-dir /shared/work
--output-dir /data/deduped
--path-chunk 0
--num-path-chunks 10
--remove-duplicates true

Repeat for other workers...

Examples

Detailed examples with step-by-step instructions are available in the examples/ directory:

Configuration

YAML Configuration (Optional)

For complex setups, you can use a YAML config file:

minhash_config.yaml

minhash_params: num_buckets: 26 bucket_size: 11 ngram_size: 5 permutation_seed: 42 tokenizer: "cl100k_base" eng_params: num_docs: 1000000 max_lines_per_path: 100000 num_sig_chunks: 8 output_params: annotate: false annotate_key: metadata.minhash # minhash output data location remove_duplicates: true # just annotate, don't remove delete_while_cleaning: false

Use with:

cargo run --release -- minhash-memory
--input-dir /data/docs
--storage-dir /tmp/work
--output-dir /data/deduped
--text-key "text"
--config minhash_config.yaml

System Requirements

Memory-Based Methods

Disk-Based Methods

Design Principles

Performance Tips

  1. Use RAID0 for NVMe drives on cloud instances for maximum I/O throughput
  2. Adjust --num-path-chunks based on available workers
  3. Monitor disk space - intermediate files can be 3-5x input size
  4. Use --cleanup-storage carefully in distributed settings
  5. Set appropriate --num-buckets and --bucket-size for your similarity threshold

Troubleshooting

Out of memory errors: Use disk-based methods instead of memory-based

Slow performance: Ensure you're using fast local storage (NVMe/SSD), not network storage

Missing intermediate files: Ensure all parallel steps complete before running sequential steps