GitHub - scientist-labs/spellkit: Fast, safe typo correction for Ruby. SymSpell-based spell checker with Rust performance, term protection via regex patterns, and hot-reloadable dictionaries. Sub-millisecond latency, zero dependencies. (original) (raw)

spellkit

Fast, safe typo correction for search-term extraction. A Ruby gem with a native Rust implementation of the SymSpell algorithm.

SpellKit provides:

Why a custom implementation? Existing Rust SymSpell crates require lowercase dictionary entries, but SpellKit preserves canonical forms (NASA stays NASA, iPhone stays iPhone). We also needed domain-specific guards, hot-reload, and Aspell-style skip patterns - features not available in existing implementations.

Why SpellKit?

No Runtime Dependencies

SpellKit is a pure Ruby gem with a Rust extension. Just gem install spellkit and you're done. No need to install Aspell, Hunspell, or other system packages. This makes deployment simpler and more reliable across different environments.

Fast Performance

Built on the SymSpell algorithm with Rust, SpellKit delivers:

See the Benchmarks section for detailed comparisons.

Production Ready

Installation

Add to your Gemfile:

Or install directly:

Quick Start

SpellKit works with dictionaries from URLs or local files. Try it immediately:

require "spellkit"

Load from URL (downloads and caches automatically)

SpellKit.load!(dictionary: SpellKit::DEFAULT_DICTIONARY_URL)

Or use a configure block (recommended for Rails)

SpellKit.configure do |config| config.dictionary = SpellKit::DEFAULT_DICTIONARY_URL config.edit_distance = 1 end

Or load from local file

SpellKit.load!(dictionary: "path/to/dictionary.tsv")

Check if a word is spelled correctly

puts SpellKit.correct?("hello")

=> true

Get suggestions for a misspelled word

suggestions = SpellKit.suggestions("helllo", 5) puts suggestions.inspect

=> [{"term"=>"hello", "distance"=>1, "freq"=>...}]

Correct a typo

corrected = SpellKit.correct("helllo") puts corrected

=> "hello"

Batch correction

tokens = %w[helllo wrld ruby teset] corrected_tokens = SpellKit.correct_tokens(tokens) puts corrected_tokens.inspect

=> ["hello", "world", "ruby", "test"]

Check stats

puts SpellKit.stats.inspect

=> {"loaded"=>true, "dictionary_size"=>..., "edit_distance"=>1, "loaded_at"=>...}

Usage

Basic Correction

require "spellkit"

Load from URL (auto-downloads and caches)

SpellKit.load!(dictionary: "https://example.com/dict.tsv")

Or from local file

SpellKit.load!(dictionary: "models/dictionary.tsv", edit_distance: 1)

Check if a word is correct

SpellKit.correct?("hello")

=> true

Get suggestions

SpellKit.suggestions("lyssis", 5)

=> [{"term"=>"lysis", "distance"=>1, "freq"=>2000}, ...]

Correct a typo

SpellKit.correct("helllo")

=> "hello"

Batch correction

tokens = %w[helllo wrld ruby] SpellKit.correct_tokens(tokens)

=> ["hello", "world", "ruby"]

Term Protection

Protect specific terms from correction using exact matches or regex patterns:

Load with exact-match protected terms

SpellKit.load!( dictionary: "models/dictionary.tsv", protected_path: "models/protected.txt" # file with terms to protect )

Protect terms matching regex patterns

SpellKit.load!( dictionary: "models/dictionary.tsv", protected_patterns: [ /^[A-Z]{3,4}\d+$/, # gene symbols like CDK10, BRCA1 /^\d{2,7}-\d{2}-\d$/, # CAS numbers like 7732-18-5 /^[A-Z]{2,3}-\d+$/ # SKU patterns like ABC-123 ] )

Or combine both

SpellKit.load!( dictionary: "models/dictionary.tsv", protected_path: "models/protected.txt", protected_patterns: [/^[A-Z]{3,4}\d+$/] )

Protected terms are automatically respected

SpellKit.correct("CDK10")

=> "CDK10" # protected, never changed

Batch correction with protection

tokens = %w[helllo wrld ABC-123 for CDK10] SpellKit.correct_tokens(tokens)

=> ["hello", "world", "ABC-123", "for", "CDK10"]

Multiple Instances

SpellKit supports multiple independent checker instances, useful for different domains or languages:

Create separate instances for different domains

medical_checker = SpellKit::Checker.new medical_checker.load!( dictionary: "models/medical_dictionary.tsv", protected_path: "models/medical_terms.txt" )

legal_checker = SpellKit::Checker.new legal_checker.load!( dictionary: "models/legal_dictionary.tsv", protected_path: "models/legal_terms.txt" )

Use them independently

medical_checker.suggestions("lyssis", 5) legal_checker.suggestions("contractt", 5)

Each maintains its own state

medical_checker.stats # Shows medical dictionary stats legal_checker.stats # Shows legal dictionary stats

Configuration Block

Use the configure block pattern for Rails initializers:

SpellKit.configure do |config| config.dictionary = "models/dictionary.tsv" config.protected_path = "models/protected.txt" config.protected_patterns = [/^[A-Z]{3,4}\d+$/] config.edit_distance = 1 config.frequency_threshold = 10.0 end

This becomes the default instance

SpellKit.suggestions("word", 5) # Uses configured dictionary

Dictionary Format

Dictionary (required)

Whitespace-separated file with term and frequency (supports both space and tab delimiters):

hello	10000
world	8000
lysis	2000

Or space-separated:

hello 10000
world 8000
lysis 2000

Protected Terms (optional)

One term per line. Terms are matched case-insensitively:

protected.txt

# Product codes
ABC-123
XYZ-999

# Technical terms
CDK10
BRCA1

# Brand names
MyBrand
SpecialTerm

Dictionary Sources

SpellKit doesn't bundle dictionaries, but works with several sources:

English 80k word dictionary from SymSpell

SpellKit.load!(dictionary: SpellKit::DEFAULT_DICTIONARY_URL)

Public Dictionary URLs

Build Your Own

See "Building Dictionaries" section below for creating domain-specific dictionaries.

Caching

Dictionaries downloaded from URLs are cached in ~/.cache/spellkit/ for faster subsequent loads.

Configuration

SpellKit.load!( dictionary: "models/dictionary.tsv", # required: path or URL protected_path: "models/protected.txt", # optional protected_patterns: [/^[A-Z]{3,4}\d+$/], # optional edit_distance: 1, # 1 (default) or 2 frequency_threshold: 10.0, # default: 10.0 (minimum frequency for corrections)

Skip pattern filters (all default to false)

skip_urls: true, # Skip URLs (http://, https://, www.) skip_emails: true, # Skip email addresses skip_hostnames: true, # Skip hostnames (example.com) skip_code_patterns: true, # Skip code identifiers (camelCase, snake_case, etc.) skip_numbers: true # Skip numeric patterns (versions, IDs, measurements) )

Frequency Threshold

The frequency_threshold parameter controls which corrections are accepted by correct and correct_tokens:

This prevents suggesting rare words as corrections for common typos.

Example:

With default threshold (10.0), suggest any correction with freq ≥ 10

SpellKit.load!(dictionary: "dict.tsv") SpellKit.correct("helllo") # => "hello" (if freq ≥ 10)

With high threshold (1000.0), only suggest common corrections

SpellKit.load!(dictionary: "dict.tsv", frequency_threshold: 1000.0) SpellKit.correct("helllo") # => "hello" (if freq ≥ 1000) SpellKit.correct("rarword") # => "rarword" (no correction if freq < 1000)

Skip Patterns

SpellKit can automatically skip certain patterns to avoid "correcting" technical terms, URLs, and other special content. Inspired by Aspell's filter modes, these patterns are automatically applied when configured.

Available skip patterns:

SpellKit.load!( dictionary: "dict.tsv", skip_urls: true, # Skip URLs: https://example.com, www.example.com skip_emails: true, # Skip emails: user@domain.com, admin+tag@example.com skip_hostnames: true, # Skip hostnames: example.com, api.example.com skip_code_patterns: true, # Skip code: camelCase, snake_case, PascalCase, dotted.paths skip_numbers: true # Skip numbers: 1.2.3, #123, 5kg, 100mb )

With skip patterns enabled, technical content is preserved

SpellKit.correct("https://example.com") # => "https://example.com" SpellKit.correct("user@test.com") # => "user@test.com" SpellKit.correct("getElementById") # => "getElementById" SpellKit.correct("version-1.2.3") # => "version-1.2.3"

Regular typos are still corrected

SpellKit.correct("helllo") # => "hello"

What each skip pattern matches:

Combining with protected_patterns:

Skip patterns work alongside your custom protected_patterns:

SpellKit.load!( dictionary: "dict.tsv", skip_urls: true, # Built-in URL skipping protected_patterns: [/^CUSTOM-\d+$/] # Your custom patterns )

Both work together automatically

SpellKit.correct("https://example.com") # => "https://example.com" (skip_urls) SpellKit.correct("CUSTOM-123") # => "CUSTOM-123" (custom pattern)

API Reference

SpellKit.load!(**options)

Load or reload dictionaries. Thread-safe atomic swap. Accepts URLs (auto-downloads and caches) or local file paths.

Options:

Examples:

From URL (recommended for getting started)

SpellKit.load!(dictionary: SpellKit::DEFAULT_DICTIONARY_URL)

With skip patterns for technical content

SpellKit.load!( dictionary: SpellKit::DEFAULT_DICTIONARY_URL, skip_urls: true, skip_code_patterns: true )

From custom URL

SpellKit.load!(dictionary: "https://example.com/dict.tsv")

From local file

SpellKit.load!(dictionary: "/path/to/dictionary.tsv")

SpellKit.correct?(word)

Check if a word is spelled correctly (exact dictionary match).

Parameters:

Returns: Boolean - true if word exists in dictionary, false otherwise

Performance: Very fast O(1) HashMap lookup. Use this instead of suggest() when you only need to check correctness.

Example:

SpellKit.correct?("hello") # => true SpellKit.correct?("helllo") # => false

SpellKit.suggestions(word, max = 5)

Get ranked suggestions for a word.

Parameters:

Returns: Array of hashes with "term", "distance", and "freq" keys

Example:

SpellKit.suggestions("helllo", 5)

=> [{"term"=>"hello", "distance"=>1, "freq"=>10000}, ...]

SpellKit.correct(word)

Return corrected word or original if no better match found. Respects frequency_threshold configuration. Protected terms and skip patterns are automatically applied when configured.

Parameters:

Behavior:

Example:

SpellKit.correct("helllo") # => "hello" SpellKit.correct("hello") # => "hello" (already correct) SpellKit.correct("CDK10") # => "CDK10" (protected if configured)

SpellKit.correct_tokens(tokens)

Batch correction of an array of tokens. Respects frequency_threshold configuration. Protected terms and skip patterns are automatically applied when configured.

Parameters:

Returns: Array of corrected strings

SpellKit.stats

Get current state statistics.

Returns: Hash with:

SpellKit.healthcheck

Verify system is properly loaded. Raises error if not.

When configured, SpellKit automatically protects specific terms from correction:

Exact Matches

Terms in protected_path file are never corrected, even if similar dictionary words exist. Matching is case-insensitive, but original casing is preserved in output.

Pattern Matching

Terms matching any pattern in protected_patterns are protected. Patterns can be:

Examples

Protect specific terms

protected_patterns: [ /^[A-Z]{3,4}\d+$/, # Gene symbols: CDK10, BRCA1 /^\d{2,7}-\d{2}-\d$/, # CAS numbers: 7732-18-5 /^[A-Z]{2,3}-\d+$/ # Product codes: ABC-123 ]

Rails Integration

config/initializers/spellkit.rb

Option 1: Use default dictionary (easiest)

SpellKit.configure do |config| config.dictionary = SpellKit::DEFAULT_DICTIONARY_URL end

Option 2: Use local dictionary with full configuration

SpellKit.configure do |config| config.dictionary = Rails.root.join("models/dictionary.tsv") config.protected_path = Rails.root.join("models/protected.txt") config.protected_patterns = [ /^[A-Z]{3,4}\d+$/, # Product codes /^\d{2,7}-\d{2}-\d$/ # Reference numbers ] config.edit_distance = 1 config.frequency_threshold = 10.0 end

Option 3: Multiple domain-specific instances

config/initializers/spellkit.rb

module SpellCheckers MEDICAL = SpellKit::Checker.new.tap do |c| c.load!( dictionary: Rails.root.join("models/medical_dictionary.tsv"), protected_path: Rails.root.join("models/medical_terms.txt") ) end

LEGAL = SpellKit::Checker.new.tap do |c| c.load!( dictionary: Rails.root.join("models/legal_dictionary.tsv"), protected_path: Rails.root.join("models/legal_terms.txt") ) end end

In your search preprocessing

class SearchPreprocessor def self.correct_query(text) tokens = text.downcase.split(/\s+/) SpellKit.correct_tokens(tokens).join(" ") end end

Performance

SpellKit Standalone (M4 Max MacBook Pro, Ruby 3.3.0, 80k dictionary)

Single Word Suggestions:

Correction Performance:

Protection Performance:

Latency Distribution (10,000 iterations):

Raw Throughput: 16,192 ops/sec

Key Takeaways

  1. Consistent Performance: p95 latency of 66μs with 80k dictionary, p99 at 105μs
  2. Guards are Fast: Protected term checks improve performance by 3.2x by avoiding dictionary lookups
  3. High Throughput: Over 16k operations per second with 80k word dictionary
  4. Scales Well: Minimal performance difference between 1 vs 10 suggestions

Comparison with Aspell

SpellKit vs Aspell (M4 Max MacBook Pro, Ruby 3.3.0, 80k dictionary):

Suggestion Performance (13 misspelled words):

Spell Checking (correct? on 26 words):

Latency Distribution (10,000 single-word suggestions):

Both libraries provide high-quality spell checking, but SpellKit's SymSpell algorithm (O(1) lookup) offers significant performance advantages over Aspell's statistical approach, especially for high-throughput applications.

Benchmarks

SpellKit includes comprehensive benchmarks to measure performance and compare with other spell checkers.

Running Benchmarks

Performance Benchmark - Comprehensive SpellKit performance analysis:

bundle exec ruby benchmark/performance.rb

Measures:

Aspell Comparison - Direct comparison with Aspell:

First install Aspell if needed:

macOS: brew install aspell

Ubuntu: sudo apt-get install aspell libaspell-dev

bundle exec ruby benchmark/comparison_aspell.rb

Compares SpellKit with Aspell on:

See benchmark/README.md for detailed results and analysis.

Why These Benchmarks?

SpellKit vs Aspell: Both provide fuzzy matching and suggestions for misspelled words, but use different algorithms:

The comparison shows SpellKit's performance advantage while solving the same problem.

Building Dictionaries

Create your dictionary from your corpus:

example_builder.rb

require "set"

counts = Hash.new(0)

Read your corpus

File.foreach("corpus.txt") do |line| line.downcase.split(/\W+/).each do |word| next if word.length < 3 counts[word] += 1 end end

Filter by minimum count and write

min_count = 5 File.open("dictionary.tsv", "w") do |f| counts.select { |, count| count >= min_count } .sort_by { |, count| -count } .each { |term, count| f.puts "#{term}\t#{count}" } end

Development

After checking out the repo:

bundle install bundle exec rake compile bundle exec rake spec

To build the gem:

Platform Support

Pre-built gems available for:

Contributing

Bug reports and pull requests are welcome at https://github.com/scientist-labs/spellkit

License

MIT License - see LICENSE file for details.