GitHub - sumanpokhrel-11/symspell-c99: Fast spell-checking library - Pure C99 implementation of SymSpell algorithm (original) (raw)

A blazingly fast spell-checking library - Pure C99 implementation of Wolf Garbe's SymSpell algorithm.

License: MIT Language POSIX

By Suman Pokhrel


Why SymSpell?

SymSpell is 1 million times faster than traditional spell-checkers. Instead of generating all possible corrections for a misspelling, it pre-computes deletions of dictionary words, enabling lightning-fast lookups.

Original algorithm by Wolf Garbe


Features


Quick Start

make ./test_symspell dictionaries/dictionary.txt

Interactive mode:

Enter words to correct (or 'quit' to exit):
recieve
  Suggestions:
    receive (distance=1, iwf=7.234567, prob=0.001234, freq=234567)

Usage

#include "symspell.h"

// Create dictionary with max edit distance=2, prefix length=7 symspell_dict_t* dict = symspell_create(2, 7);

// Load dictionary (86k words optimized for spell-checking) symspell_load_dictionary(dict, "dictionaries/dictionary.txt", 0, 1);

// Lookup suggestions symspell_suggestion_t suggestions[5]; int count = symspell_lookup(dict, "speling", 2, suggestions, 5);

// Print results for (int i = 0; i < count; i++) { printf("%s (distance=%d)\n", suggestions[i].term, suggestions[i].distance); }

// Cleanup symspell_destroy(dict);


Performance

Average lookup time:     ~5ยตs (real-world usage)*
Misspelling correction:  ~30ยตs (worst case)
Correctly spelled:       ~0.7ยตs (fast path)
Correction accuracy:     82-84%
Dictionary size:         86,060 words
Memory usage:            ~45MB (with deletes index)

*Based on 15% real-world error rate in user-typed text (2-3% error per character ร— 4.7 average characters per word)

Tested on Apple M4, comparable results on x86.


Building

Requirements:

Compile:

make # Build test and benchmark programs make test # Build and run tests make benchmark # Build and run benchmark make clean # Remove built programs

Manual compilation:

gcc -std=c99 -O2 -Iinclude test/test_symspell.c src/symspell.c -o test_symspell


Understanding Results

Edit Distance Matters

SymSpell returns suggestions based on edit distance (number of character changes needed):

"helo" with max_distance=1

$ ./test_symspell dictionaries/dictionary.txt

helo held (distance=1) # One substitution: oโ†’d

"helo" with max_distance=2

$ ./test_symspell dictionaries/dictionary.txt --max-distance 2

helo hello (distance=2) # Two insertions: +l, +o held (distance=1) # Still shown as closer match

The algorithm prefers closer matches. If you want "hello" as a suggestion, increase max_edit_distance when calling symspell_lookup().

Why Not 100% Accuracy?

Real-world typos are ambiguous:

Even commercial spell-checkers (Microsoft Word, Google Docs) achieve 80-85% accuracy on standardized test sets because:

  1. Language is inherently ambiguous
  2. Context is needed for many corrections
  3. Trade-offs exist between precision and recall

Our 82-84% matches the original SymSpell paper and is competitive with commercial solutions.

Performance Context

"5ยตs average" means:

For most applications, this is more than fast enough.


Dictionaries

Main dictionary (dictionaries/dictionary.txt): 86,060 words

Extras dictionary (dictionaries/dictionary-extras.txt): 25,327 words

**Building custom dictionaries:**See dictionaries/reference/ for source materials and tools. Full documentation in docs/DICTIONARY.md.


Project Structure

symspell-c99/
โ”œโ”€โ”€ src/
โ”‚   โ””โ”€โ”€ symspell.c          # Implementation (~700 lines)
โ”œโ”€โ”€ include/
โ”‚   โ”œโ”€โ”€ symspell.h          # Public API
โ”‚   โ”œโ”€โ”€ hash.h              # Hash table implementation
โ”‚   โ”œโ”€โ”€ xxh3.h              # xxHash for fast hashing
โ”‚   โ””โ”€โ”€ posix.h             # POSIX compatibility layer
โ”œโ”€โ”€ test/
โ”‚   โ”œโ”€โ”€ test_symspell.c     # Interactive test program
โ”‚   โ”œโ”€โ”€ benchmark_symspell.c # Performance benchmarks
โ”‚   โ””โ”€โ”€ data/               # Test datasets
โ”œโ”€โ”€ dictionaries/
โ”‚   โ”œโ”€โ”€ dictionary.txt      # Main 86k word dictionary
โ”‚   โ””โ”€โ”€ reference/          # Dictionary building tools
โ””โ”€โ”€ docs/
    โ”œโ”€โ”€ SYMSPELL.md         # Algorithm details
    โ””โ”€โ”€ DICTIONARY.md       # Dictionary customization

Documentation


Algorithm

SymSpell uses Symmetric Delete spelling correction:

  1. Pre-computation: Generate all deletions up to edit distance N for each dictionary word
  2. Lookup: Generate deletions of the input word
  3. Match: Find dictionary words that share these deletions
  4. Rank: Sort by edit distance and word frequency

This trades memory for speed - pre-computing deletions makes lookups extremely fast.


Testing

Run the test program:

./test_symspell dictionaries/dictionary.txt

Run benchmarks:

./benchmark_symspell dictionaries/dictionary.txt test/data/symspell/misspellings/misspell-codespell.txt

Benchmark Test Datasets

The project includes standardized test datasets for accuracy and performance benchmarks. All test sets are located in test/data/symspell/misspellings/:

Test Files and Sources:

These test sets provide comprehensive coverage across different domains:

Accuracy Results: 82-84% correction rate across all test sets, matching the original SymSpell paper.


FAQ

Q: Why does "helo" suggest "held" instead of "hello"?

A: "held" has edit distance 1 (one character change), while "hello" has edit distance 2 (two insertions). SymSpell prefers closer matches. Increase max_edit_distance to get more suggestions.

Q: Why only 82-84% accuracy?

A: This is competitive with commercial spell-checkers and matches the original SymSpell paper. Perfect accuracy is impossible due to language ambiguity.

Q: Compilation fails with undefined reference to log

A: Add -lm flag to link the math library:

gcc -std=c99 -O2 -Iinclude test/test_symspell.c src/symspell.c -lm -o test_symspell


License

MIT License - See LICENSE file for details.

This implementation is independent but follows the algorithm described in:


Credits

Special thanks to Dr. James Freeman for his mentorship and guidance on this project.


Author

Suman Pokhrel


Contributing

This is the first pure C99 implementation of SymSpell. Contributions welcome:

Please open an issue or pull request on GitHub!


See Also


This is the first and only pure C99 implementation of SymSpell, designed for performance, portability, and clean code.