GitHub - sumanpokhrel-11/symspell-c99: Fast spell-checking library - Pure C99 implementation of SymSpell algorithm (original) (raw)
A blazingly fast spell-checking library - Pure C99 implementation of Wolf Garbe's SymSpell algorithm.
By Suman Pokhrel
Why SymSpell?
SymSpell is 1 million times faster than traditional spell-checkers. Instead of generating all possible corrections for a misspelling, it pre-computes deletions of dictionary words, enabling lightning-fast lookups.
Original algorithm by Wolf Garbe
Features
- โก Fast: Average lookup time ~5ยตs (real-world usage)*
- ๐ฏ Accurate: 82-84% correction rate on standard test sets
- ๐งน Clean: Zero compiler warnings, ~700 lines of well-structured code
- ๐ Portable: POSIX-compliant, works on any Unix system
- โ Complete: Includes tests, benchmarks, and dictionary tools
- ๐ฆ Self-contained: No external dependencies
Quick Start
make ./test_symspell dictionaries/dictionary.txt
Interactive mode:
Enter words to correct (or 'quit' to exit):
recieve
Suggestions:
receive (distance=1, iwf=7.234567, prob=0.001234, freq=234567)
Usage
#include "symspell.h"
// Create dictionary with max edit distance=2, prefix length=7 symspell_dict_t* dict = symspell_create(2, 7);
// Load dictionary (86k words optimized for spell-checking) symspell_load_dictionary(dict, "dictionaries/dictionary.txt", 0, 1);
// Lookup suggestions symspell_suggestion_t suggestions[5]; int count = symspell_lookup(dict, "speling", 2, suggestions, 5);
// Print results for (int i = 0; i < count; i++) { printf("%s (distance=%d)\n", suggestions[i].term, suggestions[i].distance); }
// Cleanup symspell_destroy(dict);
Performance
Average lookup time: ~5ยตs (real-world usage)*
Misspelling correction: ~30ยตs (worst case)
Correctly spelled: ~0.7ยตs (fast path)
Correction accuracy: 82-84%
Dictionary size: 86,060 words
Memory usage: ~45MB (with deletes index)
*Based on 15% real-world error rate in user-typed text (2-3% error per character ร 4.7 average characters per word)
Tested on Apple M4, comparable results on x86.
Building
Requirements:
- C99 compiler (GCC, Clang)
- Make
- POSIX-compliant system (Linux, macOS, BSD)
Compile:
make # Build test and benchmark programs make test # Build and run tests make benchmark # Build and run benchmark make clean # Remove built programs
Manual compilation:
gcc -std=c99 -O2 -Iinclude test/test_symspell.c src/symspell.c -o test_symspell
Understanding Results
Edit Distance Matters
SymSpell returns suggestions based on edit distance (number of character changes needed):
"helo" with max_distance=1
$ ./test_symspell dictionaries/dictionary.txt
helo held (distance=1) # One substitution: oโd
"helo" with max_distance=2
$ ./test_symspell dictionaries/dictionary.txt --max-distance 2
helo hello (distance=2) # Two insertions: +l, +o held (distance=1) # Still shown as closer match
The algorithm prefers closer matches. If you want "hello" as a suggestion, increase max_edit_distance when calling symspell_lookup().
Why Not 100% Accuracy?
Real-world typos are ambiguous:
- "helo" โ "held" or "hello"?
- "teh" โ "the" or "tea"?
- "there" โ correct word or "their"?
Even commercial spell-checkers (Microsoft Word, Google Docs) achieve 80-85% accuracy on standardized test sets because:
- Language is inherently ambiguous
- Context is needed for many corrections
- Trade-offs exist between precision and recall
Our 82-84% matches the original SymSpell paper and is competitive with commercial solutions.
Performance Context
"5ยตs average" means:
- โ 200x faster than Python implementations (~1000ยตs)
- โ 2000x faster than traditional approaches (~10000ยตs)
- โ Can check 1000 words in 5ms (imperceptible to users)
- โ Real-time spell-checking with zero UI lag
For most applications, this is more than fast enough.
Dictionaries
Main dictionary (dictionaries/dictionary.txt): 86,060 words
- Optimized for spell-checking accuracy (82-84%)
- Balanced across multiple test sets
- Recommended for general use
Extras dictionary (dictionaries/dictionary-extras.txt): 25,327 words
- Technical terms and acronyms
- For keyword extraction and IWF scoring
- DO NOT use for spell-checking (reduces accuracy)
**Building custom dictionaries:**See dictionaries/reference/ for source materials and tools. Full documentation in docs/DICTIONARY.md.
Project Structure
symspell-c99/
โโโ src/
โ โโโ symspell.c # Implementation (~700 lines)
โโโ include/
โ โโโ symspell.h # Public API
โ โโโ hash.h # Hash table implementation
โ โโโ xxh3.h # xxHash for fast hashing
โ โโโ posix.h # POSIX compatibility layer
โโโ test/
โ โโโ test_symspell.c # Interactive test program
โ โโโ benchmark_symspell.c # Performance benchmarks
โ โโโ data/ # Test datasets
โโโ dictionaries/
โ โโโ dictionary.txt # Main 86k word dictionary
โ โโโ reference/ # Dictionary building tools
โโโ docs/
โโโ SYMSPELL.md # Algorithm details
โโโ DICTIONARY.md # Dictionary customization
Documentation
- SYMSPELL.md - Algorithm explanation and implementation details
- DICTIONARY.md - Dictionary building and customization guide
- dictionaries/README.md - Dictionary overview
Algorithm
SymSpell uses Symmetric Delete spelling correction:
- Pre-computation: Generate all deletions up to edit distance N for each dictionary word
- Lookup: Generate deletions of the input word
- Match: Find dictionary words that share these deletions
- Rank: Sort by edit distance and word frequency
This trades memory for speed - pre-computing deletions makes lookups extremely fast.
Testing
Run the test program:
./test_symspell dictionaries/dictionary.txt
Run benchmarks:
./benchmark_symspell dictionaries/dictionary.txt test/data/symspell/misspellings/misspell-codespell.txt
Benchmark Test Datasets
The project includes standardized test datasets for accuracy and performance benchmarks. All test sets are located in test/data/symspell/misspellings/:
Test Files and Sources:
misspell-codespell.txt- From Codespell Project
Common misspellings found in source code and documentation (63,704 entries)misspell-wikipedia.txt- From Wikipedia: Lists of Common Misspellings/For Machines
Community-sourced misspellings from Wikipedia editors (4,305 entries)misspell-words.go.txt- From client9/misspell words.go
Go-based spelling corrector dataset, derived from Wikipedia + additional curated errors (31,139 entries)misspell-microsoft.txt- From Microsoft Research Spelling-Correction Data
Real human typing errors captured from keystroke logs via Amazon Mechanical Turk (23,031 entries)
These test sets provide comprehensive coverage across different domains:
- Technical/code (Codespell)
- General language (Wikipedia)
- Programming context (client9/misspell)
- Real typing errors (Microsoft Research)
Accuracy Results: 82-84% correction rate across all test sets, matching the original SymSpell paper.
FAQ
Q: Why does "helo" suggest "held" instead of "hello"?
A: "held" has edit distance 1 (one character change), while "hello" has edit distance 2 (two insertions). SymSpell prefers closer matches. Increase max_edit_distance to get more suggestions.
Q: Why only 82-84% accuracy?
A: This is competitive with commercial spell-checkers and matches the original SymSpell paper. Perfect accuracy is impossible due to language ambiguity.
Q: Compilation fails with undefined reference to log
A: Add -lm flag to link the math library:
gcc -std=c99 -O2 -Iinclude test/test_symspell.c src/symspell.c -lm -o test_symspell
License
MIT License - See LICENSE file for details.
This implementation is independent but follows the algorithm described in:
- SymSpell by Wolf Garbe (MIT License)
Credits
- Algorithm: Wolf Garbe - Creator of SymSpell
- Implementation: Suman Pokhrel - Pure C99 port
Special thanks to Dr. James Freeman for his mentorship and guidance on this project.
Author
Suman Pokhrel
- GitHub: @sumanpokhrel-11
- Email: 11pksuman@gmail.com
Contributing
This is the first pure C99 implementation of SymSpell. Contributions welcome:
- Bug reports and fixes
- Performance improvements
- Dictionary enhancements
- Documentation improvements
- Platform-specific optimizations
Please open an issue or pull request on GitHub!
See Also
- Original SymSpell (C#) - Wolf Garbe's original implementation
- SymSpell Ports - Implementations in other languages
This is the first and only pure C99 implementation of SymSpell, designed for performance, portability, and clean code.