HuggingFaceTB (Hugging Face Smol Models Research) (original) (raw)
AI & ML interests
Exploring smol models (for text, vision and video) and high quality web and synthetic datasets
Recent Activity
Hugging Face Smol Models Research
This is the home for smol models (SmolLM & SmolVLM) and high quality pre-training datasets. We released:
News 🗞️
- The Smol Training Playbook: a comprehensive guide to training world-class LLMs HuggingFaceTB/smol-training-playbook
Past releases
- FineWeb-Edu: a filtered version of FineWeb dataset for educational content, paper available here.
- Cosmopedia: the largest open synthetic dataset, with 25B tokens and 30M samples. It contains synthetic textbooks, blog posts, and stories, posts generated by Mixtral. Blog post available here.
- Smollm-Corpus: the pre-training corpus of SmolLM: Cosmopedia v0.2, FineWeb-Edu dedup and Python-Edu. Blog post available here.
- FineMath: the best public math pretraining dataset with 50B tokens of mathematical and problem solving data.
- Stack-Edu: the best open code pretraining dataset with educational code in 15 programming languages.
- SmolLM2 models: a series of strong small models in three sizes: 135M, 360M and 1.7B
- SmolVLM2: a family of small Video and Vision models in three sizes: 2.2B, 500M and 256M. Blog post available here.
- SmolLM3: SOTA 3B model with dual reasoning, supports 6 languages and long context with strong function calling. SmolLM3 Engineering Blueprint available here
🧠 SmolLM3 Smol, multilingual, long-context reasoner
HuggingFaceTB/SmolLM3-3B Text Generation • 3B • Updated Sep 10, 2025 • 118k • 949
HuggingFaceTB/SmolLM3-3B-Base Text Generation • 3B • Updated Aug 14, 2025 • 99.2k • 158
ggml-org/SmolLM3-3B-GGUF 3B • Updated Jul 8, 2025 • 5.18k • 59
HuggingFaceTB/SmolLM3-3B-ONNX Text Generation • Updated Jul 14, 2025 • 361 • 25
SmolLM3 pretraining datasets datasets used in SmolLM3 pretraining