The Rise of Small Language Models (SLMs) (original) (raw)

This article has been updated from when it was originally published on February 14, 2024.

The impressive power of large language models (LLMs) has evolved substantially during the last couple of years. These versatile AI-powered tools are in fact deep learning artificial neural networks that are trained with massively large datasets, capable of leveraging billions of parameters (or machine learning variables) in order to perform various natural language processing (NLP) tasks.

These can run the gamut from generating, analyzing and classifying text, all the way to generating rather convincing images from a text prompt, to translating content into different languages, or chatbots that can hold human-like conversations. Well-known LLMs include proprietary models like OpenAI’s GPT-4, as well as a growing roster of open source contenders like Meta’s LLaMA.

But despite their considerable capabilities, LLMs can nevertheless present some significant disadvantages. Their sheer size often means that they require hefty computational resources and energy to run, which can preclude them from being used by smaller organizations that might not have the deep pockets to bankroll such operations. With larger models there is also the risk of algorithmic bias being introduced via datasets that are not sufficiently diverse, leading to faulty or inaccurate outputs — or the dreaded “hallucination” as it’s called in the industry.

What Are Small Language Models?

These issues might be one of the many that are behind the recent rise of small language models, or SLMs.

Small language models are slimmed-down versions of their larger cousins, and for smaller enterprises with tighter budgets, SLMs are becoming a more attractive option, because they are generally easier to train, fine-tune and deploy, and also cheaper to run.

How Small Language Models Stack Up Next to LLMs

Small language models are essentially more streamlined versions of LLMs, in regards to the size of their neural networks, and simpler architectures.

Compared to LLMs, SLMs have fewer parameters and don’t need as much data and time to be trained — think minutes or a few hours of training time, versus many hours to even days to train a LLM. Because of their smaller size, SLMs are therefore generally more efficient and more straightforward to implement on-site, or on smaller devices.

How Small Language Models Work

Similar to their larger cousins, small language models utilize a type of deep learning neural network architecture known as the transformer model. Introduced by Google researchers back in 2017 via a paper titled Attention Is All You Need, transformers have revolutionized natural language processing (NLP) during the last few years, paving the way for the generative pre-trained transformers (GPTs) that underlie some of today’s most massive and powerful large language models.

Generally, these are the basic building blocks of the transformer model architecture:

Encoder: This component processes and transforms input tokens into a number-based representation that is called an embedding, which captures the context of each token relative to the entire sequence.
Self-attention mechanism: This part gives the model the ability to ‘focus’ their attention on the most important parts of a sequence. This allows the model to weigh the relative importance of different parts of an input sequence, and to dynamically alter their influence on the resulting output, depending on the context.
Decoder: This element leverages the embeddings created by the encoder, and the self-attention mechanism to generate an output.

How Small Language Models Are Created

Small language models are typically made from large language models using an approach called model compression, which results in smaller models that are more resource-efficient and performant, yet still relatively accurate.

Some techniques of model compression include:

Knowledge distillation: Think of this technique as having the LLM function as a “teacher” that condenses and transfers its learned knowledge into a smaller, “student” model. The result is a smaller language model that has much of the accuracy and reasoning capabilities as it larger “teacher”, but without the computational cost it would take to run a larger model.
Pruning: Like pruning a plant so that it grows optimally, this method trims back any redundant parameters that aren’t crucial to performance, thus reducing the model size. However, pruned models will likely need to be fine-tuned afterward in order to compensate for any lost accuracy.
Quantization: This technique aims to shrink a model by utilizing fewer bits to store the model’s data, by converting high-precision data into lower-precision data. For example, numbers can be stored as 8-bit values, rather than 32-bit values. With this conversion, models can become smaller and will run faster (especially on smaller devices), but without negatively impacting accuracy. Quantization can be done either during model training (quantization-aware training), or after training (post-training quantization).
Low-rank factorization: This method identifies any redundant parameters of a deep neural network by “decomposing” a larger matrix of weights into smaller one, thus helping to simplify the model’s operational needs. This helps to reduce the size of the model so that it runs faster, but the process of low-rank factorization itself can require more computational resources to implement. Additionally, fine-tuning is often required to make up for any reduction in accuracy. Low-rank factorization can be done during training — which can help to reduce training time — or it can be done after training.

Benefits and Limitations of Small Language Models

Practical and easier to customize: Because SLMs can be tailored to more narrow and specific applications, that makes them more practical for companies that require a language model that is trained on more limited datasets, and can be fine-tuned for a particular domain.
Enhanced security and privacy: Additionally, SLMs can be customized to meet an organization’s specific requirements for security and privacy. Thanks to their smaller codebases, the relative simplicity of SLMs also reduces their vulnerability to malicious attacks by minimizing potential surfaces for security breaches.
Potential for reduced performance: On the flip side, the increased efficiency and agility of SLMs may translate to slightly reduced language processing abilities, depending on the benchmarks the model is being measured against.

Examples of Small Language Models

Nevertheless, despite some of these potential limitations, some SLMs like Microsoft’s recently introduced 2.7 billion-parameter Phi-2, demonstrate state-of-the-art performance in mathematical reasoning, common sense, language understanding, and logical reasoning that is remarkably comparable to — and in some cases, exceed — that of much heftier LLMs. According to Microsoft, the efficiency of the transformer-based Phi-2 makes it an ideal choice for researchers who want to improve safety, interpretability and ethical development of AI models.

Other SLMs of note include:

DistilBERT: a lighter and faster version of Google’s BERT (Bidirectional Encoder Representations Transformer), the pioneering deep learning NLP AI model introduced back in 2018. There are also Mini, Small, Medium and Tiny versions of BERT, which are scaled-down and optimized for varying constraints, and range in size from 4.4 million parameters in the Mini, 14.5 million in the Tiny, to 41 million parameters in the Medium version. There is also MobileBERT, a version designed for mobile devices.
Orca 2: Developed by Microsoft by fine-tuning Meta’s LLaMA 2 by using synthetic data that is generated from a statistical model, rather than from real life. This results in enhanced reasoning abilities, and higher performance in reasoning, reading comprehension, math problem solving and text summarization that can overtake that of larger models that are ten times larger.
GPT-Neo and GPT-J: With 125 million and 6 billion parameters respectively, these alternatives were designed by the open source AI research consortium EleutherAI to be smaller and open source versions of OpenAI’s GPT model. These SLMs can be run on cheaper cloud computing resources from CoreWeave and TensorFlow Research Cloud.

Use Cases for Small Language Models

Because of their smaller size, and reduced computational and operational cost, businesses and institutions can easily fine-tune and tailor small language models to a specific use.

For instance, SLMs could be used as chatbots to offer timely customer service, or utilized to summarize content or create calendar events for users. These smaller models could also be used to translate foreign languages in real-time, generate programming code, or to monitor or perform preventative maintenance on devices linked to the Internet of Things (IoT). Within automotive systems, SLMs can go a long way in offering real-time traffic updates for smarter road navigation, or improving voice commands or handsfree calling.

The Future Ahead for Small Language Models

Ultimately, the emergence of small language models signals a potential shift from expensive and resource-heavy LLMs to more streamlined and efficient language models, arguably making it easier for more businesses and organizations to adopt and tailor generative AI technology to their specific needs. As language models evolve to become more versatile and powerful, it seems that going small may be the best way to go.

YOUTUBE.COM/THENEWSTACK Tech moves fast, don't miss an episode. Subscribe to our YouTube channel to stream all our podcasts, interviews, demos, and more. SUBSCRIBE

Group Created with Sketch.