Introduction to Stemming (original) (raw)

Last Updated : 16 Dec, 2025

Stemming is an important text-processing technique that reduces words to their base or root form by removing prefixes and suffixes. This process standardizes words which helps to improve the efficiency and effectiveness of various natural language processing (NLP) tasks.

In NLP, stemming simplifies words to their most basic form, making it easier to analyze and process text. For example, "chocolates" becomes "chocolate" and "retrieval" becomes "retrieve". This is important in the early stages of NLP tasks where words are extracted from a document and tokenized (broken into individual words).

It helps in tasks such as text classification, information retrieval and text summarization by reducing words to a base form. While it is effective, it can sometimes introduce drawbacks including potential inaccuracies and a reduction in text readability.

Examples of stemming for the word "like":

"likes" → "like"
"liked" → "like"
"likely" → "like"
"liking" → "like"

Types of Stemmer in NLTK

Python's NLTK (Natural Language Toolkit) provides various stemming algorithms each suitable for different scenarios and languages. Lets see an overview of some of the most commonly used stemmers:

**1. Porter's Stemmer

Porter's Stemmer is one of the most popular and widely used stemming algorithms. Proposed in 1980 by Martin Porter, this stemmer works by applying a series of rules to remove common suffixes from English words. It is well-known for its simplicity, speed and reliability. However, the stemmed output is not guaranteed to be a meaningful word and its applications are limited to the English language.

**Example:

'agreed' → 'agree'
**Rule: If the word has a suffix **EED (with at least one vowel and consonant) remove the suffix and change it to **EE.

**Advantages:

Very fast and efficient.
Commonly used for tasks like information retrieval and text mining.

**Limitations:

Outputs may not always be real words.
Limited to English words.

Now lets implement Porter's Stemmer in Python, here we will be using NLTK library.

Python `

from nltk.stem import PorterStemmer

porter_stemmer = PorterStemmer()

words = ["running", "jumps", "happily", "running", "happily"]

stemmed_words = [porter_stemmer.stem(word) for word in words]

print("Original words:", words) print("Stemmed words:", stemmed_words)

**Output:

Porter's Stemmer

**2. Snowball Stemmer

The Snowball Stemmer is an enhanced version of the Porter Stemmer which was introduced by Martin Porter as well. It is referred to as Porter2 and is faster and more aggressive than its predecessor. One of the key advantages of this is that it supports multiple languages, making it a multilingual stemmer.

**Example:

'running' → 'run'
'quickly' → 'quick'

**Advantages:

More efficient than Porter Stemmer.
Supports multiple languages.

**Limitations:

More aggressive which might lead to over-stemming.

Now lets implement Snowball Stemmer in Python, here we will be using NLTK library.

Python `

from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer(language='english')

words_to_stem = ['running', 'jumped', 'happily', 'quickly', 'foxes']

stemmed_words = [stemmer.stem(word) for word in words_to_stem]

print("Original words:", words_to_stem) print("Stemmed words:", stemmed_words)

**Output:

Snowball Stemmer

**3. Lancaster Stemmer

The Lancaster Stemmer is known for being more aggressive and faster than other stemmers. However, it’s also more destructive and may lead to excessively shortened stems. It uses a set of external rules that are applied in an iterative manner.

**Example:

'running' → 'run'
'happily' → 'happy'

**Advantages:

Very fast.
Good for smaller datasets or quick preprocessing.

**Limitations:

Aggressive which can result in over-stemming.
Less efficient than Snowball in larger datasets.

Now lets implement Lancaster Stemmer in Python, here we will be using NLTK library.

Python `

from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()

words_to_stem = ['running', 'jumped', 'happily', 'quickly', 'foxes']

stemmed_words = [stemmer.stem(word) for word in words_to_stem]

print("Original words:", words_to_stem) print("Stemmed words:", stemmed_words)

**Output:

Lancaster Stemmer

4. Regexp Stemmer

The Regexp Stemmer or Regular Expression Stemmer is a flexible stemming algorithm that allows users to define custom rules using regular expressions (regex). This stemmer can be helpful for very specific tasks where predefined rules are necessary for stemming.

**Example:

'running' → 'runn'
Custom rule: r'ing$' removes the suffix ing.

**Advantages:

Highly customizable using regular expressions.
Suitable for domain-specific tasks.

**Limitations:

Requires manual rule definition.
Can be computationally expensive for large datasets.

Now let's implement Regexp Stemmer in Python, here we will be using NLTK library.

Python `

from nltk.stem import RegexpStemmer

custom_rule = r'ing$' regexp_stemmer = RegexpStemmer(custom_rule)

word = 'running' stemmed_word = regexp_stemmer.stem(word)

print(f'Original Word: {word}') print(f'Stemmed Word: {stemmed_word}')

**Output:

Regexp Stemmer

5. **Krovetz Stemmer

The Krovetz Stemmer was developed by Robert Krovetz in 1993. It is designed to be more linguistically accurate and tends to preserve meaning more effectively than other stemmers. It includes steps like converting plural forms to singular and removing ing from past-tense verbs.

**Example:

'children' → 'child'
'running' → 'run'

**Advantages:

More accurate, as it preserves linguistic meaning.
Works well with both singular/plural and past/present tense conversions.

**Limitations:

May be inefficient with large corpora.
Slower compared to other stemmers.

**Note: The Krovetz Stemmer is not natively available in the NLTK library, unlike other stemmers such as Porter, Snowball or Lancaster.

Stemming vs. Lemmatization

Let's see the tabular difference between Stemming and Lemmatization for better understanding:

**Aspect	**Stemming	**Lemmatization
Definition	Reduces words to their root form, often producing non-valid words	Reduces words to their base form (lemma), always a valid word
Approach	Uses simple rules or algorithms	Uses vocabulary and linguistic analysis
Context Awareness	Does not consider context	Considers context and part of speech
Output Validity	May produce invalid words	Always produces valid words
Accuracy	Less accurate	More accurate
Speed	Faster	Slower compared to stemming
Example	“Better” → “bet”	“Better” → “good”
Use Case	Quick text preprocessing	NLP tasks needing meaning preservation

Applications of Stemming

Stemming plays an important role in many NLP tasks. Some of its key applications include:

**Information Retrieval: It is used in search engines to improve the accuracy of search results. By reducing words to their root form, it ensures that documents with different word forms like "run," "running," "runner" are grouped together.
**Text Classification: In text classification, it helps in reducing the feature space by consolidating variations of words into a single representation. This can improve the performance of machine learning algorithms.
**Document Clustering: It helps in grouping similar documents by normalizing word forms, making it easier to identify patterns across large text corpora.
**Sentiment Analysis: Before sentiment analysis, it is used to process reviews and comments. This allows the system to analyze sentiments based on root words which improves its ability to understand positive or negative sentiments despite word variations.

Challenges in Stemming

While stemming is beneficial but also it has some challenges:

**Over-Stemming: When words are reduced too aggressively, leading to the loss of meaning. For example, "arguing" becomes "argu" making it harder to understand.
**Under-Stemming: Occurs when related words are not reduced to a common base form, causing inconsistencies. For example, "argument" and "arguing" might not be stemmed similarly.
**Loss of Meaning: Stemming ignores context which can result in incorrect interpretations in tasks like sentiment analysis.
**Choosing the Right Stemmer: Different stemmers may produce diffierent results which requires careful selection and testing for the best fit.

These challenges can be solved by fine-tuning the stemming process or using lemmatization when necessary.

Advantages of Stemming

Stemming provides various benefits which are as follows:

**Text Normalization: By reducing words to their root form, it helps to normalize text which makes it easier to analyze and process.
**Improved Efficiency: It reduces the dimensionality of text data which can improve the performance of machine learning algorithms.
**Information Retrieval: It enhances search engine performance by ensuring that variations of the same word are treated as the same entity.
**Facilitates Language Processing: It simplifies the text by reducing variations of words which makes it easier to process and analyze large text datasets.

Tokenization

Which type of tokenization is most useful for languages like Chinese and Japanese that lack clear word boundaries?

Word Tokenization
Character Tokenization
Sentence Tokenization
N-gram Tokenization

Explanation:

Character-level tokenization is often applied to languages without explicit spaces between words, making it suitable for Chinese, Japanese, and Korean.

Which of the following NLP tasks usually requires keeping stopwords instead of removing them?

Text classification
Sentiment analysis
Machine translation
Keyword extraction

Explanation:

Stopwords are essential in tasks like machine translation or text summarization to preserve sentence structure and coherence.

Which stemmer in NLTK supports multiple languages?

Lancaster Stemmer
Regexp Stemmer
Snowball Stemmer
Porter Stemmer

Explanation:

Snowball Stemmer (Porter2) supports many languages beyond English, unlike other stemmers.

What is a major drawback of stemming?

Requires POS tagging
Always produces valid dictionary words
May lead to over-stemming
Slower than lemmatization

Quiz Completed Successfully

Your Score : 2/4

Accuracy : 0%

1/4

1/4 < Previous Next >

Introduction to Stemming (original) (raw)

Types of Stemmer in NLTK

**1. Porter's Stemmer

**2. Snowball Stemmer

**3. Lancaster Stemmer

4. Regexp Stemmer

5. **Krovetz Stemmer

Stemming vs. Lemmatization

Applications of Stemming

Challenges in Stemming

Advantages of Stemming

Related Articles: