Introduction to Stemming (original) (raw)

Last Updated : 16 Dec, 2025

Stemming is an important text-processing technique that reduces words to their base or root form by removing prefixes and suffixes. This process standardizes words which helps to improve the efficiency and effectiveness of various natural language processing (NLP) tasks.

In NLP, stemming simplifies words to their most basic form, making it easier to analyze and process text. For example, "chocolates" becomes "chocolate" and "retrieval" becomes "retrieve". This is important in the early stages of NLP tasks where words are extracted from a document and tokenized (broken into individual words).

It helps in tasks such as text classification, information retrieval and text summarization by reducing words to a base form. While it is effective, it can sometimes introduce drawbacks including potential inaccuracies and a reduction in text readability.

Examples of stemming for the word "like":

Types of Stemmer in NLTK

Python's NLTK (Natural Language Toolkit) provides various stemming algorithms each suitable for different scenarios and languages. Lets see an overview of some of the most commonly used stemmers:

**1. Porter's Stemmer

Porter's Stemmer is one of the most popular and widely used stemming algorithms. Proposed in 1980 by Martin Porter, this stemmer works by applying a series of rules to remove common suffixes from English words. It is well-known for its simplicity, speed and reliability. However, the stemmed output is not guaranteed to be a meaningful word and its applications are limited to the English language.

**Example:

**Advantages:

**Limitations:

Now lets implement Porter's Stemmer in Python, here we will be using NLTK library.

Python `

from nltk.stem import PorterStemmer

porter_stemmer = PorterStemmer()

words = ["running", "jumps", "happily", "running", "happily"]

stemmed_words = [porter_stemmer.stem(word) for word in words]

print("Original words:", words) print("Stemmed words:", stemmed_words)

`

**Output:

Porter's Stemmer

**2. Snowball Stemmer

The Snowball Stemmer is an enhanced version of the Porter Stemmer which was introduced by Martin Porter as well. It is referred to as Porter2 and is faster and more aggressive than its predecessor. One of the key advantages of this is that it supports multiple languages, making it a multilingual stemmer.

**Example:

**Advantages:

**Limitations:

Now lets implement Snowball Stemmer in Python, here we will be using NLTK library.

Python `

from nltk.stem import SnowballStemmer

stemmer = SnowballStemmer(language='english')

words_to_stem = ['running', 'jumped', 'happily', 'quickly', 'foxes']

stemmed_words = [stemmer.stem(word) for word in words_to_stem]

print("Original words:", words_to_stem) print("Stemmed words:", stemmed_words)

`

**Output:

Snowball Stemmer

**3. Lancaster Stemmer

The Lancaster Stemmer is known for being more aggressive and faster than other stemmers. However, it’s also more destructive and may lead to excessively shortened stems. It uses a set of external rules that are applied in an iterative manner.

**Example:

**Advantages:

**Limitations:

Now lets implement Lancaster Stemmer in Python, here we will be using NLTK library.

Python `

from nltk.stem import LancasterStemmer

stemmer = LancasterStemmer()

words_to_stem = ['running', 'jumped', 'happily', 'quickly', 'foxes']

stemmed_words = [stemmer.stem(word) for word in words_to_stem]

print("Original words:", words_to_stem) print("Stemmed words:", stemmed_words)

`

**Output:

Lancaster Stemmer

4. Regexp Stemmer

The Regexp Stemmer or Regular Expression Stemmer is a flexible stemming algorithm that allows users to define custom rules using regular expressions (regex). This stemmer can be helpful for very specific tasks where predefined rules are necessary for stemming.

**Example:

**Advantages:

**Limitations:

Now let's implement Regexp Stemmer in Python, here we will be using NLTK library.

Python `

from nltk.stem import RegexpStemmer

custom_rule = r'ing$' regexp_stemmer = RegexpStemmer(custom_rule)

word = 'running' stemmed_word = regexp_stemmer.stem(word)

print(f'Original Word: {word}') print(f'Stemmed Word: {stemmed_word}')

`

**Output:

Regexp Stemmer

5. **Krovetz Stemmer

The Krovetz Stemmer was developed by Robert Krovetz in 1993. It is designed to be more linguistically accurate and tends to preserve meaning more effectively than other stemmers. It includes steps like converting plural forms to singular and removing ing from past-tense verbs.

**Example:

**Advantages:

**Limitations:

**Note: The Krovetz Stemmer is not natively available in the NLTK library, unlike other stemmers such as Porter, Snowball or Lancaster.

Stemming vs. Lemmatization

Let's see the tabular difference between Stemming and Lemmatization for better understanding:

**Aspect **Stemming **Lemmatization
Definition Reduces words to their root form, often producing non-valid words Reduces words to their base form (lemma), always a valid word
Approach Uses simple rules or algorithms Uses vocabulary and linguistic analysis
Context Awareness Does not consider context Considers context and part of speech
Output Validity May produce invalid words Always produces valid words
Accuracy Less accurate More accurate
Speed Faster Slower compared to stemming
Example “Better” → “bet” “Better” → “good”
Use Case Quick text preprocessing NLP tasks needing meaning preservation

Applications of Stemming

Stemming plays an important role in many NLP tasks. Some of its key applications include:

  1. **Information Retrieval: It is used in search engines to improve the accuracy of search results. By reducing words to their root form, it ensures that documents with different word forms like "run," "running," "runner" are grouped together.
  2. **Text Classification: In text classification, it helps in reducing the feature space by consolidating variations of words into a single representation. This can improve the performance of machine learning algorithms.
  3. **Document Clustering: It helps in grouping similar documents by normalizing word forms, making it easier to identify patterns across large text corpora.
  4. **Sentiment Analysis: Before sentiment analysis, it is used to process reviews and comments. This allows the system to analyze sentiments based on root words which improves its ability to understand positive or negative sentiments despite word variations.

Challenges in Stemming

While stemming is beneficial but also it has some challenges:

  1. **Over-Stemming: When words are reduced too aggressively, leading to the loss of meaning. For example, "arguing" becomes "argu" making it harder to understand.
  2. **Under-Stemming: Occurs when related words are not reduced to a common base form, causing inconsistencies. For example, "argument" and "arguing" might not be stemmed similarly.
  3. **Loss of Meaning: Stemming ignores context which can result in incorrect interpretations in tasks like sentiment analysis.
  4. **Choosing the Right Stemmer: Different stemmers may produce diffierent results which requires careful selection and testing for the best fit.

These challenges can be solved by fine-tuning the stemming process or using lemmatization when necessary.

Advantages of Stemming

Stemming provides various benefits which are as follows:

  1. **Text Normalization: By reducing words to their root form, it helps to normalize text which makes it easier to analyze and process.
  2. **Improved Efficiency: It reduces the dimensionality of text data which can improve the performance of machine learning algorithms.
  3. **Information Retrieval: It enhances search engine performance by ensuring that variations of the same word are treated as the same entity.
  4. **Facilitates Language Processing: It simplifies the text by reducing variations of words which makes it easier to process and analyze large text datasets.

Which type of tokenization is most useful for languages like Chinese and Japanese that lack clear word boundaries?

Explanation:

Character-level tokenization is often applied to languages without explicit spaces between words, making it suitable for Chinese, Japanese, and Korean.

Which of the following NLP tasks usually requires keeping stopwords instead of removing them?

Explanation:

Stopwords are essential in tasks like machine translation or text summarization to preserve sentence structure and coherence.

Which stemmer in NLTK supports multiple languages?

Explanation:

Snowball Stemmer (Porter2) supports many languages beyond English, unlike other stemmers.

What is a major drawback of stemming?

Quiz Completed Successfully

Your Score : 2/4

Accuracy : 0%

Login to View Explanation

1/4

1/4 < Previous Next >