Text Preprocessing in NLP (original) (raw)

Last Updated : 2 May, 2026

Raw text data is often unstructured, noisy and inconsistent, containing typos, punctuation, stopwords and irrelevant information. Text preprocessing converts this data into a clean, structured and standardized format, enabling effective feature extraction and improving model performance.

Implementation

Here we implement text preprocessing techniques in Python, showing how raw text is cleaned, transformed and prepared for NLP tasks.

Step 1: Preparing the Sample Corpus

Here we define a sample corpus containing a variety of text examples, including HTML tags, emojis, URLs, numbers, punctuation and typos. This corpus will be used to demonstrate each preprocessing step in detail.

Python `

corpus = [ "I can't wait for the new season of my favorite show! 😍", "The COVID-19 pandemic has affected millions of people worldwide.", "U.S. stocks fell on Friday after news of rising inflation.", "Welcome to the website!", "Python is a great programming language!!! ??", "Check out https://www.example.com for more info!", "He won 1st prize in the comp3tition!!!", "I luvv this movie sooo much!!!" ]

`

Step 2: Text Cleaning and Regular Expressions

Text cleaning is the process of removing noise and unwanted elements from raw text to make it structured and easier for NLP models to analyze. Regular expressions (regex) is a useful tool in text preprocessing that allow you to find, match and manipulate patterns in text efficiently.

import re import string from bs4 import BeautifulSoup

def clean_text(text): text = text.lower() text = BeautifulSoup(text, "html.parser").get_text() text = re.sub(r'http\S+|www\S+', '', text) text = re.sub(r'\d+', '', text) text = text.translate(str.maketrans('', '', string.punctuation)) text = re.sub(r'\W+', ' ', text) text = re.sub(r'\s+', ' ', text).strip() return text

cleaned_corpus = [clean_text(doc) for doc in corpus] print("Cleaned Corpus:\n", cleaned_corpus)

`

**Output:

Cleaned Corpus:

['i cant wait for the new season of my favorite show', 'the covid pandemic has affected millions of people worldwide', 'us stocks fell on friday after news of rising inflation', 'welcome to the website', 'python is a great programming language', 'check out httpswwwexamplecom for more info', 'he won st prize in the comptition', 'i luvv this movie sooo much'

Step 3: Tokenization

Tokenization is the process of breaking text into smaller units, such as words or sentences. This step converts raw text into a structured format that NLP models can analyze and process.

import nltk from nltk.tokenize import word_tokenize nltk.download('punkt') nltk.download('punkt_tab')

tokenized_corpus = [word_tokenize(doc) for doc in cleaned_corpus] print("Tokenized Corpus:\n", tokenized_corpus)

`

**Output:

Tokenized Corpus:

[['i', 'cant', 'wait', 'for', 'the', 'new', 'season', 'of', 'my', 'favorite', 'show'], ['the', 'covid', 'pandemic', 'has', 'affected', 'millions', 'of', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'on', 'friday', 'after', 'news', 'of', 'rising', 'inflation'], ['welcome', 'to', 'the', 'website'], ['python', 'is', 'a', 'great', 'programming', 'language'], ['check', 'out', 'httpswwwexamplecom', 'for', 'more', 'info'], ['he', 'won', 'st', 'prize', 'in', 'the', 'comptition'], ['i', 'luvv', 'this', 'movie', 'sooo', 'much']]

Step 4: Stopword Removal

Stopwords are common words in a language (like “is”, “the”, “and”) that usually do not add significant meaning to text analysis. Removing them helps NLP models focus on the more meaningful words in the text.

from nltk.corpus import stopwords import nltk nltk.download('stopwords')

stop_words = set(stopwords.words('english')) filtered_corpus = [[word for word in doc if word not in stop_words] for doc in tokenized_corpus] print("Stopword Removed Corpus:\n", filtered_corpus)

`

**Output:

Stopword Removed Corpus:

[['cant', 'wait', 'new', 'season', 'favorite', 'show'], ['covid', 'pandemic', 'affected', 'millions', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'friday', 'news', 'rising', 'inflation'], ['welcome', 'website'], ['python', 'great', 'programming', 'language'], ['check', 'httpswwwexamplecom', 'info'], ['st', 'prize', 'comptition'], ['luvv', 'movie', 'sooo', 'much']]

Step 5: Stemming

Stemming is the process of reducing words to their root or base form. It helps in normalizing text by treating different forms of a word (e.g., “running”, “runs”) as the same word (“run”).

from nltk.stem import PorterStemmer

stemmer = PorterStemmer() stemmed_corpus = [[stemmer.stem(word) for word in doc] for doc in filtered_corpus] print("Stemmed Corpus:\n", stemmed_corpus)

`

**Output:

Stemmed Corpus:

[['cant', 'wait', 'new', 'season', 'favorit', 'show'], ['covid', 'pandem', 'affect', 'million', 'peopl', 'worldwid'], ['us', 'stock', 'fell', 'friday', 'news', 'rise', 'inflat'], ['welcom', 'websit'], ['python', 'great', 'program', 'languag'], ['check', 'httpswwwexamplecom', 'info'], ['st', 'prize', 'comptit'], ['luvv', 'movi', 'sooo', 'much']]

Step 6: Lemmatization

Lemmatization is the process of converting a word to its meaningful base or dictionary form, called a lemma. Unlike stemming, it ensures that the root word is an actual word in the language.

from nltk.stem import WordNetLemmatizer import nltk nltk.download('wordnet')

lemmatizer = WordNetLemmatizer() lemmatized_corpus = [[lemmatizer.lemmatize(word) for word in doc] for doc in filtered_corpus] print("Lemmatized Corpus:\n", lemmatized_corpus)

`

**Output:

Lemmatized Corpus:

[['cant', 'wait', 'new', 'season', 'favorite', 'show'], ['covid', 'pandemic', 'affected', 'million', 'people', 'worldwide'], ['u', 'stock', 'fell', 'friday', 'news', 'rising', 'inflation'], ['welcome', 'website'], ['python', 'great', 'programming', 'language'], ['check', 'httpswwwexamplecom', 'info'], ['st', 'prize', 'comptition'], ['luvv', 'movie', 'sooo', 'much']]

Step 7: Contractions Expansion

Contractions expansion is the process of converting shortened forms of words (like “can’t”, “won’t”) into their full forms (“cannot”, “will not”). This helps NLP models better understand the meaning of the text.

import contractions

expanded_corpus = [contractions.fix(doc) for doc in corpus] print("Expanded Corpus:\n", expanded_corpus)

`

**Output:

Expanded Corpus:

['I cannot wait for the new season of my favorite show! 😍', 'The COVID-19 pandemic has affected millions of people worldwide.', 'YOU.S. stocks fell on Friday after news of rising inflation.', 'Welcome to the website!', 'Python is a great programming language!!! ??', 'Check out https://www.example.com for more info!', 'He won 1st prize in the comp3tition!!!', 'I luvv this movie sooo much!!!']

Step 8: Emoji Conversion

Emoji conversion is the process of converting emojis in text into descriptive text labels. This allows NLP models to understand the meaning conveyed by emojis.

import emoji

emoji_corpus = [emoji.demojize(doc) for doc in corpus] print("Emoji Converted Corpus:\n", emoji_corpus)

`

**Output:

Emoji Converted Corpus:

["I can't wait for the new season of my favorite show! :smiling_face_with_heart-eyes:", 'The COVID-19 pandemic has affected millions of people worldwide.', 'U.S. stocks fell on Friday after news of rising inflation.', 'Welcome to the website!', 'Python is a great programming language!!! ??', 'Check out https://www.example.com for more info!', 'He won 1st prize in the comp3tition!!!', 'I luvv this movie sooo much!!!']

Step 9: Spell Correction

Spell correction is the process of identifying and correcting misspelled words in text. This ensures that NLP models receive accurate and meaningful words for analysis.

from spellchecker import SpellChecker

spell = SpellChecker() corrected_corpus = [[spell.correction(word) for word in doc] for doc in tokenized_corpus] print("Spell Corrected Corpus:\n", corrected_corpus)

`

**Output:

Spell Corrected Corpus:

[['i', 'cant', 'wait', 'for', 'the', 'new', 'season', 'of', 'my', 'favorite', 'show'], ['the', 'covin', 'pandemic', 'has', 'affected', 'millions', 'of', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'on', 'friday', 'after', 'news', 'of', 'rising', 'inflation'], ['welcome', 'to', 'the', 'website'], ['python', 'is', 'a', 'great', 'programming', 'language'], ['check', 'out', None, 'for', 'more', 'info'], ['he', 'won', 'st', 'prize', 'in', 'the', 'competition'], ['i', 'luvs', 'this', 'movie', 'soon', 'much']]

Step 10: Parts of Speech (POS) Tagging

POS tagging assigns grammatical labels (like noun, verb, adjective) to each word in a sentence. This helps NLP models understand the role of words and their relationships in the text.

import nltk nltk.download('averaged_perceptron_tagger') nltk.download('averaged_perceptron_tagger_eng')

pos_tagged_corpus = [nltk.pos_tag(doc) for doc in tokenized_corpus] print("POS Tagged Corpus:\n", pos_tagged_corpus)

`

**Output:

POS Tagged Corpus:

[[('i', 'NN'), ('cant', 'VBP'), ('wait', 'NN'), ('for', 'IN'), ('the', 'DT'), ('new', 'JJ'), ('season', 'NN'), ('of', 'IN'), ('my', 'PRP$'), ('favorite', 'JJ'), ('show', 'NN')], [('the', 'DT'), ('covid', 'NN'), ('pandemic', 'NN'), ('has', 'VBZ'), ('affected', 'VBN'), ('millions', 'NNS'), ('of', 'IN'), ('people', 'NNS'), ('worldwide', 'VBP')], [('us', 'PRP'), ('stocks', 'NNS'), ('fell', 'VBD'), ('on', 'IN'), ('friday', 'NN'), ('after', 'IN'), ('news', 'NN'), ('of', 'IN'), ('rising', 'VBG'), ('inflation', 'NN')], [('welcome', 'NN'), ('to', 'TO'), ('the', 'DT'), ('website', 'NN')], [('python', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('programming', 'NN'), ('language', 'NN')], [('check', 'VB'), ('out', 'RP'), ('httpswwwexamplecom', 'NN'), ('for', 'IN'), ('more', 'JJR'), ('info', 'NN')], [('he', 'PRP'), ('won', 'VBD'), ('st', 'JJ'), ('prize', 'NN'), ('in', 'IN'), ('the', 'DT'), ('comptition', 'NN')], [('i', 'NN'), ('luvv', 'VBP'), ('this', 'DT'), ('movie', 'NN'), ('sooo', 'VBZ'), ('much', 'RB')]]

Download full code from here

Applications

Advantages

Limitations