Tokenize text using NLTK in python (original) (raw)

Last Updated : 5 Aug, 2025

Text tokenization is a fundamental Natural Language Processing (NLP) technique and one such technique is Tokenization. It is the process of dividing text into smaller components or tokens. These can be:

With Python’s popular library NLTK (Natural Language Toolkit), splitting text into meaningful units becomes both simple and extremely effective.

Basic Implementation

Let's see the implementation of Tokenization using NLTK in Python,

**Step 1: Install and Setup

Install the “**punkt” tokenizer models needed for sentence and word tokenization.

Python `

!pip install nltk

import nltk nltk.download('punkt')

`

**Step 2: Tokenize Sentences

sent_tokenize() splits a string into a list of sentences, handling punctuation and abbreviations.

Python `

from nltk.tokenize import sent_tokenize

text = "NLTK is a great NLP toolkit. It makes processing text easy!" sentences = sent_tokenize(text) print(sentences)

`

**Output:

['NLTK is a great NLP toolkit.', 'It makes processing text easy!']

**Step 3: Tokenize Words

from nltk.tokenize import word_tokenize

sentence = "Tokenization is easy with NLTK's word_tokenize." words = word_tokenize(sentence) print(words)

`

**Output:

['Tokenization', 'is', 'easy', 'with', 'NLTK', "'s", 'word_tokenize', '.']

Lets see some more Examples,

**1. WordPunctTokenizer

It Splits text into alphabetic and non-alphabetic characters,

from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer() tokens = tokenizer.tokenize( "Don't split contractions. E-mails: hello@example.com!") print(tokens)

`

**Output:

['Don', "'", 't', 'split', 'contractions', '.', 'E', '-', 'mails', ':', 'hello', '@', 'example', '.', 'com', '!']

**2. TreebankWordTokenizer

It is suitable for linguistic analysis, handles punctuation and contractions.

from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer() tokens = tokenizer.tokenize("Have a look at NLTK's tokenizers.") print(tokens)

`

**Output:

['Have', 'a', 'look', 'at', 'NLTK', "'s", 'tokenizers', '.']

**3. Regex Tokenizer

It customize pattern-based splitting.

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+') tokens = tokenizer.tokenize( "Custom rule: keep only words & numbers, drop punctuation!") print(tokens)

`

**Output:

['Custom', 'rule', 'keep', 'only', 'words', 'numbers', 'drop', 'punctuation']

NLTK provides a useful and user-friendly toolkit for tokenizing text in Python, supporting a range of tokenization needs from basic word and sentence splitting to advanced custom patterns.