Building Language Models in NLP (original) (raw)

Last Updated : 23 Jul, 2025

Building language models is a fundamental task in natural language processing (NLP) that involves creating computational models capable of predicting the next word in a sequence of words. These models are essential for various NLP applications, such as machine translation, speech recognition, and text generation.

**In this article, we will build a language model using NLP using LSTM.

What is a Language Model?

A language model is a statistical model that is used to predict the probability of a sequence of words.
It learns the structure and patterns of a language from a given text corpus and can be used to generate new text that is similar to the original text.
Language models are a fundamental component of many natural language processing (NLP) tasks, such as machine translation, speech recognition, and text generation.

Steps to Build a Language Model in NLP

Here, we will implement these steps to build a language model in NLP.

Step 1: Importing Necessary Libraries

We will, at first, import all the necessary libraries required for building our model.

import tensorflow as tf from tensorflow.keras.layers import Embedding, LSTM, Dense from tensorflow.keras.models import Sequential

Step 2: Generate Sample Data

We will at first take a sample text data.

text_data = "Hello, how are you? I am doing well. Thank you for asking."

Step 3: Preprocessing the Data

The preprocessing involves tokenizing the input text data, creates input sequences, and pads the sequences to make them equal in length.

Tokenize the text

tokenizer = tf.keras.preprocessing.text.Tokenizer() tokenizer.fit_on_texts([text_data]) total_words = len(tokenizer.word_index) + 1

Create input sequences and labels

input_sequences = [] for line in text_data.split('.'): token_list = tokenizer.texts_to_sequences([line])[0] for i in range(1, len(token_list)): n_gram_sequence = token_list[:i+1] input_sequences.append(n_gram_sequence)

Pad sequences for equal length

max_sequence_len = max([len(x) for x in input_sequences]) input_sequences = tf.keras.preprocessing.sequence.pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')

Step 4: One hot encoding

The input sequences are split into predictors (xs) and labels (ys). The labels are converted to one-hot encoding.

Create predictors and label

xs, labels = input_sequences[:,:-1],input_sequences[:,-1]

Convert labels to one-hot encoding

ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

Step 5: Defining and Compiling the Model

This code defines and compiles a simple LSTM-based language model using Keras

Define the model

model = Sequential() model.add(Embedding(total_words, 64, input_length=max_sequence_len-1)) model.add(LSTM(100)) model.add(Dense(total_words, activation='softmax'))

Compile the model

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Fit the model

history = model.fit(xs, ys, epochs=100, verbose=1)

Step 6: Generating Text

This generate_text function takes a seed_text as input and generates next_words number of words using the provided model and max_sequence_len.

def generate_text(seed_text, next_words, model, max_sequence_len): for _ in range(next_words): token_list = tokenizer.texts_to_sequences([seed_text])[0] token_list = tf.keras.preprocessing.sequence.pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre') predicted_probs = model.predict(token_list, verbose=0)[0] predicted_index = tf.argmax(predicted_probs, axis=-1).numpy() output_word = "" for word, index in tokenizer.word_index.items(): if index == predicted_index: output_word = word break seed_text += " " + output_word return seed_text

Generate text

print(generate_text("how", 5, model, max_sequence_len))

**Below is the complete Implemention:

Python `

import tensorflow as tf from tensorflow.keras.layers import Embedding, LSTM, Dense from tensorflow.keras.models import Sequential

Sample text data

text_data = "Hello, how are you? I am doing well. Thank you for asking."

Tokenize the text

tokenizer = tf.keras.preprocessing.text.Tokenizer() tokenizer.fit_on_texts([text_data]) total_words = len(tokenizer.word_index) + 1

Create input sequences and labels

Pad sequences for equal length

max_sequence_len = max([len(x) for x in input_sequences]) input_sequences = tf.keras.preprocessing.sequence.pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')

Create predictors and label

xs, labels = input_sequences[:,:-1],input_sequences[:,-1]

Convert labels to one-hot encoding

ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

Define the model

model = Sequential() model.add(Embedding(total_words, 64, input_length=max_sequence_len-1)) model.add(LSTM(100)) model.add(Dense(total_words, activation='softmax'))

Compile the model

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Fit the model

history = model.fit(xs, ys, epochs=100, verbose=1)

Generate text

print(generate_text("how", 5, model, max_sequence_len))

**Output:

how are you i am doing

In summary, constructing language models for natural language processing (NLP) include various stages, including tokenization, sequence creation, model construction, training, and text generation. Tokenization transforms textual data into numerical representations, while sequence creation generates input-output pairs for model training. The model typically comprises layers like Embedding and LSTM, followed by a Dense layer for predictions. Training involves fitting the model to input sequences and their labels, while text generation utilizes the trained model to generate new text based on a provided seed text. Overall, language models are vital for NLP tasks such as text generation, machine translation, and sentiment analysis, among others.