Vectors and Vectorization Techniques in NLP (original) (raw)

Last Updated : 23 Jul, 2025

In Natural Language Processing (NLP), vectors play an important role in transforming human language into a format that machines can comprehend and process. These numerical representations enable computers to perform tasks such as sentiment analysis, machine translation and information retrieval with greater accuracy and efficiency.

Importance of Vectors in NLP

Vectorization Techniques in NLP

**1. One-Hot Encoding

One-Hot Encoding is a technique where each word is represented by a vector with a high bit corresponding to the word’s index in the vocabulary with all other bits set to zero.

**Advantages of One-Hot Encoding:

**Disadvantages of One-Hot Encoding:

from sklearn.preprocessing import OneHotEncoder import numpy as np import string

documents = [ "The cat sat on the mat.", "The dog sat on the log.", "Cats and dogs are pets." ]

words = [word.lower().strip(string.punctuation) for doc in documents for word in doc.split()] vocabulary = sorted(set(words))

encoder = OneHotEncoder(sparse_output=False) one_hot_vectors = encoder.fit_transform(np.array(vocabulary).reshape(-1, 1))

word_to_onehot = {vocabulary[i]: one_hot_vectors[i] for i in range(len(vocabulary))}

for word, vector in word_to_onehot.items(): print(f"Word: {word}, One-Hot Encoding: {vector}")

`

**Output:

one-hot-encoding-output

One-Hot Encoding

2. Bag of Words (BoW)

Bag of Words (BoW) converts text into a vector representing the frequency of words, disregarding grammar and word order. It counts the occurrences of each word in a document and generates a vector based on these counts.

**Advantages of Bag of Words (BoW)

**Disadvantages of Bag of Words (BoW)

from sklearn.feature_extraction.text import CountVectorizer

documents = [ "The cat sat on the mat.", "The dog sat on the log.", "Cats and dogs are pets." ]

vectorizer = CountVectorizer() X = vectorizer.fit_transform(documents)

print(X.toarray()) print(vectorizer.get_feature_names_out())

`

**Output:

bow-output

Bag of Words (BoW

3. Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF is an extension of BoW that weighs the frequency of words by their importance across documents.

**1. Term Frequency (TF): Measures the frequency of a word in a document.

TF(t,d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}

**2. Inverse Document Frequency (IDF): Measures the importance of a word across the entire corpus.

IDF(t) = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing term } t} \right)

The TF-IDF score is the product of TF and IDF.

**Advantages of TF-IDF

**Disadvantages of TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [ "The cat sat on the mat.", "The dog sat on the log.", "Cats and dogs are pets." ]

tfidf_vectorizer = TfidfVectorizer() X_tfidf = tfidf_vectorizer.fit_transform(documents)

print(X_tfidf.toarray()) print(tfidf_vectorizer.get_feature_names_out())

`

**Output:

TF-IDF-output

Term Frequency-Inverse Document Frequency (TF-IDF)

4. Count Vectorizer

Count Vectorizer is similar to BoW but focuses on counting the occurrences of each word in the document. It converts a collection of text documents to a matrix of token counts where each element represents the count of a word in a specific document.

**Advantages of Count Vectorizer

**Disadvantages of Count Vectorizer

from sklearn.feature_extraction.text import CountVectorizer

documents = [ "The cat sat on the mat.", "The dog sat on the log.", "Cats and dogs are pets." ]

count_vectorizer = CountVectorizer() X_count = count_vectorizer.fit_transform(documents)

print(X_count.toarray()) print(count_vectorizer.get_feature_names_out())

`

**Output:

countvectorizers-output

Count Vectorizer

Advanced Vectorization Techniques in Natural Language Processing (NLP)

1. Word Embeddings

Word embeddings are dense vector representations of words in a continuous vector space where semantically similar words are located closer to each other. These embeddings capture the context of a word, its syntactic role and semantic relationships with other words leading to better performance in various NLP tasks.

word-embedding-

Word Embedding

**Advantages:

**Disadvantages:

2. Image Embeddings

Image embeddings transforms images into numerical representations through which our model can perform image search, object recognition and image generation.

image-embedding

Image embedding

**Advantages:

**Disadvantages:

Comparison of Vectorization Techniques

Lets see a quick comparisonbetween different technique:

Technique Accuracy Computation Time Memory Usage Applicability
**Bag of Words (BoW) Low to Moderate Low High Simple text classification tasks
**TF-IDF Moderate Moderate High Text classification, information retrieval, keyword extraction
**Count Vectorizer Low to Moderate Low High Tasks focusing on word frequency
**Word Embeddings High High Moderate to High Sentiment analysis, named entity recognition, machine translation
**Image Embeddings High High Moderate to High Image classification, object detection, image retrieval

Choosing the right vectorization technique depends on the specific NLP task, available computational resources and the importance of capturing semantic and contextual information. Traditional techniques like BoW and TF-IDF are simpler and faster but may fall short in capturing the nuanced meaning of text. Advanced techniques like word embeddings and document embeddings provide richer, context-aware representations at the cost of increased computational complexity and memory usage.