What is Text Embedding? (original) (raw)

Last Updated : 23 Jul, 2025

**Text embeddings are vector representations of text that map the original text into a mathematical space where words or sentences with similar meanings are located near each other. Unlike traditional one-hot encoding, where each word is represented as a sparse vector with a single '1' for the corresponding word and '0's elsewhere, text embeddings allow for more nuanced representations, capturing relationships and meanings of words.

Text embeddings are generated by neural network models that learn to map text to vectors such that words with similar meanings have similar representations. These vectors contain a compressed version of the original text, preserving semantic properties, such as word similarity and syntactic relationships.

Types of Text Embeddings

**1. Word Embeddings

**2. Sentence Embeddings

**3. Contextualized Embeddings

Why Are Text Embeddings Important?

  1. **Capturing Semantic Relationships: Traditional methods like one-hot encoding cannot capture relationships between words. For example, words like "cat" and "dog" are very similar in meaning, but one-hot encoding treats them as completely different entities. With text embeddings, the vector space ensures that these similar words are represented closely to each other.
  2. **Efficient Representation: Embeddings allow for compact representation of text data. Instead of using sparse vectors (like one-hot encoding), dense vectors represent the text more efficiently, reducing memory usage and computation time.
  3. **Transferability: Pre-trained embeddings, such as Word2Vec, GloVe, and BERT, can be fine-tuned for specific tasks. This enables transfer learning, where a model trained on a large dataset can be adapted to a smaller, task-specific dataset, reducing the need for large amounts of labeled data.

Text Embedding using Sentence Transformer (HuggingFace)

Python `

from transformers import AutoTokenizer, AutoModel import torch import numpy as np

Load pre-trained model and tokenizer

model_name = "sentence-transformers/all-MiniLM-L6-v2" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name)

texts = ["Hugging Face is great.", "I love NLP."] inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad(): outputs = model(**inputs)

Extracting embeddings (mean pooling)

embeddings = outputs.last_hidden_state.mean(dim=1) embeddings_np = embeddings.numpy().round(decimals=4)

Output

print(f"\n{'='*50}") print(f"Embedding Generation Report") print(f"{'='*50}") print(f"Model: {model_name}") print(f"Input texts: {texts}") print(f"\nGenerated embeddings (mean-pooled, dimension: {embeddings.shape[1]}):\n")

for i, (text, embedding) in enumerate(zip(texts, embeddings_np)): print(f"{'-'*50}") print(f"Text {i+1}: {text}") print(f"Embedding shape: {embedding.shape}") print(f"First 10 dimensions:\n{embedding[:10]}") print(f"Norm: {np.linalg.norm(embedding):.4f}")

print(f"\n{'='*50}") print(f"Note: These {embeddings.shape[1]}-dimensional vectors can be used for") print(f"semantic similarity, clustering, or other NLP tasks.") print(f"{'='*50}")

`

**Output:

Capture

How Are Text Embeddings Used?

Challenges and Limitations

  1. **Interpretability: Text embeddings are often seen as a "black box" because it’s difficult to directly interpret what a specific dimension in the vector represents. This can make it challenging to understand the reasoning behind the model’s decisions.
  2. **Biases: Like any machine learning model, text embeddings can inherit biases present in the training data. If the training data contains gender, racial, or other biases, the embeddings might reflect those biases, leading to biased predictions and decisions.
  3. **Context Limitations: While models like BERT provide contextualized embeddings, they may still struggle with capturing extremely complex or domain-specific relationships. Fine-tuning is often required to adapt embeddings to specific tasks.