Embeddings in Machine Learning (original) (raw)

Last Updated : 1 May, 2026

In machine learning, embeddings are a way of representing data as numerical vectors in a continuous space. They capture the meaning or relationship between data points, so that similar items are placed closer together while dissimilar ones are farther apart. This makes it easier for algorithms to work with complex data such as words, images or audio.

Screenshot-2025-07-24-131534

Word

In the above graph, we observe distinct clusters of related words.

Important terms used for Embedding

These terms help understand how embeddings represent and organize data in machine learning.

1. Vector

2. Dense Vector

3. Vector space

4. Continuous Vector space

Working

Embeddings convert data into numerical vectors that capture meaning and relationships, allowing models to compare and process different types of data effectively.

1. Define similarity signal

First, decide what we want the model to treat as “similar”.

2. Choose dimensionality

Select how many numbers (dimensions) will describe each item, it could be 64, 384, 768 or more.

3. Build the encoder

This is the model that turns our data into a list of numbers (vector):

4. Train with a metric learning objective

5. Negative sampling and batching

Give the model tricky “hard negative” examples, things that seem alike but aren’t so it learns to tell them apart better.

6. Validate and Tune

Test how well our embeddings work by checking:

7. Index for Fast Retrieval

Store our vectors in a special database like Qdrant or FAISS to quickly find the closest matches, even from millions of items.

8. Use the embeddings

Once ready, embeddings can be used for:

Importance

Embeddings are widely used because they represent data in a meaningful and efficient way, helping models understand relationships and perform better across tasks.

Types of Data Represented with Embeddings

Embeddings can represent different types of data by converting them into dense vectors, making it easier for models to understand patterns, relationships and meaning.

1. Words

Word embeddings are numeric vectors which represent individual words as vectors where similar words are placed closer together, helping in tasks like sentiment analysis and translation.

2. Complete Text Document

Embedding models represent sentences or documents as vectors capturing overall meaning and context, useful for classification and semantic search.

3. Audio Data

Convert sound signals into vectors capturing acoustic features, enabling tasks like speech recognition and emotion detection. Some of the popular Audio embedding techniques may include Wav2Vec

4. Image Data

Represent images as vectors using CNN-based models, capturing visual features for tasks like classification and object detection.

5. Graph Data

Graph embeddings convert nodes and relationships into vectors, helping in tasks like link prediction and clustering.

6. Structured Data

Structured data such as feature vectors and tables can be embedded to help machine learning models capture underlying patterns. Common techniques include Autoencoders

Visualization using t-SNE

t-SNE is used to visualize high dimensional word embeddings by reducing them to 2D space, helping us understand how similar words are positioned relative to each other.

Step 1: Import Libraries

import numpy as np import matplotlib.pyplot as plt from sklearn.manifold import TSNE import gensim.downloader as api from gensim.models import Word2Vec

`

Step 2: Load Data and Train Word2Vec Model

Loads a sample text dataset and uses it to train a Word2Vec model which creates word vectors.

Python `

corpus = api.load('text8') model = Word2Vec(corpus)

`

Step 3: Select Words and Get Their Embeddings

words = ['cat', 'dog', 'elephant', 'lion', 'bird', 'rat', 'wolf', 'cow', 'goat', 'snake', 'rabbit', 'human', 'parrot', 'fox', 'peacock', 'lotus', 'roses', 'marigold', 'jasmine', 'computer', 'robot', 'software', 'vocabulary', 'machine', 'eye', 'vision', 'grammar', 'words', 'sentences', 'language', 'verbs', 'noun', 'transformer', 'embedding', 'neural', 'network', 'optimization'] words = [word for word in words if word in model.wv.key_to_index] word_embeddings = [model.wv[word] for word in words] embeddings = np.array(word_embeddings)

`

Step 4: Reduce Dimensionality with t-SNE

Uses t-SNE from scikit learn to shrink high dimensional word vectors into two dimensions for visualization.

Python `

tsne = TSNE(n_components=2, perplexity=2) embeddings_2d = tsne.fit_transform(embeddings)

`

Step 5: Plot Embedding

Displays a scatter plot of the words in 2D space, labels each point with its word and displays the plot.

Python `

plt.figure(figsize=(10, 7), dpi=1000) plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], marker='o') for i, word in enumerate(words): plt.text(embeddings_2d[i, 0], embeddings_2d[i, 1], word, fontsize=10, ha='left', va='bottom') plt.xlabel('t-SNE Dimension 1') plt.ylabel('t-SNE Dimension 2') plt.title('Word Embedding Graph (t-SNE with Word2Vec)') plt.grid(True) plt.savefig('embedding.png') plt.show()

`

**Output:

Original embedding vector shape (37, 100)
After applying t-SNE embedding vector shape (37, 2)

embedding

Output

Here we can see snake, cow, birds, etc are grouped together nearby showing similarity (all animals) whereas computer and machines are far away from animal cluster showing dissimilarity.

Download full code from here

Applications

Limitations