UMAP: Uniform Manifold Approximation and Projection (original) (raw)

Last Updated : 23 Jul, 2025

Uniform Manifold Approximation and Projection (UMAP) is a powerful dimension reduction technique that has gained significant traction in the fields of machine learning and data visualization. Developed by Leland McInnes, John Healy, and James Melville, UMAP is built on solid mathematical foundations, including Riemannian geometry and algebraic topology. This article delves into the technical aspects of UMAP, its underlying principles, implementation, and practical applications.

Table of Content

Introduction to UMAP

UMAP is a manifold learning technique that aims to reduce the dimensionality of data while preserving its topological structure. It is particularly useful for visualizing high-dimensional datasets in a low-dimensional space, typically two or three dimensions. UMAP is often compared to t-SNE (t-distributed Stochastic Neighbor Embedding) due to its similar application in data visualization, but it offers several advantages, including better preservation of global data structure and faster computation times.

Mathematical Foundations for UMAP

UMAP is grounded in several key mathematical concepts:

The UMAP Algorithm

The UMAP algorithm can be broken down into two main phases: constructing a fuzzy topological representation and optimizing the low-dimensional embedding.

1. Constructing the Fuzzy Topological Representation

  1. **Nearest Neighbor Search: UMAP begins by finding the nearest neighbors for each data point. This is typically done using approximate nearest neighbor algorithms to speed up the process.
  2. **Fuzzy Simplicial Set: A fuzzy simplicial set is constructed from the nearest neighbors. This set captures the local connectivity of the data points.
  3. **Fuzzy Membership Strengths: Membership strengths are assigned to the edges of the simplicial set, representing the probability that two points are connected.

2. Optimizing the Low-Dimensional Embedding

  1. **Cross-Entropy Optimization: The low-dimensional embedding is optimized to minimize the cross-entropy between the fuzzy simplicial set in the high-dimensional space and the low-dimensional space.
  2. **Stochastic Gradient Descent: UMAP uses stochastic gradient descent to perform the optimization. This involves iteratively adjusting the positions of the points in the low-dimensional space to better match the fuzzy topological structure.

Practical Implementation of UMAP

UMAP is implemented in Python and is compatible with the scikit-learn API, making it easy to integrate into existing machine learning pipelines. The following steps outline the basic usage of UMAP:

**Installation: UMAP can be installed via conda or pip:

pip install umap-learn

Example 1: UMAP to generate Two-dimensional data

Let’s see an example about generating 4-dimensional data and using UMAP to generate 2-dimensional data.

Python `

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D import seaborn as sns import umap %matplotlib inline sns.set(style='white', context='poster', rc={'figure.figsize':(14,10)}) np.random.seed(44) data = np.random.rand(700, 4) fit = umap.UMAP() %time u = fit.fit_transform(data)

`

**Output:

CPU times: user 20.8 s, sys: 696 ms, total: 21.5 s
Wall time: 32.9 s

The resulting value u is a 2-dimensional representation of the data. We can visualise the result by drawing a scatter plot of u. By colouring each point of the scatter plot by the associated 4-dimensional colour from the source data.

Python `

plt.scatter(u[:,0], u[:,1], c=data) plt.title('UMAP embedding of random colours');

`

**Output:

v-min-min-(1)

UMAP to generate Two-dimensional data

As a result, we can see that the data is placed in 2-dimensional space and the points that were in 4-dimensional space are of similar colors and kept close together.

Example 2: UMAP to Reduce the Dimensionality

Here is a simple example of using UMAP to reduce the dimensionality of the digits dataset from scikit-learn:

Python `

import numpy as np import umap from sklearn.datasets import load_digits from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt

digits = load_digits() data = digits.data scaler = StandardScaler() data = scaler.fit_transform(data)

Apply UMAP

reducer = umap.UMAP() embedding = reducer.fit_transform(data) plt.scatter(embedding[:, 0], embedding[:, 1], c=digits.target, cmap='Spectral', s=5) plt.colorbar(boundaries=np.arange(11)-0.5).set_ticks(np.arange(10)) plt.show()

`

Output:

Capture-(4)

UMAP to reduce the dimensionality of the digits dataset

Advantages of UMAP

UMAP offers several advantages over other dimension reduction techniques:

Advanced Usage and Customization

UMAP offers several advanced features and customization options:

Conclusion

UMAP is a powerful and versatile dimension reduction technique that has become a popular tool for data visualization and analysis. Its solid mathematical foundations, scalability, and ability to preserve both local and global data structures make it an excellent choice for a wide range of applications. Whether you are a data scientist, bioinformatician, or machine learning practitioner, UMAP can help you unlock new insights from your high-dimensional data.