Supervised Learning (original) (raw)

What is supervised learning?

Supervised learning is a machine learning technique that uses labeled data sets to train artificial intelligence (AI) models to identify the underlying patterns and relationships. The goal of the learning process is to create a model that can predict correct outputs on new real-world data.

Labeled data sets consist of sample data points along with the correct outputs or answers. As input data is fed into the machine learning algorithm, it adjusts its parameters until the model has been fitted appropriately. Labeled training data provides a “ground truth,” explicitly teaching the model to identify the relationships between features and data labels.

Supervised machine learning helps organizations solve various real-world problems at scale, such as classifying spam or predicting stock prices. It can be used to build highly accurate machine learning models.

What is ground truth data?

Ground truth data is verified against real-world outcomes, often through human annotation or measurement, and is used to train, validate and test models. As the name implies, ground truth data has been confirmed to be true—it is reflective of real-world values and outcomes. Ground truth reflects the ideal outputs for any given input data.

Supervised learning relies on ground truth data to teach a model the relationships between inputs and outputs. The labeled datasets used in supervised learning are ground truth data. Trained models apply their understanding of the that data to make predictions based on new, unseen data.

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

How supervised learning works

Supervised learning techniques use a labeled training dataset to understand the relationships between inputs and output data. Data scientists manually create ground truth training datasets containing input data along with the corresponding labels. Supervised learning trains the model to apply the correct outputs to unseen data in real-world use cases.

During training, the model’s algorithm processes large datasets to explore potential correlations between inputs and outputs. Then, model performance is evaluated with test data to find out whether it was trained successfully. Cross-validation is the process of testing a model using a different portion of the dataset.

The gradient descent family of algorithms, including stochastic gradient descent (SGD), are the most commonly used optimization algorithms, or learning algorithms, when training neural networks and other machine learning models. The model’s optimization algorithm assesses accuracy through the loss function: an equation that measures the discrepancy between the model’s predictions and actual values.

The loss function measures how far off predictions are from actual values. Its gradient indicates the direction in which the model’s parameters should be adjusted to reduce error. Throughout training, the optimization algorithm updates the model’s parameters—its operating rules or “settings”—to optimize the model.

Because large datasets typically contain many features, data scientists can simplify this complexity through dimensionality reduction. This data science technique reduces the number of features to those most crucial for predicting data labels, which preserves accuracy while increasing efficiency.

An example of supervised learning in action

As an example of supervised learning, consider an image classification model created to recognize images of vehicles and determine which type of vehicle they are. Such a model can power the CAPTCHA tests many websites use to detect spam bots.

To train this model, data scientists prepare a labeled training dataset containing numerous vehicle examples along with the corresponding vehicle type: car, motorcycle, truck, bicycle and more. The model’s algorithm attempts to identify the patterns in the training data that cause an input—vehicle images—to receive a designated output—vehicle type.

The model’s guesses are measured against actual data values in a test set to determine whether it has made accurate predictions. If not, the training cycle continues until the model’s performance has reached a satisfactory level of accuracy. The principle of generalization refers to a trained model’s ability to make appropriate predictions on new data from the same distribution as its training data.

Types of supervised learning

Supervised learning tasks can be broadly divided into classification and regression problems:

Classification

Classification in machine learning uses an algorithm to sort data into categories. It recognizes specific entities within the dataset and attempts to determine how those entities should be labeled or defined. Common classification algorithms are linear classifiers, support vector machines (SVM), decision trees, k-nearest neighbor (KNN), logistic regression and random forest.

Neural networks excel at handling complex classification problems. A neural network is a deep learning architecture that processes training data with layers of nodes that mimic the human brain. Each node is made up of inputs, weights, a bias (or threshold) and an output. If an output value exceeds a preset threshold, the node “fires” or activates, passing data to the next layer in the network.

Regression

Regression is used to understand the relationship between dependent and independent variables. In regression problems, the output is a continuous value, and models attempt to predict the target output. Regression tasks include projections for sales revenue or financial planning.

Regression algorithms include linear regression, lasso regression, ridge regression and polynomial regression are three examples of regression algorithms.

Ensemble learning

Ensemble learning is a meta-approach to supervised learning in which multiple models are trained on the same classification or regression task. The results of all the models in the pool are aggregated to discover the best overall approach to solving the challenge.

The individual algorithms within the larger ensemble model are known as weak learners or base models. Some weak learners have high bias, while others have high variance. In theory, the results mitigate the bias-variance tradeoff by combining the best parts of each.

Supervised learning algorithms

Optimization algorithms such as gradient descent train a wide range of machine learning algorithms that excel in supervised learning tasks.

Supervised learning versus other learning methods

Supervised learning is not the only learning method for training machine learning models. Other types of machine learning include:

Supervised versus unsupervised learning

The difference between supervised learning and unsupervised learning is that unsupervised machine learning uses unlabeled data without any objective ground truth. The model is left to discover patterns and relationships in the data on its own. Many generative AI models are initially trained with unsupervised learning and later with supervised learning to increase domain expertise.

Unsupervised learning can help solve for clustering or association problems in which common properties within a dataset are uncertain. Common clustering algorithms are hierarchical, K-means and Gaussian mixture models.

Pros of unsupervised learning

Cons of unsupervised learning

Supervised versus semi-supervised learning

Semi-supervised learning involves training a model on a small portion of labeled input data along with a larger portion of unlabeled data. Because it can be time-consuming and costly to rely on domain expertise to label data appropriately for supervised learning, semi-supervised learning can be an appealing alternative.

Pros of semi-supervised learning

Cons of semi-supervised learning

Supervised versus self-supervised learning

Self-supervised learning (SSL) is often described as bridging supervised and unsupervised learning. Rather than use the manually created labels of supervised learning datasets, SSL tasks are configured so that the model can generate its own supervisory signals—implicit or pseudo-labels—and discern ground truth from unstructured data. Then, the model’s loss function uses those labels in place of actual labels to assess model performance.

SSL is often used with transfer learning, a process in which a pretrained model is applied to a downstream task. Self-supervised learning sees widespread use in computer vision and natural language processing (NLP) tasks requiring large datasets that are prohibitively expensive and time-consuming to label.

Pros of self-supervised learning

Cons of self-supervised learning

Supervised versus reinforcement learning

Reinforcement learning trains autonomous agents, such as robots and self-driving cars, to make decisions through environmental interactions. Reinforcement learning does not use labeled data and also differs from unsupervised learning in that it teaches by trial-and-error and reward, not by identifying underlying patterns within datasets.

Pros of reinforcement learning

Cons of reinforcement learning

Real-world supervised learning use cases

Supervised learning models can build and advance business applications, including:

Challenges of supervised learning

Although supervised learning can offer businesses advantages such as deep data insights and improved automation, it might not be the best choice for all situations.

Authors