predict - Predict labels using classification tree model - MATLAB (original) (raw)

Predict labels using classification tree model

Syntax

Description

[label](#bst08bf-1-label) = predict([tree](#mw%5F15ef8b50-a5ad-45bc-84b2-c4783d2de7c3),[X](#bst08bf-1-X)) returns a vector of predicted class labels for the predictor data in the table or matrixX, based on the trained classification treetree.

example

[label](#bst08bf-1-label) = predict([tree](#mw%5F15ef8b50-a5ad-45bc-84b2-c4783d2de7c3),[X](#bst08bf-1-X),Subtrees=[subtrees](#bst08bf-1%5Fsep%5Fshared-Subtrees)) also prunes tree to the level specified by subtrees, before predicting labels.

example

[[label](#bst08bf-1-label),[score](#bst08bf-1-score),[node](#bst08bf-1-node),[cnum](#bst08bf-1-cnum)] = predict(___) also returns the following, using any of the input argument combinations in the previous syntaxes:

example

Examples

collapse all

Examine predictions for a few rows in a data set left out of training.

Load Fisher's iris data set.

Partition the data into training (50%) and validation (50%) sets.

n = size(meas,1); rng(1) % For reproducibility idxTrn = false(n,1); idxTrn(randsample(n,round(0.5*n))) = true; idxVal = idxTrn == false;

Grow a classification tree using the training set.

Mdl = fitctree(meas(idxTrn,:),species(idxTrn));

Predict labels for the validation data, and display several predicted labels. Count the number of misclassified observations.

label = predict(Mdl,meas(idxVal,:)); label(randsample(numel(label),5))

ans = 5×1 cell {'setosa' } {'setosa' } {'setosa' } {'virginica' } {'versicolor'}

numMisclass = sum(~strcmp(label,species(idxVal)))

The software misclassifies three out-of-sample observations.

Load Fisher's iris data set.

Partition the data into training (50%) and validation (50%) sets.

n = size(meas,1); rng(1) % For reproducibility idxTrn = false(n,1); idxTrn(randsample(n,round(0.5*n))) = true; idxVal = idxTrn == false;

Grow a classification tree using the training set, and then view it.

Mdl = fitctree(meas(idxTrn,:),species(idxTrn)); view(Mdl,"Mode","graph")

Figure Classification tree viewer contains an axes object and other objects of type uimenu, uicontrol. The axes object contains 18 objects of type line, text. One or more of the lines displays its values using only markers

The resulting tree has four levels.

Estimate posterior probabilities for the test set using subtrees pruned to levels 1 and 3. Display several posterior probabilities.

[~,Posterior] = predict(Mdl,meas(idxVal,:), ... Subtrees=[1 3]); Mdl.ClassNames

ans = 3×1 cell {'setosa' } {'versicolor'} {'virginica' }

Posterior(randsample(size(Posterior,1),5),:,:)

ans = ans(:,:,1) =

1.0000         0         0
1.0000         0         0
1.0000         0         0
     0         0    1.0000
     0    0.8571    0.1429

ans(:,:,2) =

0.3733    0.3200    0.3067
0.3733    0.3200    0.3067
0.3733    0.3200    0.3067
0.3733    0.3200    0.3067
0.3733    0.3200    0.3067

The elements of Posterior are class posterior probabilities:

The subtree pruned to level 1 is more sure of its predictions than the subtree pruned to level 3 (that is, the root node).

Input Arguments

collapse all

Predictor data to be classified, specified as a numeric matrix or a table.

Each row of X corresponds to one observation, and each column corresponds to one variable.

For a numeric matrix:

For a table:

Data Types: table | double | single

Data Types: single | double | char | string

Output Arguments

collapse all

Predicted class labels, returned as a categorical or character array, logical or numeric vector, or cell array of character vectors. Each entry of label corresponds to the class with the minimal expected cost for the corresponding row ofX.

Suppose subtrees is a numeric vector containing T elements, and X hasN rows.

Posterior probabilities, returned as a numeric matrix of sizeN-by-K, where N is the number of observations (rows) in X, and K is the number of classes (in tree.ClassNames). score(i,j) is the posterior probability that row i in X is of classj in tree.ClassNames.

If subtrees has T elements, andX has N rows, then score is an N-by-K-by-T array, andnode and cnum areN-by-T matrices.

Node numbers for the predicted classes, returned as a numeric vector. Each entry corresponds to the predicted node in tree for the corresponding row of X.

Class numbers corresponding to the predicted labels, returned as a numeric vector. Each entry of cnum corresponds to the predicted class number for the corresponding row of X.

More About

collapse all

predict classifies by minimizing the expected misclassification cost:

where:

For trees, the score of a classification of a leaf node is the posterior probability of the classification at that node. The posterior probability of the classification at a node is the number of training sequences that lead to that node with the classification, divided by the number of training sequences that lead to that node.

For an example, see Posterior Probability Definition for Classification Tree.

The true misclassification cost is the cost of classifying an observation into an incorrect class.

You can set the true misclassification cost per class by using the Cost name-value argument when you create the classifier. Cost(i,j) is the cost of classifying an observation into class j when its true class isi. By default, Cost(i,j)=1 ifi~=j, and Cost(i,j)=0 if i=j. In other words, the cost is 0 for correct classification and1 for incorrect classification.

The expected misclassification cost per observation is an averaged cost of classifying the observation into each class.

Suppose you have Nobs observations that you want to classify with a trained classifier, and you have K classes. You place the observations into a matrix X with one observation per row.

The expected cost matrix CE has sizeNobs-by-K. Each row ofCE contains the expected (average) cost of classifying the observation into each of the K classes.CE(_n_,_k_) is

where:

The predictive measure of association is a value that indicates the similarity between decision rules that split observations. Among all possible decision splits that are compared to the optimal split (found by growing the tree), the best surrogate decision split yields the maximum predictive measure of association. The second-best surrogate split has the second-largest predictive measure of association.

Suppose xj and xk are predictor variables j and k, respectively, and jk. At node t, the predictive measure of association between the optimal split xj < u and a surrogate split xk < v is

λjk is a value in (–∞,1]. If λjk > 0, then xk < v is a worthwhile surrogate split for xj < u.

Algorithms

predict generates predictions by following the branches of tree until it reaches a leaf node or a missing value. If predict reaches a leaf node, it returns the classification of that node.

If predict reaches a node with a missing value for a predictor, its behavior depends on the setting of the Surrogate name-value argument when fitctree constructs tree.

Alternative Functionality

To integrate the prediction of a classification tree model into Simulink®, you can use the ClassificationTree Predict block in the Statistics and Machine Learning Toolbox™ library or a MATLAB® Function block with the predict function. For examples, see Predict Class Labels Using ClassificationTree Predict Block and Predict Class Labels Using MATLAB Function Block.

When deciding which approach to use, consider the following:

Extended Capabilities

expand all

This function fully supports tall arrays. You can use models trained on either in-memory or tall data with this function.

For more information, see Tall Arrays.

Usage notes and limitations:

For more information, see Introduction to Code Generation.

Usage notes and limitations:

For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Version History

Introduced in R2011a