predict - Predict labels using classification tree model - MATLAB (original) (raw)

Predict labels using classification tree model

Syntax

Description

[label](#bst08bf-1-label) = predict([tree](#mw%5F15ef8b50-a5ad-45bc-84b2-c4783d2de7c3),[X](#bst08bf-1-X)) returns a vector of predicted class labels for the predictor data in the table or matrixX, based on the trained classification treetree.

example

[label](#bst08bf-1-label) = predict([tree](#mw%5F15ef8b50-a5ad-45bc-84b2-c4783d2de7c3),[X](#bst08bf-1-X),Subtrees=[subtrees](#bst08bf-1%5Fsep%5Fshared-Subtrees)) also prunes tree to the level specified by subtrees, before predicting labels.

example

[[label](#bst08bf-1-label),[score](#bst08bf-1-score),[node](#bst08bf-1-node),[cnum](#bst08bf-1-cnum)] = predict(___) also returns the following, using any of the input argument combinations in the previous syntaxes:

A matrix of classification scores (score) indicating the likelihood that a label comes from a particular class. For classification trees, scores are posterior probabilities. For each observation in X, thepredicted class label corresponds to the minimum expected misclassification cost among all classes.
A vector of predicted node numbers for the classification (node).
A vector of predicted class numbers for the classification (cnum).

example

Examples

collapse all

Examine predictions for a few rows in a data set left out of training.

Load Fisher's iris data set.

Partition the data into training (50%) and validation (50%) sets.

n = size(meas,1); rng(1) % For reproducibility idxTrn = false(n,1); idxTrn(randsample(n,round(0.5*n))) = true; idxVal = idxTrn == false;

Grow a classification tree using the training set.

Mdl = fitctree(meas(idxTrn,:),species(idxTrn));

Predict labels for the validation data, and display several predicted labels. Count the number of misclassified observations.

label = predict(Mdl,meas(idxVal,:)); label(randsample(numel(label),5))

ans = 5×1 cell {'setosa' } {'setosa' } {'setosa' } {'virginica' } {'versicolor'}

numMisclass = sum(~strcmp(label,species(idxVal)))

The software misclassifies three out-of-sample observations.

Load Fisher's iris data set.

Partition the data into training (50%) and validation (50%) sets.

n = size(meas,1); rng(1) % For reproducibility idxTrn = false(n,1); idxTrn(randsample(n,round(0.5*n))) = true; idxVal = idxTrn == false;

Grow a classification tree using the training set, and then view it.

Mdl = fitctree(meas(idxTrn,:),species(idxTrn)); view(Mdl,"Mode","graph")

Figure Classification tree viewer contains an axes object and other objects of type uimenu, uicontrol. The axes object contains 18 objects of type line, text. One or more of the lines displays its values using only markers

The resulting tree has four levels.

Estimate posterior probabilities for the test set using subtrees pruned to levels 1 and 3. Display several posterior probabilities.

[~,Posterior] = predict(Mdl,meas(idxVal,:), ... Subtrees=[1 3]); Mdl.ClassNames

ans = 3×1 cell {'setosa' } {'versicolor'} {'virginica' }

Posterior(randsample(size(Posterior,1),5),:,:)

ans = ans(:,:,1) =

1.0000         0         0
1.0000         0         0
1.0000         0         0
     0         0    1.0000
     0    0.8571    0.1429

ans(:,:,2) =

0.3733    0.3200    0.3067
0.3733    0.3200    0.3067
0.3733    0.3200    0.3067
0.3733    0.3200    0.3067
0.3733    0.3200    0.3067

The elements of Posterior are class posterior probabilities:

Rows correspond to observations in the validation set.
Columns correspond to the classes as listed in Mdl.ClassNames.
Pages correspond to the subtrees.

The subtree pruned to level 1 is more sure of its predictions than the subtree pruned to level 3 (that is, the root node).

Input Arguments

collapse all

Predictor data to be classified, specified as a numeric matrix or a table.

Each row of X corresponds to one observation, and each column corresponds to one variable.

For a numeric matrix:

The variables that make up the columns of X must have the same order as the predictor variables used to traintree.
If you train tree using a table (for example,Tbl), then X can be a numeric matrix ifTbl contains all numeric predictor variables. To treat numeric predictors in Tbl as categorical during training, identify categorical predictors using the CategoricalPredictors name-value argument of fitctree. If Tbl contains heterogeneous predictor variables (for example, numeric and categorical data types) andX is a numeric matrix, then predict issues an error.

For a table:

predict does not support multicolumn variables or cell arrays other than cell arrays of character vectors.
If you train tree using a table (for example,Tbl), then all predictor variables in X must have the same variable names and data types as those used to traintree (stored in tree.PredictorNames). However, the column order of X does not need to correspond to the column order of Tbl. Tbl andX can contain additional variables (response variables, observation weights, and so on), but predict ignores them.
If you train tree using a numeric matrix, then the predictor names in tree.PredictorNames and corresponding predictor variable names in X must be the same. To specify predictor names during training, use the PredictorNames name-value argument of fitctree. All predictor variables inX must be numeric vectors. X can contain additional variables (response variables, observation weights, and so on), butpredict ignores them.

Data Types: table | double | single

Data Types: single | double | char | string

Output Arguments

collapse all

Predicted class labels, returned as a categorical or character array, logical or numeric vector, or cell array of character vectors. Each entry of label corresponds to the class with the minimal expected cost for the corresponding row ofX.

Suppose subtrees is a numeric vector containing T elements, and X hasN rows.

If the response data type is char and T = 1, then label is a character matrix containingN rows. Each row contains the predicted label produced bysubtrees.
If the response data type is char and T > 1, then label is anN-by-T cell array. Column_j_ of label contains the vector of predicted labels produced by subtreesubtrees(_`j`_).
Otherwise, label is anN-by-T array that has the same data type as the response. Column j of label contains the vector of predicted labels produced by subtreesubtrees(_`j`_). (The software treats string arrays as cell arrays of character vectors.)

Posterior probabilities, returned as a numeric matrix of sizeN-by-K, where N is the number of observations (rows) in X, and K is the number of classes (in tree.ClassNames). score(i,j) is the posterior probability that row i in X is of classj in tree.ClassNames.

If subtrees has T elements, andX has N rows, then score is an N-by-K-by-T array, andnode and cnum areN-by-T matrices.

Node numbers for the predicted classes, returned as a numeric vector. Each entry corresponds to the predicted node in tree for the corresponding row of X.

Class numbers corresponding to the predicted labels, returned as a numeric vector. Each entry of cnum corresponds to the predicted class number for the corresponding row of X.

More About

collapse all

predict classifies by minimizing the expected misclassification cost:

where:

y^ is the predicted classification.
K is the number of classes.
P^(j|x) is the posterior probability of class j for observation x.
C(y|j) is the cost of classifying an observation as y when its true class is j.

For trees, the score of a classification of a leaf node is the posterior probability of the classification at that node. The posterior probability of the classification at a node is the number of training sequences that lead to that node with the classification, divided by the number of training sequences that lead to that node.

For an example, see Posterior Probability Definition for Classification Tree.

The true misclassification cost is the cost of classifying an observation into an incorrect class.

You can set the true misclassification cost per class by using the Cost name-value argument when you create the classifier. Cost(i,j) is the cost of classifying an observation into class j when its true class isi. By default, Cost(i,j)=1 ifi~=j, and Cost(i,j)=0 if i=j. In other words, the cost is 0 for correct classification and1 for incorrect classification.

The expected misclassification cost per observation is an averaged cost of classifying the observation into each class.

Suppose you have Nobs observations that you want to classify with a trained classifier, and you have K classes. You place the observations into a matrix X with one observation per row.

The expected cost matrix CE has sizeNobs-by-K. Each row ofCE contains the expected (average) cost of classifying the observation into each of the K classes.CE(_n_,_k_) is

where:

K is the number of classes.
P^(i|X(n)) is the posterior probability of class i for observation X(n).
C(k|i) is the true misclassification cost of classifying an observation as k when its true class is i.

The predictive measure of association is a value that indicates the similarity between decision rules that split observations. Among all possible decision splits that are compared to the optimal split (found by growing the tree), the best surrogate decision split yields the maximum predictive measure of association. The second-best surrogate split has the second-largest predictive measure of association.

Suppose xj and xk are predictor variables j and k, respectively, and j ≠ k. At node t, the predictive measure of association between the optimal split xj < u and a surrogate split xk < v is

PL is the proportion of observations in node t, such that xj < u. The subscript L stands for the left child of node t.
PR is the proportion of observations in node t, such that xj ≥ u. The subscript R stands for the right child of node t.
PLjLk is the proportion of observations at node t, such that xj < u and xk < v.
PRjRk is the proportion of observations at node t, such that xj ≥ u and xk ≥ v.
Observations with missing values for xj or xk do not contribute to the proportion calculations.

λjk is a value in (–∞,1]. If λjk > 0, then xk < v is a worthwhile surrogate split for xj < u.

Algorithms

predict generates predictions by following the branches of tree until it reaches a leaf node or a missing value. If predict reaches a leaf node, it returns the classification of that node.

If predict reaches a node with a missing value for a predictor, its behavior depends on the setting of the Surrogate name-value argument when fitctree constructs tree.

Surrogate ="off" (default) — predict returns the label with the largest number of training samples that reach the node.
Surrogate ="on" — predict uses the best surrogate split at the node. If all surrogate split variables with positive_predictive measure of association_ are missing, predict returns the label with the largest number of training samples that reach the node. For a definition, see Predictive Measure of Association.

Alternative Functionality

Simulink Block

To integrate the prediction of a classification tree model into Simulink®, you can use the ClassificationTree Predict block in the Statistics and Machine Learning Toolbox™ library or a MATLAB® Function block with the predict function. For examples, see Predict Class Labels Using ClassificationTree Predict Block and Predict Class Labels Using MATLAB Function Block.

When deciding which approach to use, consider the following:

If you use the Statistics and Machine Learning Toolbox library block, you can use the Fixed-Point Tool (Fixed-Point Designer) to convert a floating-point model to fixed point.
Support for variable-size arrays must be enabled for a MATLAB Function block with the predict function.
If you use a MATLAB Function block, you can use MATLAB functions for preprocessing or post-processing before or after predictions in the same MATLAB Function block.

Extended Capabilities

expand all

This function fully supports tall arrays. You can use models trained on either in-memory or tall data with this function.

For more information, see Tall Arrays.

Usage notes and limitations:

You can generate C/C++ code for both predict andupdate by using a coder configurer. Or, generate code only forpredict by using saveLearnerForCoder,loadLearnerForCoder, and codegen.
- Code generation for predict and update — Create a coder configurer by using learnerCoderConfigurer and then generate code by using generateCode. Then you can update model parameters in the generated code without having to regenerate the code.
- Code generation for predict — Save a trained model by using saveLearnerForCoder. Define an entry-point function that loads the saved model by using loadLearnerForCoder and calls thepredict function. Then use codegen (MATLAB Coder) to generate code for the entry-point function.
To generate single-precision C/C++ code forpredict, specify DataType="single" when you call the loadLearnerForCoder function.
You can also generate fixed-point C/C++ code forpredict. Fixed-point code generation requires an additional step that defines the fixed-point data types of the variables required for prediction. Create a fixed-point data type structure by using the data type function generated by generateLearnerDataTypeFcn, and then use the structure as an input argument ofloadLearnerForCoder in an entry-point function. Generating fixed-point C/C++ code requires MATLAB Coder™ and Fixed-Point Designer™.

This table contains notes about the arguments of predict. Arguments not included in this table are fully supported.

Argument	Notes and Limitations
tree	For the usage notes and limitations of the model object, see Code Generation of theCompactClassificationTree object.
X	For general code generation, X must be a single-precision or double-precision matrix or a table containing numeric variables, categorical variables, or both.In the coder configurer workflow, X must be a single-precision or double-precision matrix.For fixed-point code generation, X must be a fixed-point matrix.The number of rows, or observations, in X can be a variable size, but the number of columns in X must be fixed.If you want to specify X as a table, then your model must be trained using a table, and your entry-point function for prediction must do the following: Accept data as arrays.Create a table from the data input arguments and specify the variable names in the table.Pass the table to predict.For an example of this table workflow, see Generate Code to Classify Data in Table. For more information on using tables in code generation, see Code Generation for Tables (MATLAB Coder) and Table Limitations for Code Generation (MATLAB Coder).
label	If the response data type is char andcodegen cannot determine thatsubtrees is a scalar, then label is a cell array of character vectors.
subtrees	Names in name-value arguments must be compile-time constants. For example, to allow user-defined pruning levels in the generated code, include{coder.Constant("Subtrees"),coder.typeof(0,[1,n],[0,1])} in the-args value of codegen (MATLAB Coder), where n ismax(tree.PruneList).The subtrees name-value argument is not supported in the coder configurer workflow.For fixed-point code generation, the subtrees value must be coder.Constant("all") or have an integer data type.

For more information, see Introduction to Code Generation.

Usage notes and limitations:

The predict function does not support decision tree models trained with surrogate splits.

For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Version History

Introduced in R2011a