crossval - Cross-validate machine learning model - MATLAB (original) (raw)

Cross-validate machine learning model

Syntax

Description

[CVMdl](#mw%5Fb591e38e-5d9a-4a25-9dba-1b5dda047b0d) = crossval([Mdl](#mw%5F3bfcf973-44f4-43a8-a301-c8d7441ca152)) returns a cross-validated (partitioned) machine learning model (CVMdl) from a trained model (Mdl). By default, crossval uses 10-fold cross-validation on the training data.

example

[CVMdl](#mw%5Fb591e38e-5d9a-4a25-9dba-1b5dda047b0d) = crossval([Mdl](#mw%5F3bfcf973-44f4-43a8-a301-c8d7441ca152),[Name=Value](#namevaluepairarguments)) specifies additional options using one or more name-value arguments. For example, you can specify the fraction of data for holdout validation, and the number of folds to use in the cross-validated model.

example

Examples

collapse all

Load the ionosphere data set. This data set has 34 predictors and 351 binary responses for radar returns, either bad ('b') or good ('g').

load ionosphere rng(1); % For reproducibility

Train a support vector machine (SVM) classifier. Standardize the predictor data and specify the order of the classes.

SVMModel = fitcsvm(X,Y,'Standardize',true,'ClassNames',{'b','g'});

SVMModel is a trained ClassificationSVM classifier. 'b' is the negative class and 'g' is the positive class.

Cross-validate the classifier using 10-fold cross-validation.

CVSVMModel = crossval(SVMModel)

CVSVMModel = ClassificationPartitionedModel CrossValidatedModel: 'SVM' PredictorNames: {'x1' 'x2' 'x3' 'x4' 'x5' 'x6' 'x7' 'x8' 'x9' 'x10' 'x11' 'x12' 'x13' 'x14' 'x15' 'x16' 'x17' 'x18' 'x19' 'x20' 'x21' 'x22' 'x23' 'x24' 'x25' 'x26' 'x27' 'x28' 'x29' 'x30' 'x31' 'x32' 'x33' 'x34'} ResponseName: 'Y' NumObservations: 351 KFold: 10 Partition: [1×1 cvpartition] ClassNames: {'b' 'g'} ScoreTransform: 'none'

Properties, Methods

CVSVMModel is a ClassificationPartitionedModel cross-validated classifier. During cross-validation, the software completes these steps:

  1. Randomly partition the data into 10 sets of equal size.
  2. Train an SVM classifier on nine of the sets.
  3. Repeat steps 1 and 2 k = 10 times. The software leaves out one partition each time and trains on the other nine partitions.
  4. Combine generalization statistics for each fold.

Display the first model in CVSVMModel.Trained.

FirstModel = CVSVMModel.Trained{1}

FirstModel = CompactClassificationSVM ResponseName: 'Y' CategoricalPredictors: [] ClassNames: {'b' 'g'} ScoreTransform: 'none' Alpha: [78×1 double] Bias: -0.2209 KernelParameters: [1×1 struct] Mu: [0.8888 0 0.6320 0.0406 0.5931 0.1205 0.5361 0.1286 0.5083 0.1879 0.4779 0.1567 0.3924 0.0875 0.3360 0.0789 0.3839 9.6066e-05 0.3562 -0.0308 0.3398 -0.0073 0.3590 -0.0628 0.4064 -0.0664 0.5535 -0.0749 0.3835 … ] (1×34 double) Sigma: [0.3149 0 0.5033 0.4441 0.5255 0.4663 0.4987 0.5205 0.5040 0.4780 0.5649 0.4896 0.6293 0.4924 0.6606 0.4535 0.6133 0.4878 0.6250 0.5140 0.6075 0.5150 0.6068 0.5222 0.5729 0.5103 0.5061 0.5478 0.5712 0.5032 … ] (1×34 double) SupportVectors: [78×34 double] SupportVectorLabels: [78×1 double]

Properties, Methods

FirstModel is the first of the 10 trained classifiers. It is a CompactClassificationSVM classifier.

You can estimate the generalization error by passing CVSVMModel to kfoldLoss.

Specify a holdout sample proportion for cross-validation. By default, crossval uses 10-fold cross-validation to cross-validate a naive Bayes classifier. However, you have several other options for cross-validation. For example, you can specify a different number of folds or a holdout sample proportion.

Load the ionosphere data set. This data set has 34 predictors and 351 binary responses for radar returns, either bad ('b') or good ('g').

Remove the first two predictors for stability.

X = X(:,3:end); rng('default'); % For reproducibility

Train a naive Bayes classifier using the predictors X and class labels Y. A recommended practice is to specify the class names. 'b' is the negative class and 'g' is the positive class. fitcnb assumes that each predictor is conditionally and normally distributed.

Mdl = fitcnb(X,Y,'ClassNames',{'b','g'});

Mdl is a trained ClassificationNaiveBayes classifier.

Cross-validate the classifier by specifying a 30% holdout sample.

CVMdl = crossval(Mdl,'Holdout',0.3)

CVMdl = ClassificationPartitionedModel CrossValidatedModel: 'NaiveBayes' PredictorNames: {'x1' 'x2' 'x3' 'x4' 'x5' 'x6' 'x7' 'x8' 'x9' 'x10' 'x11' 'x12' 'x13' 'x14' 'x15' 'x16' 'x17' 'x18' 'x19' 'x20' 'x21' 'x22' 'x23' 'x24' 'x25' 'x26' 'x27' 'x28' 'x29' 'x30' 'x31' 'x32'} ResponseName: 'Y' NumObservations: 351 KFold: 1 Partition: [1×1 cvpartition] ClassNames: {'b' 'g'} ScoreTransform: 'none'

Properties, Methods

CVMdl is a ClassificationPartitionedModel cross-validated, naive Bayes classifier.

Display the properties of the classifier trained using 70% of the data.

TrainedModel = CVMdl.Trained{1}

TrainedModel = CompactClassificationNaiveBayes ResponseName: 'Y' CategoricalPredictors: [] ClassNames: {'b' 'g'} ScoreTransform: 'none' DistributionNames: {1×32 cell} DistributionParameters: {2×32 cell}

Properties, Methods

TrainedModel is a CompactClassificationNaiveBayes classifier.

Estimate the generalization error by passing CVMdl to kfoldloss.

The out-of-sample misclassification error is approximately 21%.

Reduce the generalization error by choosing the five most important predictors.

idx = fscmrmr(X,Y); Xnew = X(:,idx(1:5));

Train a naive Bayes classifier for the new predictor.

Mdlnew = fitcnb(Xnew,Y,'ClassNames',{'b','g'});

Cross-validate the new classifier by specifying a 30% holdout sample, and estimate the generalization error.

CVMdlnew = crossval(Mdlnew,'Holdout',0.3); kfoldLoss(CVMdlnew)

The out-of-sample misclassification error is reduced from approximately 21% to approximately 14%.

Train a regression generalized additive model (GAM) by using fitrgam, and create a cross-validated GAM by using crossval and the holdout option. Then, use kfoldPredict to predict responses for validation-fold observations using a model trained on training-fold observations.

Load the patients data set.

Create a table that contains the predictor variables (Age, Diastolic, Smoker, Weight, Gender, SelfAssessedHealthStatus) and the response variable (Systolic).

tbl = table(Age,Diastolic,Smoker,Weight,Gender,SelfAssessedHealthStatus,Systolic);

Train a GAM that contains linear terms for predictors.

Mdl = fitrgam(tbl,'Systolic');

Mdl is a RegressionGAM model object.

Cross-validate the model by specifying a 30% holdout sample.

rng('default') % For reproducibility CVMdl = crossval(Mdl,'Holdout',0.3)

CVMdl = RegressionPartitionedGAM CrossValidatedModel: 'GAM' PredictorNames: {'Age' 'Diastolic' 'Smoker' 'Weight' 'Gender' 'SelfAssessedHealthStatus'} CategoricalPredictors: [3 5 6] ResponseName: 'Systolic' NumObservations: 100 KFold: 1 Partition: [1×1 cvpartition] NumTrainedPerFold: [1×1 struct] ResponseTransform: 'none' IsStandardDeviationFit: 0

Properties, Methods

The crossval function creates a RegressionPartitionedGAM model object CVMdl with the holdout option. During cross-validation, the software completes these steps:

  1. Randomly select and reserve 30% of the data as validation data, and train the model using the rest of the data.
  2. Store the compact, trained model in the Trained property of the cross-validated model object RegressionPartitionedGAM.

You can choose a different cross-validation setting by using the 'CrossVal', 'CVPartition', 'KFold', or 'Leaveout' name-value argument.

Predict responses for the validation-fold observations by using kfoldPredict. The function predicts responses for the validation-fold observations by using the model trained on the training-fold observations. The function assigns NaN to the training-fold observations.

yFit = kfoldPredict(CVMdl);

Find the validation-fold observation indexes, and create a table containing the observation index, observed response values, and predicted response values. Display the first eight rows of the table.

idx = find(~isnan(yFit)); t = table(idx,tbl.Systolic(idx),yFit(idx), ... 'VariableNames',{'Obseraction Index','Observed Value','Predicted Value'}); head(t)

Obseraction Index    Observed Value    Predicted Value
_________________    ______________    _______________

        1                 124              130.22     
        6                 121              124.38     
        7                 130              125.26     
       12                 115              117.05     
       20                 125              121.82     
       22                 123              116.99     
       23                 114                 107     
       24                 128              122.52     

Compute the regression error (mean squared error) for the validation-fold observations.

Cross-validate an ECOC classifier with SVM binary learners, and estimate the generalized classification error.

Load Fisher's iris data set. Specify the predictor data X and the response data Y.

load fisheriris X = meas; Y = species; rng(1); % For reproducibility

Create an SVM template, and standardize the predictors.

t = templateSVM('Standardize',true)

t = Fit template for SVM. Standardize: 1

t is an SVM template. Most of the template object properties are empty. When training the ECOC classifier, the software sets the applicable properties to their default values.

Train the ECOC classifier, and specify the class order.

Mdl = fitcecoc(X,Y,'Learners',t,... 'ClassNames',{'setosa','versicolor','virginica'});

Mdl is a ClassificationECOC classifier. You can access its properties using dot notation.

Cross-validate Mdl using 10-fold cross-validation.

CVMdl is a ClassificationPartitionedECOC cross-validated ECOC classifier.

Estimate the generalized classification error.

genError = kfoldLoss(CVMdl)

The generalized classification error is 4%, which indicates that the ECOC classifier generalizes fairly well.

Compute the quantile loss for a quantile neural network regression model, first partitioned using holdout validation and then partitioned using 5-fold cross-validation. Compare the two losses.

Load the carbig data set, which contains measurements of cars made in the 1970s and early 1980s. Create a table containing the predictor variables Acceleration, Cylinders, Displacement, and so on, as well as the response variable MPG. View the first eight observations.

load carbig cars = table(Acceleration,Cylinders,Displacement, ... Horsepower,Model_Year,Origin,Weight,MPG); head(cars)

Acceleration    Cylinders    Displacement    Horsepower    Model_Year    Origin     Weight    MPG
____________    _________    ____________    __________    __________    _______    ______    ___

      12            8            307            130            70        USA         3504     18 
    11.5            8            350            165            70        USA         3693     15 
      11            8            318            150            70        USA         3436     18 
      12            8            304            150            70        USA         3433     16 
    10.5            8            302            140            70        USA         3449     17 
      10            8            429            198            70        USA         4341     15 
       9            8            454            220            70        USA         4354     14 
     8.5            8            440            215            70        USA         4312     14 

Remove rows of cars where the table has missing values.

Categorize the cars based on whether they were made in the USA.

cars.Origin = categorical(cellstr(cars.Origin)); cars.Origin = mergecats(cars.Origin,["France","Japan",... "Germany","Sweden","Italy","England"],"NotUSA");

Partition the data using cvpartition. First, create a partition for holdout validation, using approximately 80% of the observations for the training data and 20% for the test data. Then, create a partition for 5-fold cross-validation.

rng(0,"twister") % For reproducibility holdoutPartition = cvpartition(height(cars),Holdout=0.20); kfoldPartition = cvpartition(height(cars),KFold=5);

Train a quantile neural network regression model using the cars data. Specify MPG as the response variable, and standardize the numeric predictors. Use the default 0.5 quantile (median).

Mdl = fitrqnet(cars,"MPG",Standardize=true);

Create the partitioned quantile regression models using crossval.

holdoutMdl = crossval(Mdl,CVPartition=holdoutPartition)

holdoutMdl = RegressionPartitionedQuantileModel CrossValidatedModel: 'QuantileNeuralNetwork' PredictorNames: {'Acceleration' 'Cylinders' 'Displacement' 'Horsepower' 'Model_Year' 'Origin' 'Weight'} CategoricalPredictors: 6 ResponseName: 'MPG' NumObservations: 392 KFold: 1 Partition: [1×1 cvpartition] ResponseTransform: 'none' Quantiles: 0.5000

Properties, Methods

kfoldMdl = crossval(Mdl,CVPartition=kfoldPartition)

kfoldMdl = RegressionPartitionedQuantileModel CrossValidatedModel: 'QuantileNeuralNetwork' PredictorNames: {'Acceleration' 'Cylinders' 'Displacement' 'Horsepower' 'Model_Year' 'Origin' 'Weight'} CategoricalPredictors: 6 ResponseName: 'MPG' NumObservations: 392 KFold: 5 Partition: [1×1 cvpartition] ResponseTransform: 'none' Quantiles: 0.5000

Properties, Methods

Compute the quantile loss for holdoutMdl and kfoldMdl by using the kfoldLoss object function.

holdoutL = kfoldLoss(holdoutMdl)

kfoldL = kfoldLoss(kfoldMdl)

holdoutL is the quantile loss computed using one holdout set, while kfoldL is an average quantile loss computed using five holdout sets. Cross-validation metrics tend to be better indicators of a model's performance on unseen data.

Input Arguments

collapse all

Machine learning model, specified as a full classification, regression, or quantile regression model object, as given in the following tables of supported models.

Classification Model Object

Quantile Regression Model Object

Name-Value Arguments

collapse all

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: crossval(Mdl,KFold=3) specifies to use three folds in the cross-validated model.

Data Types: double | single

Data Types: single | double

Data Types: char | string

Printout frequency, specified as a positive integer or"off".

To track the number of folds trained by the software so far, specify a positive integer m. The software displays a message to the command line every time it finishes training m folds.

If you specify "off", the software does not display a message when it completes training folds.

Example: NPrint=5

Data Types: single | double | char | string

Options for computing in parallel, specified as a structure. Create theOptions structure using statset.

You need Parallel Computing Toolbox™ to run computations in parallel.

You can specify Options only if Mdl is aClassificationECOC model object.

Example: Options=statset(UseParallel=true)

Data Types: struct

Output Arguments

collapse all

Cross-validated machine learning model, returned as one of the cross-validated (partitioned) model objects in the following tables, depending on the input modelMdl.

Classification Model Object

Quantile Regression Model Object

Tips

Alternative Functionality

Instead of training a model and then cross-validating it, you can create a cross-validated model directly by using a fitting function and specifying one of these name-value arguments:CVPartition, Holdout, KFold, or Leaveout.

Extended Capabilities

Version History

Introduced in R2012a

expand all

Starting in R2023b, a cross-validated regression neural network model is a RegressionPartitionedNeuralNetwork object. In previous releases, a cross-validated regression neural network model was a RegressionPartitionedModel object.

You can create a RegressionPartitionedNeuralNetwork object in two ways:

Starting in R2022b, a cross-validated Gaussian process regression (GPR) model is a RegressionPartitionedGP object. In previous releases, a cross-validated GPR model was a RegressionPartitionedModel object.

You can create a RegressionPartitionedGP object in two ways:

Regardless of whether you train a full or cross-validated GPR model first, you cannot specify an ActiveSet value in the call to fitrgp.