Clean Outlier Data - Find, fill, or remove outliers in the Live Editor - MATLAB (original) (raw)

Find, fill, or remove outliers in the Live Editor

Description

The Clean Outlier Data task lets you interactively handle outliers in data. The task automatically generates MATLAB® code for your live script.

Using this task, you can:

Clean Outlier Data task in the Live Editor

Open the Task

To add the Clean Outlier Data task to a live script in the MATLAB Editor:

Examples

expand all

Interactively remove outliers from a table using the Clean Outlier Data task in the Live Editor.

Create a table using patient height and weight data from a sample file.

load("patients.mat","Height","Weight") T = table(Height,Weight); head(T)

Height    Weight
______    ______

  71       176  
  69       163  
  64       131  
  67       133  
  64       119  
  68       142  
  64       142  
  68       180  

Open the Clean Outlier Data task in the Live Editor. To clean the patient data, select T as the input data. Then, compute on the Height and Weight variables by selecting All supported variables.

The Clean Outlier Data task can fill or remove outlier data. To remove the table rows corresponding to patients with outlier height or weight measurements, use the Cleaning method field to select Remove outliers. Then, to define outliers as elements below the 10th percentile or above the 90th percentile, use the Detection method field to select Percentiles.

Then, to visualize the cleaned height and weight data, use the Variable to display field to select all variables.

Live Task

Figure contains 2 axes objects. Axes object 1 with title Number of outliers cleaned: 8, ylabel Height contains 5 objects of type line. One or more of the lines displays its values using only markers These objects represent Input data, Cleaned data, Outliers, Removed by other variables, Outlier thresholds. Axes object 2 with title Number of outliers cleaned: 18, ylabel Weight contains 5 objects of type line. One or more of the lines displays its values using only markers These objects represent Input data, Cleaned data, Outliers, Removed by other variables, Outlier thresholds.

This task returns a table of the cleaned data and a logical vector indicating the rows removed from the input table. Use outlierIndices to determine the number of rows removed from the table.

nrows = sum(outlierIndices)

Parameters

expand all

This task operates on input data contained in a vector, table, or timetable. The data can be of type single or double.

For table or timetable input data, to clean all variables with typesingle or double, select All supported variables. To choose which single ordouble variables to clean, select Specified variables.

Specify the method for filling outliers as one of these options.

Fill Method Description
Linear interpolation Linear interpolation of neighboring, nonoutlier values
Constant value Specified scalar value, which is 0 by default
Convert to missing Convert to default definition of standard missing value
Center value Center value determined by the detection method
Clip to threshold value Lower threshold value for elements smaller than the lower threshold determined by the detection method; upper threshold value for elements larger than the upper threshold determined by the detection method
Previous value Previous nonoutlier value
Next value Next nonoutlier value
Nearest value Nearest nonoutlier value
Spline interpolation Piecewise cubic spline interpolation
Shape-preserving cubic interpolation (PCHIP) Shape-preserving piecewise cubic spline interpolation
Modified Akima cubic interpolation Modified Akima cubic Hermite interpolation

Specify the detection method for finding outliers as one of these options.

Method Description
Moving median Define outliers as elements more than the specified threshold of local scaled median absolute deviations (MAD) from the local median over a specified window. The default threshold is 3.
Median Define outliers as elements more than the specified threshold of scaled MAD from the median. The default threshold is 3. For input dataA, the scaled MAD is defined asc*median(abs(A-median(A))), wherec=-1/(sqrt(2)*erfcinv(3/2)).
Mean Define outliers as elements more than the specified threshold of standard deviations from the mean. The default threshold is 3. This method is faster but less robust than Median.
Quartiles Define outliers as elements more than the specified threshold of interquartile ranges above the upper quartile (75 percent) or below the lower quartile (25 percent). The default threshold is 1.5. This method is useful when the input data is not normally distributed.
Grubbs Define outliers using Grubbs’ test, which removes one outlier per iteration based on hypothesis testing. This method assumes that the input data is normally distributed.
Generalized extreme studentized deviate (GESD) Define outliers using the generalized extreme studentized deviate test for outliers. This iterative method is similar to Grubbs but can perform better when multiple outliers are masking each other.
Moving mean Define outliers as elements more than the specified threshold of local standard deviations from the local mean over a specified window. The default threshold is 3.
Percentiles Define outliers as elements outside of the percentile range specified by an upper and lower threshold. The default lower percentile threshold is10, and the default upper percentile threshold is90. Valid threshold values are in the interval [0, 100].
Range (since R2024b) Define outliers as elements outside of the range specified by an upper and lower threshold. Specify the thresholds as scalars or vectors matching the width of the input data.
Workspace variable (since R2024b) Define outlier locations using a workspace variable. Specify a logical array or table with logical variables, where elements with a value of1 (true) correspond to outliers.

Specify the window type and size when the method for detecting outliers isMoving median or Moving mean.

Window Description
Centered Specified window length centered about the current point
Asymmetric Specified window containing the number of elements before the current point and the number of elements after the current point

Window sizes are relative to the X-axis variable units.

Version History

Introduced in R2019b

expand all

You can define outliers as elements outside of a range defined by an upper and lower threshold or as elements indicated by a value of 1 (true) in a workspace variable. Select theRange or Workspace variable detection method, respectively.

Simultaneously plot multiple table variables in the display of this Live Editor task. For table or timetable data, to visualize all selected table variables at once in a tiled chart layout, set the field.

You can convert outlier data to missing data indicated by the valueNaN. Set the field toFill outliers and select the Convert to missing option.

Append input table variables with table variables containing cleaned data. For table or timetable input data, to append the cleaned data, set the field.

This Live Editor task does not run automatically if the inputs have more than 1 million elements. In previous releases, the task always ran automatically for inputs of any size. If the inputs have a large number of elements, then the code generated by this task can take a noticeable amount of time to run (more than a few seconds).

When a task does not run automatically, the Autorun indicator is disabled. You can either run the task manually when needed or choose to enable the task to run automatically.

This Live Editor task can operate on multiple table variables at the same time. For table or timetable input data, to operate on multiple variables, select All supported variables or Specified variables. Return all of the variables or only the modified variables, and specify which variable to visualize.

Visualize results with a histogram plot for most detection methods. The histogram can summarize the input data, outliers, cleaned data with outliers filled, and outlier detection thresholds and center value.