turicreate.dbscan.create — Turi Create API 6.4.1 documentation (original) (raw)

turicreate.dbscan. create(dataset, features=None, distance=None, radius=1.0, min_core_neighbors=10, verbose=True)

Create a DBSCAN clustering model. The DBSCAN method partitions the input dataset into three types of points, based on the estimated probability density at each point.

Clusters are formed by connecting core points that are neighbors of each other, then assigning boundary points to their nearest core neighbor’s cluster.

Parameters: dataset : SFrame Training data, with each row corresponding to an observation. Must include all features specified in the features parameter, but may have additional columns as well. features : list[str], optional Name of the columns with features to use in comparing records. ‘None’ (the default) indicates that all columns of the input dataset should be used to train the model. All features must be numeric, i.e. integer or float types. distance : str or list[list], optional Function to measure the distance between any two input data rows. This may be one of two types: String: the name of a standard distance function. One of ‘euclidean’, ‘squared_euclidean’, ‘manhattan’, ‘levenshtein’, ‘jaccard’, ‘weighted_jaccard’, ‘cosine’, or ‘transformed_dot_product’. Composite distance: the weighted sum of several standard distance functions applied to various features. This is specified as a list of distance components, each of which is itself a list containing three items: list or tuple of feature names (str) standard distance name (str) scaling factor (int or float) For more information about Turi Create distance functions, please see the distances module. For sparse vectors, missing keys are assumed to have value 0.0. If ‘distance’ is left unspecified, a composite distance is constructed automatically based on feature types. radius : int or float, optional Size of each point’s neighborhood, with respect to the specified distance function. min_core_neighbors : int, optional Number of neighbors that must be within distance radius of a point in order for that point to be considered a “core point” of a cluster. verbose : bool, optional If True, print progress updates and model details during model creation.
Returns: out : DBSCANModel A model containing a cluster label for each row in the input dataset. Also contains the indices of the core points, cluster boundary points, and noise points.

Notes

References

Examples

sf = turicreate.SFrame({ ... 'x1': [0.6777, -9.391, 7.0385, 2.2657, 7.7864, -10.16, -8.162, ... 8.8817, -9.525, -9.153, 2.0860, 7.6619, 6.5511, 2.7020], ... 'x2': [5.6110, 8.5139, 5.3913, 5.4743, 8.3606, 7.8843, 2.7305, ... 5.1679, 6.7231, 3.7051, 1.7682, 7.4608, 3.1270, 6.5624]}) ... model = turicreate.dbscan.create(sf, radius=4.25, min_core_neighbors=3) model.cluster_id.print_rows(15) +--------+------------+----------+ | row_id | cluster_id | type | +--------+------------+----------+ | 8 | 0 | core | | 7 | 2 | core | | 0 | 1 | core | | 2 | 2 | core | | 3 | 1 | core | | 11 | 2 | core | | 4 | 2 | core | | 1 | 0 | boundary | | 6 | 0 | boundary | | 5 | 0 | boundary | | 9 | 0 | boundary | | 12 | 2 | boundary | | 10 | 1 | boundary | | 13 | 1 | boundary | +--------+------------+----------+ [14 rows x 3 columns]