turicreate.text_classifier.create — Turi Create API 6.4.1 documentation (original) (raw)

turicreate.text_classifier. create(dataset, target, features=None, drop_stop_words=True, word_count_threshold=2, method='auto', validation_set='auto', max_iterations=10, l2_penalty=0.2)¶

Create a model that trains a classifier to classify text from a collection of documents. The model is aLogisticClassifier model trained using a bag-of-words representation of the text dataset.

Parameters:	dataset : SFrame Contains one or more columns of text data. This can be unstructured text dataset, such as that appearing in forums, user-generated reviews, etc. target : str The column name containing class labels for each document. features : list[str], optional The column names of interest containing text dataset. Each provided column must be str type. Defaults to using all columns of type str. drop_stop_words : bool, optional Ignore very common words, eg: “the”, “a”, “is”. For the complete list of stop words, see: text_classifier.drop_words(). word_count_threshold : int, optional Words which occur less than this often, in the entire dataset, will be ignored. method: str, optional Method to use for feature engineering and modeling. Currently only bag-of-words and logistic classifier (‘bow-logistic’) is available. validation_set : SFrame, optional A dataset for monitoring the model’s generalization performance. For each row of the progress table, the chosen metrics are computed for both the provided training dataset and the validation_set. The format of this SFrame must be the same as the training set. By default this argument is set to ‘auto’ and a validation set is automatically sampled and used for progress printing. If validation_set is set to None, then no additional metrics are computed. The default value is ‘auto’. max_iterations : int, optional The maximum number of allowed passes through the data. More passes over the data can result in a more accurately trained model. Consider increasing this (the default value is 10) if the training accuracy is low and the Grad-Norm in the display is large. l2_penalty : float, optional Weight on l2 regularization of the model. The larger this weight, the more the model coefficients shrink toward 0. This introduces bias into the model but decreases variance, potentially leading to better predictions. The default value is 0.2; setting this parameter to 0 corresponds to unregularized logistic regression. See the ridge regression reference for more detail.
Returns:	out : TextClassifier

Parameters:

dataset : SFrame Contains one or more columns of text data. This can be unstructured text dataset, such as that appearing in forums, user-generated reviews, etc. target : str The column name containing class labels for each document. features : list[str], optional The column names of interest containing text dataset. Each provided column must be str type. Defaults to using all columns of type str. drop_stop_words : bool, optional Ignore very common words, eg: “the”, “a”, “is”. For the complete list of stop words, see: text_classifier.drop_words(). word_count_threshold : int, optional Words which occur less than this often, in the entire dataset, will be ignored. method: str, optional Method to use for feature engineering and modeling. Currently only bag-of-words and logistic classifier (‘bow-logistic’) is available. validation_set : SFrame, optional A dataset for monitoring the model’s generalization performance. For each row of the progress table, the chosen metrics are computed for both the provided training dataset and the validation_set. The format of this SFrame must be the same as the training set. By default this argument is set to ‘auto’ and a validation set is automatically sampled and used for progress printing. If validation_set is set to None, then no additional metrics are computed. The default value is ‘auto’. max_iterations : int, optional The maximum number of allowed passes through the data. More passes over the data can result in a more accurately trained model. Consider increasing this (the default value is 10) if the training accuracy is low and the Grad-Norm in the display is large. l2_penalty : float, optional Weight on l2 regularization of the model. The larger this weight, the more the model coefficients shrink toward 0. This introduces bias into the model but decreases variance, potentially leading to better predictions. The default value is 0.2; setting this parameter to 0 corresponds to unregularized logistic regression. See the ridge regression reference for more detail.

Returns:

out : TextClassifier