turicreate.text_analytics.count_ngrams — Turi Create API 6.4.1 documentation (original) (raw)

turicreate.text_analytics. count_ngrams(text, n=2, method='word', to_lower=True, delimiters=['\r', '\x0b', '\n', '\x0c', '\t', ' ', '!', '#', '$', '%', '&', "'", '"', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'], ignore_punct=True, ignore_space=True)¶

Return an SArray of dict type where each element contains the count for each of the n-grams that appear in the corresponding input element. The n-grams can be specified to be either character n-grams or word n-grams. The input SArray could contain strings, dicts with string keys and numeric values, or lists of strings.

Parameters:	Text : SArray[str \| dict	list] Input text data. n : int, optional The number of words in each n-gram. An n value of 1 returns word counts. method : {‘word’, ‘character’}, optional If “word”, the function performs a count of word n-grams. If “character”, does a character n-gram count. to_lower : bool, optional If True, all words are converted to lower case before counting. delimiters : list[str], None, optional If method is “word”, input strings are tokenized using delimiterscharacters in this list. Each entry in this list must contain a single character. If set to None, then a Penn treebank-style tokenization is used, which contains smart handling of punctuations. If method is “character,” this option is ignored. ignore_punct : bool, optional If method is “character”, indicates if punctuations between words are counted as part of the n-gram. For instance, with the input SArray element of “fun.games”, if this parameter is set to False one tri-gram would be ‘n.g’. If ignore_punct is set to True, there would be no such tri-gram (there would still be ‘nga’). This parameter has no effect if the method is set to “word”. ignore_space : bool, optional If method is “character”, indicates if spaces between words are counted as part of the n-gram. For instance, with the input SArray element of “fun games”, if this parameter is set to False one tri-gram would be ‘n g’. If ignore_space is set to True, there would be no such tri-gram (there would still be ‘nga’). This parameter has no effect if the method is set to “word”.
Returns:	out : SArray[dict] An SArray of dictionary type, where each key is the n-gram string and each value is its count.

Notes

Ignoring case (with to_lower) involves a full string copy of the SArray data. To increase speed for large documents, set to_lower to False.
Punctuation and spaces are both delimiters by default when counting word n-grams. When counting character n-grams, one may choose to ignore punctuations, spaces, neither, or both.

References

Examples

import turicreate

Counting word n-grams:

sa = turicreate.SArray(['I like big dogs. I LIKE BIG DOGS.']) turicreate.text_analytics.count_ngrams(sa, 3) dtype: dict Rows: 1 [{'big dogs i': 1, 'like big dogs': 2, 'dogs i like': 1, 'i like big': 2}]

Counting character n-grams:

sa = turicreate.SArray(['Fun. Is. Fun']) turicreate.text_analytics.count_ngrams(sa, 3, "character") dtype: dict Rows: 1 {'fun': 2, 'nis': 1, 'sfu': 1, 'isf': 1, 'uni': 1}]

Run count_ngrams with dictionary input

sa = turicreate.SArray([{'alice bob': 1, 'Bob alice': 0.5}, {'a dog': 0, 'a dog cat': 5}]) turicreate.text_analytics.count_ngrams(sa) dtype: dict Rows: 2 [{'bob alice': 0.5, 'alice bob': 1}, {'dog cat': 5, 'a dog': 5}]

Run count_ngrams with list input

sa = turicreate.SArray([['one', 'bar bah'], ['a dog', 'a dog cat']]) turicreate.text_analytics.count_ngrams(sa) dtype: dict Rows: 2 [{'bar bah': 1}, {'dog cat': 1, 'a dog': 2}]