turicreate.text_analytics.count_words — Turi Create API 6.4.1 documentation (original) (raw)

turicreate.text_analytics. count_words(text, to_lower=True, delimiters=['\r', '\x0b', '\n', '\x0c', '\t', ' ', '!', '#', '$', '%', '&', "'", '"', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~'])

If text is an SArray of strings or an SArray of lists of strings, the occurances of word are counted for each row in the SArray.

If text is an SArray of dictionaries, the keys are tokenized and the values are the counts. Counts for the same word, in the same row, are added together.

This output is commonly known as the “bag-of-words” representation of text data.

Parameters: text : SArray[str | dict list] SArray of type: string, dict or list. to_lower : bool, optional If True, all strings are converted to lower case before counting. delimiters : list[str], None, optional Input strings are tokenized using delimiters characters in this list. Each entry in this list must contain a single character. If set toNone, then a Penn treebank-style tokenization is used, which contains smart handling of punctuations.
Returns: out : SArray[dict] An SArray with the same length as the`text` input. For each row, the keys of the dictionary are the words and the values are the corresponding counts.

References

Examples

import turicreate

Create input data

sa = turicreate.SArray(["The quick brown fox jumps.", "Word word WORD, word!!!word"])

Run count_words

turicreate.text_analytics.count_words(sa) dtype: dict Rows: 2 [{'quick': 1, 'brown': 1, 'the': 1, 'fox': 1, 'jumps.': 1}, {'word,': 5}]

Run count_words with Penn treebank style tokenization to handle

punctuations

turicreate.text_analytics.count_words(sa, delimiters=None) dtype: dict Rows: 2 [{'brown': 1, 'jumps': 1, 'fox': 1, '.': 1, 'quick': 1, 'the': 1}, {'word': 3, 'word!!!word': 1, ',': 1}]

Run count_words with dictionary input

sa = turicreate.SArray([{'alice bob': 1, 'Bob alice': 0.5}, {'a dog': 0, 'a dog cat': 5}]) turicreate.text_analytics.count_words(sa) dtype: dict Rows: 2 [{'bob': 1.5, 'alice': 1.5}, {'a': 5, 'dog': 5, 'cat': 5}]

Run count_words with list input

sa = turicreate.SArray([['one', 'bar bah'], ['a dog', 'a dog cat']]) turicreate.text_analytics.count_words(sa) dtype: dict Rows: 2 [{'bar': 1, 'bah': 1, 'one': 1}, {'a': 2, 'dog': 2, 'cat': 1}]