normalizeWords - Stem or lemmatize words - MATLAB (original) (raw)

Syntax

Description

Use normalizeWords to reduce words to a root form. To_lemmatize_ English words (reduce them to their dictionary forms), set the 'Style' option to'lemma'.

The function supports English, Japanese, German, and Korean text.

[updatedDocuments](#mw%5F23ca12eb-ee08-4d8a-b5a2-210a599e5e6c) = normalizeWords([documents](#d126e34475)) reduces the words in documents to a root form. For English and German text, the function, by default, stems the words using the Porter stemmer for English and German text respectively. For Japanese and Korean text, the function, by default, lemmatizes the words using the MeCab tokenizer.

example

[updatedWords](#d126e34638) = normalizeWords([words](#d126e34491)) reduces each word in the string array words to a root form.

example

[updatedWords](#d126e34638) = normalizeWords([words](#d126e34491),'Language',[language](#mw%5Fdbe30f56-b26f-41ec-ba8c-bdba36f436d1)) reduces the words and also specifies the word language.

___ = normalizeWords(___,'Style',[style](#d126e34517)) also specifies normalization style. For example,normalizeWords(documents,'Style','lemma') lemmatizes the words in the input documents.

example

Examples

collapse all

Stem the words in a document array using the Porter stemmer.

documents = tokenizedDocument([ "a strongly worded collection of words" "another collection of words"]); newDocuments = normalizeWords(documents)

newDocuments = 2×1 tokenizedDocument:

6 tokens: a strongli word collect of word
4 tokens: anoth collect of word

Stem the words in a string array using the Porter stemmer. Each element of the string array must be a single word.

words = ["a" "strongly" "worded" "collection" "of" "words"]; newWords = normalizeWords(words)

newWords = 1×6 string "a" "strongli" "word" "collect" "of" "word"

Lemmatize the words in a document array.

documents = tokenizedDocument([ "I am building a house." "The building has two floors."]); newDocuments = normalizeWords(documents,'Style','lemma')

newDocuments = 2×1 tokenizedDocument:

6 tokens: i be build a house .
6 tokens: the build have two floor .

To improve the lemmatization, first add part-of-speech details to the documents using the addPartOfSpeechDetails function. For example, if the documents contain part-of-speech details, then normalizeWords reduces the only verb "building" and not the noun "building".

documents = addPartOfSpeechDetails(documents); newDocuments = normalizeWords(documents,'Style','lemma')

newDocuments = 2×1 tokenizedDocument:

6 tokens: i be build a house .
6 tokens: the building have two floor .

Tokenize Japanese text using the tokenizedDocument function. The function automatically detects Japanese text.

str = [ "空に星が輝き、瞬いている。" "空の星が輝きを増している。" "駅までは遠くて、歩けない。" "遠くの駅まで歩けない。"]; documents = tokenizedDocument(str);

Lemmatize the tokens using normalizeWords.

documents = normalizeWords(documents)

documents = 4×1 tokenizedDocument:

10 tokens: 空 に 星 が 輝く 、 瞬く て いる 。
10 tokens: 空 の 星 が 輝き を 増す て いる 。
 9 tokens: 駅 まで は 遠い て 、 歩ける ない 。
 7 tokens: 遠く の 駅 まで 歩ける ない 。

Tokenize German text using the tokenizedDocument function. The function automatically detects German text.

str = [ "Guten Morgen. Wie geht es dir?" "Heute wird ein guter Tag."]; documents = tokenizedDocument(str);

Stem the tokens using normalizeWords.

documents = normalizeWords(documents)

documents = 2×1 tokenizedDocument:

8 tokens: gut morg . wie geht es dir ?
6 tokens: heut wird ein gut tag .

Input Arguments

collapse all

Input words, specified as a string vector, character vector, or cell array of character vectors. If you specify words as a character vector, then the function treats the argument as a single word.

Data Types: string | char | cell

Normalization style, specified as one of the following:

'stem' – Stem words using the Porter stemmer. This option supports English and German text only. For English and German text, this value is the default.
'lemma' – Extract the dictionary form of each word. This option supports English, Japanese, and Korean text only. If a word is not in the internal dictionary, then the function outputs the word unchanged. For English text, the output is lowercase. For Japanese and Korean text, this value is the default.

The function only normalizes tokens with type 'letters' and 'other'. For more information on token types, seetokenDetails.

Tip

For English text, to improve lemmatization of words in documents, first add part-of-speech details using the addPartOfSpeechDetails function.

Word language, specified as one of the following:

'en' – English language
'de' – German language

If you do not specify language, then the software detects the language automatically. To lemmatize Japanese or Korean text, usetokenizedDocument input.

Data Types: char | string

Output Arguments

collapse all

Updated words, returned as a string array, character vector, or cell array of character vectors. words andupdatedWords have the same data type.

Algorithms

collapse all

tokenizedDocument objects contain details about the tokens including language details. The language details of the input documents determine the behavior ofnormalizeWords. The tokenizedDocument function, by default, automatically detects the language of the input text. To specify the language details manually, use theLanguage option of tokenizedDocument. To view the token details, use the tokenDetails function.

Version History

Introduced in R2017b

expand all

Starting in R2018b, for tokenizedDocument input,normalizeWords normalizes tokens with type'letters' or 'other' only. This behavior prevents the function from affecting complex tokens such as URLs and email-addresses.

In previous versions, normalizeWords normalizes all tokens. To reproduce this behavior, use the command updatedDocuments = docfun(@(str) normalizeWords(str),documents).