addPartOfSpeechDetails - Add part-of-speech tags to documents - MATLAB (original) (raw)

Add part-of-speech tags to documents

Syntax

Description

Use addPartOfSpeechDetails to add part-of-speech tags to documents.

The function supports English, Japanese, German, and Korean text.

[updatedDocuments](#d126e3870) = addPartOfSpeechDetails([documents](#d126e3527)) detects parts of speech in documents and updates the token details. The function, by default, retokenizes the text for part-of-speech tagging. For example, the function splits the word "you're" into the tokens "you" and "'re". To get the part-of-speech details from updatedDocuments, usetokenDetails.

example

[updatedDocuments](#d126e3870) = addPartOfSpeechDetails([documents](#d126e3527),[Name,Value](#namevaluepairarguments)) specifies additional options using one or more name-value pair arguments.

Tip

Use addPartOfSpeechDetails before using thelower, upper,erasePunctuation,normalizeWords, removeWords, and removeStopWords functions asaddPartOfSpeechDetails uses information that is removed by these functions.

Examples

collapse all

Load the example data. The file sonnetsPreprocessed.txt contains preprocessed versions of Shakespeare's sonnets. The file contains one sonnet per line, with words separated by a space. Extract the text from sonnetsPreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetsPreprocessed.txt"; str = extractFileText(filename); textData = split(str,newline); documents = tokenizedDocument(textData);

View the token details of the first few tokens.

tdetails = tokenDetails(documents); head(tdetails)

   Token       DocumentNumber    LineNumber     Type      Language
___________    ______________    __________    _______    ________

"fairest"            1               1         letters       en   
"creatures"          1               1         letters       en   
"desire"             1               1         letters       en   
"increase"           1               1         letters       en   
"thereby"            1               1         letters       en   
"beautys"            1               1         letters       en   
"rose"               1               1         letters       en   
"might"              1               1         letters       en

Add part-of-speech details to the documents using the addPartOfSpeechDetails function. This function first adds sentence information to the documents, and then adds the part-of-speech tags to the table returned by tokenDetails. View the updated token details of the first few tokens.

documents = addPartOfSpeechDetails(documents); tdetails = tokenDetails(documents); head(tdetails)

   Token       DocumentNumber    SentenceNumber    LineNumber     Type      Language     PartOfSpeech 
___________    ______________    ______________    __________    _______    ________    ______________

"fairest"            1                 1               1         letters       en       adjective     
"creatures"          1                 1               1         letters       en       noun          
"desire"             1                 1               1         letters       en       noun          
"increase"           1                 1               1         letters       en       noun          
"thereby"            1                 1               1         letters       en       adverb        
"beautys"            1                 1               1         letters       en       noun          
"rose"               1                 1               1         letters       en       noun          
"might"              1                 1               1         letters       en       auxiliary-verb

Tokenize Japanese text using tokenizedDocument.

str = [ "恋に悩み、苦しむ。" "恋の悩みで苦しむ。" "空に星が輝き、瞬いている。" "空の星が輝きを増している。" "駅までは遠くて、歩けない。" "遠くの駅まで歩けない。" "すもももももももものうち。"]; documents = tokenizedDocument(str);

For Japanese text, you can get the part-of-speech details using tokenDetails. For English text, you must first use addPartOfSpeechDetails.

tdetails = tokenDetails(documents); head(tdetails)

 Token     DocumentNumber    LineNumber       Type        Language    PartOfSpeech     Lemma       Entity  
_______    ______________    __________    ___________    ________    ____________    _______    __________

"恋"             1               1         letters           ja       noun            "恋"       non-entity
"に"             1               1         letters           ja       adposition      "に"       non-entity
"悩み"           1               1         letters           ja       verb            "悩む"      non-entity
"、"             1               1         punctuation       ja       punctuation     "、"       non-entity
"苦しむ"          1               1         letters           ja       verb            "苦しむ"    non-entity
"。"             1               1         punctuation       ja       punctuation     "。"       non-entity
"恋"             2               1         letters           ja       noun            "恋"       non-entity
"の"             2               1         letters           ja       adposition      "の"       non-entity

Tokenize German text using tokenizedDocument.

str = [ "Guten Morgen. Wie geht es dir?" "Heute wird ein guter Tag."]; documents = tokenizedDocument(str)

documents = 2×1 tokenizedDocument:

8 tokens: Guten Morgen . Wie geht es dir ?
6 tokens: Heute wird ein guter Tag .

To get the part of speech details for German text, first use addPartOfSpeechDetails.

documents = addPartOfSpeechDetails(documents);

To view the part of speech details, use the tokenDetails function.

tdetails = tokenDetails(documents); head(tdetails)

 Token      DocumentNumber    SentenceNumber    LineNumber       Type        Language    PartOfSpeech
________    ______________    ______________    __________    ___________    ________    ____________

"Guten"           1                 1               1         letters           de       adjective   
"Morgen"          1                 1               1         letters           de       noun        
"."               1                 1               1         punctuation       de       punctuation 
"Wie"             1                 2               1         letters           de       adverb      
"geht"            1                 2               1         letters           de       verb        
"es"              1                 2               1         letters           de       pronoun     
"dir"             1                 2               1         letters           de       pronoun     
"?"               1                 2               1         punctuation       de       punctuation

Input Arguments

Name-Value Arguments

collapse all

Specify optional pairs of arguments asName1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'DiscardKnownValues',true specifies to discard previously computed details and recompute them.

Method to retokenize documents, specified as one of the following:

'part-of-speech' – Transform the tokens for part-of-speech tagging. The function performs these tasks:
- Split compound words. For example, split the compound word "wanna" into the tokens "want" and"to". This includes compound words containing apostrophes. For example, the function splits the word "don't" into the tokens "do" and"n't".
- Merge periods that do not end sentences with preceding tokens. For example, merge the tokens"Mr" and "." into the token "Mr.".
- For German text, merge abbreviations that span multiple tokens. For example, merge the tokens"z", ".","B", and "." into the single token "z. B.".
- Merge runs of periods into ellipses. For example, merge three instances of "." into the single token "...".
'none' – Do not retokenize the documents.

List of abbreviations for sentence detection, specified as a string array, character vector, cell array of character vectors, or a table.

If the input documents do not contain sentence details, then the function first runs the addSentenceDetails function and specifies the abbreviation list given by 'Abbreviations'. To specify more options for sentence detection (for example, sentence starters) use the addSentenceDetails function before using addPartOfSpeechDetails details.

If Abbreviations is a string array, character vector, or cell array of character vectors, then the function treats these as regular abbreviations. If the next word is a capitalized sentence starter, then the function breaks at the trailing period. The function ignores any differences in the letter case of the abbreviations. Specify the sentence starters using the Starters name-value pair.

To specify different behaviors when splitting sentences at abbreviations, specify Abbreviations as a table. The table must have variables named Abbreviation and Usage, where Abbreviation contains the abbreviations, and Usage contains the type of each abbreviation. The following table describes the possible values of Usage, and the behavior of the function when passed abbreviations of these types.

Usage	Behavior	Example Abbreviation	Example Text	Detected Sentences
regular	If the next word is a capitalized sentence starter, then break at the trailing period. Otherwise, do not break at the trailing period.	"appt."	"Book an appt. We'll meet then."	"Book an appt.""We'll meet then."
"Book an appt. today."	"Book an appt. today."
inner	Do not break after trailing period.	"Dr."	"Dr. Smith."	"Dr. Smith."
reference	If the next token is not a number, then break at a trailing period. If the next token is a number, then do not break at the trailing period.	"fig."	"See fig. 3."	"See fig. 3."
"Try a fig. They are nice."	"Try a fig.""They are nice."
unit	If the previous word is a number and the following word is a capitalized sentence starter, then break at a trailing period.	"in."	"The height is 30 in. The width is 10 in."	"The height is 30 in.""The width is 10 in."
If the previous word is a number and the following word is not capitalized, then do not break at a trailing period.	"The item is 10 in. wide."	"The item is 10 in. wide."
If the previous word is not a number, then break at a trailing period.	"Come in. Sit down."	"Come in.""Sit down."

The default value is the output of the abbreviations function. For Japanese and Korean text, abbreviations do not usually impact sentence detection.

Tip

By default, the function treats single letter abbreviations, such as "V.", or tokens with mixed single letters and periods, such as "U.S.A." as regular abbreviations. You do not need to include these abbreviations in Abbreviations.

Data Types: char | string | table | cell

Option to discard previously computed details and recompute them, specified astrue or false.

Data Types: logical

Output Arguments

More About

collapse all

The addPartOfSpeechDetails function adds part-of-speech tags to the table returned by the tokenDetails function. The function tags each token with a categorical tag with one of the following class names:

adjective — Adjective
adposition — Adposition
adverb — Adverb
auxiliary-verb — Auxiliary verb
coord-conjunction — Coordinating conjunction
determiner — Determiner
interjection — Interjection
noun — Noun
numeral — Numeral
particle — Particle
pronoun — Pronoun
proper-noun — Proper noun
punctuation — Punctuation
subord-conjunction — Subordinating conjunction
symbol — Symbol
verb — Verb
other — Other

Algorithms

If the input documents do not contain sentence details, then the function first runsaddSentenceDetails.

Version History

Introduced in R2018b