MecabOptions - Options for MeCab tokenization - MATLAB (original) (raw)
Main Content
Options for MeCab tokenization
Description
A mecabOptions
object specifies additional options for tokenizing Japanese and Korean text.
To tokenize using the specified MeCab tokenization options, use the 'TokenizeMethod' option of tokenizedDocument.
Creation
Syntax
Description
`options` = mecabOptions
creates a MeCab tokenization option set with the default values for tokenizing Japanese.
`options` = mecabOptions([PropertyName=Value](#mw%5Fcba5ca16-c309-4714-9be3-04973e90c47d))
additionally sets additional Properties using one or more name-value pair arguments.
Properties
Path to trained model (MeCab dictionary), specified as a string scalar or a character vector.
The default value is a path to the internal dictionary for Japanese tokenization.
Example: "C:\myDict"
Data Types: char
| string
Files containing model extensions (MeCab user dictionary .dic
files), specified as a string array, a character vector, or a cell array of character vectors.
Example: "C:\myFile.dic"
Data Types: char
| string
| cell
Function extracting lemma from MeCab reply, specified as a function handle.
The function must have the form lemmata = fun(words,info)
, wherewords
is a string vector of tokens and info
is a struct with the following fields:
Feature
– String vector of tokens of the same size aswords
containing the MeCab output lines in ChaSen format without the split tokens themselves.PartOfSpeech
– Numerical code used inside the dictionary for the part-of-speech classification.
The output lemmata
is a string array of the same size aswords
containing the extracted lemmata.
The default lemma extractor is the textanalytics.ja.mecabToLemma function.
Data Types: function_handle
Function extracting part-of-speech information from MeCab reply, specified as a function handle.
The function must have the form posTags = fun(words,info)
, wherewords
is a string vector of tokens and info
is a struct with the following fields:
Feature
– String vector of tokens of the same size aswords
containing the MeCab output lines in ChaSen format without the split tokens themselves.PartOfSpeech
– Numerical code used inside the dictionary for the part-of-speech classification.
The output posTags
is a categorical array of the same size as words
containing the extracted part-of-speech tags from the following categories:
adjective
adposition
adverb
auxiliary-verb
coord-conjunction
determiner
interjection
noun
numeral
pronoun
proper-noun
punctuation
symbol
verb
other
The default part-of-speech information extractor is the textanalytics.ja.mecabToPOS function.
Data Types: function_handle
Function extracting named entity information from MeCab reply, specified as a function handle.
The function must have the form entities = fun(words,info)
, wherewords
is a string vector of tokens and info
is a struct with the following fields:
Feature
– String vector of tokens of the same size aswords
containing the MeCab output lines in ChaSen format without the split tokens themselves.PartOfSpeech
– Numerical code used inside the dictionary for the part-of-speech classification.
The output entities
is a categorical array of the same size as words
containing the extracted entities from the following categories:
non-entity
person
organization
location
other
The default part-of-speech information extractor is the textanalytics.ja.mecabToNER function.
Data Types: function_handle
Examples
Create a MecabOptions
object containing the default options for Japanese tokenization.
options = MecabOptions with properties:
Model: "C:\Program Files\MATLAB\R2023a\sys\share\dict-ipadic"
UserModel: ""
LemmaExtractor: @textanalytics.ja.mecabToLemma
POSExtractor: @textanalytics.ja.mecabToPOS
NERExtractor: @textanalytics.ja.mecabToNER
Tokenize Japanese text using custom MeCab options.
Create a string array of Japanese text.
str = [ "恋に悩み、苦しむ。" "恋の悩みで苦しむ。" "空に星が輝き、瞬いている。" "空の星が輝きを増している。"];
Create a MecabOptions
object and specify a user model as a .dic
file using the 'UserModel'
option.
options = mecabOptions('UserModel','myFile.dic')
options = MecabOptions with properties:
Model: "C:\Program Files\MATLAB\R2023a\sys\share\dict-ipadic"
UserModel: "myFile.dic"
LemmaExtractor: @textanalytics.ja.mecabToLemma
POSExtractor: @textanalytics.ja.mecabToPOS
NERExtractor: @textanalytics.ja.mecabToNER
Tokenize the text using the specified options using the 'TokenizeMethod'
option.
documents = tokenizedDocument(str,'TokenizeMethod',options)
documents = 4×1 tokenizedDocument:
6 tokens: 恋 に 悩み 、 苦しむ 。
6 tokens: 恋 の 悩み で 苦しむ 。
10 tokens: 空 に 星 が 輝き 、 瞬い て いる 。
10 tokens: 空 の 星 が 輝き を 増し て いる 。
Version History
Introduced in R2019b