MecabOptions - Options for MeCab tokenization - MATLAB (original) (raw)

Main Content

Options for MeCab tokenization

Description

A mecabOptions object specifies additional options for tokenizing Japanese and Korean text.

To tokenize using the specified MeCab tokenization options, use the 'TokenizeMethod' option of tokenizedDocument.

Creation

Syntax

Description

`options` = mecabOptions creates a MeCab tokenization option set with the default values for tokenizing Japanese.

example

`options` = mecabOptions([PropertyName=Value](#mw%5Fcba5ca16-c309-4714-9be3-04973e90c47d)) additionally sets additional Properties using one or more name-value pair arguments.

example

Properties

expand all

Path to trained model (MeCab dictionary), specified as a string scalar or a character vector.

The default value is a path to the internal dictionary for Japanese tokenization.

Example: "C:\myDict"

Data Types: char | string

Files containing model extensions (MeCab user dictionary .dic files), specified as a string array, a character vector, or a cell array of character vectors.

Example: "C:\myFile.dic"

Data Types: char | string | cell

Function extracting lemma from MeCab reply, specified as a function handle.

The function must have the form lemmata = fun(words,info), wherewords is a string vector of tokens and info is a struct with the following fields:

Feature – String vector of tokens of the same size as words containing the MeCab output lines in ChaSen format without the split tokens themselves.
PartOfSpeech – Numerical code used inside the dictionary for the part-of-speech classification.

The output lemmata is a string array of the same size aswords containing the extracted lemmata.

The default lemma extractor is the textanalytics.ja.mecabToLemma function.

Data Types: function_handle

Function extracting part-of-speech information from MeCab reply, specified as a function handle.

The function must have the form posTags = fun(words,info), wherewords is a string vector of tokens and info is a struct with the following fields:

Feature – String vector of tokens of the same size as words containing the MeCab output lines in ChaSen format without the split tokens themselves.
PartOfSpeech – Numerical code used inside the dictionary for the part-of-speech classification.

The output posTags is a categorical array of the same size as words containing the extracted part-of-speech tags from the following categories:

adjective
adposition
adverb
auxiliary-verb
coord-conjunction
determiner
interjection
noun
numeral
pronoun
proper-noun
punctuation
symbol
verb
other

The default part-of-speech information extractor is the textanalytics.ja.mecabToPOS function.

Data Types: function_handle

Function extracting named entity information from MeCab reply, specified as a function handle.

The function must have the form entities = fun(words,info), wherewords is a string vector of tokens and info is a struct with the following fields:

Feature – String vector of tokens of the same size as words containing the MeCab output lines in ChaSen format without the split tokens themselves.
PartOfSpeech – Numerical code used inside the dictionary for the part-of-speech classification.

The output entities is a categorical array of the same size as words containing the extracted entities from the following categories:

non-entity
person
organization
location
other

The default part-of-speech information extractor is the textanalytics.ja.mecabToNER function.

Data Types: function_handle

Examples

collapse all

Create a MecabOptions object containing the default options for Japanese tokenization.

options = MecabOptions with properties:

         Model: "C:\Program Files\MATLAB\R2023a\sys\share\dict-ipadic"
     UserModel: ""
LemmaExtractor: @textanalytics.ja.mecabToLemma
  POSExtractor: @textanalytics.ja.mecabToPOS
  NERExtractor: @textanalytics.ja.mecabToNER

Tokenize Japanese text using custom MeCab options.

Create a string array of Japanese text.

str = [ "恋に悩み、苦しむ。" "恋の悩みで苦しむ。" "空に星が輝き、瞬いている。" "空の星が輝きを増している。"];

Create a MecabOptions object and specify a user model as a .dic file using the 'UserModel' option.

options = mecabOptions('UserModel','myFile.dic')

options = MecabOptions with properties:

         Model: "C:\Program Files\MATLAB\R2023a\sys\share\dict-ipadic"
     UserModel: "myFile.dic"
LemmaExtractor: @textanalytics.ja.mecabToLemma
  POSExtractor: @textanalytics.ja.mecabToPOS
  NERExtractor: @textanalytics.ja.mecabToNER

Tokenize the text using the specified options using the 'TokenizeMethod' option.

documents = tokenizedDocument(str,'TokenizeMethod',options)

documents = 4×1 tokenizedDocument:

 6 tokens: 恋 に 悩み 、 苦しむ 。
 6 tokens: 恋 の 悩み で 苦しむ 。
10 tokens: 空 に 星 が 輝き 、 瞬い て いる 。
10 tokens: 空 の 星 が 輝き を 増し て いる 。

Version History

Introduced in R2019b