GitHub - paithiov909/audubon: An R package for Japanese text processing (original) (raw)

audubon

audubon status badge R-CMD-check codecov CRAN logs badge

audubon is Japanese text processing tools for:

Some features above are not implemented in ‘ICU’ (i.e., the stringi package), and the goal of the audubon package is to provide these additional features.

Installation

remotes::install_github("paithiov909/audubon")

Usage

Fill Japanese iteration marks (Odori-ji)

strj_fill_iter_mark repeats the previous character and replaces the iteration marks if the element has more than 5 characters. You can use this feature with strj_normalize or strj_rewrite_as_def.

strj_fill_iter_mark(c( "あいうゝ〃かき", "金子みすゞ", "のたり〳〵かな", "しろ/″\とした" )) #> [1] "あいうううかき" "金子みすず" "のたりたりかな" "しろじろとした"

strj_fill_iter_mark("いすゞエルフトラック") |> strj_normalize() #> [1] "いすずエルフトラック"

Character class conversion

Character class conversion useshakatashi/japanese.js.

strj_hiraganize("あのイーハトーヴォのすきとおった風") #> [1] "あのいーはとーゔぉのすきとおった風" strj_katakanize("あのイーハトーヴォのすきとおった風") #> [1] "アノイーハトーヴォノスキトオッタ風" strj_romanize("あのイーハトーヴォのすきとおった風") #> [1] "anoīhatōvonosukitōtta"

Segmentation by phrase

strj_tokenize splits Japanese text into some phrases usinggoogle/budoux, TinySegmenter, or other tokenizers.

strj_tokenize("あのイーハトーヴォのすきとおった風", engine = "budoux") #> $1 #> [1] "あの" "イーハトーヴォの" "すきとおった" "風"

Japanese text normalization

strj_normalize normalizes text following the rule based onNEologdstyle.

strj_normalize("――南アルプスの 天然水- Sparking* Lemon+ レモン一絞り") #> [1] "ー南アルプスの天然水-Sparking* Lemon+レモン一絞り"

strj_rewrite_as_def is an R port ofSudachiCharNormalizerthat typically normalizes characters following a ’*.def’ file.

audubon package contains several ’*.def’ files, so you can use them or write a ‘rewrite.def’ file by yourself as follows.

# single characters will **never** be normalized.
…
# if two characters are separated with a tab,
# left side forms are always rewritten to right side forms
# before normalized.
斎   斉
齋   斉
齊   斉
# supports rewriting a single character to a single character,
# i.e., this cannot work.
アッ  ア

This feature is more powerful than stringi::stri_trans_* because it allows users to control which characters are normalized. For instance, this function can be used to convert kyuji-tai characters to_shinji-tai_ characters.

stringi::stri_trans_nfkc("Ⅹⅳ") #> [1] "Xiv" strj_rewrite_as_def("Ⅹⅳ") #> [1] "Ⅹⅳ" strj_rewrite_as_def("惡と假面のルール", read_rewrite_def(system.file("def/kyuji.def", package = "audubon"))) #> [1] "悪と仮面のルール"

License

© 2024 Akiru Kato

Licensed under the Apache License, Version 2.0.

Icons made by iconixar from flaticon.