GitHub - pommedeterresautee/unine: Unine light stemmer for French, German, Italian, Spanish, Portuguese, Finnish, Swedish (original) (raw)

UNINE

Travis build status Build status Coverage status CRAN status CRAN_Download lifecycle License: MIT Follow

Implementation of "light" stemmers for French, German, Italian, Spanish, Portuguese, Finnish, Swedish.
They are based on the same work as the "light" stemmers found in SolR or ElasticSearch.
A "light" stemmer consists in removing inflections only for noun and adjectives.
Indexing verbs for these languages is not of primary importance compared to nouns and adjectives.

The procedures used in this stemmer are described below:

Online tests are available on this website.

Installation

You can install the released version of unine from CRAN with:

install.packages("unine")

... or the last version from Github

devtools::install_github("pommedeterresautee/unine")

Example

Below some examples for French and a comparaison with Porter French stemmer.

french_stemmer(words = c("complète", "caissière"))

[1] "complet" "caisier"

Not that below double letters are deduplicated: caissière -> caisier

french_stemmer(words = c("tester", "testament", "chevaux", "aromatique", "personnel", "folle"))

[1] "test" "testament" "cheval" "aromat" "personel" "fou"

Not that below double letters are deduplicated: personnel -> personel

look at how "testament" and "tester" have been stemmed above.

Now with Porter stemmer :

SnowballC::wordStem(c("testament", "tester"), language = "french")

[1] "test" "test"

References

Please cite [1] if using this R package.

[1] J. Savoy, A stemming procedure and stopword list for general French corpora

@article{savoy1999stemming,
  title={A stemming procedure and stopword list for general French corpora},
  author={Savoy, Jacques},
  journal={Journal of the American Society for Information Science 50(10), 944-952.},
  year={2009}
}