tokenizers: Fast, Consistent Tokenization of Natural Language Text (original) (raw)
Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.
Version: | 0.3.0 |
---|---|
Depends: | R (≥ 3.1.3) |
Imports: | stringi (≥ 1.0.1), Rcpp (≥ 0.12.3), SnowballC (≥ 0.5.1) |
LinkingTo: | Rcpp |
Suggests: | covr, knitr, rmarkdown, stopwords (≥ 0.9.0), testthat |
Published: | 2022-12-22 |
DOI: | 10.32614/CRAN.package.tokenizers |
Author: | Lincoln Mullen |
Maintainer: | Lincoln Mullen |
BugReports: | https://github.com/ropensci/tokenizers/issues |
License: | MIT + file |
URL: | https://docs.ropensci.org/tokenizers/,https://github.com/ropensci/tokenizers |
NeedsCompilation: | yes |
Citation: | tokenizers citation info |
Materials: | README NEWS |
In views: | NaturalLanguageProcessing |
CRAN checks: | tokenizers results |
Documentation:
Downloads:
Reverse dependencies:
Reverse imports: | covfefe, deeplr, DeepPINCS, DramaAnalysis, pdfsearch, proustr, rslp, textrecipes, tidypmc, tidytext, ttgsea, wactor, WhatsR |
---|---|
Reverse suggests: | edgarWebR, torchdatasets |
Reverse enhances: | quanteda |
Linking:
Please use the canonical formhttps://CRAN.R-project.org/package=tokenizersto link to this page.