Review analyzers by kermitt2 · Pull Request #990 · grobidOrg/grobid (original) (raw)
- make possible subtokenization separating digits and letters (based on unicode class)
- review Korean tokenizer
- complete retokenization for CJK
- apply the subtokenization to the citation parser
This allows to have numerical tokens even if they are mixed with letters, for instance for Korean:

