Grapheme segmenter (original) (raw)
This tool parses a string and shows extended grapheme cluster boundaries (except for Korean jamo and emoji character sequences.) Mouse over or click/tap on the segments to see their composition.
Click on the segments to reveal the character names.
show more
This app segments text in 3 different ways:
- Unicode grapheme clusters are an approximation to user perceived graphemes where the boundaries are established by rules applied to code point sequences according to UAX #29. The rules tend to be biased towards producing the units of text needed for cursor positioning.
- CCSs (combining character sequences) start with a base character and add all following combining marks. They don't extend the grapheme where there are viramas or stackers. That means that conjunct graphemes are split into separate parts. It also means that Myanmar right-rendered vowel-signs and tone marks don't create a new segment on their own.
- Orthographic syllables string together grapheme clusters that should not be broken during edit operations such as hyphenation, letter-spacing, first-letter selection, etc. The app uses home-grown algorithms to handle (so far) Hindi, Bengali, Gurmukhi, Sinhala, Tamil, Malayalam, Balinese, Javanese, and Burmese. It may handle other scripts reasonably well, especially non-Indic ones, but you should check the results.
If you check Explode the output will be separated into individual characters, which is often useful to see what the actual components are.
To pass a string in the URL, use one of:
?gc=<string>
?ccs=<string>
?os=<string>
To indicate in the URL the font you want to use for the display, add &font=<font_name>
.
See also the ICU line-break segmenter.
Updated 24 January, 2023