Grapheme Usage (original) (raw)

Draft

The goal is to allow the use of the appropriate grapheme clusters for given tasks, for a given language. See http://unicode.org/cldr/trac/ticket/2142\. Please leave any feedback as comments on that ticket.

The idea is that we have explicit boundaries that represent certain common behaviors (codepoint breaks, or legacy grapheme cluster breaks), and we also have associations for a given language between a particular function and the explicit boundaries that should be used in that language for that function.

Here is a proposal for the structure in LDML:

extended <!– when counting ‘user characters’ –>

legacy <!– paragraph drop-caps –>

aksara <!– selection boundaries: highlighting, keyboard arrows, cut&paste –>

codepoint <!– delete previous character –>

extended <!– delete next character –>

The above would be tailorable per locale.

In segments/root.xml we have GraphemeClusterBreak. We interpret that as extended grapheme clusters for compatibility. We then add rules for:

These would also be tailorable per locale (except CodePoint), but should be more rarely done.

Clients like ICU would add new constants for getting BreakIterators (or equivalents). These would be both corresponding to the new explicit rules:

And to the new ‘function-based’ breaks: