GitHub - nolta/UnicodeExtras.jl (original) (raw)
UnicodeExtras
Installation
julia> Pkg.clone("git://github.com/nolta/UnicodeExtras.git")
Usage
File encoding
julia> using UnicodeExtras
julia> b = encode("Ålborg", "iso-8859-1")
6-element Array{Uint8,1}:
0xc5
0x6c
0x62
0x6f
0x72
0x67
julia> decode(b, "iso-8859-1")
"Ålborg"
Case handling
This package extends Julia's builtin uppercase
and lowercase
functions, and adds titlecase
and foldcase
.
julia> uppercase("testingß")
"TESTINGß"
julia> using UnicodeExtras
julia> uppercase("testingß")
"TESTINGSS"
julia> set_locale("tr") # set locale to Turkish
"tr"
julia> uppercase("testingß")
"TESTİNGSS"
Note that "ß" gets converted to "SS" after UnicodeExtras is loaded, and "i" gets converted to "İ" (dotted capital I) after the locale is set to Turkish.
UnicodeText
In julia, a string is conceptually an array of unicode code points. While well defined, this occasionally causes confusion because a single code point doesn't necessarily correspond to what people perceive as a single "character".
Take the following example:
julia> n1 = "noe\u0308l"
"noël"
julia> length(s)
5
Here, the ë
"character" here consists of two code points: 'e' & '\u0308', and so the length of the string is 5, not 4.
julia> noel1 = UnicodeText("noe\u0308l")
"noël"
julia> noel2 = UnicodeText("noël")
"noël"
julia> noel1.data
5-element Array{Uint16,1}:
0x006e
0x006f
0x0065
0x0308
0x006c
julia> noel2.data
4-element Array{Uint16,1}:
0x006e
0x006f
0x00eb
0x006c
julia> noel1 == noel2
true
julia> length(noel1) == 4 == length(noel2)
true
julia> noel1[1:3]
"noë"
UnicodeText comparisons are locale sensitive:
julia> set_locale("de") # german
"de"
julia> UnicodeText("Köpfe") < UnicodeText("Kypper")
true
julia> set_locale("sv") # swedish
"sv"
julia> UnicodeText("Köpfe") < UnicodeText("Kypper")
false