Check if a Character Vector is Validly Encoded (original) (raw)

validUTF8 {base} R Documentation

Description

Check if each element of a character vector is valid in its implied encoding.

Usage

validUTF8(x)
validEnc(x)

Arguments

Details

These use similar checks to those used by functions such as[grep](../../base/help/grep.html).

validUTF8 ignores any marked encoding (see[Encoding](../../base/help/Encoding.html)) and so looks directly if the bytes in each string are valid UTF-8. (For the validity of ‘noncharacters’ see the help for [intToUtf8](../../base/help/intToUtf8.html).)

validEnc regards character strings as validly encoded unless their encodings are marked as UTF-8 or they are unmarked and the Rsession is in a UTF-8 or other multi-byte locale. (The checks in other multi-byte locales depend on the OS and as with[iconv](../../base/help/iconv.html) not all invalid inputs may be detected.)

Value

A logical vector of the same length as x. NA elements are regarded as validly encoded.

Note

It would be possible to check for the validity of character strings in a Latin-1 encoding, but extensions such as CP1252 are widely accepted as ‘Latin-1’ and 8-bit encodings rarely need to be checked for validity.

Examples

x <-
  ## from example(text)
c("Jetz", "no", "chli", "z\xc3\xbcrit\xc3\xbc\xc3\xbctsch:",
  "(noch", "ein", "bi\xc3\x9fchen", "Z\xc3\xbc", "deutsch)",
   ## from a CRAN check log
   "\xfa\xb4\xbf\xbf\x9f")
validUTF8(x)
validEnc(x) # depends on the locale
Encoding(x) <-"UTF-8"
validEnc(x) # typically the last, x[10], is invalid

## Maybe advantageous to declare it "unknown":
G <- x ; Encoding(G[!validEnc(G)]) <- "unknown"
try( substr(x, 1,1) ) # gives 'invalid multibyte string' error in a UTF-8 locale
try( substr(G, 1,1) ) # works in a UTF-8 locale
nchar(G) # fine, too
## but it is not "more valid" typically:
all.equal(validEnc(x),
          validEnc(G)) # typically TRUE

[Package _base_ version 4.6.0 Index]