AngelikaLanger.com - Character Types and Character Traits (original) (raw)

Character Types and Character Traits
Character Types and Character Traits

C++ Report, April 1998
Klaus Kreft & Angelika Langer

Character types have an impact on various classes in the standard library. Strings, iostreams, and facets are abstractions in the library that manipulate characters or character sequences. All of them are implemented as class templates that take the character type as a template argument. The motivation for this design is the desire to keep abstraction like strings, streams, etc. independent of the characters they handle. Consider string operations: Concatenation of two strings, for instance, is mostly independent of the type of the characters that form the two strings in question. Hence it is possible to implement entirely generic strings that can handle sequences of any kind of characters. That at least is the idea behind class templates like basic_string, basic_fstream, basic_streambuf, ctype, num_get, num_put , to name but a few of them.

In practice it turns out that the character type alone does not provide enough information for all tasks that strings and iostreams perform. Streams, for instance, need to know how they can recognize the end of a file; after all the end-of-file-value might differ between character types. Also, strings must know how characters are compared, because the character type alone does not imply it. This additional information about a character type is encapsulated in yet another abstraction: the character traits. For use in the standard library each character type must be accompanied by an associated traits type. Otherwise the character type cannot be used for instantiating the string and stream class templates, because they require a second template argument, which is the traits type associated to the character type.

Before we examine in detail how the character and traits type are used in strings, iostreams, and facets, we want to take a closer look at character types and character traits types in general.

Character Types

What is a character? We all know intuitively that it is something like a 'A' or a 'B'. The notion that we share, is the _abstraction_of a character. Such an abstract character has numerous properties: visual representations (glyphs), binary representations (codes), and many more. Now, what do we mean when we talk of characters in the context of the standard library? Let's get it straight: As programmers, we deal with characters inside our C++ programs. In that context a character is an object, in the sense that it is an instance of a type, either a built-in or a user-defined type like a class. The built-in character types in (C and) C++ are type char for narrow characters and type wchar_t for wide characters. Like other objects in our program, character objects do not only have a type, but also an individual object state. A character object's state is the content of the character, i.e. its binary representation. It is the bit pattern stored inside a storage unit of type char or wchar_t , for instance. The content of a character is also called a character code. A code usually belongs to a character encoding, which is a set of character codes along with rules for their interpretation. In this article we are talking of character objects. Keep in mind that whenever we mention a "character" in the following text we mean a "character object inside a C++ program".

Character Type vs. Character Encoding

A character has two aspects that are relevant in a C++ program:

As you can see, there is no 1:1 relationship between the character type that describes the storage units used for storing a character, and the character encoding used to represent the code contained in that storage unit. Instead, a character sequence of a given encoding is stored in an array of units that have the minimum size required to hold any character of the encoding. The types typically used for storage of characters are the built-in character types char and wchar_t ; one-byte character encodings are stored as char , and wide character encodings are stored as wchar_t . The table below shows examples of tiny and wide character encodings and the character type that is typically used for storing and processing them:

character type character encoding
char ASCII
char EBCDIC
char ISO 8859-2
wchar_t Unicode
wchar_t ISO 10646

Requirements to Character Types

Let us return to the character type parameter of strings, iostreams, and facets. Potential candidates for the character type are, of course, the built-in types char and wchar_t . User-defined types are allowed, too. The Japanese delegation on the ISO committee standardizing C++ brought up the notion of Jchar , a character type that encapsulates information specific to processing of Japanese character representations. Jchar would be a user-defined type that can be used for instantiation of class templates like basic_string, basic_fstream, ctype , etc. Naturally, not just any type can serve as a character type. User-defined character types must meet the following requirements:

For certain purposes (details below) the character type must provide additional functionality:

Character Traits Types

A character traits type provides information associated to a character type. In order to be used for instantiation of string or iostreams class templates a traits type has to provide a set of member typedefs and functions, some of which are predominantly used by strings, other being mostly used by iostreams.

Requirements to Character Traits Types

Below we give an overview of the member typedefs and functions required of a character traits type, grouped by topics.

Copying, Finding, and Comparing Characters

The character traits are required to provide a number of member functions for typical operations on characters and character sequences. These are:

assign(), move(), copy() assigning and copying of characters
find() finding a character in a character sequence
eq() equality of characters
lt(), compare() comparison of characters and character sequences
length() length of a character sequence

They are mostly used by the string classes in standard library.

Handling the end-of-file Character

The character traits are required to provide types and functions for handing the end-of-file-value of a character type. This information is used by the iostreams classes in the standard library. Here is an overview:

int_type, eof() type and value of the end-of-file value
char_type the character type itself
eq(), eq_int_type() equality of characters and int_type values
to_int_type(), to_char_type() conversions between character and int_type values
not_eof() returns a value different from the end-of-file value

The end-of-file character is a special character that is different from all other character values. Historically, the end-of-file value was EOF , which is a constant of type int, that is different from all character values of type char . In the standard iostreams this principle was generalized. The end-of-file character of a character type is provided by the character traits in form of a static member function eof (). The end-of-file value's type is defined as a type nested in the character traits called int_type. Note, that it usually is different from the character type itself, which is defined as char_type.

Two values of type char_type or int_type cannot simply be compared by means of the built-in equality operator, because char_type and int_type can be any arbitrary type. Instead they are compared via the eq_int_type() and eq () member functions. The traits also have to provide functions for conversion between values of char_type and int_type and a convenience function for certain stream operations that returns an int_type value that is guaranteed to be different from the end-of-file value.

Conversion State and Stream Positions

A character traits type has to provide typedefs related to character code conversions and stream positioning. These are:

state_type type of the conversion state maintained by a stream buffer
pos_type type of an absolute stream position
off_type type of an offset to a specified stream position

A discussion of character code conversions and stream positioning is beyond the scope of this article.

The Predefined Character Traits

The standard library provides two predefined traits types for the built-in character types char and wchar_t . These two types are specialization of a class template called char_traits. Here are the declarations of these predefined traits types as they appear in the header file :

template struct char_traits;
template<> struct char_traits;
template<> struct char_traits<wchar_t>;

The traits types char_traits and char_traits<wchar_t> are the default traits types associated to the built-in character types char and wchar_t. If you never specify character traits, the standard library classes will use these defaults.

Interestingly, the char_traits template itself is an empty class template. It sole purpose is to serve as a primary template for specializations. It is not supposed to be instantiated for any character types. A traits type for a user-defined character type would be a specialization, not an instantiation of the char_traits template. The empty character traits class template is used as a default template argument for the class templates requiring a traits type argument. Let's take a look at some typical examples:

template<class charT, class traits = char_traits, class Allocator = allocator >
class basic_string;
template <class charT, class traits = char_traits >
class basic_fstream;

The traits type parameter of string and iostreams classes has a default value. This is for the convenience of users of these classes so that they need not specify the traits type argument. The default value for the traits type naturally depends on the character type: it is a specialization of the predefined character traits class template char_traits . This way the predefined specializations for char and wchar_t are used as defaults, and even for user-defined types there is a natural default value:

Imagine you would define a new character type myChar . Then you would have to provide an associated traits type. If you defined it as a specialization of the char_traits template, i.e. as char_traits , then the default would apply and a myChar -string would be of type basic_string . Alternatively you could give the traits type a name of its own, say myCharTraits . In that case the traits type argument could not be omitted, i.e. you would have to say basic_string<myChar,myCharTraits> instead of just basic_string . For this reason it is recommendable to define the traits type associated to a character type as a specialization of the char_traits template. There is only one situation when you would want to define traits types that are not specializations of the char_traits template: when you define more than one traits type for the same character type.

Usage of Character Types and Traits Types

We mentioned earlier that strings and iostreams rely on different parts of the character traits and that facets do not rely on the character traits at all, but make additional requirements to the character type. To aid understanding of these differences, let us get a rough impression of the way the character traits are used in the implementation of strings, iostreams, and facets.

Strings

The implementation of the basic_string class template uses the traits member typedefs and functions, whenever it manipulates characters and character sequences. For instance, traits::eq() is used for comparison in string functions such as find() and rfind() ; traits::compare() is used for implementaion of string compare() functions, and so on. Here is a simplified example, that shows use of traits::copy() and traits::assign() in one of the append() member functions of basic_string :

template<class charT, class traits, class allocator>
class basic_string {
public:
typedef typename allocator::size_type size_type;
private:
charT *_ptr;
size_type _len;
public:
basic_string& append(const charT *s, size_type m)
{.....
size_type n;
if (0 < m && _extend(n = _len + m))
{traits::copy(_ptr + _len, s, m);
traits::assign(_ptr[_len = n], charT(0));
}
return (*this);
}
};

You can see that the basic_string template is entirely independent of the actual character type and its specific properties: The character sequence that is to be added to the string's end is copied using traits::copy() , and the end-of-string character charT(0) is added by means of traits::assign(). No assumptions are made about the character type itself or the way objects of that type are copied and assigned. One assumption that is made is that the end-of-string character can be obtained by constructing a character object from the numeric value 0 . This is one of the requirements to the character type.

IOStreams

Iostreams classes rely on the character traits, too. They use the traits member typedefs and functions that have to do with the end-of-file value. Below is a typical example. Numerous member functions of stream and stream buffer classes return the end-of-file value in case of failure and a valid character in case of success. The get() functions for unformatted input of characters show the principle:

template<class charT, class traits>
class basic_istream : virtual public basic_ios<charT, traits> {
public:
typedef typename traits::int_type int_type;
int_type get()
{int_type c;
...
if (!_Ok) c = traits::eof();
else {... c = rdbuf()->sbumpc(); ... }
return (c);
}
basic_istream<charT, traits>& get(charT& x)
{int_type c = get();
if (!traits::eq_int_type(traits::eof(), c))
x = traits::to_char_type(c);
return (*this);
}
};

You can see that the first get() function has to return a value of type traits::int_type , because the return value is either the end-of-file value, which is of type traits::int_type , or a character of type charT . The second get() function demonstrates the need for conversions between both types and for comparison of values of those types.

Iostreams classes additionally use the traits members that relate to stream positioning and code conversion.

Note, that the standard does not guarantee that iostreams classes restrict themselves to the traits member typedefs and functions related to the end-of-file value and stream positioning and code conversion. They are allowed to also make use of member functions like compare(), assign(), etc. Similarly, strings could theoretically use eof(), not_eof(), pos_type, off_type , etc., although this is unlikely in practice. The principle is that strings and iostreams are permitted to rely on the full set of properties that are required of character types and their traits.

Facets

The facet class templates do not have a traits parameter. This is because many of the facets really do not manipulate characters. Think of the ctype facet for instance: It classifies characters according to their properties, i.e. whether they are digits, white spaces, printable, lower case, upper case letters, etc. There is no need for ever really touching a character object, copying it, or comparing it.

This is different for the parsing and formatting facets like num_get, num_put, money_get, money_put, time_get, and time_put . They have to compare characters. Just think of parsing of numeric values: the num_get facet has to recognize that an input character is the radix character or the thousands separator. It will therefore compare an input character to the respective symbols defined in the numpunct facet. Comparison is defined by the character traits in form of the eq() function, but the facets do not know anything about the associated traits type. Instead of using traits::eq() , they perform the comparison of two characters by means of operator==(). Here is a code snippet that could be part of the num_get facet's do_get() function:

template <class charT, class InputIterator = istreambuf_iterator >
class num_get : public locale::facet {
protected:
virtual iter_type
do_get(iter_type beg, iter_type end, ios_base& iob,ios_base::iostate& err, long& v) const
{...
char_type ct = *in ;
char c;
if ( ct == use_facet<numpunct >(iob.getloc()).decimal_point() )
c = ’.’;
bool discard =
( ct == use_facet<numpunct >(iob.getloc()).thousands_sep()
&&
use_facet<numpunct >(iob.getloc()).grouping().length() != 0 );
...
}
};

As you can see, the comparison of a character of type charT is performed using an operator==() for that character type. This explains, why facets impose additional requirements on the character type.

An interesting side effect is that iostreams classes generally use the traits::eq() function for comparison of characters, but for parsing and formatting of numeric values they use operator==(). This is because parsing and formatting of numeric values is delegated to the stream's locale's numeric facets, and we've seen above that facets do not use character traits. It follows that one should better implement an operator==() for a user-defined character type that has the same semantics as the character type's traits::eq() function. One problem remains: You can only have one operator==() for a given character type, but several character traits types associated to that character type. What if the traits types have different eq() functions? Iostreams might yield "interesting" results under these circumstances. However, this problem is unlikely to occur in practice.

Consider also, that iostreams classes do not only rely on certain properties of the character type and the character traits type, but additionally require facets for that character type. Iostreams needs the numeric facets, as we've mentioned above. It also needs conversions between the character type and the built-in type char , which have to be defined in the ctype facet in form of the member functions narrow() and widen() . It also needs character classification functions from the ctype facet in order to identify white-space characters. A code conversion facet is needed for file streams. In turns out that you have to provide _all_standard facets for a new character type, because facets are generally allowed to be interdependent.

Summary

Character types play a significant role in the standard library as template arguments to strings, iostreams, and facets. Character types have to meet certain requirements and must be accompanied by an associated character traits type and by the standard facets for that character type.

Acknowledgements

We thank Nathan Myers, Bill Plauger, and Jerry Schwarz for their willingness to answer our questions and help interpreting the draft standard, and Kevlin Henney for his thorough review.