Digit Separators (original) (raw)
ISO/IEC JTC1 SC22 WG21 N3499 - 2012-12-19
Lawrence Crowl, crowl@google.com, Lawrence@Crowl.org
Problem
Solution
Constraints
Program Ambiguity
Lexical Language Compatibility
Extension Language Compatibility
Existing Grammar
2.3 Character sets [lex.charset]
2.5 Preprocessing tokens [lex.pptoken]
2.10 Preprocessing numbers [lex.ppnumber]
2.11 Identifiers [lex.name]
2.14.2 Integer literals [lex.icon]
2.14.3 Floating literals [lex.fcon]
2.14.8 User-defined literals [lex.ext]
16 Preprocessing directives [cpp]
Approaches
Remove User-Defined Literals
Typographic
Grave Accent
Single Quote
Underscore
Double Underscore
Scope Operator
Non-Digit Literal Suffix
Spacing
Double Radix Point
Backslash
Proposal
2.10 Preprocessing numbers [lex.ppnumber]
2.14.2 Integer literals [lex.icon]
2.14.4 Floating literals [lex.fcon]
2.14.8 User-defined literals [lex.ext]
References
Problem
Numeric literals of more than a few digits are hard to read. Consider the following tasks.
- Pronounce
7237498123
. - Compare
237498123
with237499123
for equality. - Decide whether
237499123
or20249472
is larger.
Solution
The problem has a long history of solutions in writing and typography, digit separators. In the English-speaking world, commas are usually used to separate digits.
- Pronounce
7,237,498,123
. - Compare
237,498,123
with237,499,123
for equality. - Decide whether
237,499,123
or20,249,472
is larger.
We wish to introduce digit separators into C++. The exact syntax is still open. The remainder of this paper discusses various approaches to the solution.
Constraints
Constraints on digit separators arise from three distinct sources.
Program Ambiguity
Adding digit separators introduces the potential for ambiguous C++ programs. We would prefer to avoid ambiguity, and failing that would prefer to have usable rules for disambiguating the source. In particular, the interaction with user-defined literals[N2747] [N2765]should be carefully considered.
Lexical Language Compatibility
The lexical structure of C++ is shared with C, Objective C/C++, and other tools through the preprocessor. Any introduction of digit separators should carefully consider compatibility with the existing lexical structure of these languages.
Richard Smith questions the value of compatibility here.
This problem only arises if:
- Someone is attempting to write a file which is to be shared between C++14 and other languages, and
- They include text in that header which simply does not work in those other languages.
I find it hard to believe that this will be a real problem, and it seems like a clear case of user error. (If you're writing a header which works in C and C++, the burden is on you to make sure it works in C).
This is not a new issue. The same problem already exists with C++11's raw string literals, and to a lesser extent with user-defined-literals and with C's hex floats (which allow 'p+' within pp-numbers).
Extension Language Compatibility
C++ is often used as the basis for extended languages, notably Objective C/C++, but also many languages that are smaller and less widely used. Invalidating those extension languages has costs that are hard to predict.
Existing Grammar
The existing grammar provides both constraints and opportunities.
2.3 Character sets [lex.charset]
Paragraph 1 is as follows.
The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters: [_Footnote:_The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files. —_end footnote_]
a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9 _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ! = , \ " '
Of particular note, the only printable ASCII characters not used in the C++ basic character set are$
(dollar),@
(commercial at sign), and`
(grave accent, back tick). All of these characters have been used for extension characters. Dollar has also been used as an identifier character, e.g. in VAX/VMS system functions names.
2.5 Preprocessing tokens [lex.pptoken]
The grammar is as follows.
preprocessing-token:
header-name
identifier
pp-number
character-literal
user-defined-character-literal
string-literal
user-defined-string-literal
preprocessing-op-or-punc
each non-white-space character that cannot be one of the above
Paragraph two is of special note.
A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. The categories of preprocessing token are: header names, identifiers, preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals), preprocessing operators and punctuators, and single non-white-space characters that do not lexically match the other preprocessing token categories. If a
'
or a"
character matches the last category, the behavior is undefined. Preprocessing tokens can be separated by white space; this consists of comments (2.8), or white-space characters (space, horizontal tab, new-line, vertical tab, and form-feed), or both. As described in Clause 16, in certain circumstances during translation phase 4, white space (or the absence thereof) serves as more than preprocessing token separation. White space can appear within a preprocessing token only as part of a header name or between the quotation characters in a character literal or string literal.
The implication here is that no valid C++ program should have an isolated single or double quote character. Unfortunately, that information is less useful that it might appear because an isolated single quote could be in use to signal an extension language interpretation.
2.10 Preprocessing numbers [lex.ppnumber]
The grammar is as follows.
pp-number:
digit
.
digitpp-number digit
pp-number nondigit
pp-number
e
signpp-number
E
signpp-number
.
We would like numeric literals to fit within this syntax, as it would require the least change to existing tools, e.g editor syntax highlighting and mouse word grabbing.
2.11 Identifiers [lex.name]
The grammar is as follows.
nondigit: one of
a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z _
digit: one of
0 1 2 3 4 5 6 7 8 9
The implication in this grammar is that ignored code must still be made up of valid tokens.
2.14.2 Integer literals [lex.icon]
The grammar is as follows.
integer-literal:
decimal-literal integer-suffixopt
octal-literal integer-suffixopt
hexadecimal-literal integer-suffixopt
decimal-literal:
nonzero-digit
decimal-literal digit
octal-literal:
0
octal-literal octal-digit
hexadecimal-literal:
0x
hexadecimal-digit
0X
hexadecimal-digithexadecimal-literal hexadecimal-digit
nonzero-digit: one of
1 2 3 4 5 6 7 8 9
octal-digit: one of
0 1 2 3 4 5 6 7
hexadecimal-digit: one of
0 1 2 3 4 5 6 7 8 9
a b c d e f
A B C D E F
This syntax is entirely contained with the pp-number
syntax.
2.14.3 Floating literals [lex.fcon]
The grammar is as follows.
floating-literal:
fractional-constant exponent-partopt floating-suffixopt
digit-sequence exponent-part floating-suffixopt
fractional-constant:
digit-sequenceopt
.
digit-sequencedigit-sequence
.
exponent-part:
e
signopt digit-sequence
E
signopt digit-sequencesign: one of
+ -
digit-sequence:
digit
digit-sequence digit
This syntax is entirely contained with the pp-number
syntax.
2.14.8 User-defined literals [lex.ext]
The grammar is as follows.
user-defined-literal:
user-defined-integer-literal
user-defined-floating-literal
user-defined-string-literal
user-defined-character-literal
user-defined-integer-literal:
decimal-literal ud-suffix
octal-literal ud-suffix
hexadecimal-literal ud-suffix
user-defined-floating-literal:
fractional-constant exponent-partopt ud-suffix
digit-sequence exponent-part ud-suffix
user-defined-string-literal:
string-literal ud-suffix
user-defined-character-literal:
character-literal ud-suffix
ud-suffix:
identifier
16 Preprocessing directives [cpp]
The grammar is as follows.
text-line:
pp-tokensopt new-line
pp-tokens:
preprocessing-token
pp-tokens preprocessing-token
The implication here is that #if
-ignored program source must still be made up of valid preprocessor tokens, not arbitrary text. Many preprocessors will skip arbitrary text, though.
Approaches
There are several approaches to the solution. We evaluate them in turn.
Remove User-Defined Literals
At least Daveed Vandevoorde and N.M. Maclaren have suggested removing user-defined literals. However, removing a feature that we just introduced could be difficult.
Typographic
There are three primary typographic conventions for digit separators: a comma, base-line dot, and a (thin) space.
C++ already uses the comma for an operator, and using it for a digit separator would introduce ambiguities in expressions such as ++a-3,4-b++
, or even more simply, f(12,345)
.
C++ already uses the base-line dot as a radix point, and so it is essentially not usable as a digit separator.
Bjarne Stroustrup has suggested using a space as a separator.
- Pronounce
7 237 498 123
. - Compare
237 498 123
with237 499 123
for equality. - Decide whether
237 499 123
or20 249 472
is larger.
While this approach is consistent with one common typeographic style, it suffers from some compatibility problems.
- It does not match the syntax for a pp-number, and would minimally require extending that syntax.
- More importantly, there would be some syntactic ambiguity when a hexadecimal digit in the range [a-f] follows a space. The preprocessor would not know whether to perform symbol substitution starting after the space.
- It would likely make editing tools that grab "words" less reliable.
Grave Accent
Ville Voutilainen, among others, suggests using a grave accent (`) (back tick) as a digit separator.
- Pronounce
7`237`498`123
. - Compare
237`498`123
with237`499`123
for equality. - Decide whether
237`499`123
or20`249`472
is larger.
This character is not part of the C++ basic source character set. The proposal has the advantage that introducing for this purpose cannot yield any ambiguity with existing C++ code. There are two disadvantages. First, using this character in the language invalidates any meta-languages using this character to distinguish between the C++ base layer and any meta information. Second, existing preprocessors would not recognize the grave accent as part of a preprocessor number, and may thus yield incorrect results.
Single Quote
Daveed Vandevoorde suggests using a single quote[N2747]. The single quote can be thought of as an "upper comma".
- Pronounce
7'237'498'123
. - Compare
237'498'123
with237'499'123
for equality. - Decide whether
237'499'123
or20'249'472
is larger.
There are two problems with this approach. First, an odd number of single quotes would result in a line of text that does not meet the preprocessor syntax for a token. While most preprocessors do not tokenize lines that are ignored in #if
/#else
, some preprocessors are known to emit errors for such cases. Second, existing preprocessors would not recognize the single quote as part of a preprocessor number, and may thus yield incorrect results.
Daveed Vandevoorde explains the incompatibility in more detail.
For example:
#if defined(__cplusplus) double pie = 3.141'593; #endif
In C, the preprocessor-tokens that are
#if
'ed out are (not including the double quotes) "double
", "pie
", "=
", "3.141
", "'
", "593
", and ";
".However, single and double quotes that aren't part of a larger preprocessor-tokenare deemed undefined behavior (C99, 6.4/3).
Typical C compilers (GCC, clang, EDG, and MSVC for example) have no problem with it (presumably they don't try to tokenize #if'ed-out lines), but James Dennett mentioned at least one older C compiler didn't like it.
Pete Becker points out that many tools, such as syntax highlighting in editors, rely on quotes being paired. The adaptability of the tools to new expressions is an open issue.
N.M. Maclaren suggests that single quote will lead to very bad error messages with some macro-based libraries.
Underscore
The Ada programming language uses an underscore (technically, a low line) for the digit separator[AdaLRMnumlit] [AdaRDnumlit]. This approach seems to be used in VHDL and Verilog, also possibly in Algol68. (VHDL also appears to have literal suffixes.) This approach has been proposed more than once for C++, going at least as far back as 1993[N0259].
- Pronounce
7_237_498_123
. - Compare
237_498_123
with237_499_123
for equality. - Decide whether
237_499_123
or20_249_472
is larger.
In all known cases, the primary proposal has been to permit only a single underscore between digits[N0259] [N2281] [N3342]. However, [N0259]presents an option to permit underscores between the digit sequence and any prefix or suffix.
Underscores work well as a digit separator for C++03[N0259] [N2281]. But with C++11, there exists a potential ambiguity with user-defined literals[N2747]. While the likely resolution will be some form of "max munch" rule, some mechanism must be present to disambiguate when max munch is too much. We use the term suffix separator to indicate this mechanism.
Double Underscore
[N2747] suggests a double underscore as a suffix separator.
Mike Miller provides more detail.
... one possibility that occurs to me would be to allow a trailing underscore in an integer literal. The ambiguity with user-defined literals would be resolved in favor of the plain integer literal; a user could disambiguate a user-defined literal by ending the integer part with a trailing underscore. (Double underscores would not be permitted in an integer literal.) Thus:
1_
=>1
1_2
=>12
1__2
=> value1
passed tooperator "" _2
0xdead_bee_f
=>0xdeadbeef
0xdead_bee__f
=> value0xdeadbee
passed tooperator "" _f
The ambiguity with this approach arises when the suffix begins with one or more underscores.
John Spicer suggests something slightly different.
At some point I had suggested using underscore and having a special lookup rule so that something like
0xabc_de
would look for the "de
" user-defined literal operator, and if not found, would treat the "de
" as part of the hex literal. If you wanted to force the use of the operator, you could write0xabc__de
. If you wanted to force the use of a_de
operator, you would have to write0xabc___de
.Another alternative would be to look for the "
de
" form and then the "_de
" form if the first was not found. That way would only require the use of three underscores in cases where you had both a "de
" and "_de
" operator and wanted to force use of the second.
Scope Operator
[N2747] suggests the scope operator (::
) as a potential suffix separator. The scope operator would be a pure syntactic extension, as it could not otherwise follow a literal. However, it would make substrings of a literal separately subject to preprocessor symbol substitution.
Non-Digit Literal Suffix
[N3342] suggests disallowing a leading underscore followed by a digit as a user-defined literal suffix. The intent was to make a suffix separator unnecessary. However, [N3448] points out that [N3342] fails to disambiguate hexadecimal digits, particularly in hte example 0xdead_beef_db
, where db
could be either decibel or the hexadecimal digits d
and b
.
One could simply not allow user-defined literals with hexadecimal literals. However, this restriction is not desirable.
Spacing
Discussions in the October 2012 standards meeting settled on using whitespace as the suffix separator. Unfortunately, that approach causes parsing problems for Objective C/C++.
Richard Smith explains.
An Objective-C message send works like this:
message-expression:
[
expression message-selector]
message-selector:
identifier
keyword-arguments
keyword-arguments:
identifieropt : expression keyword-argumentsopt
In particular, this is a valid Objective-C message send:
[self setValue: 0xff units: "cm"]
Hence any proposal which folds a pp-number followed by an identifier into a single literal will break a significant quantity of Objective-C code.
Doug Gregor elaborates.
There are two issues with allowing spaces between a literal and its suffix for Objective-C. One is a true ambiguity and one is a problem for error recovery.
The true ambiguity occurs because one can omit a parameter name from the method declaration, in which case there is no identifier before the ':' in the call. For example, one could have a message send that looks like this:
[a method:10 :11]
which calls the method "
method::
". Now, consider[a method:10 _suffix:11]
Currently, this parses (unambiguously) as a message send to "
method:_suffix:
", i.e., it's parsed as[a method:(10) _suffix:11] // _suffix is the name of the second argument; calls method:_suffix:
However, if we allow a space between a literal and its suffix, there is a second potential parse:
[a method:(10_suffix) :11] // _suffix is a suffix to the literal 10; calls method::
which is completely ambiguous.
The error-recovery issue is that Objective-C(++) parsers tend to rely heavily on the fact that an expression in C/C++ cannot be immediately followed by an identifier. If we see an expression followed by an identifier in an expression context, it's fairly likely that this is a message send for which the '[' has been dropped. For example, Clang detects these cases and automatically inserts the '[' for the user; this was one of the top error-recovery requests, and a regression here would be considered a major problem for our users.
Double Radix Point
Jeremiah Willcock suggests using "..
" as the suffix separator. This notation is already permitted by the pp-number syntax. It is also not presently permitted by any numeric literal. Its primary disadvantage seems to be that it is unfamilar.
Backslash
Clark Nelson suggests using "\
" as the suffix separator. This notation is not permitted by the pp-number syntax. It is also not presently permitted by any numeric literal.
Proposal
In this section we present likely wording edits, parameterized by the possible choices.
2.10 Preprocessing numbers [lex.ppnumber]
Edit the grammar as follows. Note that the additional rule for pp-number may not be necessary, depending on the specific chosen format.
digit-separator:
to be determined
pp-number:
digit
.
digitpp-number digit
pp-number nondigit
pp-number
e
signpp-number
E
signpp-number
.
pp-number digit-separator
2.14.2 Integer literals [lex.icon]
Edit the grammar as follows.
integer-literal:
decimal-literal integer-suffixopt
octal-literal integer-suffixopt
hexadecimal-literal integer-suffixopt
decimal-literal:
nonzero-digit
decimal-literal digit-separatoroptdigit
octal-literal:
0
octal-literal digit-separatoroptoctal-digit
hexadecimal-literal:
0x
hexadecimal-digit
0X
hexadecimal-digithexadecimal-literal digit-separatoropthexadecimal-digit
nonzero-digit: one of
1 2 3 4 5 6 7 8 9
octal-digit: one of
0 1 2 3 4 5 6 7
hexadecimal-digit: one of
0 1 2 3 4 5 6 7 8 9
a b c d e f
A B C D E F
Edit paragraph 1 as follows. Note that each **?**
will be replaced by the actual chosen digit separator character(s).
An integer literalis a sequence of digits that has no period or exponent part, with optional digit separators. These separators are ignored when determining its value. .... [Example:
theThe number twelve can be written12
,014
, or0XC
.The literals1048576
,1**?**048**?**576
,0X100000
,0x10**?**0000
, and0**?**004**?**000**?**000
all have the same value.—_end example_]
2.14.4 Floating literals [lex.fcon]
Edit the grammar as follows.
floating-literal:
fractional-constant exponent-partoptfloating-suffixopt
digit-sequence exponent-part floating-suffixopt
fractional-constant:
digit-sequenceopt
.
digit-sequencedigit-sequence
.
exponent-part:
e
signopt digit-sequence
E
signopt digit-sequencesign: one of
+ -
digit-sequence:
digit
digit-sequence digit-separatoroptdigit
Edit within paragraph 1 as follows. Note that each **?**
will be replaced by the actual chosen digit separator character(s).
.... The integer and fraction parts both consist of a sequence of decimal (base ten) digits, with optional digit separators.These separators are ignored when determining its value. [_Example:_The literals
1.602**?**176**?**565e-19
and1.602176565e-19
have the same value. —_end example_]....
2.14.8 User-defined literals [lex.ext]
Edit the grammar as follows.
user-defined-literal:
user-defined-integer-literal
user-defined-floating-literal
user-defined-string-literal
user-defined-character-literal
user-defined-integer-literal:
decimal-literal
ud-suffixseparated-suffixoctal-literal
ud-suffixseparated-suffixhexadecimal-literal
ud-suffixseparated-suffixuser-defined-floating-literal:
fractional-constant exponent-partopt
ud-suffixseparated-suffixdigit-sequence exponent-part
ud-suffixseparated-suffixuser-defined-string-literal:
string-literal ud-suffix
user-defined-character-literal:
character-literal ud-suffix
separated-suffix:
literal-separatoropt ud-suffix
literal-separator:
to be determined
ud-suffix:
identifier
Edit paragraph 1 as follows. Note that each **?**
will be replaced by the actual chosen digit separator character(s) and each **??**
will be replaced by the actual chosen literal separator character(s).
If a token matches both user-defined-literaland another literal kind, it is treated as the latter. [Example:
123_km
and123**??**km
is a user-defined-literalare user-defined-literals, but 123**?**456 and 12LLis an integer-literalare integer-literals—_end example_] ....
References
[N0259]
A proposal to allow Binary Literals, and some other small changes to Chapter 2: Lexical Conventions, John Max Skaller, ISO/IEC JTC1 SC22 WG21N0259, 1993-03-26
[N2281]
Digit Separators, Lawrence Crowl, ISO/IEC JTC1 SC22 WG21N2281, 2007-05-02
[N2747]
Ambiguity and Insecurity with User-Defined Literals, Lawrence Crowl, ISO/IEC JTC1 SC22 WG21N2747, 2008-08-24
[N2765]
User-defined Literals (aka. Extensible Literals (revision 5)), Ian McIntosh, Michael Wong, Raymond Mak, Robert Klarer, Jens Maurer, Alisdair Meredith, Bjarne Stroustrup, David Vandevoorde, ISO/IEC JTC1 SC22 WG21N2765, 2008-09-18
[N3250]
US-18: Removing User-Defined Literals, Douglas Gregor, ISO/IEC JTC1 SC22 WG21N3250, 2011-02-28
[N3402]
User-defined Literals for Standard Library Types, Peter Sommerlad, ISO/IEC JTC1 SC22 WG21N3402, 2012-09-07
[N3342]
Digit Separators coming back, Jens Maurer, ISO/IEC JTC1 SC22 WG21N3342, 2012-01-09
[N3448]
Painless Digit Separation, Daveed Vandevoorde, ISO/IEC JTC1 SC22 WG21N3448, 2012-09-21
[N3472]
Binary Literals in the C++ Core Language, James Dennett, ISO/IEC JTC1 SC22 WG21N3472, 2012-10-19
[AdaLRMnumlit]
Ada '83 Language Reference Manual, Section 2.4 Numeric Literals,http://archive.adaic.com/standards/83lrm/html/lrm-02-04.html#2.4
[AdaRDnumlit]
Rationale for the Design of the Ada Programming Language, Section 2.1 Lexical Structurehttp://archive.adaic.com/standards/83rat/html/ratl-02-01.html#2.1