msg47201 - (view) |
Author: Mike Brown (mike_j_brown) |
Date: 2004-10-31 07:25 |
The attached diff may be applied against v1.175 of libfuncs.tex -- http://cvs.sourceforge.net/viewcvs.py/*checkout*/python/python/dist/src/Doc/lib/libfuncs.tex?content-type=text%2Fplain&rev=1.175 chr(): A str is not in any particular encoding, so don't talk about ASCII, which does not apply to arguments > 127 anyway. Also make reference to unichr(). ord(): A str is not in any particular encoding, so don't talk about ASCII. Describe what the return value represents for each type of string (str, unicode), and mention the TypeError that will be raised on narrow unicode builds of Python. unichr(): Mention the restrictions on the argument depending on whether Python was built with wide or narrow unicode. The precedent in unicode() is to refer to str objects as "8-bit strings", so the wording of the above changes was chosen accordingly. |
|
|
msg47202 - (view) |
Author: Raymond Hettinger (rhettinger) *  |
Date: 2004-10-31 07:38 |
Logged In: YES user_id=80475 The attachment didn't make it. Try again. And, FWIW, I think the documentation is perfectly clear as is. Though the ASCII reference is not strict, I think taking it out would be a mistake. Though many encodings are possible, there is a strong relationship between the number 97 and the letter 'a'. Mentioning ASCII makes that relationship clear. IOW, I -1 on changing it until a new bytes type is introduced. |
|
|
msg47203 - (view) |
Author: Mike Brown (mike_j_brown) |
Date: 2004-10-31 07:51 |
Logged In: YES user_id=371366 That kind of resistance to using accurate, strict terminology just perpetuates common misunderstandings about the relationship between characters and encodings. |
|
|
msg47204 - (view) |
Author: Mike Brown (mike_j_brown) |
Date: 2004-10-31 08:23 |
Logged In: YES user_id=371366 Also note that I did not suggest removing the example with the letter "a". I just suggested removing the reference to "ASCII" in particular. Ideally, IMHO, the documentation for sequence types is where one should mention the strong association between strings and ASCII. It currently doesn't even really describe what a string or Unicode string is. It should state that non-Unicode strings are an abstraction in which each member of the sequence is a "character" that is actually an 8-bit value, as in Standard C, intended to represent a character in an arbitrary encoding, and that there is an _informal_ convention, in documentation, of referring to these values as being ASCII values, in part due to the notational conventions of string literals, such as using "\t", "\n", and "\r" to represent decimal values 9, 10, and 13, respectively (associations that only make sense in ASCII or ASCII-based encodings), and in part because it is easier to talk about the lower 128 values in terms of their ASCII equivalents (e.g. "chr(97) produces the string 'a'"). Likewise, the unicode type could be described as being an abstraction of 16-bit ("narrow") or 32-bit ("wide") code units, depending on how Python was built, and so on... I would see making such unambiguous statements to be a reasonable alternative to just deleting mentions of ASCII from the library docs, although I think making all of the changes would be best, as people already have preconceived notions of what a 'string' is and I know from experience that they tend to not worry about straightening out their understanding of such nuances until they get burned by assumptions built around statements like "ord() gives you the ASCII value". |
|
|
msg47205 - (view) |
Author: Mike Brown (mike_j_brown) |
Date: 2004-10-31 18:17 |
Logged In: YES user_id=371366 Oops, didn't mean to remove the assignment to fdrake when adding previous comment. |
|
|
msg47206 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2004-11-01 11:11 |
Logged In: YES user_id=38388 The new wording is indeed better than the old one. +1 on that change. However, you should use the term "code point" consistently and perhaps add a footnote explaining the difference between code point, glyph and character (Unicode strings are arrays of code points - not characters). Another note: I don't particularly like the terms "narrow" and "wide" Unicode builds. If possible, these terms should be replaced by the more accurate technical terms UCS2 and UCS4 - since the error messages relating to this difference also mention these technical terms rather then narrow or wide builds. |
|
|
msg47207 - (view) |
Author: Mike Brown (mike_j_brown) |
Date: 2004-11-02 06:56 |
Logged In: YES user_id=371366 You're right re: UCS2/UCS4. I can work up another patch. I think you know this, but "code point" is not accurate UTR#17-conformant terminology, as it just refers to the single integer number from the code space that is available to Unicode (0x0-0xD7FF and 0xE000-0x10FFFF), bearing in mind that not all code points correspond to characters (all those whose hex values end in FFFE and FFFF, for example). If we are just talking about what a Unicode string is in general sense, we say it is just a sequence of characters -- a character being a unit like, say, "Latin small letter z", or "plus sign", in a writing system ("script") like Latin/Roman, Cyrillic, Hiragana, etc. If we are talking about what the unicode type is in Python, to be accurate, we should say it is a sequence of UCS2 or UCS4 "code values", depending on how Python was compiled, and note that in its printable representation, the unicode type displays, for characters outside the ASCII range, the "code points" represented by those code values. It does this using the same syntax as for string literals, but treats surrogate pairs of code values as being representative of a single code point (e.g., a unicode object consisting of code value 0xD800 followed by 0xDC00 is printably represented by u'\U00010000' even though it's still a string of length 2 in both UCS2 and UCS4 builds of Python). Is there a recommendation for how to refer unambiguously to an instance of a unicode type? Is it a "unicode object"? How about an instance of the str type? Is it an "8-bit string"? I notice we say "byte string" a lot but apparently not everyone is happy about that. |
|
|
msg47208 - (view) |
Author: Fred Drake (fdrake)  |
Date: 2005-01-19 04:52 |
Logged In: YES user_id=3066 Is the patch here finished, or was additional work needed? |
|
|
msg47209 - (view) |
Author: Mike Brown (mike_j_brown) |
Date: 2005-01-19 06:42 |
Logged In: YES user_id=371366 I was just waiting for someone to answer my question about terminology. (1) Is there a recommendation for how to refer unambiguously to an instance of a unicode type? Is it a "unicode object"? (2) How about an instance of the str type? Is it an "8-bit string"? I notice we say "byte string" a lot but apparently not everyone is happy about that. |
|
|
msg47210 - (view) |
Author: Fred Drake (fdrake)  |
Date: 2005-01-19 06:59 |
Logged In: YES user_id=3066 Ah, ok, here's some answers, then: (1) "unicode object" is right. (2) I'm happy with either "8-bit string" or "byte string", so whichever you find makes more sense in context is good. |
|
|
msg47211 - (view) |
Author: Mike Brown (mike_j_brown) |
Date: 2005-01-19 10:50 |
Logged In: YES user_id=371366 Thanks. I've attached a new copy of the patch, with minor substitions made (UCS2 and UCS4 instead of narrow and wide, mainly). |
|
|
msg47212 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2005-01-24 06:22 |
Logged In: YES user_id=593130 I strongly prefer byte string to 8-bit string both because the former is easier to think/say and because it is more accurate. 8-bits, or rather, 256 different possible values, is a minimum but not a maximum. If, for instance, Python were ported to old machines with 6-bit chars, it would likely use 12-bit bytes (double machine bytes) with code similar to USC2 (double 8-bit byte) unicode builds. And, given that there are no bit operations of the bytes of a byte string, the machine implementation in terms of bits is not really relevant. |
|
|
msg47213 - (view) |
Author: Fred Drake (fdrake)  |
Date: 2005-08-23 04:35 |
Logged In: YES user_id=3066 The portion of this that applies to the ord() documentation has been committed; the remainder of this patch is no longer necessary due to other changes to the documentation. Relevant portion committed to Doc/lib/libfuncs.tex revisions 1.188, 1.175.2.8. |
|
|