Issue 4153: Unicode HOWTO up to date? (original) (raw)

Created on 2008-10-20 18:04 by terry.reedy, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (28)

msg74999 - (view)

Author: Terry J. Reedy (terry.reedy) * (Python committer)

Date: 2008-10-20 18:03

The Unicode HOWTO begins with "Warning This HOWTO has not yet been updated for Python 3000’s string object changes."

Without reading in detail, it appears it has been updated, at least somewhat, and certainly more than I feared from the warning. "The String Type Since Python 3.0, the language features a str type that contain Unicode characters" and then a section "Converting to Bytes" and a later reference to bytearrays.

So perhaps the warning is obsolete and should be removed. Also, the revision history should have at least one more entry for the 3.0 updates, which certainly were entered since 2005

msg76240 - (view)

Author: Georg Brandl (georg.brandl) * (Python committer)

Date: 2008-11-22 10:27

Thanks for noting this! The most basic changes had been done, but I had to revise some sections for changes. Done in r67338.

msg121444 - (view)

Author: Alexander Belopolsky (belopolsky) * (Python committer)

Date: 2010-11-18 06:12

Reopening because it looks like the fix was reverted in r82301.

""" This HOWTO discusses Python 2.x’s support for Unicode, and explains various problems that people commonly encounter when trying to work with Unicode. (This HOWTO has not yet been updated to cover the 3.x versions of Python.) """ http://docs.python.org/dev/howto/unicode.html

msg121466 - (view)

Author: Alexander Belopolsky (belopolsky) * (Python committer)

Date: 2010-11-18 16:03

The changes added in r82301 are misleading because code examples in this HOWTO have been converted to 3.x. I am attaching a patch that removes "has not yet been updated to cover the 3.x" warning and makes some minor stylistic changes.

I have bumped the release version to 1.12, but I would like to remove the revision history which is largely irrelevant.

msg121474 - (view)

Author: Alexander Belopolsky (belopolsky) * (Python committer)

Date: 2010-11-18 16:45

r82301 appears to be a blind merge of r82120 from the trunk. It is fairly obvious that it was not intentional.

msg121488 - (view)

Author: Terry J. Reedy (terry.reedy) * (Python committer)

Date: 2010-11-18 19:41

Thanks for persisting with this. Looking at the patch:

@@ -65,7 +63,7 @@ goal was to have Unicode contain the alphabets for every single human language. It turns out that even 16 bits isn't enough to meet that goal, and the modern Unicode specification uses a wider range of codes, 0-1,114,111 (0x10ffff in -base-16). +base 16).

I visually parse 0-1,114,111 as 0-1, 114, 111. So I think either the commas should be removed or extra spaces are needed: 0-1114111 or 0 - 1,114,111. In your recent (and excellent) chr/ord doc patch, you used (or stayed with) 'hexadecimal' versus 'base 16'. Do we have a standard? I think I prefer the former.

-character with value 0x12ca (4810 decimal). The Unicode standard contains a lot +character with value 0x12ca (4,810 decimal). The Unicode standard contains a lot

I prefer without the added comma.

 >>> b'\x80abc'.decode("utf-8", "replace")

'\ufffdabc'

'ï¿½abc'

Three replacements (i with diaeresis, upside-down ?, 1/2) for one bad char looks wrong. With IDLE I get '�abc' (? in hexagon, codepoint 65533). Perhaps something just went wrong to patch from your file to my browser window.

@@ -281,10 +279,10 @@ built-in :func:ord function that takes a one-character Unicode string and returns the code point value::

You fixed chr/ord doc, need to fix references thereto in this doc.

-point. The \U escape sequence is similar, but expects 8 hex digits, not 4:: +point. The \U escape sequence is similar, but expects eight base 16 +digits, not four::

I really think of them as hex or hexadecimal digits, just as 0-9 are decimal, not base 10 digits.

 >>> s = "a\xac\u1234\u20ac\U00008000"
           ^^^^ two-digit hex escape

msg121490 - (view)

Author: Alexander Belopolsky (belopolsky) * (Python committer)

Date: 2010-11-18 20:00

On Thu, Nov 18, 2010 at 2:41 PM, Terry J. Reedy <report@bugs.python.org> wrote: ..

I visually parse 0-1,114,111 as 0-1, 114, 111. So I think either the commas should be removed or extra spaces are needed: 0-1114111 or 0 - 1,114,111.

What about "0 through 1,114,111"?

you used (or stayed with) 'hexadecimal' versus 'base 16'. Do we have a standard? I think I prefer the former.

I prefer 'base 16'. I thought about changing 'hexadecimal' to 'base 16' in chr/ord docs, but decided to leave it because the term 'hexadecimal' is used elsewhere on the same page notably in hex() function description where it is quite appropriate. No, we don't have a standard. I've also seen "base-16" used elsewhere which I really don't like.

'ï¿½abc'

Three replacements (i with diaeresis, upside-down ?, 1/2) for one bad char looks wrong.

That must be UTF-8 misinterpreted as Latin-1. Won't affect the commit.

With IDLE I get '�abc' (? in hexagon, codepoint 65533). Perhaps something just went wrong to patch from your file to my browser window.

Yes. I get the same on the terminal window and that's what it should look like.

built-in :func:ord function that takes a one-character Unicode string and returns the code point value::

You fixed chr/ord doc, need to fix references thereto in this doc.

I don't understand. I think "one-character Unicode string" is fine here because "Unicode string" means an abstract Unicode string, not :class:str.

-point. The \U escape sequence is similar, but expects 8 hex digits, not 4:: +point. The \U escape sequence is similar, but expects eight base 16 +digits, not four::

I really think of them as hex or hexadecimal digits, just as 0-9 are decimal, not base 10 digits.

I am fine with "hexadecimal" here. I did not like "hex".

msg121491 - (view)

Author: Alexander Belopolsky (belopolsky) * (Python committer)

Date: 2010-11-18 20:12

On Thu, Nov 18, 2010 at 3:00 PM, Alexander Belopolsky <report@bugs.python.org> wrote: ..

I really think of them as hex or hexadecimal digits, just as 0-9 are decimal, not base 10 digits.

I am fine with "hexadecimal" here. I did not like "hex".

If you think about it, "hexadecimal digit" is a twice oxymoron because both "decimal" and "digit" imply base 10. :-) It does look like the most widely used term, nevertheless.

msg121495 - (view)

Author: Terry J. Reedy (terry.reedy) * (Python committer)

Date: 2010-11-18 20:47

0 through ... is fine with me.

Yes, hex numeral would be more accurate than hex digit.

msg121499 - (view)

Author: Raymond Hettinger (rhettinger) * (Python committer)

Date: 2010-11-19 01:02

Yes, hex numeral would be more accurate than hex digit.

Stick with hex digit. We've used that phraseology for a long time. See string.hexdigits for example. And "hex numeral" just sounds weird -- it makes me do a double-take to see if there was some special implied meaning.

msg121547 - (view)

Author: Alexander Belopolsky (belopolsky) * (Python committer)

Date: 2010-11-19 16:22

Committed in revision 86530. Thanks Terry and Raymond for your comments. I would like to keep this issue open (at a low priority) because the question in the titles is still relevant. There are many new 3.x features that are not covered such as surrogateescape error handler. Such topics may or may not be appropriate for a HOWTO. there are also some stylistic changes that I would like to consider:

Replace verbatim URLs with properly formatted hyperlinked titles of the referenced resources.
I couldn't figure out who the original author was. With first person passages, such as "I remember looking at Apple ][ BASIC programs, .." it may be appropriate to list the original author at the top even if the text has been changed by others over the years. At the very least the Acknowlegements section should start with "This article was originally written by X [on an occasion Y.]"
Examples should be properly marked up to allow sphinx to run them and check the output.

msg121548 - (view)

Author: Éric Araujo (eric.araujo) * (Python committer)

Date: 2010-11-19 16:30

Agreed on 1 and 3. Regarding 2, looking at the early history of the file makes me suspect that amk is the author.

msg143310 - (view)

Author: Ezio Melotti (ezio.melotti) * (Python committer)

Date: 2011-09-01 08:04

After the recent discussions on python-dev I went through the Unicode howto and fixed a few things, then I found this issue so I'm attaching the patch here. The patch addresses mostly markup issues, but it also removes the usage of 'byte string'. A few more things that should be done:

clarify some more terms (e.g. codepoints, code units, characters, possibly scalar values etc.);
mention the differences between narrow and wide builds, including:
- a discussion about the UCS-2/UTF-16 implementation of narrow builds;
- something about surrogates and surrogate pairs;
- effects of slicing and indexing on narrow builds;
- functions/methods that (don't) accept non-BMP chars on narrow builds;
something about Unicode supports in the re module (this probably can wait after the 'regex' inclusion).

Also the codecs doc has a section about Unicode and encodings that might be moved to the howto.

msg143317 - (view)

Author: Ezio Melotti (ezio.melotti) * (Python committer)

Date: 2011-09-01 10:57

I also left a few comments on rietveld about other things that can be improved. Please reply and comment there.

msg143421 - (view)

Author: Éric Araujo (eric.araujo) * (Python committer)

Date: 2011-09-02 17:13

something about Unicode supports in the re module (this probably can wait after the 'regex' inclusion). I’d prefer documentation for the re module now.

msg143422 - (view)

Author: Éric Araujo (eric.araujo) * (Python committer)

Date: 2011-09-02 17:38

it also removes the usage of 'byte string'. I see you’ve replaced it with “byte object”. I’m -0, as “byte[s] string” is not ambiguous IMO.

msg143424 - (view)

Author: Ezio Melotti (ezio.melotti) * (Python committer)

Date: 2011-09-02 17:44

There was some discussion a while ago on python-dev about it. AFAIR the outcome was that using "bytes strings" should be avoided because bytes are bytes, and not strings (until they get decoded at least). Using 'string' for both might lead people to think that there are two kinds of strings, bytes and Unicode (like in python 2) while they should think that there are only Unicode strings and they can be converted to a bytes object (or simply to 'bytes').

msg143426 - (view)

Author: Éric Araujo (eric.araujo) * (Python committer)

Date: 2011-09-02 17:58

Ah, I see: you’re equating “string” with “text string” or “character string”, whereas I read “bytes string” as “finite sequence of bytes”. With this definition, there are two string types in Python 3, it’s just that they’re much more divorced than in 2.x.

they should think that there are only Unicode strings I’d say they should think that text processing should only happen with the one type dedicated to text, i.e. str.

they can be converted to a bytes object (or simply to 'bytes') Okay, +0 to use only “bytes object” (or “bytes” when it sounds better).

msg180283 - (view)

Author: Roundup Robot (python-dev) (Python triager)

Date: 2013-01-20 10:19

New changeset 260a9afd999a by Ezio Melotti in branch '3.2': #4153: update the Unicode howto. http://hg.python.org/cpython/rev/260a9afd999a

New changeset 572ca3d35c2f by Ezio Melotti in branch '3.3': #4153: merge with 3.2. http://hg.python.org/cpython/rev/572ca3d35c2f

New changeset 034e1e076c77 by Ezio Melotti in branch 'default': #4153: merge with 3.3. http://hg.python.org/cpython/rev/034e1e076c77

msg180284 - (view)

Author: Ezio Melotti (ezio.melotti) * (Python committer)

Date: 2013-01-20 10:31

I committed the attached patch with some minor modifications, but there are still comments that should be addressed on Rietveld.

msg180738 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-01-27 03:27

The section in the HOWTO on Python's unicode support also misses the fact that the easiest way to include a Unicode character in a string literal in Python 3 is to include that character in the string literal (since source code is now treated as UTF-8 by default).

msg180820 - (view)

Author: Ezio Melotti (ezio.melotti) * (Python committer)

Date: 2013-01-28 02:16

As discussed in #13997, the HOWTO should be reorganized to start with a basic introduction and then expand on more advanced topic.

See also for a couple of essays that could be linked as "see also" or integrated in the HOWTO.

msg190820 - (view)

Author: A.M. Kuchling (akuchling) * (Python committer)

Date: 2013-06-08 19:46

Continuing my tour of the howtos, here's a patch making many of the changes discussed here and on . Changes made:

state that python3 source encoding is UTF-8, and give examples
mention surrogateescape in the 'tips and tricks' section, and backslashreplace in the "Python's Unicode Support" section.
default filesystem encoding is now UTF-8, not ascii.
link to Nick Coghlan's and Ned Batchelder's notes/presentations.
remove revision history
remove usage of "I think", "I'm not going to", etc.
update acks section

Things I did not do, though they were suggested:

Move tip "Software should only work with Unicode strings internally" from the last section to somewhere earlier and more prominent. Perhaps it could go somewhere in the "Python's Unicode Support" section.
mention codecs.StreamRecoder and StreamReaderWriter (I could put this in 'tips and tricks').
Examples should be properly marked up to allow sphinx to run them and check the output. (May not be possible.)
mention unicode support in re module
clarify some more terms (e.g. codepoints, code units, characters, possibly scalar values etc.) -- I don't see why they matter, since we don't use them.

msg190835 - (view)

Author: A.M. Kuchling (akuchling) * (Python committer)

Date: 2013-06-08 22:33

Updated version of my patch, which adds two more todo items and handles Ezio's review comments:

Switch from Greek examples to French, and remove non-Latin-1 characters.
Change language for bytes.decode to "but supports a few more possible handlers".
Describe Unicode support in the re module.
Describe StreamRecoder. I don't see why StreamReaderWriter would need to be mentioned.

I do not intend to do the remaining items on the todo list (clarify some more terms; make it work with doctest).

msg190841 - (view)

Author: Alyssa Coghlan (ncoghlan) * (Python committer)

Date: 2013-06-09 04:06

amk's latest patch looks like a very nice improvement to me.

One suggested wording tweak for the aside about the simplified history: s/The average Python programmer doesn't need to know the historical details/The precise historical details aren't relevant to understanding how to use Unicode effectively/ (and then continue with "; if you're curious ..." as it does now)

msg191511 - (view)

Author: Roundup Robot (python-dev) (Python triager)

Date: 2013-06-20 13:46

New changeset 1dbbed06a163 by Andrew Kuchling in branch '3.3': #4153: update Unicode howto for Python 3.3 http://hg.python.org/cpython/rev/1dbbed06a163

msg191513 - (view)

Author: A.M. Kuchling (akuchling) * (Python committer)

Date: 2013-06-20 14:16

As far as I can tell, there are no other outstanding suggestions for howto updates, so I'll now close this item. Feel free to re-open or file a new item if there are further improvements that can be made.

msg191638 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2013-06-22 09:11

Most of changes are applicable to Python 2 too. Do you want backport part of your patch to 2.7?

History

Date

User

Action

Args

2022-04-11 14:56:40

admin

set

github: 48403

2013-06-22 09:11:32

serhiy.storchaka

set

nosy: + serhiy.storchaka
messages: +

2013-06-20 14:16:18

akuchling

set

status: open -> closed
resolution: fixed
messages: +

stage: commit review -> resolved

2013-06-20 13:46:27

python-dev

set

messages: +

2013-06-09 04:06:14

ncoghlan

set

messages: +

2013-06-08 22:33:21

akuchling

set

files: + unicode-howto.txt

messages: +

2013-06-08 19:46:46

akuchling

set

files: + unicode-howto.txt

messages: +

2013-01-28 02:20:15

ezio.melotti

link

issue13997 superseder

2013-01-28 02:16:25

ezio.melotti

set

messages: +

2013-01-27 03:27:52

ncoghlan

set

messages: +

2013-01-27 02:39:05

cvrebert

set

nosy: + cvrebert

2013-01-20 10:31:46

ezio.melotti

set

messages: +

2013-01-20 10:20:29

python-dev

set

nosy: + python-dev
messages: +

2012-09-26 17:45:54

ezio.melotti

set

assignee: ezio.melotti

2011-09-17 16:38:18

ezio.melotti

set

nosy: + ncoghlan

2011-09-02 17:58:08

eric.araujo

set

messages: +

2011-09-02 17:44:30

ezio.melotti

set

messages: +

2011-09-02 17:38:27

eric.araujo

set

messages: +

2011-09-02 17:13:56

eric.araujo

set

messages: +

2011-09-01 10:57:45

ezio.melotti

set

messages: +

2011-09-01 08:04:12

ezio.melotti

set

files: + issue4153-2.diff
versions: + Python 3.3
messages: +

assignee: georg.brandl -> (no value)
resolution: fixed -> (no value)
stage: commit review

2010-11-19 16:30:39

eric.araujo

set

messages: +

2010-11-19 16:22:09

belopolsky

set

priority: normal -> low

messages: +

2010-11-19 01:02:47

rhettinger

set

nosy: + rhettinger
messages: +

2010-11-18 20:47:05

terry.reedy

set

messages: +

2010-11-18 20:12:20

belopolsky

set

messages: +

2010-11-18 20:00:24

belopolsky

set

messages: +

2010-11-18 19:42:12

ezio.melotti

set

nosy: + ezio.melotti

2010-11-18 19:41:54

terry.reedy

set

messages: +

2010-11-18 16:52:37

eric.araujo

set

nosy: + eric.araujo

2010-11-18 16:48:26

belopolsky

set

nosy: + akuchling

2010-11-18 16:45:37

belopolsky

set

messages: +

2010-11-18 16:38:50

belopolsky

set

files: - issue4153.diff

2010-11-18 16:38:40

belopolsky

set

files: + issue4153.diff

2010-11-18 16:03:45

belopolsky

set

files: + issue4153.diff
keywords: + patch
messages: +

2010-11-18 06:12:58

belopolsky

set

status: closed -> open
versions: + Python 3.2, - Python 3.0
nosy: + belopolsky

messages: +

2008-11-22 10:27:32

georg.brandl

set

status: open -> closed
resolution: fixed
messages: +

2008-10-20 18:04:00

terry.reedy

create