cpython: 1dbbed06a163 (original) (raw)

--- a/Doc/howto/unicode.rst +++ b/Doc/howto/unicode.rst @@ -28,15 +28,15 @@ which required accented characters could as 'naïve' and 'café', and some publications have house styles which require spellings such as 'coöperate'.) -For a while people just wrote programs that didn't display accents. I remember -looking at Apple ][ BASIC programs, published in French-language publications in -the mid-1980s, that had lines like these:: +For a while people just wrote programs that didn't display accents. +In the mid-1980s an Apple II BASIC program written by a French speaker +might have lines like these:: PRINT "FICHIER EST COMPLETE." PRINT "CARACTERE NON ACCEPTE." -Those messages should contain accents, and they just look wrong to someone who -can read French. +Those messages should contain accents (completé, caractère, accepté), +and they just look wrong to someone who can read French. In the 1980s, almost all personal computers were 8-bit, meaning that bytes could hold values ranging from 0 to 255. ASCII codes only went up to 127, so some @@ -69,9 +69,12 @@ There's a related ISO standard, ISO 1064 originally separate efforts, but the specifications were merged with the 1.1 revision of Unicode. -(This discussion of Unicode's history is highly simplified. I don't think the -average Python programmer needs to worry about the historical details; consult -the Unicode consortium site listed in the References for more information.) +(This discussion of Unicode's history is highly simplified. The +precise historical details aren't necessary for understanding how to +use Unicode effectively, but if you're curious, consult the Unicode +consortium site listed in the References or +the Wikipedia entry for Unicode <http://en.wikipedia.org/wiki/Unicode#History> +for more information.) Definitions @@ -216,10 +219,8 @@ Unicode character tables. Another good introductory article <http://www.joelonsoftware.com/articles/Unicode.html> was written by Joel Spolsky. -If this introduction didn't make things clear to you, you should try reading this -alternate article before continuing. - -.. Jason Orendorff XXX http://www.jorendorff.com/articles/unicode/ is broken +If this introduction didn't make things clear to you, you should try +reading this alternate article before continuing. Wikipedia entries are often helpful; see the entries for "character encoding[](#l1.51) <http://en.wikipedia.org/wiki/Character_encoding>_" and `UTF-8 @@ -239,8 +240,31 @@ Since Python 3.0, the language features characters, meaning any string created using "unicode rocks!", 'unicode[](#l1.54) rocks!', or the triple-quoted string syntax is stored as Unicode. -To insert a non-ASCII Unicode character, e.g., any letters with -accents, one can use escape sequences in their string literals as such:: +The default encoding for Python source code is UTF-8, so you can simply +include a Unicode character in a string literal:: +

+ +You can use a different encoding from UTF-8 by putting a specially-formatted +comment as the first or second line of the source code:: +

+ +Side note: Python 3 also supports using Unicode characters in identifiers:: +

+ +If you can't enter a particular character in your editor or want to +keep the source code ASCII-only for some reason, you can also use +escape sequences in string literals. (Depending on your system, +you may see the actual capital-delta glyph instead of a \u escape.) :: >>> "\N{GREEK CAPITAL LETTER DELTA}" # Using the character name '\u0394' @@ -251,7 +275,7 @@ accents, one can use escape sequences in In addition, one can create a string using the :func:~bytes.decode method of :class:bytes. This method takes an encoding argument, such as UTF-8, -and optionally, an errors argument. +and optionally an errors argument. The errors argument specifies the response when the input string can't be converted according to the encoding's rules. Legal values for this argument are @@ -295,11 +319,15 @@ Converting to Bytes The opposite method of :meth:bytes.decode is :meth:str.encode, which returns a :class:bytes representation of the Unicode string, encoded in the -requested encoding. The errors parameter is the same as the parameter of -the :meth:~bytes.decode method, with one additional possibility; as well as -'strict', 'ignore', and 'replace' (which in this case inserts a -question mark instead of the unencodable character), you can also pass -'xmlcharrefreplace' which uses XML's character references. +requested encoding. + +The errors parameter is the same as the parameter of the +:meth:~bytes.decode method but supports a few more possible handlers. As well as +'strict', 'ignore', and 'replace' (which in this case +inserts a question mark instead of the unencodable character), there is +also 'xmlcharrefreplace' (inserts an XML character reference) and +backslashreplace (inserts a \uNNNN escape sequence). + The following example shows the different results:: >>> u = chr(40960) + 'abcd' + chr(1972) @@ -316,16 +344,15 @@ The following example shows the differen b'?abcd?' >>> u.encode('ascii', 'xmlcharrefreplace') b'ꀀabcd޴' - -.. XXX mention the surrogate* error handlers

-The low-level routines for registering and accessing the available encodings are -found in the :mod:codecs module. However, the encoding and decoding functions -returned by this module are usually more low-level than is comfortable, so I'm -not going to describe the :mod:codecs module here. If you need to implement a -completely new encoding, you'll need to learn about the :mod:codecs module -interfaces, but implementing encodings is a specialized task that also won't be -covered here. Consult the Python documentation to learn more about this module. +The low-level routines for registering and accessing the available +encodings are found in the :mod:codecs module. Implementing new +encodings also requires understanding the :mod:codecs module. +However, the encoding and decoding functions returned by this module +are usually more low-level than is comfortable, and writing new encodings +is a specialized task, so the module won't be covered in this HOWTO. Unicode Literals in Python Source Code @@ -415,12 +442,50 @@ These are grouped into categories such a from the above output, 'Ll' means 'Letter, lowercase', 'No' means "Number, other", 'Mn' is "Mark, nonspacing", and 'So' is "Symbol, other". See -http://www.unicode.org/reports/tr44/#General_Category_Values for a +the General Category Values section of the Unicode Character Database documentation <http://www.unicode.org/reports/tr44/#General_Category_Values>_ for a list of category codes. + +Unicode Regular Expressions +--------------------------- + +The regular expressions supported by the :mod:re module can be provided +either as bytes or strings. Some of the special character sequences such as +\d and \w have different meanings depending on whether +the pattern is supplied as bytes or a string. For example, +\d will match the characters [0-9] in bytes but +in strings will match any character that's in the 'Nd' category. + +The string in this example has the number 57 written in both Thai and +Arabic numerals:: +

+When executed, \d+ will match the Thai numerals and print them +out. If you supply the :const:re.ASCII flag to +:func:~re.compile, \d+ will match the substring "57" instead. + +Similarly, \w matches a wide variety of Unicode characters but +only [a-zA-Z0-9_] in bytes or if :const:re.ASCII is supplied, +and \s will match either Unicode whitespace characters or +[ \t\n\r\f\v]. + + References ---------- +.. comment should these be mentioned earlier, e.g. at the start of the "introduction to Unicode" first section? + +Some good alternative discussions of Python's Unicode support are: + +* Processing Text Files in Python 3 <http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html>, by Nick Coghlan. +* Pragmatic Unicode <http://nedbatchelder.com/text/unipain.html>, a PyCon 2012 presentation by Ned Batchelder. + The :class:str type is described in the Python library reference at :ref:textseq. @@ -428,12 +493,10 @@ The documentation for the :mod:unicoded[](#l1.194) [](#l1.195) The documentation for the :mod:codecsmodule.[](#l1.196) [](#l1.197) -Marc-André Lemburg gave a presentation at EuroPython 2002 titled "Python and[](#l1.198) -Unicode". A PDF version of his slides is available at[](#l1.199) -<http://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf>, and is an[](#l1.200) -excellent overview of the design of Python's Unicode features (based on Python[](#l1.201) -2, where the Unicode string type is called ``unicode`` and literals start with[](#l1.202) -``u``).[](#l1.203) +Marc-André Lemburg gavea presentation titled "Python and Unicode" (PDF slides) http://downloads.egenix.com/python/Unicode-EPC2002-Talk.pdf_ at[](#l1.204) +EuroPython 2002. The slides are an excellent overview of the design[](#l1.205) +of Python 2's Unicode features (where the Unicode string type is[](#l1.206) +called ``unicode`` and literals start with ``u``).[](#l1.207) [](#l1.208) [](#l1.209) Reading and Writing Unicode Data[](#l1.210) @@ -512,7 +575,7 @@ example, Mac OS X uses UTF-8 while Windo[](#l1.211) Windows, Python uses the name "mbcs" to refer to whatever the currently[](#l1.212) configured encoding is. On Unix systems, there will only be a filesystem[](#l1.213) encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if[](#l1.214) -you haven't, the default encoding is ASCII.[](#l1.215) +you haven't, the default encoding is UTF-8.[](#l1.216) [](#l1.217) The :func:sys.getfilesystemencoding function returns the encoding to use on[](#l1.218) your current system, in case you want to do the encoding manually, but there's[](#l1.219) @@ -527,13 +590,13 @@ automatically converted to the right enc[](#l1.220) Functions in the :mod:os module such as :func:os.stat will also accept Unicode[](#l1.221) filenames.[](#l1.222) [](#l1.223) -Function :func:os.listdir, which returns filenames, raises an issue: should it return[](#l1.224) +The :func:os.listdir function returns filenames and raises an issue: should it return[](#l1.225) the Unicode version of filenames, or should it return bytes containing[](#l1.226) the encoded versions? :func:os.listdir` will do both, depending on whether you provided the directory path as bytes or a Unicode string. If you pass a Unicode string as the path, filenames will be decoded using the filesystem's encoding and a list of Unicode strings will be returned, while passing a byte -path will return the bytes versions of the filenames. For example, +path will return the filenames as bytes. For example, assuming the default filesystem encoding is UTF-8, running the following program:: @@ -548,13 +611,13 @@ program:: will produce the following output:: amk:~$ python t.py

The first list contains UTF-8-encoded filenames, and the second list contains the Unicode versions. -Note that in most occasions, the Unicode APIs should be used. The bytes APIs +Note that on most occasions, the Unicode APIs should be used. The bytes APIs should only be used on systems where undecodable file names can be present, i.e. Unix systems. @@ -585,65 +648,69 @@ data also specifies the encoding, since clever way to hide malicious text in the encoded bytestream. +Converting Between File Encodings +''''''''''''''''''''''''''''''''' + +The :class:~codecs.StreamRecoder class can transparently convert between +encodings, taking a stream that returns data in encoding #1 +and behaving like a stream returning data in encoding #2. + +For example, if you have an input file f that's in Latin-1, you +can wrap it with a :class:StreamRecoder to return bytes encoded in UTF-8:: +

+

+ + +Files in an Unknown Encoding +'''''''''''''''''''''''''''' + +What can you do if you need to make a change to a file, but don't know +the file's encoding? If you know the encoding is ASCII-compatible and +only want to examine or modify the ASCII parts, you can open the file +with the surrogateescape error handler:: +

+

+

+ +The surrogateescape error handler will decode any non-ASCII bytes +as code points in the Unicode Private Use Area ranging from U+DC80 to +U+DCFF. These private code points will then be turned back into the +same bytes when the surrogateescape error handler is used when +encoding the data and writing it back out. + + References ---------- -The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware -Applications in Python" are available at -http://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf -and discuss questions of character encodings as well as how to internationalize +One section of Mastering Python 3 Input/Output <http://pyvideo.org/video/289/pycon-2010--mastering-python-3-i-o>, a PyCon 2010 talk by David Beazley, discusses text processing and binary data handling. + +The PDF slides for Marc-André Lemburg's presentation "Writing Unicode-aware Applications in Python" <http://downloads.egenix.com/python/LSM2005-Developing-Unicode-aware-applications-in-Python.pdf> +discuss questions of character encodings as well as how to internationalize and localize an application. These slides cover Python 2.x only. +The Guts of Unicode in Python <http://pyvideo.org/video/1768/the-guts-of-unicode-in-python>_ is a PyCon 2013 talk by Benjamin Peterson that discusses the internal Unicode representation in Python 3.3. + Acknowledgements ================ -Thanks to the following people who have noted errors or offered suggestions on -this article: Nicholas Bastin, Marius Gedminas, Kent Johnson, Ken Krugler, -Marc-André Lemburg, Martin von Löwis, Chad Whitacre. - -.. comment

-.. comment Describe Python 3.x support (new section? new document?) -.. comment Describe use of codecs.StreamRecoder and StreamReaderWriter - -.. comment

+Thanks to the following people who have noted errors or offered +suggestions on this article: Éric Araujo, Nicholas Bastin, Nick +Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André +Lemburg, Martin von Löwis, Terry J. Reedy, Chad Whitacre.