PEP 3138 – String representation in Python 3000 | peps.python.org (original) (raw)

Author:

Atsuo Ishimoto

Status:

Final

Type:

Standards Track

Created:

05-May-2008

Python-Version:

3.0

Post-History:

05-May-2008, 05-Jun-2008


Table of Contents

Abstract

This PEP proposes a new string representation form for Python 3000. In Python prior to Python 3000, the repr() built-in function converted arbitrary objects to printable ASCII strings for debugging and logging. For Python 3000, a wider range of characters, based on the Unicode standard, should be considered ‘printable’.

Motivation

The current repr() converts 8-bit strings to ASCII using following algorithm.

For Unicode strings, the following additional conversions are done.

This algorithm converts any string to printable ASCII, and repr() is used as a handy and safe way to print strings for debugging or for logging. Although all non-ASCII characters are escaped, this does not matter when most of the string’s characters are ASCII. But for other languages, such as Japanese where most characters in a string are not ASCII, this is very inconvenient.

We can use print(aJapaneseString) to get a readable string, but we don’t have a similar workaround for printing strings from collections such as lists or tuples. print(listOfJapaneseStrings) uses repr() to build the string to be printed, so the resulting strings are always hex-escaped. Or when open(japaneseFilename) raises an exception, the error message is something like IOError: [Errno 2] No such file or directory: '\u65e5\u672c\u8a9e', which isn’t helpful.

Python 3000 has a lot of nice features for non-Latin users such as non-ASCII identifiers, so it would be helpful if Python could also progress in a similar way for printable output.

Some users might be concerned that such output will mess up their console if they print binary data like images. But this is unlikely to happen in practice because bytes and strings are different types in Python 3000, so printing an image to the console won’t mess it up.

This issue was once discussed by Hye-Shik Chang [1], but was rejected.

Specification

Rationale

The repr() in Python 3000 should be Unicode, not ASCII based, just like Python 3000 strings. Also, conversion should not be affected by the locale setting, because the locale is not necessarily the same as the output device’s locale. For example, it is common for a daemon process to be invoked in an ASCII setting, but writes UTF-8 to its log files. Also, web applications might want to report the error information in more readable form based on the HTML page’s encoding.

Characters not supported by the user’s console could be hex-escaped on printing, by the Unicode encoder’s error-handler. If the error-handler of the output file is ‘backslashreplace’, such characters are hex-escaped without raising UnicodeEncodeError. For example, if the default encoding is ASCII, print('Hello ¢') will print ‘Hello \xa2’. If the encoding is ISO-8859-1, ‘Hello ¢’ will be printed.

The default error-handler for sys.stdout is ‘strict’. Other applications reading the output might not understand hex-escaped characters, so unsupported characters should be trapped when writing. If unsupported characters must be escaped, the error-handler should be changed explicitly. Unlike sys.stdout, sys.stderr doesn’t raise UnicodeEncodingError by default, because the default error-handler is ‘backslashreplace’. So printing error messages containing non-ASCII characters to sys.stderr will not raise an exception. Also, information about uncaught exceptions (exception object, traceback) is printed by the interpreter without raising exceptions.

Alternate Solutions

To help debugging in non-Latin languages without changing repr(), other suggestions were made.

Backwards Compatibility

Changing repr() may break some existing code, especially testing code. Five of Python’s regression tests fail with this modification. If you need repr() strings without non-ASCII character as Python 2, you can use the following function.

def repr_ascii(obj): return str(repr(obj).encode("ASCII", "backslashreplace"), "ASCII")

For logging or for debugging, the following code can raise UnicodeEncodeError.

log = open("logfile", "w") log.write(repr(data)) # UnicodeEncodeError will be raised # if data contains unsupported characters.

To avoid exceptions being raised, you can explicitly specify the error-handler.

log = open("logfile", "w", errors="backslashreplace") log.write(repr(data)) # Unsupported characters will be escaped.

For a console that uses a Unicode-based encoding, for example, en_US.utf8 or de_DE.utf8, the backslashreplace trick doesn’t work and all printable characters are not escaped. This will cause a problem of similarly drawing characters in Western, Greek and Cyrillic languages. These languages use similar (but different) alphabets (descended from a common ancestor) and contain letters that look similar but have different character codes. For example, it is hard to distinguish Latin ‘a’, ‘e’ and ‘o’ from Cyrillic ‘а’, ‘е’ and ‘о’. (The visual representation, of course, very much depends on the fonts used but usually these letters are almost indistinguishable.) To avoid the problem, the user can adjust the terminal encoding to get a result suitable for their environment.

Rejected Proposals

Implementation

The author wrote a patch in http://bugs.python.org/issue2630; this was committed to the Python 3.0 branch in revision 64138 on 06-11-2008.

References

This document has been placed in the public domain.