[Python-Dev] PEP 383 update: utf8b is now the error handler (original) (raw)

Zooko O'Whielacronx zookog at gmail.com
Tue May 5 17🔞29 CEST 2009


On Tue, May 5, 2009 at 8:57 AM, Stephen J. Turnbull <stephen at xemacs.org> wrote:

2.  The specification should state, and the discussion emphasize, that  strings which were produced by surrogate replacement must not be  used in data interchange with systems that do not specifically  accept such strings, and that this is the responsibility of the  application.[2]

That sounds like a useful statement to make. How would an application make sure that they were producing only valid unicode? How about add an option to os.listdir() named "errors" with default value 'utf8b' (or 'surrogate-replace', or whatever the name is)? Then applications which need to produce only valid unicode strings could pass errors=strict, errors=ignore, or errors=replace? (If anyone really wants behavior like Python 3.0 then we could perhaps also add a new one just for os.listdir() named errors=skipfilename.)

My most recent plan for Tahoe, as of the letter that I sent last night, is to emulate the behavior of Nautilus and GNU ls by using the 'replace' error handler and (emulating Nautilus) to append " (invalid encoding)" to the end of the string. (screenshot: http://zooko.com/Nautilus_vs_invalid_encoding.png )

So if I could ask os.listdir to return filenames with U+FFFD in place of undecodable characters, then I could subsequently do something like:

for f in os.listdir(d, errors='replace'): if u"\ufffd" in f: f += " (invalid encoding)"

(On top of that I would have to check for collisions, but that's out of scope.)

Regards,

Zooko



More information about the Python-Dev mailing list