[Python-Dev] Bytes path related questions for Guido (original) (raw)

Nick Coghlan ncoghlan at gmail.com
Sun Aug 24 15:04:31 CEST 2014


On 24 August 2014 14:44, Nick Coghlan <ncoghlan at gmail.com> wrote:

2. Should we add some additional helpers to the string module for dealing with surrogate escaped bytes and other techniques for smuggling arbitrary binary data as text?

My proposal [3] is to add: * string.escapedsurrogates (constant with the 128 escaped code points) * string.clean(s): replaces surrogates with '\ufffd' or another specified code point * string.redecode(s, encoding): encodes a string back to bytes and then decodes it again using the specified encoding (the old encoding defaults to 'latin-1' to match the assumptions in WSGI)

Serhiy & Ezio convinced me to scale this one back to a proposal for "codecs.clean_surrogate_escapes(s)", which replaces surrogates that may be produced by surrogateescape (that's what string.clean() above was supposed to be, but my description was not correct, and the name was too vague for that error to be obvious to the reader)

"s != codecs.clean_surrogate_escapes(s)" would then become the check for "does this string contain any surrogate escaped bytes?"

Regards, Nick.

-- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia



More information about the Python-Dev mailing list