[Python-Dev] File system path encoding on Windows (original) (raw)

Steve Dower steve.dower at python.org
Fri Aug 19 15:33:58 EDT 2016


On 19Aug2016 1225, Daniel Holth wrote:

#1 sounds like a great idea. I suppose surrogatepass solves approximately the same problem of Rust's WTF-8, which is a way to round-trip bad UCS-2? https://simonsapin.github.io/wtf-8/

Yep.

#2 sounds like it would leave several problems, since mbcs is not the same as a normal text encoding, IIUC it depends on the active code page. So if your active code page is Russian you might not be able to encode Japanese characters into MBCS.

That's correct. In 99% (or more) of cases, mbcs is going to be the same as what we currently have. The difference is that when we encode/decode in CPython we can use a different handler than 'replace' and at least prevent the silent data loss.

Solution #2a Modify Windows so utf-8 is a valid value for the current MBCS code page.

Presumably a joke, but won't happen because too many applications assume that the active code page is one byte per character, which it isn't, but it's close enough that most of the time you never notice. (Incidentally, this is also the problem with utf-16, since many applications also assume that it's always one wchar_t per character and get away with it. At least with utf-8 you encounter multi-byte sequences often enough that you basically are forced to deal with them.)

Cheers, Steve



More information about the Python-Dev mailing list