[Python-Dev] Unicode and Windows (original) (raw)

M.-A. Lemburg mal@lemburg.com
Tue, 21 Mar 2000 01:25:09 +0100


Mark Hammond wrote:

I would like to discuss Unicode on the Windows platform, and how it relates to MBCS that Windows uses. My main goal here is to ensure that Unicode on Windows can make a round-trip to and from native Unicode stores. As an example, let's take the registry - a Windows user should be able to read a Unicode value from the registry then write it back. The value written back should be identical to the value read. Ditto for the file system: If the filesystem is Unicode, then I would expect the following code: for fname in os.listdir(): f = open(fname + ".tmp", "w") To create filenames on the filesystem with the exact base name even when the basename contains non-ascii characters. However, the Unicode patches do not appear to make this possible. open() uses PyArgParseTuple(args, "s..."); PyArgParseTuple() will automatically convert a Unicode object to UTF-8, so we end up passing a UTF-8 encoded string to the C runtime fopen function.

Right. The idea with open() was to write a special version (using #ifdefs) for use on Windows platforms which does all the needed magic to convert Unicode to whatever the native format and locale is...

Using parser markers for this is obviously not the right way to get to the core of the problem. Basically, you will have to write a helper which takes a string, Unicode or some other "t" compatible object as name object and then converts it to the system's view of things.

I think we had a private discussion about this a few months ago: there was some way to convert Unicode to a platform independent format which then got converted to MBCS -- don't remember the details though.

The end result of all this is that we end up with UTF-8 encoded names in the registry/on the file system. It does not seem possible to get a true Unicode string onto either the file system or in the registry.

Unfortunately, Im not experienced enough to know the full ramifications, but it appears that on Windows the default "unicode to string" translation should be done via the WideCharToMultiByte() API. This will then pass an MBCS encoded ascii string to Windows, and the "right thing" should magically happen. Unfortunately, MBCS encoding is dependant on the current locale (ie, one MBCS sequence will mean completely different things depending on the locale). I dont see a portability issue here, as the documentation could state that "Unicode->ASCII conversions use the most appropriate conversion for the platform. If the platform is not Unicode aware, then UTF-8 will be used."

No, no, no... :-) The default should be (and is) UTF-8 on all platforms -- whether the platform supports Unicode or not. If a platform uses a different encoding, an encoder should be used which applies the needed transformation.

This issue is the final one before I release the win32reg module. It seems critical to me that if Python supports Unicode and the platform supports Unicode, then Python unicode values must be capable of being passed to the platform. For the win32reg module I could quite possibly hack around the problem, but the more general problem (categorized by the open() example above) still remains...

Any thoughts?

Can't you use the wchar_t interfaces for the task (see the unicodeobject.h file for details) ? Perhaps you can first transfer Unicode to wchar_t and then on to MBCS using a win32 API ?!

-- Marc-Andre Lemburg


Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/