[Python-Dev] String encoding (original) (raw)

Fredrik Lundh fredrik@pythonware.com
Tue, 23 May 2000 13:38:41 +0200

Previous message: [Python-Dev] String encoding
Next message: [Python-Dev] String encoding
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

M.-A. Lemburg wrote:

The recent discussion about repr() et al. brought up the idea of a locale based string encoding again.

before proceeding down this (not very slippery but slightly unfortunate, imho) slope, I think we should decide whether

assert eval(repr(s)) =3D=3D s

should be true for strings.

if this isn't important, nothing stops you from changing 'repr' to use isprint, without having to make sure that you can still parse the resulting string.

but if it is important, you cannot really change 'repr' without addressing the big issue.

so assuming that the assertion must hold, and that changing 'repr' to be locale-dependent is a good idea, let's move on:

A support module for querying the encoding used in the current locale together with the experimental hook to set the string encoding could yield a compromise which satisfies ASCII, Latin-1 and UTF-8 proponents.

agreed.

The idea is to use the site.py module to customize the interpreter from within Python (rather than making the encoding a compile time option). This is easily doable using the (yet to be written) support module and the sys.setstringencoding() hook.

agreed.

note that parsing LANG (etc) variables on a POSIX platform is easy enough to do in Python (either in site.py or in locale.py). no need for external support modules for Unix, in other words.

for windows, I suggest adding GetACP() to the _locale module, and let the glue layer (site.py 0or locale.py) do:

if sys.platform =3D=3D "win32":
    sys.setstringencoding("cp%d" % GetACP())

on mac, I think you can determine the encoding by inspecting the system font, and fall back to "macroman" if that doesn't work out. but figuring out the right way to do that is best left to anyone who actually has access to a Mac. in the meantime, just make it:

elif sys.platform =3D=3D "mac":
    sys.setstringencoding("macroman")

The default encoding would be 'ascii' and could then be changed to whatever the user or administrator wants it to be on a per site basis.=20

Tcl defaults to "iso-8859-1" on all platforms except the Mac. assuming that the vast majority of non-Mac platforms are either modern Unixes or Windows boxes, that makes a lot more sense than US ASCII...

in other words:

else:
    # try to determine encoding from POSIX locale environment
    # variables
    ...

else:
    sys.setstringencoding("iso-latin-1")

Furthermore, the encoding should be settable on a per thread basis inside the interpreter (Python threads do not seem to inherit any per-thread globals, so the encoding would have to be set for all new threads).

is the C/POSIX locale setting thread specific?

if not, I think the default encoding should be a global setting, just like the system locale itself. otherwise, you'll just be addressing a real problem (thread/module/function/class/object specific locale handling), but not really solving it...

better use unicode strings and explicit encodings in that case.

Minor nit: due to the implementation, the C parser markers "s" and "t" and the hash() value calculation will still need to work with a fixed encoding which still is UTF-8.

can this be fixed? or rather, what changes to the buffer api are required if we want to work around this problem?

C APIs which want to support Unicode should be fixed to use "es" or query the object directly and then apply proper, possibly OS dependent conversion.

for convenience, it might be a good idea to have a "wide system encoding" too, and special parser markers for that purpose.

or can we assume that all wide system API's use unicode all the time?

unproductive-ly yrs /F

Previous message: [Python-Dev] String encoding
Next message: [Python-Dev] String encoding
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]