Issue 9335: LC_CTYPE system setting not respected by setlocale() (original) (raw)
Created on 2010-07-23 02:38 by antlong, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Messages (29)
Author: Anthony Long (antlong)
Date: 2010-07-23 02:38
On mac 10.5, python 2.6.4 (via mac ports) performing
len(string.letters) will produce 117 instead of 52.
from terminal: along-mb:~ along$ locale LANG="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_ALL=
This appears to be related to:
locale.setlocale(locale.LC_CTYPE) not being respected.
len(string.letters) should produce 52.
Author: Alexander Belopolsky (belopolsky) *
Date: 2010-07-23 02:59
I can reproduce this in Apple's idle, but not in trunk or 2.7 versions. I'll leave it open in case Ronald is interested. Antlong also reports that this happens on windows, but I cannot verify that.
Here is my session copied from idle:
Python 2.5.3c1 (release25-maint, Dec 17 2008, 21:50:37) [GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin Type "copyright", "credits" or "license()" for more information.
****************************************************************
Personal firewall software may warn about the connection IDLE
makes to its subprocess using this computer's internal loopback
interface. This connection is not visible on any external
interface and no data is sent to or received from the Internet.
****************************************************************
IDLE 1.2.3c1
from string import letters letters 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb5\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff' len(letters) 117 letters 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb5\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff' print _ ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzᆰᄉᄎÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ letters.isalpha() True import locale locale.getlocale() ('en_US', 'UTF8') locale.setlocale(locale.LC_CTYPE) 'en_US.UTF-8'
Author: Anthony Long (antlong)
Date: 2010-07-23 03:16
Also: windows 64x, python 2.7
- Python 2.7 (r27:82525, Jul 4 2010, 07:43:08) [MSC v.1500 64 bit (AMD64)] on win32
- Type "copyright", "credits" or "license()" for more information.
import string
string.letters
- 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\x83\x8a\x8c\x8e\x9a\x9c\x9e\x9f\xaa\xb5\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
import locale
locale.getdefaultlocale()
- ('en_US', 'cp1252')
Author: Jeremy Kloth (jkloth) *
Date: 2010-07-23 03:20
Note that this behavior is only present when running IDLE. Python command-line does not show this oddity.
Author: Alexander Belopolsky (belopolsky) *
Date: 2010-07-23 03:23
Here is a simpler test: in idle2.6,
'\xff'.isalpha() True
but in idle2.7 and plain python prompt, it is False.
Author: Alexander Belopolsky (belopolsky) *
Date: 2010-07-23 03:42
Here is a way to reproduce this from command line:
$ python2.6 Python 2.6.5 (r265:79359, Mar 24 2010, 01:32:55) [GCC 4.0.1 (Apple Inc. build 5493)] on darwin Type "help", "copyright", "credits" or "license" for more information.
'\xff'.isalpha() False import idlelib.run '\xff'.isalpha() True
Author: Alexander Belopolsky (belopolsky) *
Date: 2010-07-23 03:47
Or even simpler:
$ python2.6 Python 2.6.5 (r265:79359, Mar 24 2010, 01:32:55) [GCC 4.0.1 (Apple Inc. build 5493)] on darwin Type "help", "copyright", "credits" or "license" for more information.
import Tkinter '\xff'.isalpha() True
Author: Anthony Long (antlong)
Date: 2010-07-23 04:02
Windows 64 bit, python 2.7:
'\xff'.isalpha() False import idlelib.run '\xff'.isalpha() False
and- Windows 32 bit, python 2.6: Both False.
Author: Anthony Long (antlong)
Date: 2010-07-23 04:02
Mac 10.5.6: py 2.6.4 - broken
Python 2.6.4 (r264:75706, Mar 18 2010, 14:58:13) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information.
'\xff'.isalpha() False import Tkinter '\xff'.isalpha() True
Author: Anthony Long (antlong)
Date: 2010-07-23 04:17
Python 2.6.4, Mac 10.5:
from string import letters letters 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb5\xba\xc0\xc1\xc2\xc 3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd 8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xe c\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff' import locale locale.getdefaultlocale() ('en_US', 'UTF8')
Author: Alexander Belopolsky (belopolsky) *
Date: 2010-07-23 04:18
This is clearly a Tkinter rather than Mac issue, so I am unassigning this from Ronald. This appears to be the same problem as the one Mark described in .
import locale locale.nl_langinfo(locale.CODESET) 'US-ASCII' import _tkinter locale.nl_langinfo(locale.CODESET) 'UTF-8'
This happens in both 2.6 and 2.7, but seems to be deliberate. As Mark wrote in :
""" There's still the issue of the Tkinter import changing the locale, but that seems to be out of Python's control. As far as I can tell, it happens when the module initialization calls Tcl_FindExecutable, which is part of the Tcl library itself. This may well be deliberate: see
http://www.tcl.tk/cgi-bin/tct/tip/66.html """
What is still unclear to me, is why after CODESET changes to 'UTF-8', 2.6 thinks that '\xff' is a letter, but 2.7 does not.
Of course, '\xff' makes little sense in 'UTF-8', but why does the answer change between versions?
Author: Anthony Long (antlong)
Date: 2010-07-23 04:26
After import _tkinter, I would up getting this, which is totally different than before:
letters 'abcdefghijklmnopqrstuvwxyz\xaa\xb5\xba\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xffABCDEFGHIJKLMNOPQRSTUVWXYZ\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde'
Author: Anthony Long (antlong)
Date: 2010-07-23 04:40
A bit more info:
Python 2.6.4 (r264:75706, Mar 18 2010, 14:58:13) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information.
import locale locale.nl_langinfo(locale.CODESET) 'US-ASCII'
along-mb:~ along$ locale LANG="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_CTYPE="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_ALL= along-mb:~ along$
Author: Alexander Belopolsky (belopolsky) *
Date: 2010-07-23 04:50
In 3.x, it is different:
locale.nl_langinfo(locale.CODESET) 'UTF-8'
Victor,
This looks like your cup of tee.
Author: Martin v. Löwis (loewis) *
Date: 2010-07-23 08:03
I fail to see the bug in this report. '\xff' is a letter because the C library says it is. If you think the result is wrong, file a bug report with the OS vendor.
Author: Alexander Belopolsky (belopolsky) *
Date: 2010-07-23 14:13
On Fri, Jul 23, 2010 at 4:03 AM, Martin v. Löwis <report@bugs.python.org> wrote: ..
I fail to see the bug in this report. '\xff' is a letter because the C library says it is.
This does not explain the difference between 2.6 and 2.7. With attached -test.py,
$ cat -test.py import locale locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') print(chr(255).isalpha())
$ python2.7 -test.py False $ python2.6 -test.py True $ python2.5 -test.py True
Since chr(255) = '\xff', is not a valid UTF-8 byte sequence, it makes little sense to ask whether it is a letter or not in a locale that uses UTF-8 encoding. Nevertheless the behavior changed between revisions and it is not mentioned in "what's new in 2.7". (I suspect this was introduced in (r72040), but I have not verified.)
There are two possible action items here:
New behavior needs to be documented. I believe 2.7 is correct because when isalpha is used to sanitize untrusted input, it is better to reject in the case of uncertainy.
Arguably, this is a security issue and thus eligible for backporting to 2.6.
Author: Alexander Belopolsky (belopolsky) *
Date: 2010-07-23 14:20
Another issue that may be worth revisiting is whether or not it is OK for _tkinter to set the locale.
""" 21.2.2. For extension writers and programs that embed Python
Extension modules should never call setlocale(), except to find out what the current locale is. """
http://docs.python.org/dev/library/locale.html#for-extension-writers-and-programs-that-embed-python
Author: Ronald Oussoren (ronaldoussoren) *
Date: 2010-07-23 14:28
This might be caused by the fix for (which is mentioned in the NEWS file).
Author: Alexander Belopolsky (belopolsky) *
Date: 2010-07-23 15:27
This might be caused by the fix for .
Ronald,
You are absolutely right. Reverting r80178 in the trunk restores the old behavior.
: Fixed in r80178 (trunk), r80180 (2.6), r80182 (3.2), r80183 (3.1)
I think this can be closed as out of date, but I am giving it back to you to decide whether security implications are important enough to backport to 2.5.
Anthony,
Please open a separate issue for Tkinter if you want it considered. It was rejected once already [], but even if Tkinter behavior is deemed appropriate, I think it should at least be documented.
Author: Ronald Oussoren (ronaldoussoren) *
Date: 2010-07-23 15:45
Why do you think this may have security implications?
I'm closing this as out of date because the issue is fixed and the fix is imho inappropriate for a backport to 2.6 due to the change in behaviour.
Author: Alexander Belopolsky (belopolsky) *
Date: 2010-07-23 15:53
Accepting binary input where only letters are expected by an application is a very common source of security holes. An application that relies on s.isalpha() to guarantee that s does not contain non-ASCII characters when UTF-8 locale is in use, may have a security hole if it is ran with python 2.5.
Author: Martin v. Löwis (loewis) *
Date: 2010-07-24 10:33
If an application uses .isalpha for a security-relevant check, this is a security issue in the application, not in Python.
Author: Anthony Long (antlong)
Date: 2010-07-24 10:35
I disagree. It's expected that the function will return valid data. This doesn't return valid data so isalpha() is compromised.
Author: Ronald Oussoren (ronaldoussoren) *
Date: 2010-07-24 10:41
I agree with Martin that the security problem would be in the application, not python itself.
Testing with isalpha is generally not the right thing to do anyway, it is much better to restrict input to a know-good set of data, such as by using regular expressions. For multi-byte encodings like UTF-8 you cannot rely on per-byte calls to isalpha anyway. The situation is even worse for an encoding like Shift-JIS where you need context to know if a byte is part of a multi-byte value.
Author: Martin v. Löwis (loewis) *
Date: 2010-07-24 11:37
I disagree. It's expected that the function will return valid data. This doesn't return valid data so isalpha() is compromised.
What is "valid data"? The function (isalpha) should return a boolean, and it does. So the result is certainly "valid".
The documentation says "For 8-bit strings, this method is locale-dependent." So it is correct if it returns what the OS vendor says to return.
Author: Anthony Long (antlong)
Date: 2010-07-24 12:23
The locale is set incorrectly though - so it is not valid data. Valid data is a-Z. nothing more nothing less, and the locale and the alphabet should not be changed.
Author: STINNER Victor (vstinner) *
Date: 2010-07-25 23:27
Victor, This looks like your cup of tee.
Unicode is my cup of tee, but not programs considering that bytes are characters.
.isalpha() doesn't mean anything to me :-)
This issue is a more question about the C library, not about Python :-) So try the attached program "isalpha.c" if you would like to test your libc.
Results on my Linux box (Debian Sid, eglibc 2.11.2):
$ ./isalpha C ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz (52)
$ ./isalpha fr_FR.UTF-8 ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz (52)
$ ./isalpha fr_FR.iso88591 ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb5\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff (117)
$ ./isalpha fr_FR.iso885915@euro ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xa6\xa8\xaa\xb4\xb5\xb8\xba\xbc\xbd\xbe\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff (124)
If your libc consider that \xff is a valid UTF-8 character, you should change your OS for a better one :-)
--
len(letters) 117 ... locale.setlocale(locale.LC_CTYPE) 'en_US.UTF-8'
It looks like Mac OS X uses ISO-8859-1 instead of UTF-8.
--
string.letters is built using strop.lowercase + strop.uppsercase which are built using the C functions islower() and islower(). locale.setlocale() regenerates strop/string.lowercase, strop/string.uppercase and string.letters for LC_CTYPE and LC_ALL categories.
--
You don't need to run IDLE or import Tkinter to set the locale:
import locale; locale.setlocale(locale.LC_ALL, '')
is enough.
--
A library should not change the locale (only the application).
$ python2.6
import locale locale.getlocale() (None, None) import Tkinter locale.getlocale() ('fr_FR', 'UTF8')
=> Tkinter is an horrible library! (The bug is in the C library, not in the Python wrapper)
Use a better one like Gtk ou Qt ;-)
$ python
import locale import pygtk locale.getlocale() (None, None) import PyQt4 locale.getlocale() (None, None)
(IDLE is based on Tkinter)
--
I don't understand why Alexander gets different results on Python 2.6 and Python 2.7.
@belopolsky: Are both programs linked to (built with?) the same C library? (same libray version)
Author: Alexander Belopolsky (belopolsky) *
Date: 2010-07-26 00:20
On Sun, Jul 25, 2010 at 7:27 PM, STINNER Victor <report@bugs.python.org> wrote: ..
Unicode is my cup of tee, but not programs considering that bytes are characters.
What I called "your cup of tee" was 3.x returning 'UTF-8' from locale.nl_langinfo(locale.CODESET) where 2.x returned 'US-ASCII'. (In both cases this was the first call to locale module functions.)
I don't understand why Alexander gets different results on Python 2.6 and Python 2.7.
It looks like you have missed most of the discussion under this issue. Sorry that you had to reinvestigate. Ronald explained the difference in . He introduced a workaround for broken OSX C library isalpha in r80178.
Author: STINNER Victor (vstinner) *
Date: 2010-07-26 00:26
Oops, the issue is already closed /o\
History
Date
User
Action
Args
2022-04-11 14:57:04
admin
set
github: 53581
2010-07-26 00:26:23
vstinner
set
messages: +
2010-07-26 00:20:55
belopolsky
set
messages: +
2010-07-25 23:27:20
vstinner
set
files: + isalpha.c
messages: +
2010-07-24 12:23:24
antlong
set
messages: +
2010-07-24 11:37:47
loewis
set
messages: +
2010-07-24 10:41:11
ronaldoussoren
set
messages: +
2010-07-24 10:35:29
antlong
set
messages: +
2010-07-24 10:33:54
loewis
set
messages: +
2010-07-23 15:53:12
belopolsky
set
messages: +
2010-07-23 15:45:08
ronaldoussoren
set
status: pending -> closed
messages: +
2010-07-23 15:27:01
belopolsky
set
status: open -> pending
assignee: belopolsky -> ronaldoussoren
components: + macOS
versions: - Python 2.6
nosy:loewis, ronaldoussoren, mark.dickinson, belopolsky, vstinner, eric.smith, jkloth, eric.araujo, antlong
messages: +
resolution: out of date
stage: resolved
2010-07-23 14:28:00
ronaldoussoren
set
messages: +
2010-07-23 14:27:09
eric.araujo
set
nosy: + eric.araujo
2010-07-23 14:20:21
belopolsky
set
nosy: + eric.smith
messages: +
components: + Interpreter Core, - macOS
type: behavior
2010-07-23 14:13:26
belopolsky
set
files: + issue9335-test.py
messages: +
2010-07-23 08:03:38
loewis
set
nosy: + loewis
messages: +
2010-07-23 04:50:34
belopolsky
set
nosy: + vstinner
messages: +
2010-07-23 04:40:35
antlong
set
messages: +
2010-07-23 04:26:50
antlong
set
messages: +
2010-07-23 04:20:29
belopolsky
set
assignee: ronaldoussoren -> belopolsky
2010-07-23 04:20:10
belopolsky
set
nosy: + mark.dickinson
2010-07-23 04🔞52
belopolsky
set
messages: +
2010-07-23 04:17:47
antlong
set
messages: +
2010-07-23 04:02:35
antlong
set
messages: +
2010-07-23 04:02:09
antlong
set
messages: +
2010-07-23 03:47:01
belopolsky
set
nosy:ronaldoussoren, belopolsky, jkloth, antlong
messages: +
components: + Tkinter
2010-07-23 03:42:15
belopolsky
set
messages: +
2010-07-23 03:23:01
belopolsky
set
messages: +
2010-07-23 03:20:25
jkloth
set
nosy: + jkloth
messages: +
2010-07-23 03:16:12
antlong
set
nosy:ronaldoussoren, belopolsky, antlong
type: behavior -> (no value)
messages: +
components: - IDLE
2010-07-23 02:59:44
belopolsky
set
nosy: + belopolsky
messages: +
components: + IDLE
type: behavior
2010-07-23 02:38:29
antlong
create