Issue 2562: Cannot use non-ascii letters in disutils if setuptools is used. (original) (raw)
Created on 2008-04-06 09:47 by tarek, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Messages (30)
Author: Tarek Ziadé (tarek) *
Date: 2008-04-06 09:47
If I try to put my name in the Author field as a string field, it will brake because distutils makes the assumption that the fields are string encoded in ascii, before it decodes it into unicode, then encode it in utf8 to send the data.
See in distutils.command.register.post_to_server :
value = unicode(value).encode("utf-8")
One way to avoid this error is to provide unicode for all field, but will fail farther if setuptools is used, because this other package makes the assumption that the fields are strings::
self.run_command('egg_info') ... distutils/dist.py", line 1047, in write_pkg_info pkg_info.write('Author: %s\n' % self.get_contact() ) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 18: ordinal not in range(128)
So I guess distutils shouldn't guess that it receives ascii strings and do a raw unicode() call, and should make the assumption that it receives unicode fields only.
Since many packages out there use strings, I have left a unicode() call in my patch, together with a warning.
test provided.
Author: Martin v. Löwis (loewis) *
Date: 2008-04-06 13:21
The official supported way for non-ASCII characters in distutils is to use Unicode strings. If anything else fails, that's not a bug.
IIUC, in this case, it's setuptools that fails, not distutils. Assuming I understood correctly, I'm closing this as won't-fix/3rd party.
Author: Tarek Ziadé (tarek) *
Date: 2008-04-06 13:43
In that case, distutils should not do a unicode() call over each field passed before .encode('utf8') is called, because it makes the assumption that string type can be used.
Author: Martin v. Löwis (loewis) *
Date: 2008-04-06 13:59
I don't understand. It is certainly allowed to use byte strings for these data, as long as they are ASCII. The Unicode requirement exists only for non-ASCII characters, and distutils makes explicit, deliberate use of the default encoding here (hoping that nobody changed it away from ASCII).
There are tons of setup.py files out there that use plain byte strings, and there is no reason to break them, e.g. by mandating that the string is Unicode already.
Author: Tarek Ziadé (tarek) *
Date: 2008-04-06 14:14
ok I see what you mean, thanks for the explanation
Author: Tarek Ziadé (tarek) *
Date: 2008-04-06 14:41
oh, hold one, it is more complicated in fact :)
setuptools calls DistributionMetadata.dist.write_pkg_file() method to write the .egg-info file.
This method make the assertion that the metadata fields are string so it is not setuptools fault.
This code fail the same way:
dist = Distribution(attrs={'author': u'Mister Café'}) dist.metadata.write_pkg_file(file)
So I guess the patch needs to be done in distutils.dist.DistributionMetadata, so it checks upon the type of field before it runs:
file.write('Author: %s\n' % self.get_contact() )
That what I meant when I said that distutils should decide wheter it works with unicode or str for this fields.
I can re-write a new patch if you agree on this
Author: Martin v. Löwis (loewis) *
Date: 2008-04-06 17:09
I agree there is a bug in distutils. Before we proceed, I think distutils-sig needs to be consulted. My proposal would be the one I suggested earlier: all strings should either be Unicode or ASCII-only byte strings. This contradicts to the documentation that says that none of the strings must be Unicode, so it would be an incompatible change (and would indeed likely break packages that currently use UTF-8, and sdist, but never register)
Author: Martin v. Löwis (loewis) *
Date: 2008-04-06 17:17
As a follow-up: for compatibility, it might be possible to support either Unicode or arbitrary plain strings in write_pkg_file. In 3k, such support can then be dropped.
As that constitutes a new feature, it shouldn't be applied to 2.5.
Author: Tarek Ziadé (tarek) *
Date: 2008-04-07 08:06
ok, I'll summarize this in distutils-sig sometime today.
If we do use Unicode, I think we might need an extra meta-data, "encoding", that would default to "utf8", and that could be used when the class needs to serialize the data in a file.
Author: Tarek Ziadé (tarek) *
Date: 2008-04-07 08:17
adding a sample patch to show a possible implementation, and to point the problem to people
Author: Marc-Andre Lemburg (lemburg) *
Date: 2008-04-07 15:51
Note that
value = unicode(value).encode("utf-8")
will also work if value is already Unicode, so a backwards compatible fix would be to allow passing in:
- ASCII encoded strings
- Unicode objects
for the meta data keyword parameters and then apply unicode() to all the meta-data arguments.
I don't think that we should support non-ASCII encodings for meta-data strings passed to setup().
If setuptools is broken in this respect, it needs to be fixed. Dito for other 3rd party tools.
Author: Martin v. Löwis (loewis) *
Date: 2008-04-07 19:37
If we do use Unicode, I think we might need an extra meta-data, "encoding", that would default to "utf8", and that could be used when the class needs to serialize the data in a file.
I don't think so. Whenever the data is written to a file, the file format should specify the encoding.
Regards, Martin
Author: Martin v. Löwis (loewis) *
Date: 2008-04-07 19:39
I don't think that we should support non-ASCII encodings for meta-data strings passed to setup().
If setuptools is broken in this respect, it needs to be fixed. Dito for other 3rd party tools.
We do need to support non-ASCII files, as distutils didn't previously even support Unicode strings, and people still wanted to get their names right. It's not about setuptools, and not about other 3rd party tools. It's about distutils packages which we need to continue to support.
Author: Marc-Andre Lemburg (lemburg) *
Date: 2008-04-07 19:49
Agreed, but any change will target the package authors who can easily upgrade their packages to use Unicode for e.g. names.
If the change were to address distutils users, we'd have to be a lot more careful.
In any case, if UTF-8 is the defacto standard used in older packages, then we should probably use that as fallback solution if the ASCII assumption doesn't work out:
try: value = unicode(value) except UnicodeDecodeError: value = unicode(value, 'utf-8') value = value.encode('utf-8')
Author: Martin v. Löwis (loewis) *
Date: 2008-04-07 20:07
Agreed, but any change will target the package authors who can easily upgrade their packages to use Unicode for e.g. names.
They can't: that would break their 2.5-and-earlier compatibility.
If the change were to address distutils users, we'd have to be a lot more careful.
We do address distutils users: what else? Why should we be more careful?
In any case, if UTF-8 is the defacto standard used in older packages, then we should probably use that as fallback solution if the ASCII assumption doesn't work out:
try: value = unicode(value) except UnicodeDecodeError: value = unicode(value, 'utf-8') value = value.encode('utf-8')
For writing the metadata, we don't need to make any assumptions. We can just write the bytes as-is. This is how distutils has behaved for many releases now, and this is how users have been using it.
Of course, we (probably) agree that this is conceptually wrong, as we won't be able to know what the encoding of the metadata file is, and we (probably) also agree that the metadata should have the fixed encoding of UTF-8. However, I don't think we should deliberately break packages before 3.0 (even if they chose to use some other encoding); instead, such packages will silently start doing the right thing with 3.0, when their strings become Unicode strings.
Author: Marc-Andre Lemburg (lemburg) *
Date: 2008-04-07 20:22
With "distutils users" I'm referring to people that are told to run "python setup.py install". Changed affecting the way this line behaves need to be carefully considered.
OTOH, when upgrading a package to a new Python version (and distutils version), package authors will have to modify their packages anyway, so it is well possible to ask them to use Unicode strings for meta-information.
Supporting pre-2.6 Python version is also not much of a problem, since authors could setup the strings in question to be either Unicode or 8-bit strings depending on the Python version.
This change would be really minor (compared to e.g the Py_ssize_t change ;-).
That said, I don't think it's a good idea to make package data more complicated by allowing multiple encodings. The meta-data file should have a fixed pre-defined encoding, preferrably UTF-8.
Author: Tarek Ziadé (tarek) *
Date: 2008-04-08 13:26
For writing the metadata, we don't need to make any assumptions. We can just write the bytes as-is. This is how distutils has behaved for many releases now, and this is how users have been using it.
But write_pkg_file will use ascii encoding if we don't indicate it here:
pkg_info.write('Author: %s\n' % self.get_contact() )
So wouldn't a light fix in write_pkg_file() would be sufficient when a unicode(field) fails, as MAL mentioned ? by trying utf8:
try: ... pkg_info.write('Author: %s\n' % self.get_contact() ) ... except UnicodeEncodeError: ... pkg_info.write('Author: %s\n' % self.get_contact().encode('utf8') )
As far as I know, this simple change will not impact people and will just make it possible to use Unicode. And everything will be fine under Py3K as it is now.
But I don't know yet how this would impact 3rd party softwares that reads the egg-info file. But like MAL said, they will have to get fixed as well.
Author: Martin v. Löwis (loewis) *
Date: 2008-04-08 20:16
But write_pkg_file will use ascii encoding if we don't indicate it here:
pkg_info.write('Author: %s\n' % self.get_contact() )
Why do you say that it uses ascii? It uses whatever encoding the string returned by get_contact uses. See the attached P1-1.0.tar.gz for an example. This doesn't use ASCII, and doesn't use UTF-8, and works with 2.4.
So wouldn't a light fix in write_pkg_file() would be sufficient when a unicode(field) fails, as MAL mentioned ? by trying utf8:
try: ... pkg_info.write('Author: %s\n' % self.get_contact() ) ... except UnicodeEncodeError: ... pkg_info.write('Author: %s\n' % self.get_contact().encode('utf8') )
That would work - although I fail to see what this has to do with a failing unicode(field). Instead, it has rather to do with a failing .write().
Author: Tarek Ziadé (tarek) *
Date: 2008-04-08 20:39
pkg_info.write('Author: %s\n' % self.get_contact() ) Why do you say that it uses ascii? It uses whatever encoding the string returned by get_contact uses. See the attached P1-1.0.tar.gz for an example. This doesn't use ASCII, and doesn't use UTF-8, and works with 2.4.
This happens of course only when get_contact returns an unicode. It uses the ascii codec by default. Here's an example:
contact = u'Barnabé' f = open('/tmp/test', 'w') f.write('Author: %s' % contact) Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 14: ordinal not in range(128)
That would work - although I fail to see what this has to do with a failing unicode(field). Instead, it has rather to do with a failing .write().
Absolutely, I was focusing on write_pkg_file() method that fails when the egg-info file is written.
Author: Tarek Ziadé (tarek) *
Date: 2008-04-20 16:27
I suppose the simplest way to deal with the problem is to force utf-8 encoding for the concerned fields, since this problem will dissapear in 3k.
Here's a simplified patch, that does it, so write_pkg_file behaves as expected.
Author: Tarek Ziadé (tarek) *
Date: 2008-05-10 13:35
I think this should also be fixed in 2.5
Author: Marc-Andre Lemburg (lemburg) *
Date: 2008-08-25 15:16
Is this still an issue in 2.6 ?
AFAIK, there have been a few changes both to setuptools and PyPI that make it easy to just use Unicode objects in the setup() call for non-ASCII values.
Author: Tarek Ziadé (tarek) *
Date: 2008-08-25 16:05
The problem is in distutils code, not in setuptools or PyPI.
As long as I can see, the problem remains in the trunk. It is dead simple to reproduce : put an unicode name for the author in a plain setup.py with a non ascii character. (for example my name ;))
Here's an up-to-date patch that includes a test that reproduces the problem.
Author: Marc-Andre Lemburg (lemburg) *
Date: 2008-08-25 17:03
Here's an updated patch that applies the same logic to all meta-data fields, instead of just a few. This simplifies the code somewhat.
I've tested it with the test you provided and also with eGenix packages using Unicode author names (ie. my name ;-)).
I guess we need at least one more reviewer to commit this change.
Author: Tarek Ziadé (tarek) *
Date: 2008-08-26 08:48
ok I will ask for this on the ML
Author: Marc-Andre Lemburg (lemburg) *
Date: 2008-09-03 10:48
Removing Python 2.5 from the version list, since the patch may in some cases (e.g. using a different encoding than UTF-8) cause problems with existing setup.py files out there.
The patch is not compatible with Python 3.0 for obvious reasons, but there shouldn't be any issue for Python 3.0 anyway.
Given that no one has volunteered to review the patch in addition to Tarek and myself, I think we're good to go.
Tarek, if you're fine with this, please let me know and I'll check in the patch (together with a note in NEWS).
Author: Tarek Ziadé (tarek) *
Date: 2008-09-03 10:58
Sure, sounds fine to me, thanks for the help on this issue
Author: Marc-Andre Lemburg (lemburg) *
Date: 2008-09-03 11:28
Checked in as r66181 on trunk.
Author: Benjamin Peterson (benjamin.peterson) *
Date: 2008-09-08 21:44
Does this need to be merged into py3k? If so, can someone who handled this bug do it. I met a few test failures in my attempt...
Author: Marc-Andre Lemburg (lemburg) *
Date: 2008-09-09 09:37
On 2008-09-08 23:45, Benjamin Peterson wrote:
Benjamin Peterson <musiccomposition@gmail.com> added the comment:
Does this need to be merged into py3k? If so, can someone who handled this bug do it. I met a few test failures in my attempt...
As mentioned in the ticket discussion, this does not need to be forward patched to 3.0.
History
Date
User
Action
Args
2022-04-11 14:56:33
admin
set
github: 46814
2010-11-25 23:56:05
jwilk
set
nosy: + jwilk
2008-09-09 09:37:01
lemburg
set
messages: +
title: Cannot use non-ascii letters in disutils if setuptools is used. -> Cannot use non-ascii letters in disutils if setuptools is used.
2008-09-08 21:44:04
benjamin.peterson
set
nosy: + benjamin.peterson
messages: +
2008-09-03 11:28:59
lemburg
set
status: open -> closed
messages: +
2008-09-03 10:58:04
tarek
set
messages: +
2008-09-03 10:48:28
lemburg
set
messages: +
versions: - Python 2.5
2008-08-26 08:48:48
tarek
set
messages: +
2008-08-25 17:04:00
lemburg
set
files: + distutils-unicode-metadata.patch
messages: +
2008-08-25 16:06:11
tarek
set
files: - distutils.unicode.simplified.patch
2008-08-25 16:06:04
tarek
set
files: - unicode.metadata.patch
2008-08-25 16:06:01
tarek
set
files: - unicode.patch
2008-08-25 16:05:45
tarek
set
files: + distutils.unicode.patch
messages: +
2008-08-25 15:16:56
lemburg
set
messages: +
2008-08-24 22:28:00
nnorwitz
set
type: crash -> behavior
2008-05-10 13:35:03
tarek
set
messages: +
versions: + Python 2.5
2008-04-20 16:27:12
tarek
set
files: + distutils.unicode.simplified.patch
messages: +
2008-04-12 18:29:44
georg.brandl
link
2008-04-08 20:39:50
tarek
set
messages: +
2008-04-08 20:16:06
loewis
set
files: + P1-1.0.tar.gz
messages: +
2008-04-08 13:26:16
tarek
set
messages: +
2008-04-07 20:22:32
lemburg
set
messages: +
2008-04-07 20:07:40
loewis
set
messages: +
2008-04-07 19:49:38
lemburg
set
messages: +
2008-04-07 19:39:26
loewis
set
messages: +
2008-04-07 19:37:53
loewis
set
messages: +
2008-04-07 15:51:06
lemburg
set
nosy: + lemburg
messages: +
2008-04-07 08:17:05
tarek
set
files: + unicode.metadata.patch
messages: +
2008-04-07 08:06:26
tarek
set
messages: +
2008-04-06 17:17:38
loewis
set
messages: +
2008-04-06 17:09:05
loewis
set
status: closed -> open
resolution: wont fix ->
messages: +
versions: + Python 2.6, - 3rd party
2008-04-06 14:41:30
tarek
set
messages: +
2008-04-06 14:14:12
tarek
set
messages: +
2008-04-06 13:59:08
loewis
set
messages: +
2008-04-06 13:43:33
tarek
set
messages: +
2008-04-06 13:21:05
loewis
set
status: open -> closed
resolution: wont fix
messages: +
nosy: + loewis
versions: + 3rd party, - Python 2.6
2008-04-06 10:14:38
tarek
set
files: + unicode.patch
2008-04-06 10:14:28
tarek
set
files: - unicode.patch
2008-04-06 09:47:18
tarek
create