Issue 2562: Cannot use non-ascii letters in disutils if setuptools is used. (original) (raw)

Created on 2008-04-06 09:47 by tarek, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (30)

msg65028 - (view)

Author: Tarek Ziadé (tarek) * (Python committer)

Date: 2008-04-06 09:47

If I try to put my name in the Author field as a string field, it will brake because distutils makes the assumption that the fields are string encoded in ascii, before it decodes it into unicode, then encode it in utf8 to send the data.

See in distutils.command.register.post_to_server :

value = unicode(value).encode("utf-8")

One way to avoid this error is to provide unicode for all field, but will fail farther if setuptools is used, because this other package makes the assumption that the fields are strings::

self.run_command('egg_info') ... distutils/dist.py", line 1047, in write_pkg_info pkg_info.write('Author: %s\n' % self.get_contact() ) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 18: ordinal not in range(128)

So I guess distutils shouldn't guess that it receives ascii strings and do a raw unicode() call, and should make the assumption that it receives unicode fields only.

Since many packages out there use strings, I have left a unicode() call in my patch, together with a warning.

test provided.

msg65032 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-04-06 13:21

The official supported way for non-ASCII characters in distutils is to use Unicode strings. If anything else fails, that's not a bug.

IIUC, in this case, it's setuptools that fails, not distutils. Assuming I understood correctly, I'm closing this as won't-fix/3rd party.

msg65033 - (view)

Author: Tarek Ziadé (tarek) * (Python committer)

Date: 2008-04-06 13:43

In that case, distutils should not do a unicode() call over each field passed before .encode('utf8') is called, because it makes the assumption that string type can be used.

msg65035 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-04-06 13:59

I don't understand. It is certainly allowed to use byte strings for these data, as long as they are ASCII. The Unicode requirement exists only for non-ASCII characters, and distutils makes explicit, deliberate use of the default encoding here (hoping that nobody changed it away from ASCII).

There are tons of setup.py files out there that use plain byte strings, and there is no reason to break them, e.g. by mandating that the string is Unicode already.

msg65038 - (view)

Author: Tarek Ziadé (tarek) * (Python committer)

Date: 2008-04-06 14:14

ok I see what you mean, thanks for the explanation

msg65040 - (view)

Author: Tarek Ziadé (tarek) * (Python committer)

Date: 2008-04-06 14:41

oh, hold one, it is more complicated in fact :)

setuptools calls DistributionMetadata.dist.write_pkg_file() method to write the .egg-info file.

This method make the assertion that the metadata fields are string so it is not setuptools fault.

This code fail the same way:

dist = Distribution(attrs={'author': u'Mister Café'}) dist.metadata.write_pkg_file(file)

So I guess the patch needs to be done in distutils.dist.DistributionMetadata, so it checks upon the type of field before it runs:

file.write('Author: %s\n' % self.get_contact() )

That what I meant when I said that distutils should decide wheter it works with unicode or str for this fields.

I can re-write a new patch if you agree on this

msg65046 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-04-06 17:09

I agree there is a bug in distutils. Before we proceed, I think distutils-sig needs to be consulted. My proposal would be the one I suggested earlier: all strings should either be Unicode or ASCII-only byte strings. This contradicts to the documentation that says that none of the strings must be Unicode, so it would be an incompatible change (and would indeed likely break packages that currently use UTF-8, and sdist, but never register)

msg65047 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-04-06 17:17

As a follow-up: for compatibility, it might be possible to support either Unicode or arbitrary plain strings in write_pkg_file. In 3k, such support can then be dropped.

As that constitutes a new feature, it shouldn't be applied to 2.5.

msg65069 - (view)

Author: Tarek Ziadé (tarek) * (Python committer)

Date: 2008-04-07 08:06

ok, I'll summarize this in distutils-sig sometime today.

If we do use Unicode, I think we might need an extra meta-data, "encoding", that would default to "utf8", and that could be used when the class needs to serialize the data in a file.

msg65070 - (view)

Author: Tarek Ziadé (tarek) * (Python committer)

Date: 2008-04-07 08:17

adding a sample patch to show a possible implementation, and to point the problem to people

msg65076 - (view)

Author: Marc-Andre Lemburg (lemburg) * (Python committer)

Date: 2008-04-07 15:51

Note that

value = unicode(value).encode("utf-8")

will also work if value is already Unicode, so a backwards compatible fix would be to allow passing in:

for the meta data keyword parameters and then apply unicode() to all the meta-data arguments.

I don't think that we should support non-ASCII encodings for meta-data strings passed to setup().

If setuptools is broken in this respect, it needs to be fixed. Dito for other 3rd party tools.

msg65102 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-04-07 19:37

If we do use Unicode, I think we might need an extra meta-data, "encoding", that would default to "utf8", and that could be used when the class needs to serialize the data in a file.

I don't think so. Whenever the data is written to a file, the file format should specify the encoding.

Regards, Martin

msg65103 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-04-07 19:39

I don't think that we should support non-ASCII encodings for meta-data strings passed to setup().

If setuptools is broken in this respect, it needs to be fixed. Dito for other 3rd party tools.

We do need to support non-ASCII files, as distutils didn't previously even support Unicode strings, and people still wanted to get their names right. It's not about setuptools, and not about other 3rd party tools. It's about distutils packages which we need to continue to support.

msg65108 - (view)

Author: Marc-Andre Lemburg (lemburg) * (Python committer)

Date: 2008-04-07 19:49

Agreed, but any change will target the package authors who can easily upgrade their packages to use Unicode for e.g. names.

If the change were to address distutils users, we'd have to be a lot more careful.

In any case, if UTF-8 is the defacto standard used in older packages, then we should probably use that as fallback solution if the ASCII assumption doesn't work out:

try: value = unicode(value) except UnicodeDecodeError: value = unicode(value, 'utf-8') value = value.encode('utf-8')

msg65113 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-04-07 20:07

Agreed, but any change will target the package authors who can easily upgrade their packages to use Unicode for e.g. names.

They can't: that would break their 2.5-and-earlier compatibility.

If the change were to address distutils users, we'd have to be a lot more careful.

We do address distutils users: what else? Why should we be more careful?

In any case, if UTF-8 is the defacto standard used in older packages, then we should probably use that as fallback solution if the ASCII assumption doesn't work out:

try: value = unicode(value) except UnicodeDecodeError: value = unicode(value, 'utf-8') value = value.encode('utf-8')

For writing the metadata, we don't need to make any assumptions. We can just write the bytes as-is. This is how distutils has behaved for many releases now, and this is how users have been using it.

Of course, we (probably) agree that this is conceptually wrong, as we won't be able to know what the encoding of the metadata file is, and we (probably) also agree that the metadata should have the fixed encoding of UTF-8. However, I don't think we should deliberately break packages before 3.0 (even if they chose to use some other encoding); instead, such packages will silently start doing the right thing with 3.0, when their strings become Unicode strings.

msg65118 - (view)

Author: Marc-Andre Lemburg (lemburg) * (Python committer)

Date: 2008-04-07 20:22

With "distutils users" I'm referring to people that are told to run "python setup.py install". Changed affecting the way this line behaves need to be carefully considered.

OTOH, when upgrading a package to a new Python version (and distutils version), package authors will have to modify their packages anyway, so it is well possible to ask them to use Unicode strings for meta-information.

Supporting pre-2.6 Python version is also not much of a problem, since authors could setup the strings in question to be either Unicode or 8-bit strings depending on the Python version.

This change would be really minor (compared to e.g the Py_ssize_t change ;-).

That said, I don't think it's a good idea to make package data more complicated by allowing multiple encodings. The meta-data file should have a fixed pre-defined encoding, preferrably UTF-8.

msg65158 - (view)

Author: Tarek Ziadé (tarek) * (Python committer)

Date: 2008-04-08 13:26

For writing the metadata, we don't need to make any assumptions. We can just write the bytes as-is. This is how distutils has behaved for many releases now, and this is how users have been using it.

But write_pkg_file will use ascii encoding if we don't indicate it here:

pkg_info.write('Author: %s\n' % self.get_contact() )

So wouldn't a light fix in write_pkg_file() would be sufficient when a unicode(field) fails, as MAL mentioned ? by trying utf8:

try: ... pkg_info.write('Author: %s\n' % self.get_contact() ) ... except UnicodeEncodeError: ... pkg_info.write('Author: %s\n' % self.get_contact().encode('utf8') )

As far as I know, this simple change will not impact people and will just make it possible to use Unicode. And everything will be fine under Py3K as it is now.

But I don't know yet how this would impact 3rd party softwares that reads the egg-info file. But like MAL said, they will have to get fixed as well.

msg65213 - (view)

Author: Martin v. Löwis (loewis) * (Python committer)

Date: 2008-04-08 20:16

But write_pkg_file will use ascii encoding if we don't indicate it here:

pkg_info.write('Author: %s\n' % self.get_contact() )

Why do you say that it uses ascii? It uses whatever encoding the string returned by get_contact uses. See the attached P1-1.0.tar.gz for an example. This doesn't use ASCII, and doesn't use UTF-8, and works with 2.4.

So wouldn't a light fix in write_pkg_file() would be sufficient when a unicode(field) fails, as MAL mentioned ? by trying utf8:

try: ... pkg_info.write('Author: %s\n' % self.get_contact() ) ... except UnicodeEncodeError: ... pkg_info.write('Author: %s\n' % self.get_contact().encode('utf8') )

That would work - although I fail to see what this has to do with a failing unicode(field). Instead, it has rather to do with a failing .write().

msg65214 - (view)

Author: Tarek Ziadé (tarek) * (Python committer)

Date: 2008-04-08 20:39

pkg_info.write('Author: %s\n' % self.get_contact() ) Why do you say that it uses ascii? It uses whatever encoding the string returned by get_contact uses. See the attached P1-1.0.tar.gz for an example. This doesn't use ASCII, and doesn't use UTF-8, and works with 2.4.

This happens of course only when get_contact returns an unicode. It uses the ascii codec by default. Here's an example:

contact = u'Barnabé' f = open('/tmp/test', 'w') f.write('Author: %s' % contact) Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 14: ordinal not in range(128)

That would work - although I fail to see what this has to do with a failing unicode(field). Instead, it has rather to do with a failing .write().

Absolutely, I was focusing on write_pkg_file() method that fails when the egg-info file is written.

msg65650 - (view)

Author: Tarek Ziadé (tarek) * (Python committer)

Date: 2008-04-20 16:27

I suppose the simplest way to deal with the problem is to force utf-8 encoding for the concerned fields, since this problem will dissapear in 3k.

Here's a simplified patch, that does it, so write_pkg_file behaves as expected.

msg66518 - (view)

Author: Tarek Ziadé (tarek) * (Python committer)

Date: 2008-05-10 13:35

I think this should also be fixed in 2.5

msg71936 - (view)

Author: Marc-Andre Lemburg (lemburg) * (Python committer)

Date: 2008-08-25 15:16

Is this still an issue in 2.6 ?

AFAIK, there have been a few changes both to setuptools and PyPI that make it easy to just use Unicode objects in the setup() call for non-ASCII values.

msg71943 - (view)

Author: Tarek Ziadé (tarek) * (Python committer)

Date: 2008-08-25 16:05

The problem is in distutils code, not in setuptools or PyPI.

As long as I can see, the problem remains in the trunk. It is dead simple to reproduce : put an unicode name for the author in a plain setup.py with a non ascii character. (for example my name ;))

Here's an up-to-date patch that includes a test that reproduces the problem.

msg71944 - (view)

Author: Marc-Andre Lemburg (lemburg) * (Python committer)

Date: 2008-08-25 17:03

Here's an updated patch that applies the same logic to all meta-data fields, instead of just a few. This simplifies the code somewhat.

I've tested it with the test you provided and also with eGenix packages using Unicode author names (ie. my name ;-)).

I guess we need at least one more reviewer to commit this change.

msg71971 - (view)

Author: Tarek Ziadé (tarek) * (Python committer)

Date: 2008-08-26 08:48

ok I will ask for this on the ML

msg72383 - (view)

Author: Marc-Andre Lemburg (lemburg) * (Python committer)

Date: 2008-09-03 10:48

Removing Python 2.5 from the version list, since the patch may in some cases (e.g. using a different encoding than UTF-8) cause problems with existing setup.py files out there.

The patch is not compatible with Python 3.0 for obvious reasons, but there shouldn't be any issue for Python 3.0 anyway.

Given that no one has volunteered to review the patch in addition to Tarek and myself, I think we're good to go.

Tarek, if you're fine with this, please let me know and I'll check in the patch (together with a note in NEWS).

msg72384 - (view)

Author: Tarek Ziadé (tarek) * (Python committer)

Date: 2008-09-03 10:58

Sure, sounds fine to me, thanks for the help on this issue

msg72385 - (view)

Author: Marc-Andre Lemburg (lemburg) * (Python committer)

Date: 2008-09-03 11:28

Checked in as r66181 on trunk.

msg72790 - (view)

Author: Benjamin Peterson (benjamin.peterson) * (Python committer)

Date: 2008-09-08 21:44

Does this need to be merged into py3k? If so, can someone who handled this bug do it. I met a few test failures in my attempt...

msg72834 - (view)

Author: Marc-Andre Lemburg (lemburg) * (Python committer)

Date: 2008-09-09 09:37

On 2008-09-08 23:45, Benjamin Peterson wrote:

Benjamin Peterson <musiccomposition@gmail.com> added the comment:

Does this need to be merged into py3k? If so, can someone who handled this bug do it. I met a few test failures in my attempt...

As mentioned in the ticket discussion, this does not need to be forward patched to 3.0.

History

Date

User

Action

Args

2022-04-11 14:56:33

admin

set

github: 46814

2010-11-25 23:56:05

jwilk

set

nosy: + jwilk

2008-09-09 09:37:01

lemburg

set

messages: +
title: Cannot use non-ascii letters in disutils if setuptools is used. -> Cannot use non-ascii letters in disutils if setuptools is used.

2008-09-08 21:44:04

benjamin.peterson

set

nosy: + benjamin.peterson
messages: +

2008-09-03 11:28:59

lemburg

set

status: open -> closed
messages: +

2008-09-03 10:58:04

tarek

set

messages: +

2008-09-03 10:48:28

lemburg

set

messages: +
versions: - Python 2.5

2008-08-26 08:48:48

tarek

set

messages: +

2008-08-25 17:04:00

lemburg

set

files: + distutils-unicode-metadata.patch
messages: +

2008-08-25 16:06:11

tarek

set

files: - distutils.unicode.simplified.patch

2008-08-25 16:06:04

tarek

set

files: - unicode.metadata.patch

2008-08-25 16:06:01

tarek

set

files: - unicode.patch

2008-08-25 16:05:45

tarek

set

files: + distutils.unicode.patch
messages: +

2008-08-25 15:16:56

lemburg

set

messages: +

2008-08-24 22:28:00

nnorwitz

set

type: crash -> behavior

2008-05-10 13:35:03

tarek

set

messages: +
versions: + Python 2.5

2008-04-20 16:27:12

tarek

set

files: + distutils.unicode.simplified.patch
messages: +

2008-04-12 18:29:44

georg.brandl

link

issue1721241 superseder

2008-04-08 20:39:50

tarek

set

messages: +

2008-04-08 20:16:06

loewis

set

files: + P1-1.0.tar.gz
messages: +

2008-04-08 13:26:16

tarek

set

messages: +

2008-04-07 20:22:32

lemburg

set

messages: +

2008-04-07 20:07:40

loewis

set

messages: +

2008-04-07 19:49:38

lemburg

set

messages: +

2008-04-07 19:39:26

loewis

set

messages: +

2008-04-07 19:37:53

loewis

set

messages: +

2008-04-07 15:51:06

lemburg

set

nosy: + lemburg
messages: +

2008-04-07 08:17:05

tarek

set

files: + unicode.metadata.patch
messages: +

2008-04-07 08:06:26

tarek

set

messages: +

2008-04-06 17:17:38

loewis

set

messages: +

2008-04-06 17:09:05

loewis

set

status: closed -> open
resolution: wont fix ->
messages: +
versions: + Python 2.6, - 3rd party

2008-04-06 14:41:30

tarek

set

messages: +

2008-04-06 14:14:12

tarek

set

messages: +

2008-04-06 13:59:08

loewis

set

messages: +

2008-04-06 13:43:33

tarek

set

messages: +

2008-04-06 13:21:05

loewis

set

status: open -> closed
resolution: wont fix
messages: +
nosy: + loewis
versions: + 3rd party, - Python 2.6

2008-04-06 10:14:38

tarek

set

files: + unicode.patch

2008-04-06 10:14:28

tarek

set

files: - unicode.patch

2008-04-06 09:47:18

tarek

create