Message 97535 - Python tracker (original) (raw)

Amaury Forgeot d'Arc wrote:

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

I don't see the point in changing the various conversion APIs in the unicode database to return Py_UCS4 when there are no conversions that map code points between BMP and non-BMP.

For consistency: if Py_UNICODE_ISPRINTABLE is changed to take Py_UCS4, Py_UNICODE_TOLOWER should also take Py_UCS4, and must return the same type.

In order to solve the problem in question (unicode_repr() failing), we should change the various property checking APIs to accept Py_UCS4 input data. This needlessly increases the type database size without real benefit. [I'm not sure to understand. For me the 'real benefit' is that it solves the problem in question.]

The problem in question is already solved by just changing the property checking APIs. Changing the conversion APIs fixes a non-problem, since there are no mappings that would require Py_UCS4 on a UCS2 build.

Yes this increases the type database: there are 300 more "case" statements in _PyUnicode_ToNumeric(), and the PyUnicode_TypeRecords array needs 1068 more bytes. On Windows, VS9.0 release build, unicodectype.obj grows from 86Kb to 94Kb; python32.dll is exactly 1.5Kb larger (from 2219Kb to 2221.5Kb); the memory usage of the just-started interpreter is about 32K larger (around 5M). These look reasonable figures to me.

For that to work properly we'll have to either make sure that extensions get recompiled if they use these changed APIs, or we provide an additional set of UCS2 APIs that extend the Py_UNICODE input value to a Py_UCS4 value before calling the underlying Py_UCS4 API.

Extensions that use these changed APIs need to be recompiled, or they won't load: existing modules link with symbols like _PyUnicodeUCS2_IsPrintable, when the future interpreter will define _PyUnicode_IsPrintable.

Hmm, that's a good point.

OK, you got me convinced: let's go for it then.