Use sys.maxunicode and MAX_UNICODE constants by vstinner · Pull Request #15067 · python/cpython (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation22 Commits3 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Modify unicodeobject.c and test_unicode.py to use MAX_UNICODE and
sys.maxunicode constants, rather than hardcoded constants.

Modify unicodeobject.c and test_unicode.py to use MAX_UNICODE and sys.maxunicode constants, rather than hardcoded constants.

This change should enhance the readability of the code.

It also makes Python a little bit more future proof if Unicode decides to extend their Unicode Character Set to more than 0x10ffff.

This change is a follow-up of a discussion in PR #15019 about sys.maxunicode+1 vs 0x110000.

0x10ffff and 0x110000 constants are still used in other files, but I chose to only modify files directly related to the implementation of Unicode.

$ grep -l -i -E '10ffff|110000)' $(find -name "*.c" -o -name "*.h")
./Include/cpython/unicodeobject.h
./Include/unicodeobject.h
./Modules/_io/winconsoleio.c
./Modules/cjkcodecs/_codecs_cn.c
./Modules/expat/xmltok_impl.c
./Modules/expat/xmltok.c
./Modules/unicodedata.c
./Modules/unicodedata_db.h
./Objects/stringlib/ucs4lib.h
./Objects/stringlib/find_max_char.h
./Objects/stringlib/codecs.h
./Objects/unicodectype.c
./Objects/unicodeobject.c
./Python/clinic/bltinmodule.c.h
./Python/formatter_unicode.c
./Python/bltinmodule.c
./Python/sysmodule.c

Modifying other other would need to expose MAX_UNICODE in a header file.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure there is any benefit of this change. The limit 0x10FFFF is determined by the UTF-16 codec. It is very unlikely it will be changed in the nearest century.

Some occurrences of 0x10FFFF in tests mean not maximal code point, but an arbitrary UCS4 code.

@@ -220,10 +220,10 @@ STRINGLIB(utf8_decode)(const char *inptr, const char end,
}
ch = (ch << 18) + (ch2 << 12) + (ch3 << 6) + ch4 -
((0xF0 << 18) + (0x80 << 12) + (0x80 << 6) + 0x80);
assert ((ch > 0xFFFF) && (ch <= 0x10FFFF));
assert ((ch > 0xFFFF) && (ch <= MAX_UNICODE));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the current UTF-8 codec would not work for MAX_UNICODE > 0x001FFFFF.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the UTF-8 decoder: it has to ensure that it doesn't produce a code point not supported by Python. (ch <= MAX_UNICODE) is the correct test here.

Test more characters for .isascii() and .center()
Add xxx_CHAR and xxx_CHARS constants

@serhiy-storchaka: I addressed your comments, would you mind to review the updated PR?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I've only looked at the test changes; I may come back to look at the rest later.)

invalid_char = 0x10ffff+1
for before in "a\xe9\u20ac\U0010ffff":
mapping = str.maketrans({before: invalid_char})
for before in FULL_UCS_CHARS:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads to me like "all characters in UCS" or "in Unicode". In fact I assumed it meant that (here and in similar loops below), and was a bit surprised you were making all these loops so comprehensive 🙂, before I scrolled back up to the top and read their definitions.

How about putting something like "EXAMPLE" in the names? That's what they really are -- arbitrary lists of example characters for testing.

self.assertEqual('axb'.translate(str.maketrans({'a': None, 'b': '123'})),
"x123")
self.assertEqual('axb'.translate(str.maketrans({'a': None, 'b': '\xe9'})),
"x\xe9")
for ch in UCS1_ONLY_CHARS:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really specific to characters up to U+00FF? If not, better to just use the full list FULL_UCS_CHARS.

# Test characters of the Basic Multilingual Plane (U+0000-U+ffff)
BMP_CHARS = UCS1_CHARS + ("\u0100", UCS2_CHAR, UCS2_LAST_CHAR)
# Test characters of the full Unicode Character Set (U+0000-U+10ffff)
FULL_UCS_CHARS = BMP_CHARS + ('\U00010000', UCS4_CHAR, UCS4_LAST_CHAR)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you don't end up using most of these sequences (except to define the next ones) -- mostly the all-inclusive one FULL_UCS_CHARS, and in one place UCS1_ONLY_CHARS (but perhaps that too should be the full one, see below.)

Which makes sense -- for basically all of the tests that use these sequences, the desired behavior is something like "it works correctly for every codepoint", and Python code shouldn't have to really know or care whether it fits in the BMP or ASCII or whatever.

In that case I think it would be clearest to just write one sequence. Combined with my naming suggestion below, that might look like

# Test characters spanning the full range.
EXAMPLE_CHARS = (
    "\x00", EXAMPLE_ASCII_CHAR, ASCII_LAST_CHAR,
    "\x80", EXAMPLE_UCS1_CHAR, UCS1_LAST_CHAR,
# ...
)

UCS2_CHAR = "\u20ac" # euro sign
UCS2_LAST_CHAR = "\uffff"
# Characters in UCS4 ("Astral") range (U+0000-U+FFFF),
# but not in the UCS2 range

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Astral" means specifically "not the BMP", so this isn't quite right. (Hm, also it should say "U+10FFFF".)

Perhaps just # Characters outside the BMP ("astral" characters).

UCS1_CHAR = "\xe9"
UCS1_LAST_CHAR = "\xff"
# Characters in UCS2 ("BMP") range (U+0000-U+FFFF),
# but not in the UCS1 range

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the normative Unicode jargon, UCS-2 isn't a set of characters -- it's an encoding. This range of characters (*) is called the BMP ("Basic Multilingual Plane"). UCS-2 is a partial encoding, which only works for the BMP.

See https://unicode.org/glossary/

This is all really about characters (or code points), not encodings. So I think the talk of UCS-2 is confusing here -- it's best to just say "BMP", which is the proper name of this range of characters.

(*) in fact of code points, but that doesn't matter here

@@ -1355,7 +1390,7 @@ def test_formatting(self):

self.assertEqual('%c' % 0x1234, '\u1234')
self.assertEqual('%c' % 0x21483, '\U00021483')
self.assertRaises(OverflowError, "%c".__mod__, (0x110000,))
self.assertRaises(OverflowError, "%c".__mod__, (INVALID_CODEPOINT,))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIRST_INVALID_CODEPOINT would I think capture the intent more precisely. (This is similar to a remark @serhiy-storchaka made elsewhere.)

A few years ago I tried to replace all occurrences of 0x10ffff and 0x110000 with a named constant, but refused such change because I did not see a clear benefit from it. I'm still not sure that it's worth doing.

A few years ago I tried to replace all occurrences of 0x10ffff and 0x110000 with a named constant, but refused such change because I did not see a clear benefit from it. I'm still not sure that it's worth doing.

Victor has already heard this from me (on #15019 ), but FWIW my view is similar -- I think in the bulk of these examples the code is clearer with the explicit 0x10ffff or 0x110000 rather than MAX_UNICODE.

One reason is that in a lot of these there are other constants around like 0xff and 0xffff, and understanding the code involves seeing the relationships between them. When some or all of them are behind names like MAX_UNICODE, that adds an indirection for the reader to make.

If these lines are going to use a named constant, though, these changes look good to me as a way to do it.

And in the test code, I think giving names to the sequences of example characters is a helpful direction. Particularly because that makes it easy to make more of the tests test more combinations, as this change already does.

I failed to convince with this change, so I abandon it ;-)