gh-98836: Extend PyUnicode_FromFormat() by serhiy-storchaka · Pull Request #98838 · python/cpython (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation19 Commits15 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Support for conversion specifiers o (octal) and X (uppercase hexadecimal).
Support for length modifiers j (intmax_t) and t (ptrdiff_t).
Length modifiers are now applied to all integer conversions.
Support for wchar_t C strings (%ls and %lV).
Support for variable width and precision (*).
Support for flag - (left alignment).
Issue: Extend PyUnicode_FromFormat() #98836

Support for conversion specifiers o (octal) and X (uppercase hexadecimal).
Support for length modifiers j (intmax_t) and t (ptrdiff_t).
Length modifiers are now applied to all integer conversions.
Support for wchar_t C strings (%ls and %lV).
Support for variable width and precision (*).
Support for flag - (left alignment).

:c:func:`PyUnicode_FromFormatV` now sets a :exc:`SystemError`.
In previous versions it caused all the rest of the format string to be
copied as-is to the result string, and any extra arguments discarded.
(Contributed by Serhiy Storchaka in :gh:`95781`.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this paragraph was duplicated by mistake?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.

@@ -2637,7 +2637,8 @@ def test_from_format(self):
c_char_p,
pythonapi, py_object, sizeof,
c_int, c_long, c_longlong, c_ssize_t,
c_uint, c_ulong, c_ulonglong, c_size_t, c_void_p)
c_uint, c_ulong, c_ulonglong, c_size_t, c_void_p,
sizeof, c_wchar, c_wchar_p)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please a comment somewhere to list formats which are not tested and the reason why?

I see at least %j and %t which are not tested. Please mention that _testcapi.test_string_from_format() has a wider coverage of all formats, and test these formats.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

check_format(' 0000123', b'%10.7i', c_int(123))
check_format('0000000123', b'%010.7i', c_int(123))
check_format('0000123 ', b'%-10.7i', c_int(123))
check_format('0000123 ', b'%-010.7i', c_int(123))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to generate these tests? The code is similar for the 6 groups of tests.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not so similar. It is different for signed and unsigned types, but I'll try to generalize it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just suggested to generate these tests. It's ok to leave them as they are if it's too complicated to generate them.

b'%V', None, b'xyz')

# test %ls
check_format('abc', b'%ls', c_wchar_p('abc'))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can please add one non-ASCII character in wchar tests, %ls and %lV? In addition to tests truncating to 2 wchar_t.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? Isn't the following line enough?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking at adding a test with non-ASCII characters which fit into UCS-2 (16-bit wchar_t) characters. Something like:

check_format('a\xe9\u20ac', b'%ls', c_wchar_p('a\xe9\u20ac'))

Some bugs are only triggered depending on the maximum code point, and wchar_t code can take different code path depending if there is a surrogate character or not.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -2880,10 +2989,11 @@ def check_format(expected, format, *args):
# check for crashes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is misleading. Python doesn't crash but reject invalid format string with a SyntaxError. Would you mind to rephrase the comment? Like: # test invalid format strings?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Restored the part of the older comment.

| | | | :c:func:`PyObject_Repr`. | | | ------------------------------------------------------------------------------ | | --------------------------- | | | +-------------------+---------------------+----------------------------------+ | | | | | ASCII-encoded string. | | | | | | | | |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add something like [[fill]align][sign][z][#][0][width][grouping_option][.precision][type]? Example taken from: https://docs.python.org/dev/library/string.html#format-specification-mini-language

The printf format is complex, and it's not obvious in which order each part should be written. I understand that it's something like:

%[conversion flags][minimum width][.precision][length modifier][conversion type]

Or if you prefer a shorter version:

%[conversion][width][.precision][modifier][conversion]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

check_format(' 0000123', b'%10.7i', c_int(123))
check_format('0000000123', b'%010.7i', c_int(123))
check_format('0000123 ', b'%-10.7i', c_int(123))
check_format('0000123 ', b'%-010.7i', c_int(123))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just suggested to generate these tests. It's ok to leave them as they are if it's too complicated to generate them.

# Length modifiers "j" and "t" are not tested here because ctypes does
# not expose types for intmax_t and ptrdiff_t.
# _testcapi.test_string_from_format() has a wider coverage of all
# formats.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this comment is useful to me at least :-)

b'%V', None, b'xyz')

# test %ls
check_format('abc', b'%ls', c_wchar_p('abc'))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking at adding a test with non-ASCII characters which fit into UCS-2 (16-bit wchar_t) characters. Something like:

check_format('a\xe9\u20ac', b'%ls', c_wchar_p('a\xe9\u20ac'))

Some bugs are only triggered depending on the maximum code point, and wchar_t code can take different code path depending if there is a surrogate character or not.

Feel free to ignore my suggestions, the PR LGTM.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.