What do you think of deprecating C APIs to modify immutable strings? (original) (raw)
October 8, 2024, 2:48pm 1
Hi,
Python C API allows modifying immutable strings. It’s a common pattern to create new strings. Examples of functions:
PyUnicode_New()
PyUnicode_FromStringAndSize(NULL, size)
PyUnicode_Resize()
PyUnicode_WriteChar()
PyUnicode_WRITE()
PyUnicode_CopyCharacters()
PyUnicode_Fill()
The problem is that PyUnicode_New()
is designed for PEP 393: it requires a “maximum character”. If tomorrow, Python switchs to UTF-8 internally, computing the maximum character becomes inefficient since it’s useless. By the way, PyPy is already facing the problem today since it uses UTF-8 internally. I was asked by PyPy developers long time ago to get rid of PEP 393 C APIs. We should try to hide these implementation details.
Python 3.14 has a new PyUnicodeWriter C API which avoids writing into immutable strings. It’s available on Python 3.6-3.13 using the pythoncapi-compat project.
What do you think of deprecating C APIs which modify immutable strings? I don’t think that the PyUnicodeWriter
API is complete enough, we might need to add other APIs to create strings. These APIs have to be designed.
I don’t know the cost on performance. PyUnicodeWriter
was designed with performance in mind. It can overallocate its internal buffer if needed, for example.
Victor
storchaka (Serhiy Storchaka) October 8, 2024, 3:44pm 2
How would you implement PyUnicodeWriter without such API? Or other high-performant code, like str.replace, decoders, etc?
pitrou (Antoine Pitrou) October 8, 2024, 4:00pm 3
- What is the performance story for
PyUnicodeWriter
vs. the legacy APIs you want to deprecate? - Have you tried to migrate the stdlib to
PyUnicodeWriter
to validate the approach?
encukou (Petr Viktorin) October 9, 2024, 8:41am 4
I’d soft-deprecate them and add them to PEP-743.
Since the replacement was just added, I’d rather wait a release before making such important API raise deprecation warnings.
vstinner (Victor Stinner) October 9, 2024, 8:54am 5
To be honest, I don’t know
I feel like there is more and more pressure on using these functions versus willingness to change Unicode internals. So we should think about replacement APIs to hide implementation details.
I don’t know.
I’m not sure that the stdlib is a good candidate since we like to abuse internals to get best performance. Using PyUnicodeWriter in the stdlib extensions would only be acceptable if there is no performance overhead.
pitrou (Antoine Pitrou) October 9, 2024, 12:01pm 6
Doesn’t it precisely make the stdlib a good testing ground to check that the PyUnicodeWriter
can be a complete replacement for the legacy APIs?
Intuitively, I see two possible problems with the PyUnicodeWriter
API:
- It seems that
PyUnicodeWriter_Create
/PyUnicodeWriter_Finish
add a malloc/free pair in addition to the actual PyUnicodeObject allocation. This might be eliminated using clever tricks, though. - The presizing/overallocation behavior is not documented. Even the
length
parameter toPyUnicodeWriter_Create
isn’t documented (is it a number of codepoints ? a number of UTF8 bytes?).
Also an additional concern, perhaps temporary, is that PyUnicodeWriter
is not part of the limited API (yet?).
vstinner (Victor Stinner) October 10, 2024, 9:34am 7
I created issue gh-125196 to use the public PyUnicodeWriter API in the stdlib.
da-woods (Da Woods) October 10, 2024, 8:45pm 8
The only place that Cython uses these that wouldn’t be easy with the PyUnicodeWriter
API is to try to optimize in-place addition of unicode.
s = ""
for other_string in list_of_strings:
s += other_string
It does basically the same thing that Python does internally, where if there reference count is 1 it tries to resize rather than creating a new object.
Obviously that isn’t good code, but it’s nice to be able to optimize it.
In principle, it should be possible (maybe easier?) to keep doing the same thing in a UTF-8 world. But I don’t think it’s easily expressed with PyUnicodeWriter
in anything but the simplest cases.
malemburg (Marc-André Lemburg) October 11, 2024, 7:49am 9
Moving to the new PyUnicodeWriter
API internally is a good idea, provided the performance stays the same, but I don’t think we’ll be able to deprecate the mentioned C API for a longer while, since the basic idea “allocate, fill in data, then do a final resize” has been a common approach for strings in Python ever since the beginning, so people using the Python C API will have it internalized.
vstinner (Victor Stinner) October 25, 2024, 9:32am 10
PyUnicodeWriter
API is good to append-only functions. Maybe we need a different API to "allocate a buffer, fill in data, then do a final resize”.
pitrou (Antoine Pitrou) October 25, 2024, 5:03pm 11
Why would it be a different API?