Issue 22649: Use _PyUnicodeWriter in case_operation() (original) (raw)

The case_operation() in Objects/unicodeobject.c is used for case operations: lower, upper, casefold, etc.

Currently, the function uses a buffer of Py_UCS4 and overallocate the buffer by 300%. The function uses the worst case: one character replaced with 3 characters.

I propose the use the _PyUnicodeWriter API to be able to optimize the most common case: each character is replaced by only one another character, and the output string uses the same unicode kind (UCS1, UCS2 or UCS4).

The patch preallocates the writer using the kind of the input string, but in some cases, the result uses a lower kind (ex: latin1 => ASCII). "Special" characters taking the slow path from unit tests:

test_capitalize: 'ﬁnnish' => 'FInnish' (ascii)
test_casefold: 'ß' => 'ss', 'ﬁ' => 'fi'
test_swapcase: 'ﬁ' => 'FI', 'ß' => 'SS'
test_title: 'ﬁNNISH' => 'Finnish'
test_upper: 'ﬁ' => 'FI', 'ß' => 'SS'

The writer only uses overallocation if a replaced character uses more than one character. Bad cases where the length changes:

test_capitalize: 'ῳῳῼῼ' => 'ΩΙῳῳῳ', 'hİ' => 'Hi̇', 'ῒİ' => 'Ϊ̀i̇', 'ﬁnnish' => 'FInnish'
test_casefold: 'ß' => 'ss', 'ﬁ' => 'fi'
test_lower: 'İ' => 'i̇'
test_swapcase: 'ﬁ' => 'FI', 'İ' => 'i̇', 'ß' => 'SS', 'ῒ' => 'Ϊ̀'
test_title: 'ﬁNNISH' => 'Finnish'
test_upper: 'ﬁ' => 'FI', 'ß' => 'SS', 'ῒ', 'Ϊ̀'

Add tests for 'µ' or 'ÿ' (upper maps UCS1 to UCS2), 'ΐ' or like (upper maps UCS2 to 3 UCS2), 'ﬃ' or 'ﬄ' (upper maps UCS2 to 3 ASCII), 'İ' (only one character for which lower doesn't map to 1 character), 'Å' (lower maps UCS2 to UCS1), any of Deseret or Warang Citi characters (UCS4).

Looks like it's cheaper to overallocate than add checks for overflow at each loop iteration.

I expected that the temporary Py_UCS4 buffer and the conversion to a Unicode object (Py_UCS1, Py_UCS2 or Py_UCS4) would be more expensive than _PyUnicodeWriter. It looks like it's slower.

I tried to optimize the code but I didn't see how to make it really faster than the current code.

Currently, the code uses:

for (j = 0; j < n_res; j++) { *maxchar = Py_MAX(*maxchar, mapped[j]); res[k++] = mapped[j]; }

where res is a Py_UCS4* string, and mapped an array of 3 Py_UCS4.

I replaced it with a call to case_operation_write() which calls _PyUnicodeWriter_WriteCharInline().

_PyUnicodeWriter_WriteCharInline() is maybe more expensive than "res[k++] = mapped[j];".