Hum, Serhiy is better than me to review such hardcore C code :-) I let him review the patch ;-) > short one, a little regression In stringlib, the usual solution is to use a threshold: use a dummy loop for less than N bytes, otherwise use the ultra-optimized loop. Serhiy event implemented a "dynamic" threshold in some functions, when too many false positive are found. I don't recall where. "And since str.replace may also go through the code path involving count, it's somewhat affected: (...) 1.07x faster" I'm not really excited by optimizing str.count, since I don't think that this function is commonly used. But if str.replace is made faster, I'm interested :-) I understand that count() is only used when the old and new patterns of str.replace() have a different length.
> I understand that count() is only used when the old and new patterns of str.replace() have a different length. Yes. I thought it won't help much since str.replace get many operations. But for long string, looks good: ./python3 -m perf timeit --compare-to ~/cpython/python -s 's="abcdefghihijklmnopqrstuvwxyz~!@##$%^&*()-=_+{}|"*100' 's.replace("a", "bc")' python: ..................... 7.36 us +- 0.04 us python3: ..................... 4.91 us +- 0.04 us Median +- std dev: [python] 7.36 us +- 0.04 us -> [python3] 4.91 us +- 0.04 us: 1.50x faster # 50% ??!! how? And this patch also applies to bytes since they share codes.
The code looks too complex. I think that it is not worth to complicate the code so much for optimizing just few non-critical string operations. Many optimizations were rejected in the past due to high cost and low benefit.
> The code looks too complex. It is if looking at the patch alone. But the skill is used across stringlib and the only thing new is the bitwise operation. str/bytes.count is not critical but not bad if it could be optimized and especially the effect is significant for long data. I am not favour of rejecting it now but set priority to low.