Issue 25267: Optimize UTF-8 encoder with error handlers (original) (raw)

Issue25267

Created on 2015-09-29 11:30 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
utf8_encoder_errors.patch vstinner,2015-09-29 11:30 review
bench.py vstinner,2015-10-01 11:58
utf8_encoder_errors-2.patch vstinner,2015-10-01 12:01 review
utf8_encoder_errors-3.patch vstinner,2015-10-01 12:47 review
Messages (6)
msg251845 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-09-29 11:30
Attached patch optimizes the UTF-8 encoder for error handlers: ignore, replace, surrogateescape, surrogatepass. It is based on the patch faster_surrogates_hadling.patch written by Serhiy Storchaka in the issue #24870. It also modifies unicode_encode_ucs1() to use memset() for the replace error handler. It should be faster for long sequences of unencodable characters, but it may be slower for short sequences of unencodable characters. The patch adds new unit tests and fix unit tests to ensure that utf-8-sig codec is also well tested. TODO: write a benchmark. See also the issue #25227 which optimized ASCII and latin1 encoders with the surrogateescape error handlers.
msg252021 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-10-01 12:01
Oh, there is a bug in utf8_encoder() (not in my patch!), newpos was not used after calling the error handler. It's now fixed in the new patch.
msg252022 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-10-01 12:04
Benchmark results. Sorry for the very long output. There are some (corner?) cases where the patched Python is a little bit slower. I consider that it's ok since it's *much* faster in the other cases. What do you think? Common platform: Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) Bits: int=32, long=64, long long=64, size_t=64, void*=64 CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Python unicode implementation: PEP 393 Timer: time.perf_counter SCM: hg revision=10efb1797e7b+ tag=tip branch=default date="2015-10-01 13:16 +0200" Platform: Linux-4.1.6-200.fc22.x86_64-x86_64-with-fedora-22-Twenty_Two CFLAGS: -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Platform of campaign before: Date: 2015-10-01 13:30:07 Timer precision: 61 ns Python version: 3.6.0a0 (default:10efb1797e7b, Oct 1 2015, 13:30:06) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)] Platform of campaign after: Timer precision: 63 ns Date: 2015-10-01 13:54:14 Python version: 3.6.0a0 (default:10efb1797e7b+, Oct 1 2015, 13:53:51) [GCC 5.1.1 20150618 (Red Hat 5.1.1-4)] --------------------------+-------------+---------------- ignore: "\udcff" * length | before after --------------------------+-------------+---------------- length=10**1 3.16 us (*) 279 ns (-91%) length=10**3 241 us (*) 1.08 us (-100%) length=10**2 23.9 us (*) 346 ns (-99%) length=10**4 2.39 ms (*) 6.48 us (-100%) --------------------------+-------------+---------------- Total 2.66 ms (*) 8.19 us (-100%) --------------------------+-------------+---------------- --------------------------------+-------------+--------------- ignore: "a" * length + "\udcff" before after --------------------------------+-------------+--------------- length=10**1 1.12 us (*) 295 ns (-74%) length=10**3 2.2 us (*) 1.57 us (-29%) length=10**2 1.21 us (*) 408 ns (-66%) length=10**4 10.4 us (*) 12.3 us (+18%) --------------------------------+-------------+--------------- Total 15 us (*) 14.6 us --------------------------------+-------------+--------------- --------------------------------------------+-------------+--------------- ignore: ("a" * 99 + "\udcff" * 99) * length before after --------------------------------------------+-------------+--------------- length=10**1 238 us (*) 2.46 us (-99%) length=10**3 23.7 ms (*) 234 us (-99%) length=10**2 2.38 ms (*) 20.8 us (-99%) length=10**4 238 ms (*) 2.56 ms (-99%) --------------------------------------------+-------------+--------------- Total 265 ms (*) 2.82 ms (-99%) --------------------------------------------+-------------+--------------- ---------------------------------------+-------------+---------------- ignore: ("\udcff" * 99 + "a") * length before after ---------------------------------------+-------------+---------------- length=10**1 239 us (*) 1.29 us (-99%) length=10**3 23.8 ms (*) 80.9 us (-100%) length=10**2 2.4 ms (*) 8.44 us (-100%) length=10**4 236 ms (*) 839 us (-100%) ---------------------------------------+-------------+---------------- Total 263 ms (*) 930 us (-100%) ---------------------------------------+-------------+---------------- --------------------------------+-------------+--------------- ignore: "\udcff" + "a" * length before after --------------------------------+-------------+--------------- length=10**1 1.09 us (*) 297 ns (-73%) length=10**3 2.19 us (*) 1.58 us (-28%) length=10**2 1.19 us (*) 409 ns (-66%) length=10**4 10.5 us (*) 12.3 us (+17%) --------------------------------+-------------+--------------- Total 14.9 us (*) 14.6 us --------------------------------+-------------+--------------- ---------------------------+-------------+---------------- replace: "\udcff" * length before after ---------------------------+-------------+---------------- length=10**1 3.47 us (*) 317 ns (-91%) length=10**3 263 us (*) 1.07 us (-100%) length=10**2 26.4 us (*) 383 ns (-99%) length=10**4 2.65 ms (*) 6.75 us (-100%) ---------------------------+-------------+---------------- Total 2.94 ms (*) 8.52 us (-100%) ---------------------------+-------------+---------------- ---------------------------------+-------------+--------------- replace: "a" * length + "\udcff" before after ---------------------------------+-------------+--------------- length=10**1 1.16 us (*) 319 ns (-72%) length=10**3 2.25 us (*) 1.62 us (-28%) length=10**2 1.25 us (*) 432 ns (-65%) length=10**4 13.4 us (*) 12.4 us (-7%) ---------------------------------+-------------+--------------- Total 18 us (*) 14.7 us (-18%) ---------------------------------+-------------+--------------- ---------------------------------------------+-------------+--------------- replace: ("a" * 99 + "\udcff" * 99) * length before after ---------------------------------------------+-------------+--------------- length=10**1 267 us (*) 2.52 us (-99%) length=10**3 26.2 ms (*) 210 us (-99%) length=10**2 2.63 ms (*) 21.3 us (-99%) length=10**4 264 ms (*) 2.98 ms (-99%) ---------------------------------------------+-------------+--------------- Total 293 ms (*) 3.21 ms (-99%) ---------------------------------------------+-------------+--------------- ----------------------------------------+-------------+---------------- replace: ("\udcff" * 99 + "a") * length before after ----------------------------------------+-------------+---------------- length=10**1 263 us (*) 1.29 us (-100%) length=10**3 26.1 ms (*) 86.6 us (-100%) length=10**2 2.63 ms (*) 9.02 us (-100%) length=10**4 261 ms (*) 925 us (-100%) ----------------------------------------+-------------+---------------- Total 290 ms (*) 1.02 ms (-100%) ----------------------------------------+-------------+---------------- ---------------------------------+-------------+--------------- replace: "\udcff" + "a" * length before after ---------------------------------+-------------+--------------- length=10**1 1.14 us (*) 317 ns (-72%) length=10**3 2.24 us (*) 1.6 us (-28%) length=10**2 1.23 us (*) 428 ns (-65%) length=10**4 10.5 us (*) 12.3 us (+17%) ---------------------------------+-------------+--------------- Total 15.1 us (*) 14.7 us ---------------------------------+-------------+--------------- -----------------------------------+-------------+--------------- surrogateescape: "\udcff" * length before after -----------------------------------+-------------+--------------- length=10**1 3.48 us (*) 281 ns (-92%) length=10**3 267 us (*) 1.77 us (-99%) length=10**2 26.7 us (*) 424 ns (-98%) length=10**4 2.67 ms (*) 13.9 us (-99%) -----------------------------------+-------------+--------------- Total 2.97 ms (*) 16.3 us (-99%) -----------------------------------+-------------+--------------- -----------------------------------------+-------------+--------------- surrogateescape: "a" * length + "\udcff" before after -----------------------------------------+-------------+--------------- length=10**1 1.14 us (*) 277 ns (-76%) length=10**3 2.32 us (*) 1.57 us (-32%) length=10**2 1.24 us (*) 391 ns (-68%) length=10**4 10.6 us (*) 12.3 us (+17%) -----------------------------------------+-------------+--------------- Total 15.3 us (*) 14.6 us -----------------------------------------+-------------+--------------- -----------------------------------------------------+-------------+--------------- surrogateescape: ("a" * 99 + "\udcff" * 99) * length before after -----------------------------------------------------+-------------+--------------- length=10**1 266 us (*) 3.26 us (-99%) length=10**3 26.4 ms (*) 285 us (-99%) length=10**2 2.65 ms (*) 28.9 us (-99%) length=10**4 266 ms (*) 3.73 ms (-99%) -----------------------------------------------------+-------------+--------------- Total 295 ms (*) 4.04 ms (-99%) -----------------------------------------------------+-------------+--------------- ------------------------------------------------+-------------+--------------- surrogateescape: ("\udcff" * 99 + "a") * length before after ------------------------------------------------+-------------+--------------- length=10**1 265 us (*) 2.04 us (-99%) length=10**3 26.2 ms (*) 165 us (-99%) length=10**2 2.64 ms (*) 17 us (-99%) length=10**4 263 ms (*) 1.75 ms (-99%) ------------------------------------------------+-------------+--------------- Total 292 ms (*) 1.93 ms (-99%) ------------------------------------------------+-------------+--------------- -----------------------------------------+-------------+--------------- surrogateescape: "\udcff" + "a" * length before after -----------------------------------------+-------------+--------------- length=10**1 1.12 us (*) 278 ns (-75%) length=10**3 2.25 us (*) 1.59 us (-29%) length=10**2 1.21 us (*) 389 ns (-68%) length=10**4 10.5 us (*) 12.3 us (+17%) -----------------------------------------+-------------+--------------- Total 15.1 us (*) 14.6 us -----------------------------------------+-------------+--------------- ---------------------------------+-------------+--------------- surrogatepass: "\udcff" * length before after ---------------------------------+-------------+--------------- length=10**1 3.71 us (*) 306 ns (-92%) length=10**3 289 us (*) 2.61 us (-99%) length=10**2 28.9 us (*) 532 ns (-98%) length=10**4 2.88 ms (*) 22.4 us (-99%) ---------------------------------+-------------+--------------- Total 3.2 ms (*) 25.8 us (-99%) ---------------------------------+-------------+--------------- ---------------------------------------+-------------+--------------- surrogatepass: "a" * length + "\udcff" before after ---------------------------------------+-------------+--------------- length=10**1 1.16 us (*) 299 ns (-74%) length=10**3 2.36 us (*) 1.59 us (-32%) length=10**2 1.27 us (*) 413 ns (-68%) length=10**4 10.6 us (*) 12.3 us (+16%) ---------------------------------------+-------------+--------------- Total 15.4 us (*) 14.6 us (-5%) ---------------------------------------+-------------+--------------- ---------------------------------------------------+-------------+--------------- surrogatepass: ("a" * 99 + "\udcff" * 99) * length before after ---------------------------------------------------+-------------+--------------- length=10**1 289 us (*) 3.99 us (-99%) length=10**3 28.5 ms (*) 362 us (-99%) length=10**2 2.86 ms (*) 36.7 us (-99%) length=10**4 287 ms (*) 5.18 ms (-98%) ---------------------------------------------------+-------------+--------------- Total 319 ms (*) 5.59 ms (-98%) ---------------------------------------------------+-------------+--------------- ----------------------------------------------+-------------+--------------- surrogatepass: ("\udcff" * 99 + "a") * length before after ----------------------------------------------+-------------+--------------- length=10**1 288 us (*) 2.91 us (-99%) length=10**3 28.5 ms (*) 242 us (-99%) length=10**2 2.86 ms (*) 24.7 us (-99%) length=10**4 284 ms (*) 2.53 ms (-99%) ----------------------------------------------+-------------+--------------- Total 316 ms (*) 2.8 ms (-99%) ----------------------------------------------+-------------+--------------- ---------------------------------------+-------------+--------------- surrogatepass: "\udcff" + "a" * length before after ---------------------------------------+-------------+--------------- length=10**1 1.13 us (*) 301 ns (-73%) length=10**3 2.3 us (*) 1.59 us (-31%) length=10**2 1.24 us (*) 409 ns (-67%) length=10**4 10.6 us (*) 12.1 us (+15%) ---------------------------------------+-------------+--------------- Total 15.2 us (*) 14.4 us (-5%) ---------------------------------------+-------------+--------------- ------------------------------------+-------------+--------------- backslashreplace: "\udcff" * length before after ------------------------------------+-------------+--------------- length=10**1 4.28 us (*) 1.58 us (-63%) length=10**3 320 us (*) 11.1 us (-97%) length=10**2 32.3 us (*) 2.56 us (-92%) length=10**4 3.17 ms (*) 96.6 us (-97%) ------------------------------------+-------------+--------------- Total 3.52 ms (*) 112 us (-97%) ------------------------------------+-------------+--------------- ------------------------------------------+-------------+--------------- backslashreplace: "a" * length + "\udcff" before after ------------------------------------------+-------------+--------------- length=10**1 1.44 us (*) 1.47 us length=10**3 2.43 us (*) 2.77 us (+14%) length=10**2 1.52 us (*) 1.64 us (+8%) length=10**4 10.6 us (*) 13.3 us (+25%) ------------------------------------------+-------------+--------------- Total 16 us (*) 19.2 us (+20%) ------------------------------------------+-------------+--------------- ------------------------------------------------------+-------------+--------------- backslashreplace: ("a" * 99 + "\udcff" * 99) * length before after ------------------------------------------------------+-------------+--------------- length=10**1 316 us (*) 16 us (-95%) length=10**3 31.3 ms (*) 1.46 ms (-95%) length=10**2 3.14 ms (*) 147 us (-95%) length=10**4 313 ms (*) 15.3 ms (-95%) ------------------------------------------------------+-------------+--------------- Total 347 ms (*) 16.9 ms (-95%) ------------------------------------------------------+-------------+--------------- -------------------------------------------------+-------------+--------------- backslashreplace: ("\udcff" * 99 + "a") * length before after -------------------------------------------------+-------------+--------------- length=10**1 317 us (*) 14.7 us (-95%) length=10**3 31.3 ms (*) 1.34 ms (-96%) length=10**2 3.17 ms (*) 135 us (-96%) length=10**4 313 ms (*) 13.8 ms (-96%) -------------------------------------------------+-------------+--------------- Total 347 ms (*) 15.3 ms (-96%) -------------------------------------------------+-------------+--------------- ------------------------------------------+-------------+--------------- backslashreplace: "\udcff" + "a" * length before after ------------------------------------------+-------------+--------------- length=10**1 1.43 us (*) 1.45 us length=10**3 2.36 us (*) 2.58 us (+9%) length=10**2 1.51 us (*) 1.57 us length=10**4 10.5 us (*) 13.2 us (+26%) ------------------------------------------+-------------+--------------- Total 15.8 us (*) 18.8 us (+19%) ------------------------------------------+-------------+--------------- ------------------------------------------------------+--------------+---------------- Summary before after ------------------------------------------------------+--------------+---------------- ignore: "\udcff" * length 2.66 ms (*) 8.19 us (-100%) ignore: "a" * length + "\udcff" 15 us (*) 14.6 us ignore: ("a" * 99 + "\udcff" * 99) * length 265 ms (*) 2.82 ms (-99%) ignore: ("\udcff" * 99 + "a") * length 263 ms (*) 930 us (-100%) ignore: "\udcff" + "a" * length 14.9 us (*) 14.6 us replace: "\udcff" * length 2.94 ms (*) 8.52 us (-100%) replace: "a" * length + "\udcff" 18 us (*) 14.7 us (-18%) replace: ("a" * 99 + "\udcff" * 99) * length 293 ms (*) 3.21 ms (-99%) replace: ("\udcff" * 99 + "a") * length 290 ms (*) 1.02 ms (-100%) replace: "\udcff" + "a" * length 15.1 us (*) 14.7 us surrogateescape: "\udcff" * length 2.97 ms (*) 16.3 us (-99%) surrogateescape: "a" * length + "\udcff" 15.3 us (*) 14.6 us surrogateescape: ("a" * 99 + "\udcff" * 99) * length 295 ms (*) 4.04 ms (-99%) surrogateescape: ("\udcff" * 99 + "a") * length 292 ms (*) 1.93 ms (-99%) surrogateescape: "\udcff" + "a" * length 15.1 us (*) 14.6 us surrogatepass: "\udcff" * length 3.2 ms (*) 25.8 us (-99%) surrogatepass: "a" * length + "\udcff" 15.4 us (*) 14.6 us (-5%) surrogatepass: ("a" * 99 + "\udcff" * 99) * length 319 ms (*) 5.59 ms (-98%) surrogatepass: ("\udcff" * 99 + "a") * length 316 ms (*) 2.8 ms (-99%) surrogatepass: "\udcff" + "a" * length 15.2 us (*) 14.4 us (-5%) backslashreplace: "\udcff" * length 3.52 ms (*) 112 us (-97%) backslashreplace: "a" * length + "\udcff" 16 us (*) 19.2 us (+20%) backslashreplace: ("a" * 99 + "\udcff" * 99) * length 347 ms (*) 16.9 ms (-95%) backslashreplace: ("\udcff" * 99 + "a") * length 347 ms (*) 15.3 ms (-96%) backslashreplace: "\udcff" + "a" * length 15.8 us (*) 18.8 us (+19%) ------------------------------------------------------+--------------+---------------- Total 3.04 sec (*) 54.9 ms (-98%) ------------------------------------------------------+--------------+----------------
msg252024 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-10-01 12:47
Oh, the default handler for errror handlers uses a loop to check for non-ASCII characters. It can be replaced with PyUnicode_IS_ASCII(str) which has a complexity O(1). Done in new patch.
msg252058 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2015-10-01 21:20
New changeset 2b5357b38366 by Victor Stinner in branch 'default': Issue #25267: The UTF-8 encoder is now up to 75 times as fast for error https://hg.python.org/cpython/rev/2b5357b38366
msg252060 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-10-01 21:27
I pushed my optimization.
History
Date User Action Args
2022-04-11 14:58:21 admin set github: 69454
2015-10-01 21:27:18 vstinner set status: open -> closedresolution: fixedmessages: +
2015-10-01 21:20:10 python-dev set nosy: + python-devmessages: +
2015-10-01 12:47:04 vstinner set files: + utf8_encoder_errors-3.patchmessages: +
2015-10-01 12:04:37 vstinner set messages: +
2015-10-01 12:01:26 vstinner set files: + utf8_encoder_errors-2.patchmessages: +
2015-10-01 11:58:15 vstinner set files: + bench.py
2015-09-29 11:30:34 vstinner create