gh-99108: Add HACL* Blake2 implementation to hashlib by msprotz · Pull Request #119316 · python/cpython (original) (raw)
So I computed some benchmarks on a variety of machines, comparing HACL's performance with libb2. This is using the benchmarking infrastructure from HACL-packages: https://github.com/cryspen/hacl-packages/blob/main/benchmarks/blake.cc
To read these benchmarks, bear in mind that:
- libb2 automatically picks the "best" implementation (via CPUID)
- libb2 has portable C code (which is what gets picked on ARM), and AVX2 code (which is what gets picked for the benchmarks below on Intel machines)
- HACL* never does runtime CPU detection
- HACL's Blake2s comes in regular (a.k.a. "32") version, and a 128-bit version (that runs on either NEON or AVX)
- HACL's Blake2b comes in regular (a.k.a. "32") version, and a 256-bit version (that runs on AVX2)
- we benchmark various input sizes
M3 Max (ARM)
HACL_blake2b_32_oneshot/0 121 ns 120 ns 5839173
HACL_blake2b_32_oneshot/16 121 ns 120 ns 5758236
HACL_blake2b_32_oneshot/256 226 ns 225 ns 3098757
HACL_blake2b_32_oneshot/4096 3413 ns 3396 ns 206182
HACL_blake2b_32_oneshot/65536 54123 ns 53818 ns 12987
HACL_blake2b_32_oneshot/1048576 867868 ns 863094 ns 815
HACL_blake2b_32_oneshot/16777216 13993602 ns 13929400 ns 50
libb2_blake2b_oneshot/0 148 ns 147 ns 4757374
libb2_blake2b_oneshot/16 146 ns 145 ns 4812849
libb2_blake2b_oneshot/256 276 ns 275 ns 2551783
libb2_blake2b_oneshot/4096 4224 ns 4202 ns 167284
libb2_blake2b_oneshot/65536 67478 ns 67179 ns 10367
libb2_blake2b_oneshot/1048576 1079958 ns 1073158 ns 644
libb2_blake2b_oneshot/16777216 17322373 ns 17220756 ns 41
HACL_blake2s_32_oneshot/0 101 ns 100 ns 6901311
HACL_blake2s_32_oneshot/16 101 ns 101 ns 6980107
HACL_blake2s_32_oneshot/256 370 ns 368 ns 1894196
HACL_blake2s_32_oneshot/4096 5771 ns 5740 ns 121353
HACL_blake2s_32_oneshot/65536 91417 ns 90835 ns 7762
HACL_blake2s_32_oneshot/1048576 1478896 ns 1469956 ns 475
HACL_blake2s_32_oneshot/16777216 23471740 ns 23337633 ns 30
HACL_blake2s_vec128_oneshot/0 177 ns 176 ns 4077567
HACL_blake2s_vec128_oneshot/16 186 ns 185 ns 3864926
HACL_blake2s_vec128_oneshot/256 685 ns 682 ns 1044433
HACL_blake2s_vec128_oneshot/4096 10868 ns 10819 ns 65872
HACL_blake2s_vec128_oneshot/65536 174311 ns 173302 ns 4041
HACL_blake2s_vec128_oneshot/1048576 2795034 ns 2779949 ns 255
HACL_blake2s_vec128_oneshot/16777216 44968086 ns 44750562 ns 16
libb2_blake2s_oneshot/0 177 ns 175 ns 3965174
libb2_blake2s_oneshot/16 176 ns 175 ns 3977815
libb2_blake2s_oneshot/256 689 ns 684 ns 1034218
libb2_blake2s_oneshot/4096 10729 ns 10686 ns 65111
libb2_blake2s_oneshot/65536 171900 ns 170899 ns 4119
libb2_blake2s_oneshot/1048576 2762123 ns 2746686 ns 255
libb2_blake2s_oneshot/16777216 44150450 ns 43912125 ns 16
interpretation:
- HACL* 20% faster than libb2 (Blake2b)
- HACL* 47% faster than libb2 (Blake2s) -- as long as you don't use HACL's NEON version (which suffers from bad latency for vector shift instructions, see BLAKE2b NEON suffers poor performance on ARMv8/Aarch64 with Cortex-A57 weidai11/cryptopp#367)
Intel AVX2 (8-Core Intel Core i9, Macbook Pro, 2019)
HACL_blake2b_32_oneshot/0 189 ns 188 ns 3681808
HACL_blake2b_32_oneshot/16 192 ns 192 ns 3721267
HACL_blake2b_32_oneshot/256 334 ns 334 ns 2107862
HACL_blake2b_32_oneshot/4096 4774 ns 4772 ns 151013
HACL_blake2b_32_oneshot/65536 76518 ns 76485 ns 9405
HACL_blake2b_32_oneshot/1048576 1180256 ns 1180031 ns 609
HACL_blake2b_32_oneshot/16777216 18999771 ns 18996081 ns 37
HACL_blake2b_vec256_oneshot/0 144 ns 144 ns 5012424
HACL_blake2b_vec256_oneshot/16 142 ns 142 ns 4864185
HACL_blake2b_vec256_oneshot/256 247 ns 246 ns 2801502
HACL_blake2b_vec256_oneshot/4096 3457 ns 3456 ns 202513
HACL_blake2b_vec256_oneshot/65536 54684 ns 54660 ns 12402
HACL_blake2b_vec256_oneshot/1048576 878910 ns 878574 ns 793
HACL_blake2b_vec256_oneshot/16777216 14755193 ns 14751612 ns 49
libb2_blake2b_oneshot/0 162 ns 162 ns 4324084
libb2_blake2b_oneshot/16 175 ns 175 ns 4303296
libb2_blake2b_oneshot/256 282 ns 282 ns 2425427
libb2_blake2b_oneshot/4096 4318 ns 4313 ns 176491
libb2_blake2b_oneshot/65536 63438 ns 63411 ns 10497
libb2_blake2b_oneshot/1048576 987715 ns 987456 ns 708
libb2_blake2b_oneshot/16777216 16262424 ns 16262024 ns 41
HACL_blake2s_32_oneshot/0 137 ns 136 ns 5224504
HACL_blake2s_32_oneshot/16 133 ns 133 ns 5199706
HACL_blake2s_32_oneshot/256 447 ns 446 ns 1590439
HACL_blake2s_32_oneshot/4096 6792 ns 6789 ns 104370
HACL_blake2s_32_oneshot/65536 103446 ns 103393 ns 6687
HACL_blake2s_32_oneshot/1048576 1770925 ns 1769228 ns 403
HACL_blake2s_32_oneshot/16777216 27090139 ns 27082875 ns 24
HACL_blake2s_vec128_oneshot/0 121 ns 121 ns 5760132
HACL_blake2s_vec128_oneshot/16 127 ns 127 ns 5784359
HACL_blake2s_vec128_oneshot/256 378 ns 378 ns 1834978
HACL_blake2s_vec128_oneshot/4096 5727 ns 5725 ns 119742
HACL_blake2s_vec128_oneshot/65536 88395 ns 88384 ns 7822
HACL_blake2s_vec128_oneshot/1048576 1462898 ns 1462569 ns 490
HACL_blake2s_vec128_oneshot/16777216 24048063 ns 24040862 ns 29
libb2_blake2s_oneshot/0 102 ns 102 ns 6673913
libb2_blake2s_oneshot/16 105 ns 105 ns 6419663
libb2_blake2s_oneshot/256 367 ns 367 ns 1891554
libb2_blake2s_oneshot/4096 5540 ns 5537 ns 121671
libb2_blake2s_oneshot/65536 88250 ns 88211 ns 7868
libb2_blake2s_oneshot/1048576 1326051 ns 1325750 ns 519
libb2_blake2s_oneshot/16777216 22597565 ns 22584281 ns 32
interpretation:
- for Blake2b, HACL*/AVX2 10% faster than libb2, HACL*/portable C 16% slower than libb2
- for Blake2s, HACL*/AVX2 10% slower than libb2, HACL*/portable C 20% slower than libb2 (this is surprising, I'm seeing equal performance using another API, so this may simply be some missing
static inline
s on hot paths)
AVX2 desktop machine, Haswell
HACL_blake2b_32_oneshot/0 211 ns 211 ns 3096750
HACL_blake2b_32_oneshot/16 212 ns 212 ns 3300382
HACL_blake2b_32_oneshot/256 388 ns 388 ns 1803603
HACL_blake2b_32_oneshot/4096 5638 ns 5638 ns 124313
HACL_blake2b_32_oneshot/65536 89905 ns 89903 ns 7774
HACL_blake2b_32_oneshot/1048576 1438161 ns 1438056 ns 488
HACL_blake2b_32_oneshot/16777216 22998077 ns 22997469 ns 30
HACL_blake2b_vec256_oneshot/0 168 ns 168 ns 4145151
HACL_blake2b_vec256_oneshot/16 168 ns 168 ns 4172085
HACL_blake2b_vec256_oneshot/256 309 ns 309 ns 2263628
HACL_blake2b_vec256_oneshot/4096 4507 ns 4507 ns 155137
HACL_blake2b_vec256_oneshot/65536 72069 ns 72068 ns 9748
HACL_blake2b_vec256_oneshot/1048576 1149298 ns 1149303 ns 610
HACL_blake2b_vec256_oneshot/16777216 18438313 ns 18436309 ns 38
libb2_blake2b_oneshot/0 190 ns 190 ns 3670823
libb2_blake2b_oneshot/16 194 ns 194 ns 3618244
libb2_blake2b_oneshot/256 353 ns 353 ns 1984245
libb2_blake2b_oneshot/4096 5225 ns 5225 ns 134636
libb2_blake2b_oneshot/65536 82645 ns 82642 ns 8502
libb2_blake2b_oneshot/1048576 1328695 ns 1328701 ns 529
libb2_blake2b_oneshot/16777216 21131069 ns 21130208 ns 33
HACL_blake2s_32_oneshot/0 172 ns 172 ns 4080561
HACL_blake2s_32_oneshot/16 174 ns 174 ns 4024023
HACL_blake2s_32_oneshot/256 605 ns 605 ns 1154699
HACL_blake2s_32_oneshot/4096 9210 ns 9210 ns 76047
HACL_blake2s_32_oneshot/65536 147054 ns 147054 ns 4744
HACL_blake2s_32_oneshot/1048576 2354038 ns 2354016 ns 296
HACL_blake2s_32_oneshot/16777216 37686817 ns 37686992 ns 19
HACL_blake2s_vec128_oneshot/0 150 ns 150 ns 4686347
HACL_blake2s_vec128_oneshot/16 151 ns 151 ns 4628570
HACL_blake2s_vec128_oneshot/256 513 ns 513 ns 1367690
HACL_blake2s_vec128_oneshot/4096 7748 ns 7748 ns 90207
HACL_blake2s_vec128_oneshot/65536 123277 ns 123276 ns 5676
HACL_blake2s_vec128_oneshot/1048576 1971694 ns 1971703 ns 355
HACL_blake2s_vec128_oneshot/16777216 31514541 ns 31512743 ns 22
libb2_blake2s_oneshot/0 151 ns 151 ns 4651126
libb2_blake2s_oneshot/16 154 ns 154 ns 4542519
libb2_blake2s_oneshot/256 542 ns 542 ns 1293258
libb2_blake2s_oneshot/4096 8381 ns 8381 ns 83415
libb2_blake2s_oneshot/65536 133872 ns 133869 ns 5217
libb2_blake2s_oneshot/1048576 2140551 ns 2140560 ns 327
libb2_blake2s_oneshot/16777216 34272522 ns 34271712 ns 20
interpretation:
- HACL*/AVX2 13% faster than libb2, HACL*/portable C 8% slower than libb2 (Blake2b)
- HACL*/AVX2 8% faster than libb2, HACL*/portable C 19% slower than libb2 (Blake2s)
This leaves us with several options. I would love to have your opinion @gpshead
- CPython loses the ability to build against libb2, and simply packages HACL's portable versions, which offer a significant performance boost on ARM but are slightly slower on Intel.
- Pros: CPython build is simplified, drops an external dependency, compelling story with simple C code everywhere
- Cons: slight performance impact on Intel
- CPython loses the ability to build against libb2, and packages HACL's portable versions and vectorized (128-bit and 256-bit) versions, to be enabled on Intel and AVX (but not on NEON, see issue with high-latency vector shift, above)
- Pros: CPython increases Blake2 performance across the board (Intel and ARM), + build simplification and loss of an extra dependency
- Cons: I need to author a CPU detection layer and fiddle with the CPython build (not a big deal, happy to do so)
- CPython maintains HACL and libb2 side by side
- Pros: none
- Cons: duplication of code, build nightmare, and need to deal with two different APIs for the bindings from C to the Python module.
Please share thoughts. Thanks!
CC @R1kM who provided considerable help with libb2 benchmarking