gh-99108: Add HACL* Blake2 implementation to hashlib by msprotz · Pull Request #119316 · python/cpython (original) (raw)

So I computed some benchmarks on a variety of machines, comparing HACL's performance with libb2. This is using the benchmarking infrastructure from HACL-packages: https://github.com/cryspen/hacl-packages/blob/main/benchmarks/blake.cc

To read these benchmarks, bear in mind that:

libb2 automatically picks the "best" implementation (via CPUID)
libb2 has portable C code (which is what gets picked on ARM), and AVX2 code (which is what gets picked for the benchmarks below on Intel machines)
HACL* never does runtime CPU detection
HACL's Blake2s comes in regular (a.k.a. "32") version, and a 128-bit version (that runs on either NEON or AVX)
HACL's Blake2b comes in regular (a.k.a. "32") version, and a 256-bit version (that runs on AVX2)
we benchmark various input sizes

M3 Max (ARM)

HACL_blake2b_32_oneshot/0                   121 ns          120 ns      5839173
HACL_blake2b_32_oneshot/16                  121 ns          120 ns      5758236
HACL_blake2b_32_oneshot/256                 226 ns          225 ns      3098757
HACL_blake2b_32_oneshot/4096               3413 ns         3396 ns       206182
HACL_blake2b_32_oneshot/65536             54123 ns        53818 ns        12987
HACL_blake2b_32_oneshot/1048576          867868 ns       863094 ns          815
HACL_blake2b_32_oneshot/16777216       13993602 ns     13929400 ns           50
libb2_blake2b_oneshot/0                     148 ns          147 ns      4757374
libb2_blake2b_oneshot/16                    146 ns          145 ns      4812849
libb2_blake2b_oneshot/256                   276 ns          275 ns      2551783
libb2_blake2b_oneshot/4096                 4224 ns         4202 ns       167284
libb2_blake2b_oneshot/65536               67478 ns        67179 ns        10367
libb2_blake2b_oneshot/1048576           1079958 ns      1073158 ns          644
libb2_blake2b_oneshot/16777216         17322373 ns     17220756 ns           41
HACL_blake2s_32_oneshot/0                   101 ns          100 ns      6901311
HACL_blake2s_32_oneshot/16                  101 ns          101 ns      6980107
HACL_blake2s_32_oneshot/256                 370 ns          368 ns      1894196
HACL_blake2s_32_oneshot/4096               5771 ns         5740 ns       121353
HACL_blake2s_32_oneshot/65536             91417 ns        90835 ns         7762
HACL_blake2s_32_oneshot/1048576         1478896 ns      1469956 ns          475
HACL_blake2s_32_oneshot/16777216       23471740 ns     23337633 ns           30
HACL_blake2s_vec128_oneshot/0               177 ns          176 ns      4077567
HACL_blake2s_vec128_oneshot/16              186 ns          185 ns      3864926
HACL_blake2s_vec128_oneshot/256             685 ns          682 ns      1044433
HACL_blake2s_vec128_oneshot/4096          10868 ns        10819 ns        65872
HACL_blake2s_vec128_oneshot/65536        174311 ns       173302 ns         4041
HACL_blake2s_vec128_oneshot/1048576     2795034 ns      2779949 ns          255
HACL_blake2s_vec128_oneshot/16777216   44968086 ns     44750562 ns           16
libb2_blake2s_oneshot/0                     177 ns          175 ns      3965174
libb2_blake2s_oneshot/16                    176 ns          175 ns      3977815
libb2_blake2s_oneshot/256                   689 ns          684 ns      1034218
libb2_blake2s_oneshot/4096                10729 ns        10686 ns        65111
libb2_blake2s_oneshot/65536              171900 ns       170899 ns         4119
libb2_blake2s_oneshot/1048576           2762123 ns      2746686 ns          255
libb2_blake2s_oneshot/16777216         44150450 ns     43912125 ns           16

interpretation:

HACL* 20% faster than libb2 (Blake2b)
HACL* 47% faster than libb2 (Blake2s) -- as long as you don't use HACL's NEON version (which suffers from bad latency for vector shift instructions, see BLAKE2b NEON suffers poor performance on ARMv8/Aarch64 with Cortex-A57 weidai11/cryptopp#367)

Intel AVX2 (8-Core Intel Core i9, Macbook Pro, 2019)

HACL_blake2b_32_oneshot/0                   189 ns          188 ns      3681808
HACL_blake2b_32_oneshot/16                  192 ns          192 ns      3721267
HACL_blake2b_32_oneshot/256                 334 ns          334 ns      2107862
HACL_blake2b_32_oneshot/4096               4774 ns         4772 ns       151013
HACL_blake2b_32_oneshot/65536             76518 ns        76485 ns         9405
HACL_blake2b_32_oneshot/1048576         1180256 ns      1180031 ns          609
HACL_blake2b_32_oneshot/16777216       18999771 ns     18996081 ns           37
HACL_blake2b_vec256_oneshot/0               144 ns          144 ns      5012424
HACL_blake2b_vec256_oneshot/16              142 ns          142 ns      4864185
HACL_blake2b_vec256_oneshot/256             247 ns          246 ns      2801502
HACL_blake2b_vec256_oneshot/4096           3457 ns         3456 ns       202513
HACL_blake2b_vec256_oneshot/65536         54684 ns        54660 ns        12402
HACL_blake2b_vec256_oneshot/1048576      878910 ns       878574 ns          793
HACL_blake2b_vec256_oneshot/16777216   14755193 ns     14751612 ns           49
libb2_blake2b_oneshot/0                     162 ns          162 ns      4324084
libb2_blake2b_oneshot/16                    175 ns          175 ns      4303296
libb2_blake2b_oneshot/256                   282 ns          282 ns      2425427
libb2_blake2b_oneshot/4096                 4318 ns         4313 ns       176491
libb2_blake2b_oneshot/65536               63438 ns        63411 ns        10497
libb2_blake2b_oneshot/1048576            987715 ns       987456 ns          708
libb2_blake2b_oneshot/16777216         16262424 ns     16262024 ns           41
HACL_blake2s_32_oneshot/0                   137 ns          136 ns      5224504
HACL_blake2s_32_oneshot/16                  133 ns          133 ns      5199706
HACL_blake2s_32_oneshot/256                 447 ns          446 ns      1590439
HACL_blake2s_32_oneshot/4096               6792 ns         6789 ns       104370
HACL_blake2s_32_oneshot/65536            103446 ns       103393 ns         6687
HACL_blake2s_32_oneshot/1048576         1770925 ns      1769228 ns          403
HACL_blake2s_32_oneshot/16777216       27090139 ns     27082875 ns           24
HACL_blake2s_vec128_oneshot/0               121 ns          121 ns      5760132
HACL_blake2s_vec128_oneshot/16              127 ns          127 ns      5784359
HACL_blake2s_vec128_oneshot/256             378 ns          378 ns      1834978
HACL_blake2s_vec128_oneshot/4096           5727 ns         5725 ns       119742
HACL_blake2s_vec128_oneshot/65536         88395 ns        88384 ns         7822
HACL_blake2s_vec128_oneshot/1048576     1462898 ns      1462569 ns          490
HACL_blake2s_vec128_oneshot/16777216   24048063 ns     24040862 ns           29
libb2_blake2s_oneshot/0                     102 ns          102 ns      6673913
libb2_blake2s_oneshot/16                    105 ns          105 ns      6419663
libb2_blake2s_oneshot/256                   367 ns          367 ns      1891554
libb2_blake2s_oneshot/4096                 5540 ns         5537 ns       121671
libb2_blake2s_oneshot/65536               88250 ns        88211 ns         7868
libb2_blake2s_oneshot/1048576           1326051 ns      1325750 ns          519
libb2_blake2s_oneshot/16777216         22597565 ns     22584281 ns           32

interpretation:

for Blake2b, HACL*/AVX2 10% faster than libb2, HACL*/portable C 16% slower than libb2
for Blake2s, HACL*/AVX2 10% slower than libb2, HACL*/portable C 20% slower than libb2 (this is surprising, I'm seeing equal performance using another API, so this may simply be some missing static inlines on hot paths)

AVX2 desktop machine, Haswell

HACL_blake2b_32_oneshot/0                   211 ns          211 ns      3096750
HACL_blake2b_32_oneshot/16                  212 ns          212 ns      3300382
HACL_blake2b_32_oneshot/256                 388 ns          388 ns      1803603
HACL_blake2b_32_oneshot/4096               5638 ns         5638 ns       124313
HACL_blake2b_32_oneshot/65536             89905 ns        89903 ns         7774
HACL_blake2b_32_oneshot/1048576         1438161 ns      1438056 ns          488
HACL_blake2b_32_oneshot/16777216       22998077 ns     22997469 ns           30
HACL_blake2b_vec256_oneshot/0               168 ns          168 ns      4145151
HACL_blake2b_vec256_oneshot/16              168 ns          168 ns      4172085
HACL_blake2b_vec256_oneshot/256             309 ns          309 ns      2263628
HACL_blake2b_vec256_oneshot/4096           4507 ns         4507 ns       155137
HACL_blake2b_vec256_oneshot/65536         72069 ns        72068 ns         9748
HACL_blake2b_vec256_oneshot/1048576     1149298 ns      1149303 ns          610
HACL_blake2b_vec256_oneshot/16777216   18438313 ns     18436309 ns           38
libb2_blake2b_oneshot/0                     190 ns          190 ns      3670823
libb2_blake2b_oneshot/16                    194 ns          194 ns      3618244
libb2_blake2b_oneshot/256                   353 ns          353 ns      1984245
libb2_blake2b_oneshot/4096                 5225 ns         5225 ns       134636
libb2_blake2b_oneshot/65536               82645 ns        82642 ns         8502
libb2_blake2b_oneshot/1048576           1328695 ns      1328701 ns          529
libb2_blake2b_oneshot/16777216         21131069 ns     21130208 ns           33
HACL_blake2s_32_oneshot/0                   172 ns          172 ns      4080561
HACL_blake2s_32_oneshot/16                  174 ns          174 ns      4024023
HACL_blake2s_32_oneshot/256                 605 ns          605 ns      1154699
HACL_blake2s_32_oneshot/4096               9210 ns         9210 ns        76047
HACL_blake2s_32_oneshot/65536            147054 ns       147054 ns         4744
HACL_blake2s_32_oneshot/1048576         2354038 ns      2354016 ns          296
HACL_blake2s_32_oneshot/16777216       37686817 ns     37686992 ns           19
HACL_blake2s_vec128_oneshot/0               150 ns          150 ns      4686347
HACL_blake2s_vec128_oneshot/16              151 ns          151 ns      4628570
HACL_blake2s_vec128_oneshot/256             513 ns          513 ns      1367690
HACL_blake2s_vec128_oneshot/4096           7748 ns         7748 ns        90207
HACL_blake2s_vec128_oneshot/65536        123277 ns       123276 ns         5676
HACL_blake2s_vec128_oneshot/1048576     1971694 ns      1971703 ns          355
HACL_blake2s_vec128_oneshot/16777216   31514541 ns     31512743 ns           22
libb2_blake2s_oneshot/0                     151 ns          151 ns      4651126
libb2_blake2s_oneshot/16                    154 ns          154 ns      4542519
libb2_blake2s_oneshot/256                   542 ns          542 ns      1293258
libb2_blake2s_oneshot/4096                 8381 ns         8381 ns        83415
libb2_blake2s_oneshot/65536              133872 ns       133869 ns         5217
libb2_blake2s_oneshot/1048576           2140551 ns      2140560 ns          327
libb2_blake2s_oneshot/16777216         34272522 ns     34271712 ns           20

interpretation:

HACL*/AVX2 13% faster than libb2, HACL*/portable C 8% slower than libb2 (Blake2b)
HACL*/AVX2 8% faster than libb2, HACL*/portable C 19% slower than libb2 (Blake2s)

This leaves us with several options. I would love to have your opinion @gpshead

CPython loses the ability to build against libb2, and simply packages HACL's portable versions, which offer a significant performance boost on ARM but are slightly slower on Intel.

Pros: CPython build is simplified, drops an external dependency, compelling story with simple C code everywhere
Cons: slight performance impact on Intel

CPython loses the ability to build against libb2, and packages HACL's portable versions and vectorized (128-bit and 256-bit) versions, to be enabled on Intel and AVX (but not on NEON, see issue with high-latency vector shift, above)

Pros: CPython increases Blake2 performance across the board (Intel and ARM), + build simplification and loss of an extra dependency
Cons: I need to author a CPU detection layer and fiddle with the CPython build (not a big deal, happy to do so)

CPython maintains HACL and libb2 side by side

Pros: none
Cons: duplication of code, build nightmare, and need to deal with two different APIs for the bindings from C to the Python module.

Please share thoughts. Thanks!

CC @R1kM who provided considerable help with libb2 benchmarking