zstd: x86 assembler implementation of sequenceDecs.decode by WojciechMula · Pull Request #528 · klauspost/compress (original) (raw)

This is plain x86 and x86 with BMI2 implementation of sequenceDecs.decode. Part of #515.

Since the benchmarks use decodeSync I temporarily replaced its implementation with one using decode and execute, at cost of allocation of the seqVals array every time.

There are some IMHO nice improvements and small regressions in few cases. From my previous experience can tell that we'll get quite big speedup when rewrite execute. And of course we'll get the biggest speedup when fuse decode and execute into a single procedure.

Marking PR as a draft as just one test TestNewDecoderBad/Reader-4/6f88497edbc9059998f9e6d0ea0d0eed8d8af38d.zst fails. Have to investigate why. [fixed]

Below are benchmarks.

Comparison of old.txt with new.txt

benchmark                                                                 old MB/s     new MB/s     speedup
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            237.60       238.72       1.00x
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        764.22       868.49       1.14x
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         192.39       197.70       1.03x
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           226.37       233.74       1.03x
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         211.46       216.24       1.02x
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          189.03       190.01       1.01x
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             1743.29      1951.14      1.12x
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       3079.05      3309.04      1.07x
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       6696.56      7926.76      1.18x
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             337.14       365.98       1.09x
BenchmarkDecoder_DecoderSmall/html.zst-16                                 613.59       687.53       1.12x
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        345.78       374.54       1.08x
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               244.27       241.55       0.99x
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           785.09       912.82       1.16x
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            200.18       203.32       1.02x
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              237.47       239.27       1.01x
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            211.75       214.06       1.01x
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             190.52       188.88       0.99x
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                1397.55      1488.07      1.06x
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          3428.09      3716.01      1.08x
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          9548.90      10887.83     1.14x
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                356.48       386.03       1.08x
BenchmarkDecoder_DecodeAll/html.zst-16                                    598.01       666.57       1.11x
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           333.55       364.23       1.09x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      208.57       222.10       1.06x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      206.44       206.26       1.00x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       219.64       224.09       1.02x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         215.01       212.89       0.99x
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          9564.96      10855.81     1.13x
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          230.88       242.77       1.05x
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           300.04       357.40       1.19x
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             481.01       629.48       1.31x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              1153.28      1167.83      1.01x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              1178.62      1197.84      1.02x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               1061.04      1085.91      1.02x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 431.41       434.83       1.01x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 260.26       288.78       1.11x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 178.69       182.34       1.02x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  178.37       183.01       1.03x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    163.05       166.78       1.02x
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       379.84       413.38       1.09x
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       364.05       398.18       1.09x
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        396.48       440.21       1.11x
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          346.17       375.39       1.08x
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         9561.29      10873.23     1.14x
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         224.96       240.51       1.07x
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          303.94       364.00       1.20x
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            476.76       635.21       1.33x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    1602.64      1693.12      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    1470.20      1556.52      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     1781.10      1894.80      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       1542.00      1661.43      1.08x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     9556.32      10879.80     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     9571.19      10882.04     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      9572.19      10887.11     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        9566.54      10886.24     1.14x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     1343.71      1607.61      1.20x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     1311.12      1407.07      1.07x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      1401.88      1587.61      1.13x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        1402.53      1484.78      1.06x
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         66679.48     94849.27     1.42x
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         1306.06      1502.06      1.15x
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          1927.42      2336.68      1.21x
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            3472.36      4863.07      1.40x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             6276.41      6383.51      1.02x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             5490.46      5771.14      1.05x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              6008.30      6052.66      1.01x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                3799.77      3895.76      1.03x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                1412.04      1441.62      1.02x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                962.89       949.33       0.99x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 984.52       969.09       0.98x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   801.71       795.11       0.99x
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      1738.74      1974.58      1.14x
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      1633.98      1841.22      1.13x
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       1817.99      2012.59      1.11x
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         1717.94      1874.86      1.09x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        66526.60     96359.49     1.45x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        1268.55      1490.20      1.17x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         1947.34      2373.92      1.22x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           3458.55      4850.24      1.40x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   8243.76      8724.07      1.06x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   8197.25      8948.34      1.09x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    9020.42      9939.28      1.10x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      10641.45     11529.73     1.08x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    66560.21     95518.08     1.44x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    66587.20     94626.59     1.42x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     66651.43     94356.64     1.42x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       66512.25     95444.30     1.43x
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       1466.81      1604.80      1.09x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   4831.35      5497.25      1.14x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    1215.11      1374.36      1.13x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      1464.34      1623.73      1.11x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    1240.22      1406.06      1.13x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     1094.77      1206.81      1.10x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        10029.54     11377.73     1.13x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  19167.95     22324.31     1.16x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  66383.12     95910.86     1.44x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        2687.11      3090.49      1.15x
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            3748.35      4307.61      1.15x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   1959.71      2145.97      1.10x

Comparison of old.txt with new-bmi2.txt

benchmark                                                                 old MB/s     new MB/s     speedup
BenchmarkDecoder_DecoderSmall/kppkn.gtb.zst-16                            237.60       238.72       1.00x
BenchmarkDecoder_DecoderSmall/geo.protodata.zst-16                        764.22       868.49       1.14x
BenchmarkDecoder_DecoderSmall/plrabn12.txt.zst-16                         192.39       197.70       1.03x
BenchmarkDecoder_DecoderSmall/lcet10.txt.zst-16                           226.37       233.74       1.03x
BenchmarkDecoder_DecoderSmall/asyoulik.txt.zst-16                         211.46       216.24       1.02x
BenchmarkDecoder_DecoderSmall/alice29.txt.zst-16                          189.03       190.01       1.01x
BenchmarkDecoder_DecoderSmall/html_x_4.zst-16                             1743.29      1951.14      1.12x
BenchmarkDecoder_DecoderSmall/paper-100k.pdf.zst-16                       3079.05      3309.04      1.07x
BenchmarkDecoder_DecoderSmall/fireworks.jpeg.zst-16                       6696.56      7926.76      1.18x
BenchmarkDecoder_DecoderSmall/urls.10K.zst-16                             337.14       365.98       1.09x
BenchmarkDecoder_DecoderSmall/html.zst-16                                 613.59       687.53       1.12x
BenchmarkDecoder_DecoderSmall/comp-data.bin.zst-16                        345.78       374.54       1.08x
BenchmarkDecoder_DecodeAll/kppkn.gtb.zst-16                               244.27       241.55       0.99x
BenchmarkDecoder_DecodeAll/geo.protodata.zst-16                           785.09       912.82       1.16x
BenchmarkDecoder_DecodeAll/plrabn12.txt.zst-16                            200.18       203.32       1.02x
BenchmarkDecoder_DecodeAll/lcet10.txt.zst-16                              237.47       239.27       1.01x
BenchmarkDecoder_DecodeAll/asyoulik.txt.zst-16                            211.75       214.06       1.01x
BenchmarkDecoder_DecodeAll/alice29.txt.zst-16                             190.52       188.88       0.99x
BenchmarkDecoder_DecodeAll/html_x_4.zst-16                                1397.55      1488.07      1.06x
BenchmarkDecoder_DecodeAll/paper-100k.pdf.zst-16                          3428.09      3716.01      1.08x
BenchmarkDecoder_DecodeAll/fireworks.jpeg.zst-16                          9548.90      10887.83     1.14x
BenchmarkDecoder_DecodeAll/urls.10K.zst-16                                356.48       386.03       1.08x
BenchmarkDecoder_DecodeAll/html.zst-16                                    598.01       666.57       1.11x
BenchmarkDecoder_DecodeAll/comp-data.bin.zst-16                           333.55       364.23       1.09x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/fastest-16      208.57       222.10       1.06x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/default-16      206.44       206.26       1.00x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/better-16       219.64       224.09       1.02x
BenchmarkDecoder_DecodeAllFiles/Mark.Twain-Tom.Sawyer.txt/best-16         215.01       212.89       0.99x
BenchmarkDecoder_DecodeAllFiles/e.txt/fastest-16                          9564.96      10855.81     1.13x
BenchmarkDecoder_DecodeAllFiles/e.txt/default-16                          230.88       242.77       1.05x
BenchmarkDecoder_DecodeAllFiles/e.txt/better-16                           300.04       357.40       1.19x
BenchmarkDecoder_DecodeAllFiles/e.txt/best-16                             481.01       629.48       1.31x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/fastest-16              1153.28      1167.83      1.01x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/default-16              1178.62      1197.84      1.02x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/better-16               1061.04      1085.91      1.02x
BenchmarkDecoder_DecodeAllFiles/fse-artifact3.bin/best-16                 431.41       434.83       1.01x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/fastest-16                 260.26       288.78       1.11x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/default-16                 178.69       182.34       1.02x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/better-16                  178.37       183.01       1.03x
BenchmarkDecoder_DecodeAllFiles/gettysburg.txt/best-16                    163.05       166.78       1.02x
BenchmarkDecoder_DecodeAllFiles/html.txt/fastest-16                       379.84       413.38       1.09x
BenchmarkDecoder_DecodeAllFiles/html.txt/default-16                       364.05       398.18       1.09x
BenchmarkDecoder_DecodeAllFiles/html.txt/better-16                        396.48       440.21       1.11x
BenchmarkDecoder_DecodeAllFiles/html.txt/best-16                          346.17       375.39       1.08x
BenchmarkDecoder_DecodeAllFiles/pi.txt/fastest-16                         9561.29      10873.23     1.14x
BenchmarkDecoder_DecodeAllFiles/pi.txt/default-16                         224.96       240.51       1.07x
BenchmarkDecoder_DecodeAllFiles/pi.txt/better-16                          303.94       364.00       1.20x
BenchmarkDecoder_DecodeAllFiles/pi.txt/best-16                            476.76       635.21       1.33x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/fastest-16                    1602.64      1693.12      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/default-16                    1470.20      1556.52      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/better-16                     1781.10      1894.80      1.06x
BenchmarkDecoder_DecodeAllFiles/pngdata.bin/best-16                       1542.00      1661.43      1.08x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/fastest-16                     9556.32      10879.80     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/default-16                     9571.19      10882.04     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/better-16                      9572.19      10887.11     1.14x
BenchmarkDecoder_DecodeAllFiles/sharnd.out/best-16                        9566.54      10886.24     1.14x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/fastest-16     1343.71      1607.61      1.20x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/default-16     1311.12      1407.07      1.07x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/better-16      1401.88      1587.61      1.13x
BenchmarkDecoder_DecodeAllFilesP/Mark.Twain-Tom.Sawyer.txt/best-16        1402.53      1484.78      1.06x
BenchmarkDecoder_DecodeAllFilesP/e.txt/fastest-16                         66679.48     94849.27     1.42x
BenchmarkDecoder_DecodeAllFilesP/e.txt/default-16                         1306.06      1502.06      1.15x
BenchmarkDecoder_DecodeAllFilesP/e.txt/better-16                          1927.42      2336.68      1.21x
BenchmarkDecoder_DecodeAllFilesP/e.txt/best-16                            3472.36      4863.07      1.40x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/fastest-16             6276.41      6383.51      1.02x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/default-16             5490.46      5771.14      1.05x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/better-16              6008.30      6052.66      1.01x
BenchmarkDecoder_DecodeAllFilesP/fse-artifact3.bin/best-16                3799.77      3895.76      1.03x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/fastest-16                1412.04      1441.62      1.02x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/default-16                962.89       949.33       0.99x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/better-16                 984.52       969.09       0.98x
BenchmarkDecoder_DecodeAllFilesP/gettysburg.txt/best-16                   801.71       795.11       0.99x
BenchmarkDecoder_DecodeAllFilesP/html.txt/fastest-16                      1738.74      1974.58      1.14x
BenchmarkDecoder_DecodeAllFilesP/html.txt/default-16                      1633.98      1841.22      1.13x
BenchmarkDecoder_DecodeAllFilesP/html.txt/better-16                       1817.99      2012.59      1.11x
BenchmarkDecoder_DecodeAllFilesP/html.txt/best-16                         1717.94      1874.86      1.09x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/fastest-16                        66526.60     96359.49     1.45x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/default-16                        1268.55      1490.20      1.17x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/better-16                         1947.34      2373.92      1.22x
BenchmarkDecoder_DecodeAllFilesP/pi.txt/best-16                           3458.55      4850.24      1.40x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/fastest-16                   8243.76      8724.07      1.06x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/default-16                   8197.25      8948.34      1.09x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/better-16                    9020.42      9939.28      1.10x
BenchmarkDecoder_DecodeAllFilesP/pngdata.bin/best-16                      10641.45     11529.73     1.08x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/fastest-16                    66560.21     95518.08     1.44x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/default-16                    66587.20     94626.59     1.42x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/better-16                     66651.43     94356.64     1.42x
BenchmarkDecoder_DecodeAllFilesP/sharnd.out/best-16                       66512.25     95444.30     1.43x
BenchmarkDecoder_DecodeAllParallel/kppkn.gtb.zst-16                       1466.81      1604.80      1.09x
BenchmarkDecoder_DecodeAllParallel/geo.protodata.zst-16                   4831.35      5497.25      1.14x
BenchmarkDecoder_DecodeAllParallel/plrabn12.txt.zst-16                    1215.11      1374.36      1.13x
BenchmarkDecoder_DecodeAllParallel/lcet10.txt.zst-16                      1464.34      1623.73      1.11x
BenchmarkDecoder_DecodeAllParallel/asyoulik.txt.zst-16                    1240.22      1406.06      1.13x
BenchmarkDecoder_DecodeAllParallel/alice29.txt.zst-16                     1094.77      1206.81      1.10x
BenchmarkDecoder_DecodeAllParallel/html_x_4.zst-16                        10029.54     11377.73     1.13x
BenchmarkDecoder_DecodeAllParallel/paper-100k.pdf.zst-16                  19167.95     22324.31     1.16x
BenchmarkDecoder_DecodeAllParallel/fireworks.jpeg.zst-16                  66383.12     95910.86     1.44x
BenchmarkDecoder_DecodeAllParallel/urls.10K.zst-16                        2687.11      3090.49      1.15x
BenchmarkDecoder_DecodeAllParallel/html.zst-16                            3748.35      4307.61      1.15x
BenchmarkDecoder_DecodeAllParallel/comp-data.bin.zst-16                   1959.71      2145.97      1.10x