Gain up to 2% speed on Intel Silvermont & Haswell processors. by npaglieri · Pull Request #91 · google/gemmlowp (original) (raw)

This speedup is achieved by reordering SSE kernel instructions to lower contention on CPU execution units.

All instruction dependencies are preserved: this change shouldn't introduce any difference in behavior.
The overall code structure might however seem a bit less straightforward due to interleaved instruction sequences.

The metrics below result from averaging 100 single-threaded benchmark executions.

Silvermont

benchmark size original GFlops/s optimized GFlops/s performance ratio performance gain
10x10x10 1.172 1.183 1.009 0.94%
20x20x20 3.193 3.205 1.004 0.38%
30x30x30 4.297 4.308 1.003 0.26%
40x40x40 5.633 5.658 1.004 0.44%
50x50x50 5.909 5.952 1.007 0.73%
60x60x60 7.853 7.897 1.006 0.56%
64x256x147 10.170 10.330 1.016 1.57%
100x100x1 1.236 1.237 1.001 0.08%
100x100x100 9.017 9.117 1.011 1.11%
100x1000x100 11.350 11.530 1.016 1.59%
1000x1000x1 1.488 1.510 1.015 1.48%
1000x1000x10 7.818 7.917 1.013 1.27%
1000x1000x100 12.58 12.770 1.015 1.51%
1000x1000x1000 13.100 13.390 1.022 2.21%
average gain 1.01%

Haswell

benchmark size original GFlops/s optimized GFlops/s performance ratio performance gain
10x10x10 3.423 3.442 1.006 0.56%
20x20x20 9.404 9.518 1.012 1.21%
30x30x30 12.760 12.950 1.015 1.49%
40x40x40 17.350 17.520 1.010 0.98%
50x50x50 18.280 18.540 1.014 1.42%
60x60x60 23.860 24.250 1.016 1.63%
64x256x147 31.260 31.830 1.018 1.82%
100x100x1 3.822 3.825 1.001 0.08%
100x100x100 27.550 27.920 1.013 1.34%
100x1000x100 34.910 35.530 1.018 1.78%
1000x1000x1 4.958 5.016 1.012 1.17%
1000x1000x10 24.950 25.310 1.014 1.44%
1000x1000x100 38.900 39.590 1.018 1.77%
1000x1000x1000 40.310 41.120 1.020 2.01%
average gain 1.34%