Gain up to 2% speed on Intel Silvermont & Haswell processors. by npaglieri · Pull Request #91 · google/gemmlowp (original) (raw)

This speedup is achieved by reordering SSE kernel instructions to lower contention on CPU execution units.

All instruction dependencies are preserved: this change shouldn't introduce any difference in behavior.
The overall code structure might however seem a bit less straightforward due to interleaved instruction sequences.

The metrics below result from averaging 100 single-threaded benchmark executions.

Silvermont

benchmark size	original GFlops/s	optimized GFlops/s	performance ratio	performance gain
10x10x10	1.172	1.183	1.009	0.94%
20x20x20	3.193	3.205	1.004	0.38%
30x30x30	4.297	4.308	1.003	0.26%
40x40x40	5.633	5.658	1.004	0.44%
50x50x50	5.909	5.952	1.007	0.73%
60x60x60	7.853	7.897	1.006	0.56%
64x256x147	10.170	10.330	1.016	1.57%
100x100x1	1.236	1.237	1.001	0.08%
100x100x100	9.017	9.117	1.011	1.11%
100x1000x100	11.350	11.530	1.016	1.59%
1000x1000x1	1.488	1.510	1.015	1.48%
1000x1000x10	7.818	7.917	1.013	1.27%
1000x1000x100	12.58	12.770	1.015	1.51%
1000x1000x1000	13.100	13.390	1.022	2.21%
average gain	1.01%

Haswell

benchmark size	original GFlops/s	optimized GFlops/s	performance ratio	performance gain
10x10x10	3.423	3.442	1.006	0.56%
20x20x20	9.404	9.518	1.012	1.21%
30x30x30	12.760	12.950	1.015	1.49%
40x40x40	17.350	17.520	1.010	0.98%
50x50x50	18.280	18.540	1.014	1.42%
60x60x60	23.860	24.250	1.016	1.63%
64x256x147	31.260	31.830	1.018	1.82%
100x100x1	3.822	3.825	1.001	0.08%
100x100x100	27.550	27.920	1.013	1.34%
100x1000x100	34.910	35.530	1.018	1.78%
1000x1000x1	4.958	5.016	1.012	1.17%
1000x1000x10	24.950	25.310	1.014	1.44%
1000x1000x100	38.900	39.590	1.018	1.77%
1000x1000x1000	40.310	41.120	1.020	2.01%
average gain	1.34%