Linformer: Self-Attention with Linear Complexity (original) (raw)

View PDF

Abstract:Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses O(n2)O(n^2)O(n2) time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from O(n2)O(n^2)O(n2) to O(n)O(n)O(n) in both time and space. The resulting linear transformer, the \textit{Linformer}, performs on par with standard Transformer models, while being much more memory- and time-efficient.

Submission history

From: Sinong Wang [view email]
[v1] Mon, 8 Jun 2020 17:37:52 UTC (945 KB)
[v2] Tue, 9 Jun 2020 03:03:56 UTC (945 KB)
[v3] Sun, 14 Jun 2020 08:15:54 UTC (945 KB)