Attention Layers (original) (raw)
Training with attention
By default DALLE will use full attention for all layers, but you can specify the attention type per layer as follows.
fullfull attentionaxial_rowaxial attention, along the rows of the image feature mapaxial_colaxial attention, along the columns of the image feature mapconv_likeconvolution-like attention, for the image feature map
dalle = DALLE( # ... attn_types = ('full', 'axial_row', 'axial_col', 'conv_like') # cycles between these four types of attention )
Each different type is an attempt at replicating the scant details regarding the matter from OpenAI.
What to use:
When in doubt - and if you don't need the VRAM/runtime savings, train with:
Sparse Attention - Requires CUDA 10.1 and a V100 GPU (for now):
If you can meet these requirements - this is worth the install.
dalle = DALLE( # ... attn_types = ('full', 'sparse') # cycles between full and sparse attention)