Attention Layers (original) (raw)

Training with attention

By default DALLE will use full attention for all layers, but you can specify the attention type per layer as follows.

dalle = DALLE( # ... attn_types = ('full', 'axial_row', 'axial_col', 'conv_like') # cycles between these four types of attention )

Each different type is an attempt at replicating the scant details regarding the matter from OpenAI.

What to use:

When in doubt - and if you don't need the VRAM/runtime savings, train with:

Sparse Attention - Requires CUDA 10.1 and a V100 GPU (for now):

If you can meet these requirements - this is worth the install.

[Install Deepspeed]

dalle = DALLE( # ... attn_types = ('full', 'sparse') # cycles between full and sparse attention)