Host-side OOM error while checkpointing of large models with torch-xla and glibc malloc · Issue #728 · aws-neuron/aws-neuron-sdk (original) (raw)
When doing checkpointing of large models, there's a potential OOM error in torch-xla pytorch/xla#3545 when standard glibc malloc is used.
To workaround this issue, you can use a different malloc package such as jemalloc or tcmalloc.
(jemalloc recommended)
For example, on ubuntu install and set LD_PRELOAD as follows:
apt install libjemalloc2
export LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libjemalloc.so.2"
For more information see http://jemalloc.net/