cuda.empty_cache in trainer.py slow down training · Issue #31372 · huggingface/transformers (original) (raw)

System Info

transformers version: 4.41.0.dev0
Platform: Linux-5.15.153.1-2.cm2-x86_64-with-glibc2.31
Python version: 3.10.14
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.3
Accelerate version: 0.30.1
Accelerate config: not found
PyTorch version (GPU?): 2.4.0.dev20240516+cu118 (True)
Tensorflow version (GPU?): 2.15.1 (False)
Flax version (CPU?/GPU?/TPU?): 0.7.0 (cpu)
Jax version: 0.4.13
JaxLib version: 0.4.13
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@muellerzr @SunMarc

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

torchrun --nproc_per_node 8 run_clm.py
--model_name_or_path microsoft/Phi-3-mini-4k-instruct --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --output_dir output_dir --overwrite_output_dir
--block_size 2048 --deepspeed zero_stage_2.json
--fp16 --per_device_train_batch_size 1 --num_train_epochs 2 --max_steps 12 --evaluation_strategy no --save_strategy no --remove_unused_columns False
Here's my branch for refrence: https://github.com/jingyanwangms/transformers/tree/jingywa/phi-3
Just a few lines code changes that should not have performance implications. Below profile is without any cuda.sync

Expected behavior

With empty cache

11 steps: 10.283s, 10 steps (excluding first step): 8.99s (10.283-1.284)
When zoomed in, there're cudaMalloc and cudaFree

Without empty cache
Just comment out # torch.cuda.empty_cache()

11 steps: 9.796s, 10 steps (excluding first step): 8.49s (9.796-1.298)
Both measurements are roughly 5%-6% slower

Changes introduced in #28769. CudaMalloc and cudaFree are very expensive(slow) operations. They need to be carefully introduced