cuda.empty_cache in trainer.py slow down training · Issue #31372 · huggingface/transformers (original) (raw)

System Info

Who can help?

@muellerzr @SunMarc

Information

Tasks

Reproduction

torchrun --nproc_per_node 8 run_clm.py
--model_name_or_path microsoft/Phi-3-mini-4k-instruct --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --output_dir output_dir --overwrite_output_dir
--block_size 2048 --deepspeed zero_stage_2.json
--fp16 --per_device_train_batch_size 1 --num_train_epochs 2 --max_steps 12 --evaluation_strategy no --save_strategy no --remove_unused_columns False
Here's my branch for refrence: https://github.com/jingyanwangms/transformers/tree/jingywa/phi-3
Just a few lines code changes that should not have performance implications. Below profile is without any cuda.sync

Expected behavior

With empty cache
image
11 steps: 10.283s, 10 steps (excluding first step): 8.99s (10.283-1.284)
When zoomed in, there're cudaMalloc and cudaFree
image

Without empty cache
Just comment out # torch.cuda.empty_cache()
image
11 steps: 9.796s, 10 steps (excluding first step): 8.49s (9.796-1.298)
Both measurements are roughly 5%-6% slower

Changes introduced in #28769. CudaMalloc and cudaFree are very expensive(slow) operations. They need to be carefully introduced