cuda.empty_cache in trainer.py slow down training · Issue #31372 · huggingface/transformers (original) (raw)
System Info
transformers
version: 4.41.0.dev0- Platform: Linux-5.15.153.1-2.cm2-x86_64-with-glibc2.31
- Python version: 3.10.14
- Huggingface_hub version: 0.23.0
- Safetensors version: 0.4.3
- Accelerate version: 0.30.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.4.0.dev20240516+cu118 (True)
- Tensorflow version (GPU?): 2.15.1 (False)
- Flax version (CPU?/GPU?/TPU?): 0.7.0 (cpu)
- Jax version: 0.4.13
- JaxLib version: 0.4.13
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
torchrun --nproc_per_node 8 run_clm.py
--model_name_or_path microsoft/Phi-3-mini-4k-instruct --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --output_dir output_dir --overwrite_output_dir
--block_size 2048 --deepspeed zero_stage_2.json
--fp16 --per_device_train_batch_size 1 --num_train_epochs 2 --max_steps 12 --evaluation_strategy no --save_strategy no --remove_unused_columns False
Here's my branch for refrence: https://github.com/jingyanwangms/transformers/tree/jingywa/phi-3
Just a few lines code changes that should not have performance implications. Below profile is without any cuda.sync
Expected behavior
With empty cache
11 steps: 10.283s, 10 steps (excluding first step): 8.99s (10.283-1.284)
When zoomed in, there're cudaMalloc and cudaFree
Without empty cache
Just comment out # torch.cuda.empty_cache()
11 steps: 9.796s, 10 steps (excluding first step): 8.49s (9.796-1.298)
Both measurements are roughly 5%-6% slower
Changes introduced in #28769. CudaMalloc and cudaFree are very expensive(slow) operations. They need to be carefully introduced