GitHub - kyleliang919/C-Optim: When it comes to optimizers, it's always better to be safe than sorry (original) (raw)
Cautious Optimizers (C-Optiom): Improving Training with One Line of Code
AdamW has long been the go-to optimizer for transformer pretraining. For years, the research community has been searching for faster and more stable optimizers, with a focus on achieving only positive outcomes. In this work, we introduce a simple, single-line modification in PyTorch for any momentum-based optimizer. This modification, termed Cautious Optimizer (e.g., C-AdamW and C-Lion), opens the door to improved training performance.
Our theoretical findings reveal that this modification preserves Adam’s Hamiltonian function and retains its convergence guarantees under Lyapunov analysis. Additionally, a new family of optimizers emerges from this insight. Among these, we select the simplest for empirical experiments, achieving up to 1.47x speed-up on Llama and MAE pretraining.
🌟 News
- [2025-01-23] PPO (Reinforcement Learning)
- [2025-01-14] Post Training experiment on Qwen2.5 1.5B Instruct
- [2024-12-03] 🤗🤗🤗 More validation runs on ViTs timm-optim-caution
- [2024-12-03] 🤗🤗🤗 Caution implemented in huggingface/pytorch-image-models.
- [2024-11-24] Pre-release paper available on arXiv: Cautious Optimizers: Improving Training with One Line of Code.
- [2024-11-24] Official implementation of C-Optim released! Experiment with C-AdamW and C-Lion today.
🚀 Implementation
Generic Single-Line Implementation for C-Optim
Pretraining Results
Post Training Results
PPO
📦 Installation
Install Experiment Dependencies
pip install -r requirements.txt
🛠️ Usage
Pretraining Llama on C4
torchrun --standalone --nproc_per_node 1 torchrun_main.py
--model_config configs/llama_60m.json
--lr 0.001
--batch_size 16
--total_batch_size 512
--activation_checkpointing
--num_training_steps 10000
--warmup_steps 1000
--weight_decay 0
--grad_clipping 1.0
--dtype bfloat16
--eval_every 1000
--single_gpu
--optimizer c-adamw
--max_length 1024
Pretraining MAE on ImageNet 1K (50 Epochs)
torchrun --standalone --nproc_per_node 4 run_mae.py
--dataset_name ILSVRC/imagenet-1k
--output_dir ./vit-mae-c
--remove_unused_columns False
--label_names pixel_values
--mask_ratio 0.75
--norm_pix_loss
--do_train
--do_eval
--base_learning_rate 1.5e-4
--lr_scheduler_type cosine
--weight_decay 0.05
--num_train_epochs 50
--warmup_ratio 0.05
--per_device_train_batch_size 256
--per_device_eval_batch_size 8
--logging_strategy steps
--logging_steps 10
--eval_strategy epoch
--save_strategy epoch
--load_best_model_at_end True
--save_total_limit 3
--seed 1337
--custom_optim c-adamw
--trust_remote_code
--gradient_accumulation_steps 4
Post Training Qwen2.5
torchrun \
--rdzv_id=$JOB_ID \
--rdzv-backend=c10d \
--nnodes=1:8 \
--nproc-per-node=1 \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
post_training.py --model "Qwen/Qwen2.5-1.5B-Instruct" \
--output_dir cautious_1.5b \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 2 \
--max_length 8192 \
--cautious
PPO
accelerate launch ppo_tldr.py \
--dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
--dataset_test_split validation \
--output_dir models/minimal/ppo_tldr \
--learning_rate 3e-6 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 4 \
--total_episodes 1000000 \
--model_name_or_path EleutherAI/pythia-1b-deduped \
--sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
--reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
--local_rollout_forward_batch_size 16 \
--missing_eos_penalty 1.0 \
--stop_token eos \
--eval_strategy steps \
--eval_steps 100 \
--custom_optim c_adamw \
--num_gpus 8
📖 Citation
@misc{liang2024online, title={Cautious Optimizers: Improving Training with One Line of Code}, author={Kaizhao Liang and Lizhang Chen and Bo Liu and Qiang Liu}, year={2024}, eprint={https://arxiv.org/abs/2411.16085}, archivePrefix={arXiv}, primaryClass={cs.LG} }