GitHub - hysts/pytorch_image_classification: PyTorch implementation of image classification models for CIFAR-10/CIFAR-100/MNIST/FashionMNIST/Kuzushiji-MNIST/ImageNet (original) (raw)

PyTorch Image Classification

Following papers are implemented using PyTorch.

Requirements

pip install -r requirements.txt

Usage

python train.py --config configs/cifar/resnet_preact.yaml

Results on CIFAR-10

Results using almost same settings as papers

Model Test Error (median of 3 runs) Test Error (in paper) Training Time
VGG-like (depth 15, w/ BN, channel 64) 7.29 N/A 1h20m
ResNet-110 6.52 6.43 (best), 6.61 +/- 0.16 3h06m
ResNet-preact-110 6.47 6.37 (median of 5 runs) 3h05m
ResNet-preact-164 bottleneck 5.90 5.46 (median of 5 runs) 4h01m
ResNet-preact-1001 bottleneck 4.62 (median of 5 runs), 4.69 +/- 0.20
WRN-28-10 4.03 4.00 (median of 5 runs) 16h10m
WRN-28-10 w/ dropout 3.89 (median of 5 runs)
DenseNet-100 (k=12) 3.87 (1 run) 4.10 (1 run) 24h28m*
DenseNet-100 (k=24) 3.74 (1 run)
DenseNet-BC-100 (k=12) 4.69 4.51 (1 run) 15h20m
DenseNet-BC-250 (k=24) 3.62 (1 run)
DenseNet-BC-190 (k=40) 3.46 (1 run)
PyramidNet-110 (alpha=84) 4.40 4.26 +/- 0.23 11h40m
PyramidNet-110 (alpha=270) 3.92 (1 run) 3.73 +/- 0.04 24h12m*
PyramidNet-164 bottleneck (alpha=270) 3.44 (1 run) 3.48 +/- 0.20 32h37m*
PyramidNet-272 bottleneck (alpha=200) 3.31 +/- 0.08
ResNeXt-29 4x64d 3.89 ~3.75 (from Figure 7) 31h17m
ResNeXt-29 8x64d 3.97 (1 run) 3.65 (average of 10 runs) 42h50m*
ResNeXt-29 16x64d 3.58 (average of 10 runs)
shake-shake-26 2x32d (S-S-I) 3.68 3.55 (average of 3 runs) 33h49m
shake-shake-26 2x64d (S-S-I) 2.88 (1 run) 2.98 (average of 3 runs) 78h48m
shake-shake-26 2x96d (S-S-I) 2.90 (1 run) 2.86 (average of 5 runs) 101h32m*

Notes

VGG-like

python train.py --config configs/cifar/vgg.yaml

ResNet

python train.py --config configs/cifar/resnet.yaml

ResNet-preact

python train.py --config configs/cifar/resnet_preact.yaml
train.output_dir experiments/resnet_preact_basic_110/exp00

python train.py --config configs/cifar/resnet_preact.yaml
model.resnet_preact.depth 164
model.resnet_preact.block_type bottleneck
train.output_dir experiments/resnet_preact_bottleneck_164/exp00

WRN

python train.py --config configs/cifar/wrn.yaml

DenseNet

python train.py --config configs/cifar/densenet.yaml

PyramidNet

python train.py --config configs/cifar/pyramidnet.yaml
model.pyramidnet.depth 110
model.pyramidnet.block_type basic
model.pyramidnet.alpha 84
train.output_dir experiments/pyramidnet_basic_110_84/exp00

python train.py --config configs/cifar/pyramidnet.yaml
model.pyramidnet.depth 110
model.pyramidnet.block_type basic
model.pyramidnet.alpha 270
train.output_dir experiments/pyramidnet_basic_110_270/exp00

ResNeXt

python train.py --config configs/cifar/resnext.yaml
model.resnext.cardinality 4
train.batch_size 32
train.base_lr 0.025
train.output_dir experiments/resnext_29_4x64d/exp00

python train.py --config configs/cifar/resnext.yaml
train.batch_size 64
train.base_lr 0.05
train.output_dir experiments/resnext_29_8x64d/exp00

shake-shake

python train.py --config configs/cifar/shake_shake.yaml
model.shake_shake.initial_channels 32
train.output_dir experiments/shake_shake_26_2x32d_SSI/exp00

python train.py --config configs/cifar/shake_shake.yaml
model.shake_shake.initial_channels 64
train.batch_size 64
train.base_lr 0.1
train.output_dir experiments/shake_shake_26_2x64d_SSI/exp00

python train.py --config configs/cifar/shake_shake.yaml
model.shake_shake.initial_channels 96
train.batch_size 64
train.base_lr 0.1
train.output_dir experiments/shake_shake_26_2x96d_SSI/exp00

Results

Model Test Error (1 run) # of Epochs Training Time
ResNet-preact-20, widening factor 4 4.91 200 1h26m
ResNet-preact-20, widening factor 4 4.01 400 2h53m
ResNet-preact-20, widening factor 4 3.99 1800 12h53m
ResNet-preact-20, widening factor 4, Cutout 16 3.71 200 1h26m
ResNet-preact-20, widening factor 4, Cutout 16 3.46 400 2h53m
ResNet-preact-20, widening factor 4, Cutout 16 3.76 1800 12h53m
ResNet-preact-20, widening factor 4, RICAP (beta=0.3) 3.45 200 1h26m
ResNet-preact-20, widening factor 4, RICAP (beta=0.3) 3.11 400 2h53m
ResNet-preact-20, widening factor 4, RICAP (beta=0.3) 3.15 1800 12h53m
Model Test Error (1 run) # of Epochs Training Time
WRN-28-10, Cutout 16 3.19 200 6h35m
WRN-28-10, mixup (alpha=1) 3.32 200 6h35m
WRN-28-10, RICAP (beta=0.3) 2.83 200 6h35m
WRN-28-10, Dual-Cutout (alpha=0.1) 2.87 200 12h42m
WRN-28-10, Cutout 16 3.07 400 13h10m
WRN-28-10, mixup (alpha=1) 3.04 400 13h08m
WRN-28-10, RICAP (beta=0.3) 2.71 400 13h08m
WRN-28-10, Dual-Cutout (alpha=0.1) 2.76 400 25h20m
shake-shake-26 2x64d, Cutout 16 2.64 1800 78h55m*
shake-shake-26 2x64d, mixup (alpha=1) 2.63 1800 35h56m
shake-shake-26 2x64d, RICAP (beta=0.3) 2.29 1800 35h10m
shake-shake-26 2x64d, Dual-Cutout (alpha=0.1) 2.64 1800 68h34m
shake-shake-26 2x96d, Cutout 16 2.50 1800 60h20m
shake-shake-26 2x96d, mixup (alpha=1) 2.36 1800 60h20m
shake-shake-26 2x96d, RICAP (beta=0.3) 2.10 1800 60h20m
shake-shake-26 2x96d, Dual-Cutout (alpha=0.1) 2.41 1800 113h09m
shake-shake-26 2x128d, Cutout 16 2.58 1800 85h04m
shake-shake-26 2x128d, RICAP (beta=0.3) 1.97 1800 85h06m

Note

python train.py --config configs/cifar/wrn.yaml
train.batch_size 64
train.output_dir experiments/wrn_28_10_cutout16
scheduler.type cosine
augmentation.use_cutout True

python train.py --config configs/cifar/shake_shake.yaml
model.shake_shake.initial_channels 64
train.batch_size 64
train.base_lr 0.1
scheduler.epochs 300
train.output_dir experiments/shake_shake_26_2x64d_SSI_cutout16/exp00
augmentation.use_cutout True

Results using multi-GPU

Model batch size #GPUs Test Error (1 run) # of Epochs Training Time*
WRN-28-10, RICAP (beta=0.3) 512 1 2.63 200 3h41m
WRN-28-10, RICAP (beta=0.3) 256 2 2.71 200 2h14m
WRN-28-10, RICAP (beta=0.3) 128 4 2.89 200 1h01m
WRN-28-10, RICAP (beta=0.3) 64 8 2.75 200 34m

Note

Using 1 GPU

python train.py --config configs/cifar/wrn.yaml
train.base_lr 0.2
train.batch_size 512
scheduler.epochs 200
scheduler.type cosine
train.output_dir experiments/wrn_28_10_ricap_1gpu/exp00
augmentation.use_ricap True
augmentation.use_random_crop False

Using 2 GPUs

python -m torch.distributed.launch --nproc_per_node 2
train.py --config configs/cifar/wrn.yaml
train.distributed True
train.base_lr 0.2
train.batch_size 256
scheduler.epochs 200
scheduler.type cosine
train.output_dir experiments/wrn_28_10_ricap_2gpus/exp00
augmentation.use_ricap True
augmentation.use_random_crop False

Using 4 GPUs

python -m torch.distributed.launch --nproc_per_node 4
train.py --config configs/cifar/wrn.yaml
train.distributed True
train.base_lr 0.2
train.batch_size 128
scheduler.epochs 200
scheduler.type cosine
train.output_dir experiments/wrn_28_10_ricap_4gpus/exp00
augmentation.use_ricap True
augmentation.use_random_crop False

Using 8 GPUs

python -m torch.distributed.launch --nproc_per_node 8
train.py --config configs/cifar/wrn.yaml
train.distributed True
train.base_lr 0.2
train.batch_size 64
scheduler.epochs 200
scheduler.type cosine
train.output_dir experiments/wrn_28_10_ricap_8gpus/exp00
augmentation.use_ricap True
augmentation.use_random_crop False

Results on FashionMNIST

Model Test Error (1 run) # of Epochs Training Time
ResNet-preact-20, widening factor 4, Cutout 12 4.17 200 1h32m
ResNet-preact-20, widening factor 4, Cutout 14 4.11 200 1h32m
ResNet-preact-50, Cutout 12 4.45 200 57m
ResNet-preact-50, Cutout 14 4.38 200 57m
ResNet-preact-50, widening factor 4,Cutout 12 4.07 200 3h37m
ResNet-preact-50, widening factor 4,Cutout 14 4.13 200 3h39m
shake-shake-26 2x32d (S-S-I), Cutout 12 4.08 400 3h41m
shake-shake-26 2x32d (S-S-I), Cutout 14 4.05 400 3h39m
shake-shake-26 2x96d (S-S-I), Cutout 12 3.72 400 13h46m
shake-shake-26 2x96d (S-S-I), Cutout 14 3.85 400 13h39m
shake-shake-26 2x96d (S-S-I), Cutout 12 3.65 800 26h42m
shake-shake-26 2x96d (S-S-I), Cutout 14 3.60 800 26h42m
Model Test Error (median of 3 runs) # of Epochs Training Time
ResNet-preact-20 5.04 200 26m
ResNet-preact-20, Cutout 6 4.84 200 26m
ResNet-preact-20, Cutout 8 4.64 200 26m
ResNet-preact-20, Cutout 10 4.74 200 26m
ResNet-preact-20, Cutout 12 4.68 200 26m
ResNet-preact-20, Cutout 14 4.64 200 26m
ResNet-preact-20, Cutout 16 4.49 200 26m
ResNet-preact-20, RandomErasing 4.61 200 26m
ResNet-preact-20, Mixup 4.92 200 26m
ResNet-preact-20, Mixup 4.64 400 52m

Note

Results on MNIST

Model Test Error (median of 3 runs) # of Epochs Training Time
ResNet-preact-20 0.40 100 12m
ResNet-preact-20, Cutout 6 0.32 100 12m
ResNet-preact-20, Cutout 8 0.25 100 12m
ResNet-preact-20, Cutout 10 0.27 100 12m
ResNet-preact-20, Cutout 12 0.26 100 12m
ResNet-preact-20, Cutout 14 0.26 100 12m
ResNet-preact-20, Cutout 16 0.25 100 12m
ResNet-preact-20, Mixup (alpha=1) 0.40 100 12m
ResNet-preact-20, Mixup (alpha=0.5) 0.38 100 12m
ResNet-preact-20, widening factor 4, Cutout 14 0.26 100 45m
ResNet-preact-50, Cutout 14 0.29 100 28m
ResNet-preact-50, widening factor 4, Cutout 14 0.25 100 1h50m
shake-shake-26 2x96d (S-S-I), Cutout 14 0.24 100 3h22m

Note

Results on Kuzushiji-MNIST

Model Test Error (median of 3 runs) # of Epochs Training Time
ResNet-preact-20, Cutout 14 0.82 (best 0.67) 200 24m
ResNet-preact-20, widening factor 4, Cutout 14 0.72 (best 0.67) 200 1h30m
PyramidNet-110-270, Cutout 14 0.72 (best 0.70) 200 10h05m
shake-shake-26 2x96d (S-S-I), Cutout 14 0.66 (best 0.63) 200 6h46m

Note

Experiments

Experiment on residual units, learning rate scheduling, and data augmentation

In this experiment, the effects of the following on classification accuracy are investigated:

ResNet-preact-56 is trained on CIFAR-10 with initial learning rate 0.2 in this experiment.

Note

Results

Model Test Error (median of 5 runs) Training Time
w/ 1st ReLU, w/o last BN, preactivate shortcut after downsampling 6.45 95 min
w/ 1st ReLU, w/o last BN 6.47 95 min
w/o 1st ReLU, w/o last BN 6.14 89 min
w/ 1st ReLU, w/ last BN 6.43 104 min
w/o 1st ReLU, w/ last BN 5.85 98 min
w/o 1st ReLU, w/ last BN, preactivate shortcut after downsampling 6.27 98 min
w/o 1st ReLU, w/ last BN, Cosine annealing 5.72 98 min
w/o 1st ReLU, w/ last BN, Cutout 4.96 98 min
w/o 1st ReLU, w/ last BN, RandomErasing 5.22 98 min
w/o 1st ReLU, w/ last BN, Mixup (300 epochs) 5.11 191 min
preactivate shortcut after downsampling

python train.py --config configs/cifar/resnet_preact.yaml
train.base_lr 0.2
model.resnet_preact.depth 56
model.resnet_preact.preact_stage '[True, True, True]'
model.resnet_preact.remove_first_relu False
model.resnet_preact.add_last_bn False
train.output_dir experiments/resnet_preact_after_downsampling/exp00

w/ 1st ReLU, w/o last BN

python train.py --config configs/cifar/resnet_preact.yaml
train.base_lr 0.2
model.resnet_preact.depth 56
model.resnet_preact.preact_stage '[True, False, False]'
model.resnet_preact.remove_first_relu False
model.resnet_preact.add_last_bn False
train.output_dir experiments/resnet_preact_w_relu_wo_bn/exp00

w/o 1st ReLU, w/o last BN

python train.py --config configs/cifar/resnet_preact.yaml
train.base_lr 0.2
model.resnet_preact.depth 56
model.resnet_preact.preact_stage '[True, False, False]'
model.resnet_preact.remove_first_relu True
model.resnet_preact.add_last_bn False
train.output_dir experiments/resnet_preact_wo_relu_wo_bn/exp00

w/ 1st ReLU, w/ last BN

python train.py --config configs/cifar/resnet_preact.yaml
train.base_lr 0.2
model.resnet_preact.depth 56
model.resnet_preact.preact_stage '[True, False, False]'
model.resnet_preact.remove_first_relu False
model.resnet_preact.add_last_bn True
train.output_dir experiments/resnet_preact_w_relu_w_bn/exp00

w/o 1st ReLU, w/ last BN

python train.py --config configs/cifar/resnet_preact.yaml
train.base_lr 0.2
model.resnet_preact.depth 56
model.resnet_preact.preact_stage '[True, False, False]'
model.resnet_preact.remove_first_relu True
model.resnet_preact.add_last_bn True
train.output_dir experiments/resnet_preact_wo_relu_w_bn/exp00

w/o 1st ReLU, w/ last BN, preactivate shortcut after downsampling

python train.py --config configs/cifar/resnet_preact.yaml
train.base_lr 0.2
model.resnet_preact.depth 56
model.resnet_preact.preact_stage '[True, True, True]'
model.resnet_preact.remove_first_relu True
model.resnet_preact.add_last_bn True
train.output_dir experiments/resnet_preact_after_downsampling_wo_relu_w_bn/exp00

w/o 1st ReLU, w/ last BN, cosine annealing

python train.py --config configs/cifar/resnet_preact.yaml
train.base_lr 0.2
model.resnet_preact.depth 56
model.resnet_preact.preact_stage '[True, False, False]'
model.resnet_preact.remove_first_relu True
model.resnet_preact.add_last_bn True
scheduler.type cosine
train.output_dir experiments/resnet_preact_wo_relu_w_bn_cosine/exp00

w/o 1st ReLU, w/ last BN, Cutout

python train.py --config configs/cifar/resnet_preact.yaml
train.base_lr 0.2
model.resnet_preact.depth 56
model.resnet_preact.preact_stage '[True, False, False]'
model.resnet_preact.remove_first_relu True
model.resnet_preact.add_last_bn True
augmentation.use_cutout True
train.output_dir experiments/resnet_preact_wo_relu_w_bn_cutout/exp00

w/o 1st ReLU, w/ last BN, RandomErasing

python train.py --config configs/cifar/resnet_preact.yaml
train.base_lr 0.2
model.resnet_preact.depth 56
model.resnet_preact.preact_stage '[True, False, False]'
model.resnet_preact.remove_first_relu True
model.resnet_preact.add_last_bn True
augmentation.use_random_erasing True
train.output_dir experiments/resnet_preact_wo_relu_w_bn_random_erasing/exp00

w/o 1st ReLU, w/ last BN, Mixup

python train.py --config configs/cifar/resnet_preact.yaml
train.base_lr 0.2
model.resnet_preact.depth 56
model.resnet_preact.preact_stage '[True, False, False]'
model.resnet_preact.remove_first_relu True
model.resnet_preact.add_last_bn True
augmentation.use_mixup True
train.output_dir experiments/resnet_preact_wo_relu_w_bn_mixup/exp00

Experiments on label smoothing, Mixup, RICAP, and Dual-Cutout

Results on CIFAR-10

Model Test Error (median of 3 runs) # of Epochs Training Time
ResNet-preact-20 7.60 200 24m
ResNet-preact-20, label smoothing (epsilon=0.001) 7.51 200 25m
ResNet-preact-20, label smoothing (epsilon=0.01) 7.21 200 25m
ResNet-preact-20, label smoothing (epsilon=0.1) 7.57 200 25m
ResNet-preact-20, mixup (alpha=1) 7.24 200 26m
ResNet-preact-20, RICAP (beta=0.3), w/ random crop 6.88 200 28m
ResNet-preact-20, RICAP (beta=0.3) 6.77 200 28m
ResNet-preact-20, Dual-Cutout 16 (alpha=0.1) 6.24 200 45m
ResNet-preact-20 7.05 400 49m
ResNet-preact-20, label smoothing (epsilon=0.001) 7.20 400 49m
ResNet-preact-20, label smoothing (epsilon=0.01) 6.97 400 49m
ResNet-preact-20, label smoothing (epsilon=0.1) 7.16 400 49m
ResNet-preact-20, mixup (alpha=1) 6.66 400 51m
ResNet-preact-20, RICAP (beta=0.3), w/ random crop 6.30 400 56m
ResNet-preact-20, RICAP (beta=0.3) 6.19 400 56m
ResNet-preact-20, Dual-Cutout 16 (alpha=0.1) 5.55 400 1h36m

Note

Experiments on batch size and learning rate

Linear scaling rule for learning rate

Model batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 4096 3.2 cosine 200 10.57 22m
ResNet-preact-20 2048 1.6 cosine 200 8.87 21m
ResNet-preact-20 1024 0.8 cosine 200 8.40 21m
ResNet-preact-20 512 0.4 cosine 200 8.22 20m
ResNet-preact-20 256 0.2 cosine 200 8.61 22m
ResNet-preact-20 128 0.1 cosine 200 8.09 24m
ResNet-preact-20 64 0.05 cosine 200 8.22 28m
ResNet-preact-20 32 0.025 cosine 200 8.00 43m
ResNet-preact-20 16 0.0125 cosine 200 7.75 1h17m
ResNet-preact-20 8 0.006125 cosine 200 7.70 2h32m
Model batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 4096 3.2 multistep 200 28.97 22m
ResNet-preact-20 2048 1.6 multistep 200 9.07 21m
ResNet-preact-20 1024 0.8 multistep 200 8.62 21m
ResNet-preact-20 512 0.4 multistep 200 8.23 20m
ResNet-preact-20 256 0.2 multistep 200 8.40 21m
ResNet-preact-20 128 0.1 multistep 200 8.28 24m
ResNet-preact-20 64 0.05 multistep 200 8.13 28m
ResNet-preact-20 32 0.025 multistep 200 7.58 43m
ResNet-preact-20 16 0.0125 multistep 200 7.93 1h18m
ResNet-preact-20 8 0.006125 multistep 200 8.31 2h34m

Linear scaling + longer training

Model batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 4096 3.2 cosine 400 8.97 44m
ResNet-preact-20 2048 1.6 cosine 400 7.85 43m
ResNet-preact-20 1024 0.8 cosine 400 7.20 42m
ResNet-preact-20 512 0.4 cosine 400 7.83 40m
ResNet-preact-20 256 0.2 cosine 400 7.65 42m
ResNet-preact-20 128 0.1 cosine 400 7.09 47m
ResNet-preact-20 64 0.05 cosine 400 7.17 44m
ResNet-preact-20 32 0.025 cosine 400 7.24 2h11m
ResNet-preact-20 16 0.0125 cosine 400 7.26 4h10m
ResNet-preact-20 8 0.006125 cosine 400 7.02 7h53m
Model batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 4096 3.2 cosine 800 8.14 1h29m
ResNet-preact-20 2048 1.6 cosine 800 7.74 1h23m
ResNet-preact-20 1024 0.8 cosine 800 7.15 1h31m
ResNet-preact-20 512 0.4 cosine 800 7.27 1h25m
ResNet-preact-20 256 0.2 cosine 800 7.22 1h26m
ResNet-preact-20 128 0.1 cosine 800 6.68 1h35m
ResNet-preact-20 64 0.05 cosine 800 7.18 2h20m
ResNet-preact-20 32 0.025 cosine 800 7.03 4h16m
ResNet-preact-20 16 0.0125 cosine 800 6.78 8h37m
ResNet-preact-20 8 0.006125 cosine 800 6.89 16h47m

Effect of initial learning rate

Model batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 4096 3.2 cosine 200 10.57 22m
ResNet-preact-20 4096 1.6 cosine 200 10.32 22m
ResNet-preact-20 4096 0.8 cosine 200 10.71 22m
Model batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 2048 3.2 cosine 200 11.34 21m
ResNet-preact-20 2048 2.4 cosine 200 8.69 21m
ResNet-preact-20 2048 2.0 cosine 200 8.81 21m
ResNet-preact-20 2048 1.6 cosine 200 8.73 22m
ResNet-preact-20 2048 0.8 cosine 200 9.62 21m
Model batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 1024 3.2 cosine 200 9.12 21m
ResNet-preact-20 1024 2.4 cosine 200 8.42 22m
ResNet-preact-20 1024 2.0 cosine 200 8.38 22m
ResNet-preact-20 1024 1.6 cosine 200 8.07 22m
ResNet-preact-20 1024 1.2 cosine 200 8.25 21m
ResNet-preact-20 1024 0.8 cosine 200 8.08 22m
ResNet-preact-20 1024 0.4 cosine 200 8.49 22m
Model batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 512 3.2 cosine 200 8.51 21m
ResNet-preact-20 512 1.6 cosine 200 7.73 20m
ResNet-preact-20 512 0.8 cosine 200 7.73 21m
ResNet-preact-20 512 0.4 cosine 200 8.22 20m
Model batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 256 3.2 cosine 200 9.64 22m
ResNet-preact-20 256 1.6 cosine 200 8.32 22m
ResNet-preact-20 256 0.8 cosine 200 7.45 21m
ResNet-preact-20 256 0.4 cosine 200 7.68 22m
ResNet-preact-20 256 0.2 cosine 200 8.61 22m
Model batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 128 1.6 cosine 200 9.03 24m
ResNet-preact-20 128 0.8 cosine 200 7.54 24m
ResNet-preact-20 128 0.4 cosine 200 7.28 24m
ResNet-preact-20 128 0.2 cosine 200 7.96 24m
ResNet-preact-20 128 0.1 cosine 200 8.09 24m
ResNet-preact-20 128 0.05 cosine 200 8.81 24m
ResNet-preact-20 128 0.025 cosine 200 10.07 24m
Model batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 64 0.4 cosine 200 7.42 35m
ResNet-preact-20 64 0.2 cosine 200 7.52 36m
ResNet-preact-20 64 0.1 cosine 200 7.78 37m
ResNet-preact-20 64 0.05 cosine 200 8.22 28m
Model batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 32 0.2 cosine 200 7.64 1h05m
ResNet-preact-20 32 0.1 cosine 200 7.25 1h08m
ResNet-preact-20 32 0.05 cosine 200 7.45 1h07m
ResNet-preact-20 32 0.025 cosine 200 8.00 43m

Good learning rate + longer training

Model batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 4096 1.6 cosine 200 10.32 22m
ResNet-preact-20 2048 1.6 cosine 200 8.73 22m
ResNet-preact-20 1024 1.6 cosine 200 8.07 22m
ResNet-preact-20 1024 0.8 cosine 200 8.08 22m
ResNet-preact-20 512 1.6 cosine 200 7.73 20m
ResNet-preact-20 512 0.8 cosine 200 7.73 21m
ResNet-preact-20 256 0.8 cosine 200 7.45 21m
ResNet-preact-20 128 0.4 cosine 200 7.28 24m
ResNet-preact-20 128 0.2 cosine 200 7.96 24m
ResNet-preact-20 128 0.1 cosine 200 8.09 24m
Model batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 4096 1.6 cosine 800 8.36 1h33m
ResNet-preact-20 2048 1.6 cosine 800 7.53 1h27m
ResNet-preact-20 1024 1.6 cosine 800 7.30 1h30m
ResNet-preact-20 1024 0.8 cosine 800 7.42 1h30m
ResNet-preact-20 512 1.6 cosine 800 6.69 1h26m
ResNet-preact-20 512 0.8 cosine 800 6.77 1h26m
ResNet-preact-20 256 0.8 cosine 800 6.84 1h28m
ResNet-preact-20 128 0.4 cosine 800 6.86 1h35m
ResNet-preact-20 128 0.2 cosine 800 7.05 1h38m
ResNet-preact-20 128 0.1 cosine 800 6.68 1h35m
Model batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 4096 1.6 cosine 1600 8.25 3h10m
ResNet-preact-20 2048 1.6 cosine 1600 7.34 2h50m
ResNet-preact-20 1024 1.6 cosine 1600 6.94 2h52m
ResNet-preact-20 512 1.6 cosine 1600 6.99 2h44m
ResNet-preact-20 256 0.8 cosine 1600 6.95 2h50m
ResNet-preact-20 128 0.4 cosine 1600 6.64 3h09m
Model batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 4096 1.6 cosine 3200 9.52 6h15m
ResNet-preact-20 2048 1.6 cosine 3200 6.92 5h42m
ResNet-preact-20 1024 1.6 cosine 3200 6.96 5h43m
Model batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 2048 1.6 cosine 6400 7.45 11h44m

LARS

python train.py --config configs/cifar/resnet_preact.yaml
model.resnet_preact.depth 20
train.optimizer lars
train.base_lr 0.02
train.batch_size 4096
scheduler.type cosine
train.output_dir experiments/resnet_preact_lars/exp00

Model optimizer batch size initial lr lr schedule # of Epochs Test Error (median of 3 runs) Training Time
ResNet-preact-20 SGD 4096 3.2 cosine 200 10.57 (1 run) 22m
ResNet-preact-20 SGD 4096 1.6 cosine 200 10.20 22m
ResNet-preact-20 SGD 4096 0.8 cosine 200 10.71 (1 run) 22m
ResNet-preact-20 LARS 4096 0.04 cosine 200 9.58 22m
ResNet-preact-20 LARS 4096 0.03 cosine 200 8.46 22m
ResNet-preact-20 LARS 4096 0.02 cosine 200 8.21 22m
ResNet-preact-20 LARS 4096 0.015 cosine 200 8.47 22m
ResNet-preact-20 LARS 4096 0.01 cosine 200 9.33 22m
ResNet-preact-20 LARS 4096 0.005 cosine 200 14.31 22m
Model optimizer batch size initial lr lr schedule # of Epochs Test Error (median of 3 runs) Training Time
ResNet-preact-20 SGD 2048 3.2 cosine 200 11.34 (1 run) 21m
ResNet-preact-20 SGD 2048 2.4 cosine 200 8.69 (1 run) 21m
ResNet-preact-20 SGD 2048 2.0 cosine 200 8.81 (1 run) 21m
ResNet-preact-20 SGD 2048 1.6 cosine 200 8.73 (1 run) 22m
ResNet-preact-20 SGD 2048 0.8 cosine 200 9.62 (1 run) 21m
ResNet-preact-20 LARS 2048 0.04 cosine 200 11.58 21m
ResNet-preact-20 LARS 2048 0.02 cosine 200 8.05 22m
ResNet-preact-20 LARS 2048 0.01 cosine 200 8.07 22m
ResNet-preact-20 LARS 2048 0.005 cosine 200 9.65 22m
Model optimizer batch size initial lr lr schedule # of Epochs Test Error (median of 3 runs) Training Time
ResNet-preact-20 SGD 1024 3.2 cosine 200 9.12 (1 run) 21m
ResNet-preact-20 SGD 1024 2.4 cosine 200 8.42 (1 run) 22m
ResNet-preact-20 SGD 1024 2.0 cosine 200 8.38 (1 run) 22m
ResNet-preact-20 SGD 1024 1.6 cosine 200 8.07 (1 run) 22m
ResNet-preact-20 SGD 1024 1.2 cosine 200 8.25 (1 run) 21m
ResNet-preact-20 SGD 1024 0.8 cosine 200 8.08 (1 run) 22m
ResNet-preact-20 SGD 1024 0.4 cosine 200 8.49 (1 run) 22m
ResNet-preact-20 LARS 1024 0.02 cosine 200 9.30 22m
ResNet-preact-20 LARS 1024 0.01 cosine 200 7.68 22m
ResNet-preact-20 LARS 1024 0.005 cosine 200 8.88 23m
Model optimizer batch size initial lr lr schedule # of Epochs Test Error (median of 3 runs) Training Time
ResNet-preact-20 SGD 512 3.2 cosine 200 8.51 (1 run) 21m
ResNet-preact-20 SGD 512 1.6 cosine 200 7.73 (1 run) 20m
ResNet-preact-20 SGD 512 0.8 cosine 200 7.73 (1 run) 21m
ResNet-preact-20 SGD 512 0.4 cosine 200 8.22 (1 run) 20m
ResNet-preact-20 LARS 512 0.015 cosine 200 9.84 23m
ResNet-preact-20 LARS 512 0.01 cosine 200 8.05 23m
ResNet-preact-20 LARS 512 0.0075 cosine 200 7.58 23m
ResNet-preact-20 LARS 512 0.005 cosine 200 7.96 23m
ResNet-preact-20 LARS 512 0.0025 cosine 200 8.83 23m
Model optimizer batch size initial lr lr schedule # of Epochs Test Error (median of 3 runs) Training Time
ResNet-preact-20 SGD 256 3.2 cosine 200 9.64 (1 run) 22m
ResNet-preact-20 SGD 256 1.6 cosine 200 8.32 (1 run) 22m
ResNet-preact-20 SGD 256 0.8 cosine 200 7.45 (1 run) 21m
ResNet-preact-20 SGD 256 0.4 cosine 200 7.68 (1 run) 22m
ResNet-preact-20 SGD 256 0.2 cosine 200 8.61 (1 run) 22m
ResNet-preact-20 LARS 256 0.01 cosine 200 8.95 27m
ResNet-preact-20 LARS 256 0.005 cosine 200 7.75 28m
ResNet-preact-20 LARS 256 0.0025 cosine 200 8.21 28m
Model optimizer batch size initial lr lr schedule # of Epochs Test Error (median of 3 runs) Training Time
ResNet-preact-20 SGD 128 1.6 cosine 200 9.03 (1 run) 24m
ResNet-preact-20 SGD 128 0.8 cosine 200 7.54 (1 run) 24m
ResNet-preact-20 SGD 128 0.4 cosine 200 7.28 (1 run) 24m
ResNet-preact-20 SGD 128 0.2 cosine 200 7.96 (1 run) 24m
ResNet-preact-20 LARS 128 0.005 cosine 200 7.96 37m
ResNet-preact-20 LARS 128 0.0025 cosine 200 7.98 37m
ResNet-preact-20 LARS 128 0.00125 cosine 200 9.21 37m
Model optimizer batch size initial lr lr schedule # of Epochs Test Error (median of 3 runs) Training Time
ResNet-preact-20 SGD 4096 1.6 cosine 200 10.20 22m
ResNet-preact-20 SGD 4096 1.6 cosine 800 8.36 (1 run) 1h33m
ResNet-preact-20 SGD 4096 1.6 cosine 1600 8.25 (1 run) 3h10m
ResNet-preact-20 LARS 4096 0.02 cosine 200 8.21 22m
ResNet-preact-20 LARS 4096 0.02 cosine 400 7.53 44m
ResNet-preact-20 LARS 4096 0.02 cosine 800 7.48 1h29m
ResNet-preact-20 LARS 4096 0.02 cosine 1600 7.37 (1 run) 2h58m

Ghost BN

python train.py --config configs/cifar/resnet_preact.yaml
model.resnet_preact.depth 20
train.base_lr 1.5
train.batch_size 4096
train.subdivision 32
scheduler.type cosine
train.output_dir experiments/resnet_preact_ghost_batch/exp00

Model batch size ghost batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 8192 N/A 1.6 cosine 200 12.35 25m*
ResNet-preact-20 4096 N/A 1.6 cosine 200 10.32 22m
ResNet-preact-20 2048 N/A 1.6 cosine 200 8.73 22m
ResNet-preact-20 1024 N/A 1.6 cosine 200 8.07 22m
ResNet-preact-20 128 N/A 0.4 cosine 200 7.28 24m
Model batch size ghost batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 8192 128 1.6 cosine 200 11.51 27m
ResNet-preact-20 4096 128 1.6 cosine 200 9.73 25m
ResNet-preact-20 2048 128 1.6 cosine 200 8.77 24m
ResNet-preact-20 1024 128 1.6 cosine 200 7.82 22m
Model batch size ghost batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 8192 N/A 1.6 cosine 1600
ResNet-preact-20 4096 N/A 1.6 cosine 1600 8.25 3h10m
ResNet-preact-20 2048 N/A 1.6 cosine 1600 7.34 2h50m
ResNet-preact-20 1024 N/A 1.6 cosine 1600 6.94 2h52m
Model batch size ghost batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 8192 128 1.6 cosine 1600 11.83 3h37m
ResNet-preact-20 4096 128 1.6 cosine 1600 8.95 3h15m
ResNet-preact-20 2048 128 1.6 cosine 1600 7.23 3h05m
ResNet-preact-20 1024 128 1.6 cosine 1600 7.08 2h59m

No weight decay on BN

python train.py --config configs/cifar/resnet_preact.yaml
model.resnet_preact.depth 20
train.base_lr 1.6
train.batch_size 4096
train.no_weight_decay_on_bn True
train.weight_decay 5e-4
scheduler.type cosine
train.output_dir experiments/resnet_preact_no_weight_decay_on_bn/exp00

Model weight decay on BN weight decay batch size initial lr lr schedule # of Epochs Test Error (median of 3 runs) Training Time
ResNet-preact-20 yes 5e-4 4096 1.6 cosine 200 10.81 22m
ResNet-preact-20 yes 4e-4 4096 1.6 cosine 200 10.88 22m
ResNet-preact-20 yes 3e-4 4096 1.6 cosine 200 10.96 22m
ResNet-preact-20 yes 2e-4 4096 1.6 cosine 200 9.30 22m
ResNet-preact-20 yes 1e-4 4096 1.6 cosine 200 10.20 22m
ResNet-preact-20 no 5e-4 4096 1.6 cosine 200 8.78 22m
ResNet-preact-20 no 4e-4 4096 1.6 cosine 200 9.83 22m
ResNet-preact-20 no 3e-4 4096 1.6 cosine 200 9.90 22m
ResNet-preact-20 no 2e-4 4096 1.6 cosine 200 9.64 22m
ResNet-preact-20 no 1e-4 4096 1.6 cosine 200 10.38 22m
Model weight decay on BN weight decay batch size initial lr lr schedule # of Epochs Test Error (median of 3 runs) Training Time
ResNet-preact-20 yes 5e-4 2048 1.6 cosine 200 8.46 20m
ResNet-preact-20 yes 4e-4 2048 1.6 cosine 200 8.35 20m
ResNet-preact-20 yes 3e-4 2048 1.6 cosine 200 7.76 20m
ResNet-preact-20 yes 2e-4 2048 1.6 cosine 200 8.09 20m
ResNet-preact-20 yes 1e-4 2048 1.6 cosine 200 8.83 20m
ResNet-preact-20 no 5e-4 2048 1.6 cosine 200 8.49 20m
ResNet-preact-20 no 4e-4 2048 1.6 cosine 200 7.98 20m
ResNet-preact-20 no 3e-4 2048 1.6 cosine 200 8.26 20m
ResNet-preact-20 no 2e-4 2048 1.6 cosine 200 8.47 20m
ResNet-preact-20 no 1e-4 2048 1.6 cosine 200 9.27 20m
Model weight decay on BN weight decay batch size initial lr lr schedule # of Epochs Test Error (median of 3 runs) Training Time
ResNet-preact-20 yes 5e-4 1024 1.6 cosine 200 8.45 21m
ResNet-preact-20 yes 4e-4 1024 1.6 cosine 200 7.91 21m
ResNet-preact-20 yes 3e-4 1024 1.6 cosine 200 7.81 21m
ResNet-preact-20 yes 2e-4 1024 1.6 cosine 200 7.69 21m
ResNet-preact-20 yes 1e-4 1024 1.6 cosine 200 8.26 21m
ResNet-preact-20 no 5e-4 1024 1.6 cosine 200 8.08 21m
ResNet-preact-20 no 4e-4 1024 1.6 cosine 200 7.73 21m
ResNet-preact-20 no 3e-4 1024 1.6 cosine 200 7.92 21m
ResNet-preact-20 no 2e-4 1024 1.6 cosine 200 7.93 21m
ResNet-preact-20 no 1e-4 1024 1.6 cosine 200 8.53 21m

Experiments on half-precision, and mixed-precision

FP16 training

python train.py --config configs/cifar/resnet_preact.yaml
model.resnet_preact.depth 20
train.base_lr 1.6
train.batch_size 4096
train.precision O3
scheduler.type cosine
train.output_dir experiments/resnet_preact_fp16/exp00

Mixed-precision training

python train.py --config configs/cifar/resnet_preact.yaml
model.resnet_preact.depth 20
train.base_lr 1.6
train.batch_size 4096
train.precision O1
scheduler.type cosine
train.output_dir experiments/resnet_preact_mixed_precision/exp00

Results

Model precision batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 FP32 8192 1.6 cosine 200
ResNet-preact-20 FP32 4096 1.6 cosine 200 10.32 22m
ResNet-preact-20 FP32 2048 1.6 cosine 200 8.73 22m
ResNet-preact-20 FP32 1024 1.6 cosine 200 8.07 22m
ResNet-preact-20 FP32 512 0.8 cosine 200 7.73 21m
ResNet-preact-20 FP32 256 0.8 cosine 200 7.45 21m
ResNet-preact-20 FP32 128 0.4 cosine 200 7.28 24m
Model precision batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 FP16 8192 1.6 cosine 200 48.52 33m
ResNet-preact-20 FP16 4096 1.6 cosine 200 49.84 28m
ResNet-preact-20 FP16 2048 1.6 cosine 200 75.63 27m
ResNet-preact-20 FP16 1024 1.6 cosine 200 19.09 27m
ResNet-preact-20 FP16 512 0.8 cosine 200 7.89 26m
ResNet-preact-20 FP16 256 0.8 cosine 200 7.40 28m
ResNet-preact-20 FP16 128 0.4 cosine 200 7.59 32m
Model precision batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 mixed 8192 1.6 cosine 200 11.78 28m
ResNet-preact-20 mixed 4096 1.6 cosine 200 10.48 27m
ResNet-preact-20 mixed 2048 1.6 cosine 200 8.98 26m
ResNet-preact-20 mixed 1024 1.6 cosine 200 8.05 26m
ResNet-preact-20 mixed 512 0.8 cosine 200 7.81 28m
ResNet-preact-20 mixed 256 0.8 cosine 200 7.58 32m
ResNet-preact-20 mixed 128 0.4 cosine 200 7.37 41m

Results using Tesla V100

Model precision batch size initial lr lr schedule # of Epochs Test Error (1 run) Training Time
ResNet-preact-20 FP32 8192 1.6 cosine 200 12.35 25m
ResNet-preact-20 FP32 4096 1.6 cosine 200 9.88 19m
ResNet-preact-20 FP32 2048 1.6 cosine 200 8.87 17m
ResNet-preact-20 FP32 1024 1.6 cosine 200 8.45 18m
ResNet-preact-20 mixed 8192 1.6 cosine 200 11.92 25m
ResNet-preact-20 mixed 4096 1.6 cosine 200 10.16 19m
ResNet-preact-20 mixed 2048 1.6 cosine 200 9.10 17m
ResNet-preact-20 mixed 1024 1.6 cosine 200 7.84 16m

References

Model architecture

Regularization, data augmentation

Large batch

Others