Origin of the means and stds used for preprocessing? · Issue #1439 · pytorch/vision (original) (raw)

Does anyone remember how exactly we came about the channel means and stds we use for the preprocessing?

transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

I think the first mention of the preprocessing in this repo is in #39. In that issue @soumith points to https://github.com/pytorch/examples/tree/master/imagenet for reference. If you look at the history of main.py the commit pytorch/examples@27e2a46 first introduced the values. Unfortunately it contains no explanation, hence my question.

Specifically, I'm seeking answers to the following questions:

Are these values rounded, floored, or even ceiled?
Did we use only the images in the training set of ImageNet or additionally the images of the validation set?
Did we perform any kind of resizing or cropping on each image before the calculations were performed?

I've tested some combinations and will post my results here.

Parameters	mean	std
train set only, no resizing / cropping	[0.4803, 0.4569, 0.4083]	[0.2806, 0.2736, 0.2877]
train set only, resize to 256 and center crop to 224	[0.4845, 0.4541, 0.4025]	[0.2724, 0.2637, 0.2761]
train set only, center crop to 224	[0.4701, 0.4340, 0.3832]	[0.2845, 0.2733, 0.2805]

While the means match fairly well, the std differ significantly.

Update:

The process for obtaining the values of mean and std was roughly equivalent to the following but the the concrete subset that was used is lost:

import torch from torchvision import datasets, transforms as T

transform = T.Compose([T.Resize(256), T.CenterCrop(224), T.PILToTensor(), T.ConvertImageDtype(torch.float)]) dataset = datasets.ImageNet(".", split="train", transform=transform)

means = [] stds = [] for img in subset(dataset): means.append(torch.mean(img)) stds.append(torch.std(img))

mean = torch.mean(torch.tensor(means)) std = torch.mean(torch.tensor(stds))

See #1965 for the reproduction experiments.