When to set geoMeans in estimateSizeFactors · Issue #445 · joey711/phyloseq (original) (raw)

Hi there,
I am currently dealing with a dataset of 16S OTUs from sediment of two Antarctic cruises. I notice that in some of the phyloseq tutorials, the following is used:

dds1 = phyloseq_to_deseq2(phyloseq_object, ~some_factor)
gm_mean = function(x, na.rm=TRUE){
    exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}
geoMeans = apply(counts(dds1), 1, gm_mean)
dds1 = estimateSizeFactors(dds1, geoMeans=geoMeans)

Other times, a standard call to estimateSizeFactors is used:

dds1 = phyloseq_to_deseq2(phyloseq_object, ~some_factor)
dds2= estimateSizeFactors(dds2)

According to the help associated with estimateSizeFactors, geoMeans is used to provide a frozen, pre-calculated set of geometric means. If this is not provided, then the function calculates geometric means from the data. I ran estimateSizeFactors with and without specifying geoMeans and it makes a major difference to the results:

> head(dds1$sizeFactor)
ANT_C1_1 ANT_C1_10 ANT_C1_11 ANT_C1_12 ANT_C1_13  ANT_C1_2 
1.837425  1.943064  1.943064  1.037106  1.943064  1.856124 
>head(dds2$sizeFactor)
 ANT_C1_1 ANT_C1_10 ANT_C1_11 ANT_C1_12 ANT_C1_13  ANT_C1_2 
3.6321766 4.1681911 2.0902822 0.6115556 2.3290326 2.5426859

I'm not quite sure why they are giving different values. From what I can see, the method gm_mean is just a standard calculation of geometric means. However, it has a major effect on normalization by rlog, as can be seen in the two attached meanSdPlot figures, generated with the following:

rld2<-rlogTransformation(dds2, blind=FALSE)
rlogMat2<-assay(rld2)
vst2<-varianceStabilizingTransformation(dds2, blind=FALSE)
vstMat2<-assay(vst2)
library('vsn')
op<-par(mfrow=c(1,3))
meanSdPlot(log2(counts(dds2, normalized=TRUE)+1), main='shifted log')
meanSdPlot(rlogMat2, main='rlog transformed')
meanSdPlot(vstMat2, main='VST transformed')
par(op)

My question is two-fold:

  1. Why does pre-calculating the geometric mean give different values?
  2. When is it appropriate to pre-calculate geometric means as in the first example?

Many thanks,

Ben

geomean set not blind
geomean auto not blind