(original) (raw)
%\VignetteIndexEntry{OOMPA Mahalanobis Distance} %\VignetteKeywords{OOMPA, PCA, Mahalanobis Distance, Outliers} %\VignetteDepends{oompaBase,ClassDiscovery} %\VignettePackage{ClassDiscovery} \documentclass{article} \usepackage{graphicx} \usepackage{hyperref} \usepackage{cite} \pagestyle{myheadings} \markright{maha-test} \setlength{\topmargin}{0in} \setlength{\textheight}{8in} \setlength{\textwidth}{6.5in} \setlength{\oddsidemargin}{0in} \setlength{\evensidemargin}{0in} \def\rcode#1{\texttt{#1}} \def\fref#1{\textbf{Figure~\ref{#1}}} \def\tref#1{\textbf{Table~\ref{#1}}} \def\sref#1{\textbf{Section~\ref{#1}}} \title{PCA, Mahalanobis Distance, and Outliers} \author{Kevin R. Coombes} \date{4 November 2011} \begin{document} <<echo=false>>= options(width=88) options(SweaveHooks = list(fig = function() par(bg='white'))) #if (!file.exists("Figures")) dir.create("Figures") @ %\SweaveOpts{prefix.string=Figures/02-AML-27plex, eps=FALSE} \maketitle \tableofcontents \section{Simulated Data} We simulate a dataset. <>= set.seed(564684) nSamples <- 30 nGenes <- 3000 dataset <- matrix(rnorm(nSamples*nGenes), ncol=nSamples, nrow=nGenes) dimnames(dataset) <- list(paste("G", 1:nGenes, sep=''), paste("S", 1:nSamples, sep='')) @ Now we make two of the entries into distinct outliers. <>= nShift <- 300 affected <- sample(nGenes, nShift) dataset[affected,1] <- dataset[affected,1] + rnorm(nShift, 1, 1) dataset[affected,2] <- dataset[affected,2] + rnorm(nShift, 1, 1) @ \section{PCA} We start with a principal components analysis (PCA) of this dataset. A plot of the samples against the first two principal components (PCs) shows two very clear outliers (\fref{spca1}). <>= library(ClassDiscovery) spca <- SamplePCA(dataset) @ \begin{figure} <<fig=true,echo=false>>= plot(spca) @ \caption{Principal components plot of the samples.} \label{spca1} \end{figure} We want to explore the possibility of an outlier more formally. First, we look at the cumulative amount of variance explained by the PCs: <>= round(cumsum(spca@variances)/sum(spca@variances), digits=2) @ We see that we need 202020 components in order to explain 7070\%70 of the variation in the data. Next, we compute the Mahalanobis distance of each sample from the center of an NNN-dimensional principal component space. We apply the \texttt{mahalanobisQC} function using different numbers of components between 222 and 202020. <>= maha2 <- mahalanobisQC(spca, 2) maha5 <- mahalanobisQC(spca, 5) maha10 <- mahalanobisQC(spca, 10) maha20 <- mahalanobisQC(spca, 20) myd <- data.frame(maha2, maha5, maha10, maha20) colnames(myd) <- paste("N", rep(c(2, 5, 10, 20), each=2), rep(c(".statistic", ".p.value"), 4), sep='') @ The theory says that, under the null hypothesis that all samples arise from the same multivariate normal distribution, the distance from the center of a ddd-dimensional PC space should follow a chi-squared distribution with ddd degrees of freedom. This theory lets us compute ppp-values associated with the Mahalanobis distances for each sample (\tref{maha}). <<results=tex, echo="FALSE">>= library(xtable) xtable(myd, digits=c(0, rep(c(1, 4),4)), align=paste("|l|",paste(rep("r",8), collapse=''),"|",sep=''), label="maha", caption=paste("Mahalanobis distance (with unadjusted p-values)", "of each sample from the center of", "N-dimensional principal component space.")) @ We see that the samples S1 and S2 are outliers, at least when we look at the first 222, 555, or, 101010 components. However, sample S2 is not quite significant (at the 55\%5 level) when we get out to 202020 components. This can occur when there are multiple outliers because of the ``inflated'' variance estimates coming from the outliers themselves. \clearpage \section{A Second Round} Now we repeat the PCA after removing the one definite outlier. Sample S2 still stands out as ``not like the others'' (\fref{spca2}). <>= reduced <- dataset[,-1] dim(reduced) spca <- SamplePCA(reduced) round(cumsum(spca@variances)/sum(spca@variances), digits=2) @ \begin{figure} <<fig=true,echo=false>>= plot(spca) @ \caption{Principal components plot of the normal control samples, after omitting an extreme outlier.} \label{spca2} \end{figure} And we can recompute the mahalanobis distances (\tref{maha2}). Here we see that even out at the level of 202020 components, this sample remains an outlier. <>= maha20 <- mahalanobisQC(spca, 20) @ <<echo=false,results=tex>>= xtable(maha20, digits=c(0, 1, 4), align="|l|rr|", label="maha2", caption=paste("Mahalanobis distance (with unadjusted p-values)", "of each sample from the center of", "20-dimensional principal component space.")) @ \clearpage \section{A Final Round} We repeat the analysis after removing one more outlier. <>= red2 <- reduced[,-1] dim(red2) spca <- SamplePCA(red2) round(cumsum(spca@variances)/sum(spca@variances), digits=2) @ \begin{figure} <<fig=true,echo=false>>= plot(spca) @ \caption{Principal components plot of the normal control samples, after omitting an extreme outlier.} \label{spca3} \end{figure} And we can recompute the mahalanobis distances (\tref{maha3}). At this point, there are no outliers. <>= maha20 <- mahalanobisQC(spca, 20) @ <<echo=false,results=tex>>= xtable(maha20, digits=c(0, 1, 4), align="|l|rr|", label="maha3", caption=paste("Mahalanobis distance (with unadjusted p-values)", "of each sample from the center of", "20-dimensional principal component space.")) @ \section{Appendix} This analysis was performed in the following directory: <>= getwd() @ This analysis was performed in the following software environment: <>= sessionInfo() @ \end{document}</echo=false,results=tex></fig=true,echo=false></echo=false,results=tex></fig=true,echo=false></results=tex,></fig=true,echo=false></echo=false>