Help for package PAC (original) (raw)

Type: Package
Title: Partition-Assisted Clustering and Multiple Alignments of Networks
Version: 1.1.4
Date: 2021-02-17
Author: Ye Henry Li, Dangna Li
Maintainer: Ye Henry Li hlowl2@gmail.com
Description: Implements partition-assisted clustering and multiple alignments of networks. It 1) utilizes partition-assisted clustering to find robust and accurate clusters and 2) discovers coherent relationships of clusters across multiple samples. It is particularly useful for analyzing single-cell data set. Please see Li et al. (2017) <doi:10.1371/journal.pcbi.1005875> for detail method description.
URL: https://doi.org/10.1371/journal.pcbi.1005875
License: GPL-3
Imports: Rcpp (≥ 0.12.2),igraph,parmigene,infotheo,dplyr, Rtsne, ggplot2, ggrepel
Suggests: knitr, rmarkdown
VignetteBuilder: knitr
LinkingTo: Rcpp
RoxygenNote: 5.0.1
NeedsCompilation: yes
SystemRequirements: C++11
Packaged: 2021-02-18 06:20:59 UTC; henryli
Repository: CRAN
Date/Publication: 2021-02-18 07:00:02 UTC

Finds N Leaf centers in the data

Description

Finds N Leaf centers in the data

Usage

BSPLeaveCenter(data, N = 40, method = "dsp")

Arguments

data a n x p data matrix
N number of leaves centers
method partition method, either "dsp (discrepancy based partition)", or "ll (bayesian sequantial partition limited-look ahead)"

Value

leafctr N leaves centers


Calculates the Jaccard similarity matrix.

Description

Calculates the Jaccard similarity matrix.

Usage

JaccardSM(network1, network2)

Arguments

network1 first network matrix input
network2 second network matrix input

Value

the alignment/co-occurene score


Creates network alignments using network constructed from subpopulations after PAC

Description

Creates network alignments using network constructed from subpopulations after PAC

Usage

MAN(sampleIDs, num_PACSupop, smallSubpopCutoff, k_clades)

Arguments

sampleIDs sampleID vector
num_PACSupop number of subpopulations learned in PAC step for each sample
smallSubpopCutoff Population size cutoff for subpopulations in clade calculation. The small subpopulations will be considered in the refinement step.
k_clades number of clades to output before refinement

Value

clades_network_only the clades constructed without small subpopulations (by cutoff) using mutual information network alignments


Plots mutual information network (mrnet algorithm) connection using the parmigene package. Mutual information calculated with infotheo package.

Description

Plots mutual information network (mrnet algorithm) connection using the parmigene package. Mutual information calculated with infotheo package.

Usage

MINetworkPlot_topEdges(dataMatrix, threshold)

Arguments

dataMatrix data matrix
threshold the maximum number of edges to draw for each subpopulation mutual information network

Mutual information network connection matrix generation (mrnet algorithm) using the parmigene package. Mutual information calculated with infotheo package.

Description

Mutual information network connection matrix generation (mrnet algorithm) using the parmigene package. Mutual information calculated with infotheo package.

Usage

MINetwork_matrix_topEdges(dataMatrix, threshold)

Arguments

dataMatrix data matrix
threshold the number of edges to draw for each subpopulation mutual information network

Value

the mutual information network connection matrix with top edges


Outputs the vectorized summary of a network based on the number of edges connected to a node

Description

Outputs the vectorized summary of a network based on the number of edges connected to a node

Usage

MINetwork_simplified_topEdges(dataMatrix, threshold)

Arguments

dataMatrix data matrix
threshold the number of edges to draw for each subpopulation mutual information network

Partition Assisted Clustering PAC 1) utilizes dsp or bsp-ll to recursively partition the data space and 2) applies a short round of kmeans style postprocessing to efficiently output clustered labels of data points.

Description

Partition Assisted Clustering PAC 1) utilizes dsp or bsp-ll to recursively partition the data space and 2) applies a short round of kmeans style postprocessing to efficiently output clustered labels of data points.

Usage

PAC(data, K, maxlevel = 40, method = "dsp", max.iter = 50)

Arguments

data a n x p data matrix
K number of final clusters in the output
maxlevel the maximum level of the partition
method partition method, either "dsp(discrepancy based partition)", or "bsp(bayesian sequantial partition)"
max.iter maximum iteration for the kmeans step

Value

y cluter labels for the input

Examples

n = 5e3                       # number of observations
p = 1                         # number of dimensions
K = 3                         # number of clusters
w = rep(1,K)/K                # component weights
mu <- c(0,2,4)                # component means
sd <- rep(1,K)/K              # component standard deviations
g <- sample(1:K,prob=w,size=n,replace=TRUE)   # ground truth for clustering
X <- as.matrix(rnorm(n=n,mean=mu[g],sd=sd[g]))
y <- PAC(X, K)
print(fmeasure(g,y))

Aggregates results from the clustering and merging step.

Description

Aggregates results from the clustering and merging step.

Usage

aggregateData(dataInput, labelsInput)

Arguments

dataInput Data matrix, with first column being SampleID.
labelsInput cluster labels from PAC.

Value

The aggregated data of dataInput, with average signal levels for all clusters and sample combinations.

Examples

n = 5e3                       # number of observations
p = 1                         # number of dimensions
K = 3                         # number of clusters
w = rep(1,K)/K                # component weights
mu <- c(0,2,4)                # component means
sd <- rep(1,K)/K              # component standard deviations
g <- sample(1:K,prob=w,size=n,replace=TRUE)   # ground truth for clustering
X <- as.matrix(rnorm(n=n,mean=mu[g],sd=sd[g]))
y <- PAC(X, K)
X2<-as.matrix(rnorm(n=n,mean=mu[g],sd=sd[g]))
y2<-PAC(X2,K)
X<-cbind("Sample1", as.data.frame(X)); colnames(X)<-c("SampleID", "Value")
X2<-cbind("Sample2", as.data.frame(X2)); colnames(X2)<-c("SampleID", "Value")
aggregateData(rbind(X,X2),c(y,y2))

Creates annotation matrix for the clades in aggregated format. The matrix contains average signals of each dimension for each clade in each sample

Description

Creates annotation matrix for the clades in aggregated format. The matrix contains average signals of each dimension for each clade in each sample

Usage

annotateClades(sampleIDs, topHubs)

Arguments

sampleIDs sampleID vector
topHubs number of top ranked genes to output for annotation; annotation is a concatenated list of top ranked genes.

Value

Annotated clade matrix


Adds subpopulation proportion for the annotation matrix for the clades

Description

Adds subpopulation proportion for the annotation matrix for the clades

Usage

annotationMatrix_withSubpopProp(aggregateMatrix_withAnnotation)

Arguments

aggregateMatrix_withAnnotation the annotated clade matrix

Value

Annotated clade matrix with subpopulation proportions


Description

Makes constellation plot, in which the centroids are clusters are embedded in the t-SNE 2D plane and the cross-sample relationships are plotted as lines connecting related sample clusters (clades).

Usage

constellationPlot(pacman_results, perplexity, max_iter, seed,
  plotTitle = "Constellations of Clades", nudge_x = 0.3, nudge_y = 0.3)

Arguments

pacman_results PAC-MAN analysis result matrix that contains network annotation, clade IDs and mean (centroid) clade expression levels.
perplexity perplexity setting for running t-SNE
max_iter max_iter setting for running t-SNE
seed set seed to make t-SNE and consetllation plot to be reproducible
plotTitle max_iter setting for running t-SNE
nudge_x nudge on x coordinate of centroid labels
nudge_y nudge on y coordinate of centroid labels

F-measure Calculation

Description

Compute the F measure between the ground truth and the estimated label

Usage

fmeasure(g, t)

Arguments

g the ground truth
t estimated labels

Value

f the F measure


Calculate the (global) average spread of subpopulations in clades with 2 subpopulations on the constellation plot.

Description

Calculate the (global) average spread of subpopulations in clades with 2 subpopulations on the constellation plot.

Usage

getAverageSpreadOf2SubpopClades(tsneResults, pacman_results)

Arguments

tsneResults t-SNE output of clade centroids' embedding.
pacman_results PAC-MAN analysis result matrix that contains network annotation, clade IDs and mean (centroid) clade expression levels.

Value

Returns global average of 2-subpopulation clade spread on the constellation plot.


Description

Calculates subpopulations in clades (with two or more subpopulations) that are too far away from other subpopulations (within the same clade) on the constellation plot; these far away subpopulations should be pruned away from the original clades.

Usage

getExtraneousCladeSubpopulations(tsneResults, pacman_results,
  threshold_multiplier, max_threshold)

Arguments

tsneResults t-SNE output of clade centroids' embedding.
pacman_results PAC-MAN analysis result matrix that contains network annotation, clade IDs and mean (centroid) clade expression levels.
threshold_multiplier how many times the threshold ( (a) spread from center of clade for clades with three or more sample subpopulations and (b) distance from each subpopulation centroid for clades with exactly two subpopulations).
max_threshold the maximum distance (on t-SNE plane) allowed for sample subpopulations to be categorized into the same clade.

Value

Returns clade subpopulations to be pruned.


Representative Networks

Description

Outputs representative networks for clades/subpopulations larger than a size filter (very small subpopulations are not considered in downstream analyses)

Usage

getRepresentativeNetworks(sampleIDs, dim_subset, SubpopSizeFilter,
  num_networkEdge)

Arguments

sampleIDs sampleID vector
dim_subset a string vector of string names to subset the data columns for PAC; set to NULL to use all columns
SubpopSizeFilter the cutoff for small subpopulations. Smaller subpopulations have unstable covariance structure, so no network structure is calculated
num_networkEdge the number of edges to draw for each subpopulation mutual information network

Creates the matrix that can be easily plotted with a heatmap function available in an R package

Description

Creates the matrix that can be easily plotted with a heatmap function available in an R package

Usage

heatmapInput(aggregateMatrix_withAnnotation)

Arguments

aggregateMatrix_withAnnotation the annotated clade matrix

Value

the heatmap input matrix


Wrapper to output the mutual information networks for subpopulations with size larger than a desired threshold.

Description

Wrapper to output the mutual information networks for subpopulations with size larger than a desired threshold.

Usage

outputNetworks_topEdges_matrix(dataMatrix, subpopulationLabels, threshold)

Arguments

dataMatrix data matrix with first column being the sample ID
subpopulationLabels the subpopulation labels
threshold the number of edges to draw for each subpopulation mutual information network

Outputs the representative/clade networks (plots and summary vectors) for subpopulations with size larger than a desired threshold. Saves the networks and the data matrices without the smaller subpopulations.

Description

Outputs the representative/clade networks (plots and summary vectors) for subpopulations with size larger than a desired threshold. Saves the networks and the data matrices without the smaller subpopulations.

Usage

outputRepresentativeNetworks_topEdges(dataMatrix, subpopulationLabels,
  threshold)

Arguments

dataMatrix data matrix with first column being the sample ID
subpopulationLabels the subpopulation labels
threshold the number of edges to draw for each subpopulation mutual information network

Calculates the within cluster spread

Description

Calculates the within cluster spread

Usage

recordWithinClusterSpread(sampleIDs, dim_subset = NULL, SubpopSizeFilter)

Arguments

sampleIDs A vector of sample names.
dim_subset a string vector of string names to subset the data columns for PAC; set to NULL to use all columns.
SubpopSizeFilter threshold to filter out very small clusters with too few points; these very small subpopulations may not be outliers and not biologically relevant.

Value

Returns the sample within cluster spread


Refines the subpopulation labels from PAC using network alignment and small subpopulation information. Outputs a new set of files containing the representative labels.

Description

Refines the subpopulation labels from PAC using network alignment and small subpopulation information. Outputs a new set of files containing the representative labels.

Usage

refineSubpopulationLabels(sampleIDs, dim_subset, clades_network_only,
  expressionGroupClamp)

Arguments

sampleIDs sampleID vector
dim_subset a string vector of string names to subset the data columns for PAC; set to NULL to use all columns
clades_network_only the alignment results from MAN; used to translate the original sample-specific labels into clade labels
expressionGroupClamp clamps the subpopulations into desired number of expression groups for assigning small subpopulations into larger groups or their own groups.

Prune away specified subpopulations in clades that are far away.

Description

Prune away specified subpopulations in clades that are far away.

Usage

renamePrunedSubpopulations(pacman_results, subpopulationsToPrune)

Arguments

pacman_results PAC-MAN analysis result matrix that contains network annotation, clade IDs and mean (centroid) clade expression levels.
subpopulationsToPrune A vector of clade IDs; these clades will be pruned.

Value

Returns PAC-MAN analysis result matrix with pruned clades. The pruning process creates new clades to replace the original clade ID of the specified subpopulations.


Runs elbow point analysis to find the practical optimal number of clades to output. Outputs the average within sample cluster spread for all samples and the elbow point analysis plot with loess line fitted through the results.

Description

Runs elbow point analysis to find the practical optimal number of clades to output. Outputs the average within sample cluster spread for all samples and the elbow point analysis plot with loess line fitted through the results.

Usage

runElbowPointAnalysis(ks, sampleIDs, dim_subset, num_PACSupop,
  smallSubpopCutoff, expressionGroupClamp, SubpopSizeFilter)

Arguments

ks Vector that is a sequence of clade sizes.
sampleIDs A vector of sample names.
dim_subset a string vector of string names to subset the data columns for PAC; set to NULL to use all columns.
num_PACSupop Number of PAC subpopulation explored in each sample.
smallSubpopCutoff Cutoff of minor subpopulation not used in multiple alignments of networks
expressionGroupClamp clamps the subpopulations into desired number of expression groups for assigning small subpopulations into larger groups or their own groups.
SubpopSizeFilter threshold to filter out very small clusters with too few points in the calculation of cluster spreads; these very small subpopulations may be outliers and not biologically relevant.

Run PAC for Specified Samples

Description

A wrapper to run PAC and output subpopulation mutual information networks. Please use the PAC function itself for individual samples or if the MAN step is not needed.

Usage

samplePass(sampleIDs, dim_subset, hyperrectangles, num_PACSupop, max.iter,
  num_networkEdge)

Arguments

sampleIDs sampleID vector
dim_subset a string vector of string names to subset the data columns for PAC; set to NULL to use all columns
hyperrectangles number of hyperrectangles to learn for each sample
num_PACSupop number of subpopulations to output for each sample using PAC
max.iter postprocessing kmeans iterations
num_networkEdge a threshold on the number of edges to output for each subpopulation mutual information network