L-shaped selection from expression and methylation data (original) (raw)

The LHeuristic package implements the heuristic algorithm by Sanchez et al. (2019) to detect L-shaped scatterplots. However, since many researchers rely on negative correlation to identify genes “regulated by methylation,” the package includes both methods for completeness.

Each method has various parameters optimized for selecting L-shaped patterns and detecting negative correlations in scatterplot data.

The functions also support parameter tuning, allowing for other forms of correlation selection.

Correlation method

The correlation method is a simple approach that identifies genes with a significant negative correlation between their expression and methylation values.

cl <- correlationSelection(mae1, type = "Spearman",
    pValCutoff = 0.25, rCutoff = -0.5, adj = TRUE)

correlationL <- cl[!is.na(cl$SigNegCorr) & 
    cl$SigNegCorr, ]
correlationNoL <- cl[!is.na(cl$SigNegCorr)& 
    !cl$SigNegCorr, ]

message(
    "The number of genes selected 
    with the correlation method is: ", sum(correlationL$SigNegCorr),
    "\n"
)
#> The number of genes selected 
#>     with the correlation method is: 27

We can observe the distribution of the correlation coefficients:

d <- density(correlationL[, 1])
x2 <- data.frame(x = d$x, y = d$y)

library(ggplot2)
ggplot(x2, aes(x,y)) + geom_line() +
    labs(
    title = "Significant correlations in the TCGA dataset",
    x = "Correlation",
    y = "Density"
    )+
    theme_minimal()

The number of genes retained depends on the p-value and r cutoff.

The result is a table listing “selected” and “unselected” genes, with columns for the r coefficient, p-value, adjusted p-value, distance correlation, and a Boolean indicating whether the gene was selected based on a significantly negative correlation.

head(correlationL)
#>            r (Sp)       p (Sp) adj.Spear.Pval   distCor SigNegCorr
#> ABCC2  -0.8754171 2.441980e-10   2.441980e-07 0.8541382       TRUE
#> ACOX2  -0.6222469 2.412488e-04   1.855760e-02 0.6833996       TRUE
#> ACSL5  -0.5466073 1.776232e-03   8.073781e-02 0.5711910       TRUE
#> ADCY7  -0.5808676 7.635690e-04   4.242050e-02 0.6024344       TRUE
#> ADGRE3 -0.5675195 1.072547e-03   5.362737e-02 0.5547401       TRUE
#> ADH6   -0.6298109 1.919776e-04   1.599813e-02 0.6159664       TRUE

The selected genes can be plotted and saved as a PDF (leaving the file name empty displays the plot on screen). The figure below shows the first four selected genes.

genes2plot <- rownames(correlationL)[1:3]

plotGenesMat(mae1, geneNames = genes2plot,
    text4Title = correlationL[rownames(correlationL), ""],
    saveToPDF = FALSE
)

To plot scatterplots showing the relationship between methylation and expression for a list of genes, use the function plotGenesMat.

Heuristic method

The heuristic method identifies L-shaped scatterplots by overlaying a grid on the graph and defining specific cells that must contain a minimum (or maximum) percentage of points for the scatterplot to qualify as L-shaped.

It computes a score where points in the designated L region receive positive values, while points outside this region receive negative values. Properly adjusting scores and weights results in positive scores for L-shaped scatterplots and negative scores for non-L-shaped ones.

This concept can be clarified with a “three-band rule”:

Overlay a \(3 \times 3\) grid on the scatterplot.
Classify the scatterplot as “L” or “non-L” based on a few conditions
2.1. A minimum number of points must be in the upper-left (cell (1,1)) and lower-right (cell (3,3)) corners. 2.2. A maximum number of points must be in the upper-right (cell (1,3)), as points here indicate hypermethylation and hyperexpression, which are contrary to our goal. 2.3. A minimum of points in cell (3,1) is usually not required unless we strictly want an L-shape; we can also accept diagonals that reflect a negative correlation.
Score points in each sub-grid as follows: 3.1. Points in permitted regions (left margin: cells (1,1), (2,2), (3,1), (3,2), (3,3)) score positively if classified as L, or zero if classified as non-L. 3.2. Points in non-desired regions (outer band: cells (1,2), (1,3), (2,3)) always score negatively. 3.3. Some regions, like cell (2,2), may be neutral and not score.
Tune scoring parameters either manually (based on experience and dataset characteristics) or through cross-validation (if a set of positive and negative L-shaped genes is available).

The scheme can be summarized using the following equation:

\[ S(X) = W_L \circ X \times \mathbf{1}_L(X) + W_{L^C} \circ X \times \mathbf{1}_{L^c}(X) \] where:

\(X\) is the matrix of counts, representing the number of counts in each cell of the grid.
\(W_L\) is the matrix of scores per cell and point if the scatterplot is classified as \(L\).
\(W_{L^c}\) is the matrix of scores per cell and point if the scatterplot is classified as non-\(L\) (\(L^c\)). The symbol \(\circ\) denotes the Hadamard product of the two matrices \(W_{L/L^c}\) (i.e., element-wise multiplication), and \(\mathbf{1}_{L/L^c}()\) is the indicator function for \(L\) or \(L^c\).

The classification of the scatterplot as \(L\) or \(L^c\) can also be represented by the Hadamard product of three matrices:

\[ \mathbf{1}_L(X) = \bigwedge_{i,j} X \circ C \circ \left( mMP \times \sum_{i,j} x_{ij} \right), \]

where: - \(X\) is the matrix of counts, indicating the number of counts in each grid cell.

\(C\) is the matrix of conditions to verify if the scatterplot should be classified as \(L\).
\(mMP\) is the matrix of minimum and maximum percentages of points required in each cell for classification as \(L\).
\(\circ\) represents point-wise logical operations, allowing the product of the three matrices to act as a logical operation.
\(\bigwedge_{i,j}\) denotes a logical “AND” operation across all cells; if all cells are TRUE, the result is assigned to \(L\); if any cell fails, it is assigned to \(L^c\).

sampleSize <- dim(mae1[[2]])[2]
numGenes <- dim(mae1[[2]])[1]

reqPercentages <- matrix(c(2, 20, 5, 5, 40, 20, 3, 3, 2), 
    nrow = 3, byrow = TRUE)
sum(reqPercentages)
#> [1] 100
(maxminCounts <- toReqMat(sampleSize, reqPercentages))
#>      [,1] [,2] [,3]
#> [1,]    1    6    2
#> [2,]    2   12    6
#> [3,]    1    1    1

(theWeightMifL <- matrix(c(2, -2, -sampleSize / 5, 1, 0, -2, 1, 1, 2), 
    nrow = 3, byrow = TRUE))
#>      [,1] [,2] [,3]
#> [1,]    2   -2   -6
#> [2,]    1    0   -2
#> [3,]    1    1    2
(theWeightMifNonL <- matrix(c(0, -2, -sampleSize / 5, 0, 0, -2, 0, 0, 0),
    nrow = 3,
    byrow = TRUE
))
#>      [,1] [,2] [,3]
#> [1,]    0   -2   -6
#> [2,]    0    0   -2
#> [3,]    0    0    0

heur <- scoreGenesMat(mae1,
    aReqPercentsMat = reqPercentages, aWeightMifL = theWeightMifL,
    aWeightMifNonL = theWeightMifNonL
)

message("Number of scatterplots scored  : ", dim(heur)[1], "\n")
#> Number of scatterplots scored  : 1000
message("Number of L-shape scatterplots : ", sum(heur[, 1]), "\n")
#> Number of L-shape scatterplots : 67

heurL <- heur[heur$logicSc, ]
heurNoL <- heur[!heur$logicSc, ]

We can check the results in the following table, which includes a logical value indicating whether the gene exhibits an L-shape based on our criteria, along with the numerSc score.

logicSc	numericSc
A2ML1	TRUE	18
A4GNT	TRUE	-4
ABCB5	TRUE	10
ACADL	TRUE	32
ACOT2	TRUE	32
ACOT8	TRUE	36
ACOX2	TRUE	39
ACRBP	TRUE	-6
ADAM11	TRUE	30
ADAM19	TRUE	33
ADAM8	TRUE	14
ADAMTS1	TRUE	17
ADAMTS14	TRUE	1
ADAMTS3	TRUE	31
ADGRL2	TRUE	28
ADH6	TRUE	31
AEBP1	TRUE	8
AGPAT4	TRUE	32
AGT	TRUE	27
AGTR1	TRUE	26
ALDH1A2	TRUE	27
ALDH1A3	TRUE	9
ALDH1L1	TRUE	34
ALDOB	TRUE	32
ALOX5AP	TRUE	16
ALX1	TRUE	26
AMBN	TRUE	-4
AMIGO2	TRUE	34
AMMECR1L	TRUE	10
ANGPTL5	TRUE	34
ANK2	TRUE	12
ANKRD53	TRUE	12
ANTXR1	TRUE	18
ANXA9	TRUE	31
AOX1	TRUE	-2
AP3B2	TRUE	36
APH1B	TRUE	31
APOC2	TRUE	24
APOLD1	TRUE	35
AQP9	TRUE	9
ARHGDIB	TRUE	14
ARHGDIG	TRUE	28
ARHGEF10	TRUE	36
ARHGEF25	TRUE	12
ARHGEF7	TRUE	6
ARL10	TRUE	14
ARMC4	TRUE	26
ARPP21	TRUE	10
ASB4	TRUE	29
ASCL1	TRUE	29
ASIC4	TRUE	33
ATP6V0D2	TRUE	9
ATP6V1C2	TRUE	40
ATP6V1G3	TRUE	18
AVIL	TRUE	5
B3GALNT1	TRUE	36
B4GALNT1	TRUE	29
B4GALNT4	TRUE	31
BACE2	TRUE	33
BACH2	TRUE	34
BAMBI	TRUE	24
BBOX1	TRUE	31
BCAN	TRUE	24
BCAR1	TRUE	9
BCHE	TRUE	34
BDKRB1	TRUE	20
BEST4	TRUE	11

Next, we can visualize the scatterplots of the selected genes or save them on a PDF file.


genes2plot2 <-   rownames(heurL)[1:4] #  

plotGenesMat(mae1, geneNames = genes2plot2,
    fileName = NULL, text4Title = heurL[genes2plot2, "numeriSc"],
    saveToPDF = FALSE
)

In summary, the Heuristic method allows us to select genes with an L-shaped scatterplot as follows:

Select datasets: A pair of row-column matched matrices for expression and methylation.
Set parameters: 2.1. Grid definition 2.2. Binary scoring 2.3. Numerical scoring
Score the selected data: Return scores, the classification (L-Shape = TRUE / non-L-shape = FALSE), and plots for each gene.