Theories behind cubar (original) (raw)

The Codon Adaptation Index (CAI)

CAI measures the similarity of codon usage between a coding sequence (CDS) and a set of highly expressed genes (Sharp and Li 1987). To quantify the relative synonymous codon usage (RSCU) values among the set of highly expressed genes, the observed frequency of each codon is divided by the frequency expected under the assumption of equal usage of the synonymous codons for an amino acid (referred to as codon family hereafter):

RSCUij=Xij1ni∑k=1niXik=Xij1niXi\text{RSCU}_{ij} = \frac{X_{ij}}{\frac{1}{n_i}\sum_{k=1}^{n_i}{X_{ik}}}=\frac{X_{ij}}{\frac{1}{n_i}X_i}

whereXijX_{ij}is the number of occurrences of thejjth codon for theiith codon family in CDSs of highly expressed genes, andnin_iis the total number of alternative codons for theiith codon family.XiX_iis the number of occurrence of codons in theiith codon family. The relative adaptiveness of a codon,ww, is defined as the RSCU of that codon divided by the maximum RSCU of that codon family:

wij=RSCUijmaxk=1,…,niRSCUikw_{ij}=\frac{\text{RSCU}_{ij}}{\max_{k=1,\dots,n_i}{\text{RSCU}_{ik}}}

The CAI for a CDS is then calculated as the geometric mean of the relative adaptiveness of codons used in the CDS of that gene:

CAI=(∏k=1Lwk)1L=exp(1L∑k=1Llnwk)=exp(∑i∑jXijlnwij∑i∑jXij)\text{CAI} = \bigg(\prod_{k=1}^L w_k\bigg)^{\frac{1}{L}} =\text{exp}\bigg(\frac{1}{L}\sum_{k=1}^L \ln w_k\bigg) = \text{exp}\bigg( \frac{\sum_i\sum_j {X_{ij}\ln w_{ij}}}{\sum_i\sum_j X_{ij}} \bigg)

whereLLis the total number of codons in CDS.

The effective number of codons (ENC)

ENC reflects the unequal usage of codons in the CDS of a gene (Wright 1990). The lower the ENC, the larger the overall bias in codon usage. The original implementation of ENC calculates the homozygosity of codon usage in codon familyiias follows:

Fiori=Xi∑j=1ni(XijXi)2−1Xi−1F_i^{\text{ori}} = \frac{X_i \sum_{j=1}^{n_i}{(\frac{X_{ij}}{X_i})^2} - 1}{X_i - 1}

whereXiX_iis the number of occurrence of codons in theiith codon family as used above. In cubar, codon family homozygosity is calculated with an improved implementation that is more robust to bias due to smallnin_i(Sun, et al. 2013):

Fi=∑j=1ni(Xij+1Xi+ni)2F_i = \sum_{j=1}^{n_i}(\frac{X_{ij} + 1}{X_i + n_i})^2

Then we can calculate the final ENC of the CDS of a gene as follows:

ENC=K1+K2∑i=1K2Xi∑i=1K2XiFi+K3∑i=1K3Xi∑i=1K3XiFi+K4∑i=1K4Xi∑i=1K4XiFi\text{ENC}=K_1 + K_2\frac{\sum_{i=1}^{K_2}{X_i}}{\sum_{i=1}^{K_2}{X_iF_i}} + K_3\frac{\sum_{i=1}^{K_3}{X_i}}{\sum_{i=1}^{K_3}{X_iF_i}} + K_4\frac{\sum_{i=1}^{K_4}{X_i}}{\sum_{i=1}^{K_4}{X_iF_i}}

whereKmK_mdenotes the number of codon families withmmsynonymous codons:

Km=∑iδ(ni−m)K_m=\sum_{i}{\delta(n_i-m)}

It should be noted that throughout cubar, codon families with more than four synomymous codons are divided into different subfamilies based on the first two nucleotides of codons.

The fraction of optimal codons (Fop)

FopF_{op}measures the fraction of optimal codons in the CDS of a gene given a list of optimal codons (Ikemura 1981). It is calculated as follows:

Fop=∑k=1LI(k-th codon is optimal)LF_{\text{op}}=\frac{\sum_{k=1}^L{I(k\text{-th codon is optimal})}}{L}

whereIIis the indicator function. In case the optimal codons are unknown, cubar automatically determines the optimal codons in each codon family based on the rationale that optimal codons tend to be used more frequently in genes showing stronger codon usage bias. Specifically, as the number of occurrence of codonjjamong codon familyiiin genekkfollows a Binomial distribution:

Xijk∼Binomial(Xik,pk)X_{ij}^k \sim \text{Binomial}(X_i^k, p_k)

the influence of codon usage bias on the tendency to use codonjjcan be estimated with binomial regression using the following link function:

lnp1−p∼β0+β1⋅ENC+ϵ\ln{\frac{p}{1-p}} \sim \beta_0 + \beta_1 \cdot \text{ENC} + \epsilon

If codonjjis more likely to be used in genes with higher overall codon bias (i.e., smaller ENC), then the regression coefficient should be negative and significantly differs from zero. cubar implements the binomial regression using the glm function withbinomial family in R. Optimal codons are determined with a false discovery rate of 0.001 after multiple testing correction with the Benjamini-Hochberg procedure (Benjamini and Hochberg 1995) by default.

tRNA Adaptation Index (tAI)

tAI quantifies how much the usage of codons in CDS of a gene resembles the abundance of tRNAs (dos Reis, et al. 2004), which is often approximated by tRNA gene copy numbers. To determine tAI, the absolute tRNA adaptiveness valueWiW_ifor each codoniiis defined as

Wi=∑j=1ti(1−sij)TijW_i = \sum_{j=1}^{t_i}{(1 - s_{ij}) T_{ij}}

wheretit_iis the number of tRNA isoacceptors recognizing theiith codon andTiT_iis the abundance or gene copy number of thejjth tRNA recognizing this codon.sijs_{ij}is a panelty on non-canonical codon–anticodon pairings and differs among different species (Sabi and Tuller 2014). Cubar uses averagesijs_{ij}values for eukaryotes (Sabi and Tuller 2014) by default. Absolute adaptiveness values are then normalized by maximum as follows:

wi={WimaxjWj,if Wi>0W‾|Wj≠0maxjWj,if Wi=0w_i = \begin{cases} \frac{W_i}{\max_{j}W_j},& \text{if } W_i > 0 \\ \frac{\bar W|_{W_j \neq 0}}{\max_j{W_j}}, & \text{if } W_i=0\end{cases}

whereW‾|Wj≠0\bar W|_{W_j \neq 0}is the geometric mean of non-zero absolute adaptiveness values. Then tAI of codons in a gene can be calculated as follows, similar to CAI:

TAI=(∏k=1Lwk)1L=exp(1L∑k=1Llnwk)=exp(∑i∑jXijlnwij∑i∑jXij)\text{TAI} = \bigg(\prod_{k=1}^L w_k\bigg)^{\frac{1}{L}} =\text{exp}\bigg(\frac{1}{L}\sum_{k=1}^L \ln w_k\bigg) = \text{exp}\bigg( \frac{\sum_i\sum_j {X_{ij}\ln w_{ij}}}{\sum_i\sum_j X_{ij}} \bigg)

The mean codon stabilization coefficients (CSCg)

CSC of a codon is the Pearson correlation coefficient between the frequency of that codon and the mRNA half-lives across different genes (Presnyak, et al. 2015). CSCg is the average codon stabilization coefficient (CSC) of codons in the CDS of a gene (Carneiro, et al. 2019):

CSCg=1L∑k=1LCSCk=∑i∑jXijCSCij∑i∑jXij\text{CSCg} = \frac{1}{L}\sum_{k=1}^L{\text{CSC}_k}= \frac{\sum_i\sum_j {X_{ij} \text{CSC}_{ij}}}{\sum_i{\sum_j{X_{ij}}}}

The deviation from proportionality (Dp)

DpD_pmeasures the departure of the codon usage of an exogenous CDS from the tRNA pool of the host organism (Chen, et al. 2020; Chen and Yang 2022). For a codon familyiiwithnin_isynonymous codons (ni>1n_i>1), the fraction of codonjjamong all occurrences of this family in the exogenous CDS is:

Yij=XijXiY_{ij} = \frac{X_{ij}}{X_i}

The relative tRNA availability of this codon in the codon family in the host organism is calculated as:

Rij=wij∑i=1niwijR_{ij} = \frac{w_{ij}}{\sum_{i=1}^{n_i}{w_{ij}}}

wherewijw_{ij}can be relative tRNA adaptiveness values as in the calculation of tAI or the RSCU of host protein-coding genes as in the calculation of CAI. The discrepancy between codon proportions of exogenous CDS and host tRNA availability is then calculated as the Euclidean distance:

Di=∑j(Yij−Rij)2D_i = \sqrt{\sum_j(Y_{ij} - R_{ij})^2}

Finally,DpD_pis calculated as the geometric mean of distances for all codon families with at least two synonymous codons:

Dp=(∏i=1KDi)1K=exp(1K∑i=1KlnDi)D_p = (\prod_{i=1}^K{D_i})^{\frac{1}{K}} = \exp (\frac{1}{K}\sum_{i=1}^K{\ln D_i})

whereK=∑iI(ni>1)K=\sum_{i}{I(n_i>1)}andIIis the indicator function.

References