Estimating IBD tracts from low coverage NGS data (original) (raw)

Abstract

Motivation: The amount of IBD in an individual depends on the relatedness of the individual’s parents. However, it can also provide information regarding mating system, past history and effective size of the population from which the individual has been sampled.

Results: Here, we present a new method for estimating inbreeding IBD tracts from low coverage NGS data. Contrary to other methods that use genotype data, the one presented here uses genotype likelihoods to take the uncertainty of the data into account. We benchmark it under a wide range of biologically relevant conditions and show that the new method provides a marked increase in accuracy even at low coverage.

Availability and implementation: The methods presented in this work were implemented in C/C ++ and are freely available for non-commercial use from https://github.com/fgvieira/ngsF-HMM.

Contact: [email protected]

Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction

The inference of inbreeding levels is of central importance in many studies of ecology, evolution and conservation biology. Inbred individuals often have lower fitness than offspring of unrelated parents (Ebert et al., 2002) and a cumulative effect can reduce the population growth rate and probability of persistence (O’Grady et al., 2006). Furthermore, the study of inbreeding levels on natural populations can shed light into the species’ mating system and past history (Gibson et al., 2006; Stevens et al., 2012; Vieira et al., 2013) and is important for understanding the distribution of genetic variation within and among populations and, consequently, the effect of natural selection (Charlesworth, 2003).

In addition to the genome-wide inbreeding coefficient of an individual, further information can be gained by examining the distribution of inbred (or Identity By Descent; IBD) regions throughout the genome. These regions are usually organized into tracts of homozygous genotypes that recombination breaks down over time. Their number and lengths are tightly coupled to population genetics processes, from fine-scale population structure, effective population size, selfing and recombination rate, to age and type of inbreeding and even relatedness of the parents (Gibson et al., 2006). Briefly, short tracts reflect old inbreeding in the population (possibly due to large effective population sizes), while long tracts may reflect recent inbreeding either due to small effective population sizes or familial matings. Under certain conditions and carefully designed experiments, the identification of IBD tracts can even be used for other types of analyses, like mapping of recombination breakpoints from back-crossed individuals.

As common descent is always guaranteed for any pair of homologous loci, IBD is often defined relative to an expectation under a certain model. An example is IBD due to recent familial relationships relative to the expectation of genetic identity for two individuals sampled at random from the population. A common operational definition of inbreeding is the excess of homozygosity compared to the Hardy–Weinberg Equilibrium expectation. This corresponds to classical population genetic definitions dating back to Wright (1922) and is in effect the definition used here and in many previous papers (Leutenegger et al., 2003; Vieira et al., 2013), although inferences are done in a single individual by combining information from multiple sites.

Previous studies have addressed the problem of inferring IBD tracts using Hidden Markov Models (HMMs) (Leutenegger et al., 2003) based on individual genotype data. Current Next-Generation Sequencing (NGS) technologies can sometimes produce data with high error rates due to random sampling of homologous base pairs, sequencing, or alignment errors (Nielsen et al., 2011; Ross et al., 2013). Furthermore, due to budget constraints, many NGS studies rely on low or medium depth of coverage sequence data (⁠<5× per individual), causing genotype calling to be made with a considerable amount of uncertainty. The uncertainty, especially from low depth data, can greatly bias inferences of IBD tracts, as sites with sequencing errors can be mistaken for heterozygote sites thereby breaking up the tracts into smaller segments.

Many recent methods rely on probabilistic frameworks to account for these errors and accurately call SNPs and genotypes, even at low coverage (Li, 2011; Martin et al., 2010; Nielsen et al., 2012). These methods integrate the base quality score together with other error sources (e.g. mapping or sequencing errors) to calculate an overall genotype likelihood (DePristo et al., 2011; Li et al., 2009a,b). These likelihoods can be used directly or combined with priors to take into account the uncertainty associated with the data (for reviews see Nielsen et al., 2011; O’Rawe et al., 2015).

There are several methods available to infer inbreeding coefficients, but only ngsF (Vieira et al., 2013) is suitable for low coverage NGS data; however, this program only estimates genome-wide inbreeding levels and is thus incapable of inferring IBD tracts. In this paper, we present a new method to estimate IBD tracts from low-coverage NGS data. We evaluate its accuracy using extensive biologically relevant simulations and apply it to two real datasets, one consisting of populations from the HAPMAP and 1000 Genomes projects, and another consisting of varieties of wild and domesticated rice. When comparing against genotype-based methods, the method presented here performs considerably better when estimating IBD tracts and per-individual inbreeding coefficients for coverages <3×⁠.

2 Methods

2.1 Optimization and decoding

To optimize the parameters from our model (Fig. 1), we adopted a Maximum Likelihood iterative approach. Using F for the per-individual inbreeding coefficient, α for the transition rate, f for the allele frequencies, π for the most probable path (i.e. the inferred IBD tracts) and Li for the likelihood at iteration i:

HMM model. Illustration of the HMM model used. We used a two-state model to represent IBD and IBD¯ (not IBD) genomic regions. The model switches between states with probability P(ys−1|ys) and each state can emit the three different genotypes with probability P(g|ys) (where g represents a genotype and ys the most probable state at site s)

Fig. 1.

HMM model. Illustration of the HMM model used. We used a two-state model to represent IBD and IBD¯ (not IBD) genomic regions. The model switches between states with probability P(ys−1|ys) and each state can emit the three different genotypes with probability P(g|ys) (where g represents a genotype and ys the most probable state at site s)

(0) Initialization:

Set i = 0.

Initialize F, α and f from a uniform distribution.

Calculate initial likelihood _L_0.

(1) Given f, jointly estimate F and α through the L-BFGS-B algorithm.

(2) Given F and α, estimate f using an EM algorithm.

(4) Check convergence:

Set i=i+1⁠.

Calculate Likelihood Li.

If Li−Li−1>epsilon go to (1).

(5) Infer the most probable path π (decoding) with the Viterbi algorithm.

In more detail, on step (1), the L-BFGS-B algorithm (Zhu et al., 1997) uses numerical derivatives and a new backward pass for each evaluation of the likelihood, that is in turn calculated from the forward algorithm. As for step (2), the EM algorithm was adapted from ANGSD (Korneliussen et al., 2014) and estimates f directly from the genotype likelihoods using, as prior, the expected genotype frequencies under a certain inbreeding level; this is calculated from the probability of a given site being IBD (calculated from the forward and backward algorithms).

2.2 NGS data simulation

We performed extensive simulation to assess the performance of our method when estimating all parameters. Due to computational constraints, we simulated mapped sequencing data directly (rather than raw sequencing reads) similarly to previous studies (Fumagalli et al., 2013; Kim et al., 2011; Vieira et al., 2013). When simulating genotypes, we have to account for the IBD state in each site. Therefore, we first sampled the true IBD states directly by implementing a Markov model (Fig. 1), taking into account the distance between SNPs, per-individual inbreeding coefficient (F) and a transition rate (α). Allele frequencies were sampled uniformly between 0 and 0.5. We then sampled genotypes given the allele frequencies and the previously simulated IBD states. Third, and similarly to previous studies, we sampled the number of reads from a Poisson distribution with mean equal to the specified individual sequencing coverage and simulated sequencing errors by changing each read base to any of the other three nucleotides with probability ϵ/3⁠, where ϵ represents the error rate. We then calculated the genotype likelihoods assuming the GATK model (DePristo et al., 2011). We note that, since we use the same model to simulate the data and make inferences, results from simulated data represent best case scenarios.

We simulated 10 000 variable sites with distances sampled from a normal distribution with a mean of 100 000 base pairs and standard deviation equal to 1/3 of the mean. We assumed sample sizes of 10, 30 and 50 individuals, and average sequencing coverages of 0.5, 1, 3, 5 and 10×⁠, with error rates of 0.5, 1 and 2%, and varied the inbreeding coefficients from 0.0 to 1.0 in steps of 0.1, for a total of 495 combinations. With these parameter choices we focused on datasets with small sample sizes and low coverage, for which inference is harder, used realistic error rates (Glenn, 2011), and covered biologically relevant scenarios of inbreeding from < 0.07 in humans (Carothers et al., 2006) and ∼0.3 dogs (Gray et al., 2009; Kirkness et al., 2003) to 0.4–0.98 in rice (Kovach et al., 2007) and 0.757 in wasps (Chapman and Stewart, 1996).

2.3 Genotype calling

Genotypes were called following a Maximum-A-Posteriori (MAP) approach by internally setting the genotype with the highest posterior probability to 1 and the remaining genotypes to 0. Genotype’s posterior probability was calculated from the genotype likelihood together with a prior (Li, 2011; Nielsen et al., 2011, 2012), where the latter was either (i) a uniform distribution (GL_CG), the expected genotype frequencies under (ii) HWE (HWE_CG) or (iii) HWE assuming an inbreeding coefficient (HWE + F_CG). In this work, the inbreeding priors were calculated with the program ngsFVieira et al. (2013), that estimated genome-wide inbreeding coefficients from low coverage NGS data. It is worth noting that under this genotype calling approach, we will only have undetermined (i.e. missing) genotypes when using a uniform prior since, in face of equally likely genotypes, the prior alone will determine the outcome.

2.4 Metric of accuracy

We calculated error rates associated with all estimated parameters: F, α, IBD tract (π), allele frequencies (f) and genotype calls (g). For genotype calls, the associated error was calculated as the proportion of miscalled genotypes, while for the other parameters we used the RMSD defined as:

where _X_true and _X_est are the true and estimated values of the parameters, and S the total number of estimates (number of individuals for F and α, and all sites across all individuals for π; missing values were ignored. All plots were made using R package ggplot2 (Wickham, 2009).

2.5 Analysis of real data

2.5.1 Human dataset

We selected individuals from the LWK (Luhya in Webuye, Kenya) and GIH (Gujarati Indian from Houston, Texas) populations present in both the HAPMAP (The International HapMap 3 Consortium, 2010) and 1000 Genomes (The 1000 Genomes Project Consortium, 2012) datasets, on a total of 86 and 94 samples from the LWK and GIH, respectively, with sequencing coverages ranging from 3.13× up to 13.98×⁠, with an average of 6.2× (Supplementary Tables S1 and S2). For computational reasons, we restricted our analyses to chromosomes 8 and 11 on the LWK and GIH populations, respectively (Supplementary Fig. S6). We selected all sites with a Minor Allele Frequencies (MAF) of at least 0.05 and, since our method does not explicitly account for linkage disequilibrium (LD) among sites, devised two datasets to assess the impact of LD on the IBD inferences: one composed of all sites in the HAPMAP dataset (HAPMAP) and another (HAPMAP-LD) where sites in LD were pruned with PLINK (Purcell et al., 2007) based on pairwise LD (–indep-pairwise 50 5 0.2). To assess the impact of sample size and sequencing coverage on the estimates, apart from the original 1000 Genomes dataset, we also analyzed downsampled datasets. For the coverage we downsampled the original dataset to 50 and 25% of the original total number of reads; for the sample size, we downsampled it to 15, 30 and 50 individuals. We extracted the sites for each of these datasets and, using the originally mapped reads, used ‘ANGSD’ (Korneliussen et al., 2014) to calculate genotype likelihoods (-baq 1 -C 50 -minMapQ 15 -minQ 10 -minInd (N_IND/2) -GL 1 -doGlf 2 -doMajorMinor 1 -doMaf 1 -SNP_pval 1e-6). Briefly, we used the SAMtools formula (Li et al., 2009a) to calculate the genotype likelihoods using only reads with a root mean square (RMS) mapping quality > 15, and sites with a base quality >10, where data was present in at least half the individuals and with a high probability of being SNPs. To assess the accuracy of our method, we inferred IBD tracts on all of the above mentioned 36 datasets (2 populations, 3 sample sizes, 3 coverages and 2 sets of sites) and calculated the proportion of sites with correctly assigned IBD states. Since we don’t know the true IBD state of these samples, benchmarked the performance of our method dealing with low-coverage data (using genotype likelihoods) relative to the performance of full genotyping data, using the our HMM algorithm in both cases.

2.5.2 Rice dataset

For the rice example, we used the 113 wild, 72 indica and 79 japonica (both tropical and temperate) accessions from the MiniCore collection (Wang et al., 2016). We used the original data comprising 52 838 SNPs evenly distributed across the rice genome. Briefly, genotype likelihoods were calculated with ANGSD (options ‘-baq 1 -C 50 -minMapQ 20 -minQ 20 -GL 1 -doGlf 2’) and called SNPs with a significance level of approx. 0.0001 for rejecting the hypothesis of the site being non-polymorphic. We restricted our analyses to sites with a MAF of at least 0.05 and, to account for LD, we randomly selected a representative SNP for every 5 kb region. In this case we simply inferred the IBD tracts, since we don’t have a high quality reference to compare to. Since the level of population structure of these species is unclear, we analyzed the wild, indica and japonica accessions separately (Supplementary Figs S7–S9).

3 Results

3.1 Model for IBD tracts

To estimate Identity By Descent (IBD) tracts along a genome, we extended the approach of Leutenegger et al. (2003) which uses an HMM model with two hidden states (not inbred, y = 0 or IBD¯⁠; inbred, y = 1 or IBD) for each polymorphic site (s) of the genome (Fig. 1). However, instead of the observed data being genotypes, we will use genotype likelihoods directly to infer IBD regions (or the most probable hidden state path πˆ⁠) across the genome.

The transition probabilities between site s and s – 1 represent the probability of switching between IBD states (⁠y∈{0,1}⁠) and can be obtained from an instantaneous rate matrix Q. This rate matrix is naturally parametrized in terms of an inbreeding coefficient (F) and a transition rate per Mb (α) which depends on the recombination rate and the time to common ancestor(s) in the underlying pedigree:

Q=State(y)IBD¯IBDIBD¯1−αFαFIBDα(1−F)1−α(1−F)

(2)

However, we note that the α parameter can only be estimated when 0<F<1⁠. Based on the instantaneous rate matrix, the transition probabilities of the continuous time Markov chain are given by Leutenegger et al. (2003):

P(ys=0|ys−1=0)=(1−e−αt)(1−F)+e−αtP(ys=0|ys−1=1)=(1−e−αt)(1−F)P(ys=1|ys−1=0)=(1−e−αt)FP(ys=1|ys−1=1)=(1−e−αt)F+e−αt

(3)

The emission probabilities at site s, P(Xs|ys)⁠, represent the probability of the observed data (i.e. overlapping reads and corresponding base qualities) at that site (Xs) given the current state (ys). If we were dealing with called genotypes, this would be straightforward since the observed data would be the genotype but here, since we are dealing with genotype likelihoods, we have to integrate over all possible genotypes:

P(Xs|ys)=∑gP(Xs|g)P(g|ys)

(4)

where for site s, P(Xs|g) is the genotype likelihood for genotype g and P(gs|ys) the probability of observing genotype g on state ys. Assuming that all sites are bi-allelic with two alleles represented by A and a and, there are only three possible genotypes, so:

P(AA|ys)=(1−fs)2+fs(1−fs)ysP(Aa|ys)=2fs(1−fs)−2fs(1−fs)ysP(aa|ys)=fs2+fs(1−fs)ys

(5)

where ys is the IBD state and fs the minor allele (a) frequency at site s, which is also estimated from the data (see below). From a set of individuals, we jointly estimate the allele frequencies f, and the per-individual inbreeding coefficients F and transitions rates α.

3.2 Estimating inbreeding from simulated data

To assess the accuracy of our method, we applied it to a simulated dataset covering a wide range of biologically relevant scenarios. For each simulated scenario, we used our HMM method and assessed the estimates’ accuracy when based both on genotype likelihoods (GL) and called genotypes (GL_CG, HWE_CG and HWE + F_CG). On every case, we estimated the per-individual inbreeding coefficient (F), transition rate (α), most probable path (⁠πˆ⁠) and allele frequencies (f), together with their associated RMSD.

3.2.1 Estimating IBD tracts and transition probabilities

We benchmarked the accuracy of the new method on the above mentioned simulated dataset for inferring IBD tracts (⁠πˆ⁠). In all cases we used the Viterbi algorithm to infer IBD tracts, i.e. the most probable path between the model’s two states. Overall, the analyses based on genotype likelihoods directly vastly outperform analyses based on called genotypes (Fig. 2 and Supplementary Fig. S1). In fact, we get broadly the same accuracy level at 0.5× from genotype likelihoods than at 3× when calling genotypes accounting for inbreeding. Calling genotypes with priors that do not take inbreeding into account (uniform or HWE priors) reduces the accuracy of the estimates further, specially at low depth sequencing.

Fig. 2.

Estimation of IBD tracts. Performance of the HMM method to infer IBD tracts for a sample size of 30 individuals, a transition rate of 0.01 and 10, 000 variable sites simulated with a 0.5% error rate. Columns represent the different analytical approaches, from genotype likelihoods (1st column), called genotypes assuming no prior (2nd column) and called genotypes assuming a prior based on genotype frequencies under HWE (3rd column) or a prior assuming inbreeding (last column). The different lines represent RMSD under different simulated sequencing depths

When inferring the transition rate (α) the overall trend is the same as for IBD tracts, except that all methods based on called genotypes perform quite poorly for coverages <3× and, with higher error rates, even <5× (Fig. 3 and Supplementary Fig. S2). When calling genotypes assuming HWE in particular, it is impossible to make any inference at coverages <3× (and why data points are missing for these coverages). Due to the extremely low coverage, the HWE prior will lead to high genotype calling errors (30%; Supplementary Fig. S5) that disrupt IBD tracts and preclude their identification. For any given depth, inferences directly from genotype likelihoods always outperform all other methods, even looking across different depths.

Fig. 3.

Estimation of transition rate. Performance of the HMM method when estimating α for a sample size of 30 individuals, an inbreeding coefficient of 0.5 and 10 000 variable sites simulated with a 0.5% error rate. Columns represent the different analytical approaches: genotype likelihoods (1st column), called genotypes assuming no prior (2nd column) and called genotypes assuming a prior based on genotype frequencies under HWE (3rd column) or a prior assuming inbreeding (last column). The different lines represent RMSD under different simulated sequencing depths

3.2.2 Estimating individual inbreeding coefficients

The individual inbreeding coefficients is given from the model and, as previously shown, can be used as a prior in Bayesian analyses (Vieira et al., 2013) and can be particularly important when dealing with low depth-of-coverage NGS datasets. Overall we see that all approaches perform quite well at a sequencing coverage depth of 10×⁠, except when assuming HWE on highly inbred samples. This is true even at >5× (Fig. 4 and Supplementary Fig. S3). At <3× the two methods based on called genotypes perform quite poorly, illustrating the biases that this type of approach entails when dealing with low coverage data. At this coverage range, only genotype likelihood-based methods have acceptable accuracies, with the new method presented here slightly outperforming ngsF (Vieira et al., 2013) at low sequencing depth (⁠<3×⁠) and small sample sizes (⁠n≤30⁠).

Fig. 4.

Estimation of per-individual inbreeding coefficients. Performance of the HMM method to infer genome-wide per-individual inbreeding coefficients for a sample size of 30 individuals, a transition rate of 0.01 and 10 000 variable sites simulated with a 0.5% error rate. Columns represent the different analytical approaches, from genotype likelihoods (1st column), called genotypes assuming no prior (2nd column) and called genotypes assuming a prior based on genotype frequencies under HWE (3rd column). The last column represents the inference of per individual F from genotype likelihoods (as in 1st column) but using the previously developed method ngsF. The different lines represent RMSD under different simulated sequencing depths

3.2.3 Estimating allele frequencies

Allele frequencies form the backbone of most population genetics methods and, as such, their accurate estimation is of high importance. The method presented here was adapted from ANGSD and, as such, performs similarly. Overall, genotype likelihood-based methods seem to perform better than those based on called genotypes and be somewhat robust to various levels of inbreeding and priors (Supplementary Fig. S4).

3.2.4 Effect of inbreeding on genotype calling

A common downstream analysis of NGS data is the identification of genotypes at each position for all individuals (genotype calling). The methods developed here can provide improved Bayesian genotype-calling in low depth-of-coverage data, by providing a more appropriate prior that takes inbreeding into account. However, several factors in addition to inbreeding, can affect genotype calling, including high error rates, sequencing coverage and small sample sizes. To assess their impact, we calculated genotype posterior probabilities with our HMM method, called genotypes, and compared these with the previously mentioned called genotypes datasets: GL, HWE_CG and HWE + F_CG (see Section 2). All methods show overall comparable error rates for calling genotypes (Supplementary Fig. S5), but assuming a uniform prior gives a high proportion of undetermined genotypes (e.g. 60 and 37% at 0.5× and 1×⁠, respectively). If we use an informative prior, the undetermined genotypes’ levels are drastically reduced. Assuming HWE yields a relatively constant error rate across inbreeding levels but, in highly inbred samples, being able to incorporate inbreeding into the prior (either local or global) can drastically reduce genotype calling errors by as much as 79% (from 0.215 to 0.046) when analysing 10 fully inbred individuals at 5× (Supplementary Fig. S5).

Dividing genotypes into homozygous and heterozygous calls, it is clear they are differently affected. Homozygous genotypes tend to have constant error rates independently of inbreeding levels, except when assuming HWE, where higher inbred samples have higher error since the prior makes it difficult to call low frequency homozygotes. Using a uniform prior gives very low error rates but at the expenses of high levels of undetermined genotypes, while both informative inbreeding priors have similar performances.

Heterozygous genotypes are typically the most difficult genotypes to call and, as such, have considerably higher error rates. In this case, a uniform prior is the worst method with both high error rates and undetermined genotypes’ levels. Assuming an inbreeding prior performs similarly to HWE for low inbred samples but performs worse for highly inbred ones (i.e. 0.5<F<0.9⁠), since the prior penalizes heterozygote genotypes. In this case, the method presented here presents another advantage as it actually allows the use of different priors for regions that have been inferred to be IBD or IBD¯⁠. To sum up, the method largely has the advantages of an inbreeding prior for homozygote genotypes and a HWE prior for heterozygotes genotypes in non-IBD regions.

3.3 Application to real data

In addition to simulated data, we also analyzed two publicly available datasets of low coverage NGS data: a human dataset composed of populations from HAPMAP and the 1000 Genomes projects, as well as a recently published rice dataset.

3.3.1 HAPMAP and 1000 genomes human datasets

In the analyses of the human dataset, we took advantage of the fact that some individuals genotyped for the HAPMAP project (The International HapMap 3 Consortium, 2010) have also been sequenced at low coverage on the 1000 Genomes (The 1000 Genomes Project Consortium, 2012). This gives us a perfect test dataset, as we have both low-coverage sequencing data and high quality genotypes to use as reference. We selected the LWK and GIH populations, since they are included in both datasets, both were expected to be composed of unrelated individuals, but where recent studies have shown some history of inbreeding (e.g. Gazal et al., 2015; Stevens et al., 2012).

We inferred IBD tracts on the 1000 Genomes populations comparing results across several combinations of sample size and coverage, with and without LD pruning (see Section 2). We assume HAPMAP genotypes are called correctly and, as such, that IBD tracts inferred from them represent the gold standard. Comparing to it, we show that the HMM method can accurately infer IBD tracts even under cases of extremely low coverage and small sample sizes (1/4 of 1000 Genomes coverage and 30 individuals) (Supplementary Fig. S6). In fact, both sample size and coverage seem to have a relatively small effect on IBD inference (the latter when sites in LD have been pruned), in comparison to the presence of sites in LD. That said, pruning of sites in LD can greatly reduce error rates by as much as an order of magnitude, from (e.g.) an average proportion of miss-identified sites of 0.182–0.020, when analysing 30 individuals from the GIH population at 1/4 of their original coverage.

3.3.2 MiniCore rice dataset

As a second example, we used the recently published MiniCore dataset, composed of both domesticated and wild accessions sequenced at low coverage (Wang et al., 2016). There are several species of wild rice but the O. rufipogon species complex is thought to be the closest to domesticated rice (Oryza sativa) (e.g. Grillo et al., 2009; Wei et al., 2012), which is further divided into two subspecies (O. s. japonica and O. s. indica). Wild and domesticated accessions have markedly different selfing rates, ranging in the wild between 50 and 95% (Gao et al., 2002; Morishima et al., 1984; Oka, 1988; Phan et al., 2012), and in cultivated between 95 and 100%.

Our estimates show wild rice with a wide range of inbreeding values, from totally outbred to almost fully inbred (Supplementary Fig. S7), while cultivated rice accessions were all almost totally inbred. These estimates are also reflected in the IBD tract inferences, in which cultivated accessions have IBD tracts spanning whole chromosomes (Supplementary Figs S9 and S8).

4 Discussion

The levels of inbreeding in an individual is an important parameter in population genomic studies, since it can reflect mating system, selfing rates, population size and past population history. Theoretically, the best way to infer it is through the pedigree, but pedigrees are not available in much studies. However, even in these cases, inferences based on pedigrees can be biased due to incomplete knowledge of the pedigree. In addition, pedigrees provide expected levels of inbreeding, but these may differ from true genetic levels of inbreeding due to the stochasticity of allelic segregation and recombination. This has been demonstrated by recent analyses identifying higher than expected ROH prevalence in unrelated individuals from outbred populations (Gibson et al., 2006; The International HapMap Consortium, 2007), supporting a recent claim that estimates based on markers are more accurate than expected values inferred from pedigrees (Kardos et al., 2015).

Here, we have developed a method that can reliably estimate individual Identical By Descent (IBD) tracts and inbreeding coefficient, directly from genomic data, without requiring any knowledge of the underlying genealogy. Other methods exist for this (Hall et al., 2012; Leutenegger et al., 2003) but all were developed for SNP chip data, which has much lower error rates than low coverage NGS data. NGS technologies have revolutionized genetics by providing fast, cheap and reliable large-scale DNA sequencing data. However, the per base pair error rate in NGS data is still considerably higher than in Sanger sequencing or chip-based genotyping technologies (Glenn, 2011). As a consequence, researchers usually sequence at high sequencing depths but this comes at an increased financial, computational and storage cost. Furthermore, due to the ever-growing demand for larger sample sizes, many NGS studies rely on low coverage NGS sequence data (⁠<5×⁠). As such, the availability of methods that can properly handle this data will help researchers make more cost-effective choices in the trade-off between sample size and sequencing depth.

The method presented here facilitates the estimation of IBD tracts from low coverage NGS data. We evaluate its performance through both simulated and real data analyses. When compared to genotype-calling-based methods, the improvement in accuracy when estimating IBD tracts and individual inbreeding coefficients is considerable for sequencing depths <3×⁠. Apart from the previously mentioned use of this methods in population genomic studies, we note that there is another possible application of the method: mapping of recombination breakpoints in backcrosses between inbred lines. In backcross data, each individual in the backcross generation effectively has an inbreeding coefficient of F = 0.5. The method presented here allows the estimation of recombination rates for such data by estimation of the parameter α. More importantly, the posterior decoding algorithm provides estimates of the genomic location of recombination breakpoints. This provides, in combination with low-coverage NGS sequencing of a backcross generation, an efficient design for mapping recombination breakpoints from model species such as yeast.

Acknowledgements

We would like to thank Thorfinn Korneliussen for helpful discussions and assistance on the use of ANGSD, and Shyam Gopalakrishnan for helpful discussions.

Funding

Funding for this work was supported by a DFF-MOBILEX grant to FGV (DFF-1325-00136), and a Villum Foundation fellowship to AA.

Conflict of Interest: none declared.

References

et al. . (

2006

)

Estimating human inbreeding coefficients: comparison of genealogical and marker heterozygosity approaches

Ann. Hum. Genet

666

–

676

(

1996

)

Extremely high levels of inbreeding in a natural population of the free-living wasp Ancistrocerus antilope (Hymenoptera: Vespidae: Eumeninae)

Heredity (Edinb)

–

(

2003

)

Effects of inbreeding on the genetic diversity of populations

Philos. Trans. R. Soc. London B: Biol. Sci

358

1051

–

1070

et al. . (

2011

)

A framework for variation discovery and genotyping using next-generation DNA sequencing data

Nat. Genet

491

–

498

et al. . (

2002

)

A selective advantage to immigrant genes in a Daphnia metapopulation

Science

295

485

–

488

et al. . (

2013

)

Quantifying population genetic differentiation from next-generation sequencing data

Genetics

195

979

–

992

. et al. . (

2002

)

Assessment of population genetic structure in common wild rice Oryza rufipogon Griff. using microsatellite and allozyme markers

Theor. Appl. Genet

106

173

–

180

et al. . (

2015

)

High level of inbreeding in final phase of 1000 Genomes Project

Sci. Rep

et al. . (

2006

)

Extended tracts of homozygosity in outbred human populations

Hum. Mol. Genet

789

–

795

(

2011

)

Field guide to next-generation DNA sequencers

Mol. Ecol. Resour

759

–

769

et al. . (

2009

)

Linkage disequilibrium and demographic history of wild and domestic canids

Genetics

181

1493

–

1505

et al. . (

2009

)

Genetic architecture for the adaptive origin of annual wild rice, Oryza nivara

Evolution (N. Y)

870

–

883

et al. . (

2012

)

Maximum likelihood estimation of individual inbreeding coefficients and null allele frequencies

Genet. Res. (Camb)

151

–

161

et al. . (

2015

)

Measuring individual inbreeding in the age of genomics: marker-based measures are better than pedigrees

Heredity (Edinb)

115

–

et al. . (

2011

)

Estimation of allele frequency and association mapping using next-generation sequencing data

BMC Bioinformatics

et al. . (

2003

)

The dog genome: survey sequencing and comparative analysis

Science

301

1898

–

1903

et al. . (

2014

)

ANGSD: analysis of next generation sequencing data

BMC Bioinformatics

et al. . (

2007

)

New insights into the history of rice domestication

Trends Genet

578

–

587

et al. . (

2003

)

Estimation of the inbreeding coefficient through use of genomic data

Am. J. Hum. Genet

516

–

523

(

2011

)

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

Bioinformatics

2987

–

2993

et al. . (

2009a

)

The sequence alignment/map format and SAMtools

Bioinformatics

2078

–

2079

et al. . (

2009b

)

SNP detection for massively parallel whole-genome resequencing

Genome Res

1124

–

1132

et al. . (

2010

)

SeqEM: an adaptive genotype-calling approach for next-generation sequencing studies

Bioinformatics

2803

–

2810

et al. . (

1984

)

Differentiation of perennial and annual types due to habitat conditions in the wild rice Oryza perennis

Plant Syst. Evol

144

119

–

135

et al. . (

2011

)

Genotype and SNP calling from next-generation sequencing data

Nat. Rev. Genet

443

–

451

et al. . (

2012

)

SNP calling, genotype calling, and sample allele frequency estimation from New-Generation Sequencing data

PLoS One

e37558.

et al. . (

2006

)

Realistic levels of inbreeding depression strongly affect extinction risk in wild populations

Biol. Conserv

133

–

(

1988

Origin of Cultivated Rice

Elsevier Science/Japan Scientific Societies Press

Tokyo

et al. . (

2015

)

Accounting for uncertainty in DNA sequencing data

Trends Genet

–

et al. . (

2012

)

Estimation of the outcrossing rate for annual Asian wild rice under field conditions

Breed. Sci

256

–

262

et al. . (

2007

)

PLINK: a tool set for whole-genome association and population-based linkage analyses

Am. J. Hum. Genet

559

–

575

et al. . (

2013

)

Characterizing and measuring bias in sequence data

Genome Biol

R51.

et al. . (

2012

)

Unexpected relationships and inbreeding in HapMap Phase III populations

PLoS One

e49575

The 1000 Genomes Project Consortium

. (

2012

)

An integrated map of genetic variation from 1,092 human genomes

Nature

491

–

The International HapMap 3 Consortium

. (

2010

)

Integrating common and rare genetic variation in diverse human populations

Nature

467

–

The International HapMap Consortium

. (

2007

)

A second generation human haplotype map of over 3.1 million SNPs

Nature

449

851

–

861

et al. . (

2013

)

Estimating inbreeding coefficients from NGS data: Impact on genotype calling and allele frequency estimation

Genome Res

1852

–

1861

et al. . (

2016

)

The power of inbreeding: NGS based GWAS of rice reveals convergent evolution during rice domestication

Molecular Plant

, (in press).

et al. . (

2012

)

Domestication and geographic origin of Oryza sativa in China: insights from multilocus analysis of nucleotide variation of O. sativa and O. rufipogon

Mol. Ecol

5073

–

5087

(

2009

)

Ggplot2: Elegant Graphics for Data Analysis

Springer-Verlag, New York

(

1922

)

Coefficients of Inbreeding and Relationship

The American Naturalist

330

–

338

et al. . (

1997

)

Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization

ACM Trans. Math. Softw

550

–

560

Author notes

Associate Editor: Inanc Birol