Phenotype sequencing: identifying the genes that cause a phenotype directly from pooled sequencing of independent mutants - PubMed (original) (raw)
Phenotype sequencing: identifying the genes that cause a phenotype directly from pooled sequencing of independent mutants
Marc A Harper et al. PLoS One. 2011.
Abstract
Random mutagenesis and phenotype screening provide a powerful method for dissecting microbial functions, but their results can be laborious to analyze experimentally. Each mutant strain may contain 50-100 random mutations, necessitating extensive functional experiments to determine which one causes the selected phenotype. To solve this problem, we propose a "Phenotype Sequencing" approach in which genes causing the phenotype can be identified directly from sequencing of multiple independent mutants. We developed a new computational analysis method showing that 1. causal genes can be identified with high probability from even a modest number of mutant genomes; 2. costs can be cut many-fold compared with a conventional genome sequencing approach via an optimized strategy of library-pooling (multiple strains per library) and tag-pooling (multiple tagged libraries per sequencing lane). We have performed extensive validation experiments on a set of E. coli mutants with increased isobutanol biofuel tolerance. We generated a range of sequencing experiments varying from 3 to 32 mutant strains, with pooling on 1 to 3 sequencing lanes. Our statistical analysis of these data (4099 mutations from 32 mutant genomes) successfully identified 3 genes (acrB, marC, acrA) that have been independently validated as causing this experimental phenotype. It must be emphasized that our approach reduces mutant sequencing costs enormously. Whereas a conventional genome sequencing experiment would have cost 7,200inreagentsalone,ourPhenotypeSequencingdesignyieldedthesameinformationvalueforonly7,200 in reagents alone, our Phenotype Sequencing design yielded the same information value for only 7,200inreagentsalone,ourPhenotypeSequencingdesignyieldedthesameinformationvalueforonly1200. In fact, our smallest experiments reliably identified acrB and marC at a cost of only 110−110-110−340.
Conflict of interest statement
Competing Interests: The authors have declared that no competing interests exist.
Figures
Figure 1. Schematic diagram of phenotype sequencing and key parameters.
Overview of phenotype sequencing stages: mutagenesis, screening, and sequencing. Conventional unpooled sequencing of individual strains (left), is contrasted with pooled sequencing of multiple strains per library (right), comparing the expected frequency of observation of a real mutation in each case.
Figure 2. Target discovery yield as a function of mutations per strain and number of strains sequenced.
A. For five target genes. Gray color (upper-left corner) represents discovery of all 5 targets; red = zero targets. B. For ten target genes. Gray represents discovery of all 10 targets. C. For twenty target genes. Gray represents discovery of all 20 targets.
Figure 3. Effects of sequencing error and pooling on average target gene discovery yields.
A. The probability of reporting a SNP at a single site as a function of the mutation call threshold (read counts) assuming a coverage of c = 75, due either to sequencing error (red), or a real mutation (green), assuming a 1% sequencing error rate and a 25% true mutation fraction (i.e. library-pooling factor of P = 4). Circles indicate the expected mean read counts on each plot. B. The expected number of total mutation calls per genome as a function of the mutation call threshold, due either to sequencing error (red), or a real mutation (green), assuming a 4 Mb genome size. The dashed red line indicates the lowest mutation call threshold at which the number of false positive mutation calls falls below one. The dashed green line indicates the maximum mutation call threshold at which the number of false negatives remains less than one. C. The average number of true target genes discovered (at an FDR <0.67) as a function of the mutation call threshold, for different library-pooling levels P = 2 to P = 9, assuming sequencing of 80 mutant strains with a mutation density of 50 mutations per genome, and 20 true target genes.
Figure 4. Modeled vs. experimental target gene yield as a function of increasing number of strains sequenced.
A. Bioinformatic model of expected yield for discovery of 3 target genes, as a function of increasing number of strains sequenced, plotted vs. experiment cost, assuming one lane of sequencing at a cost of 37.50persequencedstrain.∗∗B∗∗.Experimentallymeasuredtargetgenediscoveryyieldsasafunctionofnumberofstrainssequenced,plottedvs.experimentcost.Eachdatapointistheaverageofallsub−experimentscontainingthatnumberofstrains;theerrorbargivesthestandarderrorforthisaveragefromthatsetofsub−experiments.redline(invertedtriangles):onelaneofsequencing(32xcoverageperlibrary);blueline(+signs):threelanesofsequencing(96xcoverageperlibrary,resultinginatotalcostof37.50 per sequenced strain. B. Experimentally measured target gene discovery yields as a function of number of strains sequenced, plotted vs. experiment cost. Each data point is the average of all sub-experiments containing that number of strains; the error bar gives the standard error for this average from that set of sub-experiments. red line (inverted triangles): one lane of sequencing (32x coverage per library); blue line (+ signs): three lanes of sequencing (96x coverage per library, resulting in a total cost of 37.50persequencedstrain.∗∗B∗∗.Experimentallymeasuredtargetgenediscoveryyieldsasafunctionofnumberofstrainssequenced,plottedvs.experimentcost.Eachdatapointistheaverageofallsub−experimentscontainingthatnumberofstrains;theerrorbargivesthestandarderrorforthisaveragefromthatsetofsub−experiments.redline(invertedtriangles):onelaneofsequencing(32xcoverageperlibrary);blueline(+signs):threelanesofsequencing(96xcoverageperlibrary,resultinginatotalcostof81.25 per strain).
Figure 5. Effects of mutagenesis density, sequencing error, and sequencing cost on target yield and experiment cost.
A. Average target discovery yield (y-axis) as a function of experiment cost (x-axis), at different mutagenesis densities: 20 mutations per genome (green circles); 50 mutations/genome (blue +); 100 mutations/genome (red triangles). B. Total experiment cost for analyzing 32 mutant strains (y-axis), as a function of the number of tagged libraries pooled per sequencing lane (x-axis), for different levels of sequencing error (1% vs. 0.1%) and different sequencing costs ($700 per lane vs. 350perlane):1350 per lane): 1% error, 350perlane):1700 per lane (blue circles); 0.1% error, 700perlane(redsquares);1700 per lane (red squares); 1% error, 700perlane(redsquares);1350 per lane (green +); 0.1% error, $350 per lane (cyan triangles).
Figure 6. Effect of uniform vs. non-uniform gene size distributions on p-value scoring.
Uniform gene-size model (blue circles, dashed line); Variable gene-size model based on subdividing the E. coli gene size distribution into ten size classes, each containing 424 genes represented by the average size within that class (green + markers); Variable gene-size model based on the exact sizes of all 4244 E coli genes (red line).
References
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases