Addressing challenges in the production and analysis of illumina sequencing data - PubMed (original) (raw)

Addressing challenges in the production and analysis of illumina sequencing data

Martin Kircher et al. BMC Genomics. 2011.

Abstract

Advances in DNA sequencing technologies have made it possible to generate large amounts of sequence data very rapidly and at substantially lower cost than capillary sequencing. These new technologies have specific characteristics and limitations that require either consideration during project design, or which must be addressed during data analysis. Specialist skills, both at the laboratory and the computational stages of project design and analysis, are crucial to the generation of high quality data from these new platforms. The Illumina sequencers (including the Genome Analyzers I/II/IIe/IIx and the new HiScan and HiSeq) represent a widely used platform providing parallel readout of several hundred million immobilized sequences using fluorescent-dye reversible-terminator chemistry. Sequencing library quality, sample handling, instrument settings and sequencing chemistry have a strong impact on sequencing run quality. The presence of adapter chimeras and adapter sequences at the end of short-insert molecules, as well as increased error rates and short read lengths complicate many computational analyses. We discuss here some of the factors that influence the frequency and severity of these problems and provide solutions for circumventing these. Further, we present a set of general principles for good analysis practice that enable problems with sequencing runs to be identified and dealt with.

PubMed Disclaimer

Figures

Figure 1

Figure 1

Illumina sample preparation and sequencing. Illumina sequencing requires that a DNA sample (a) is converted into special sequencing libraries. This can be achieved by shearing DNA to a designated size and adding specific adapter sequences on both ends of the DNA molecules (b). These adapters allow molecules to be amplified and immobilized in one or more channels of an 8-channel flow cell (c). Immobilization and solid-phase amplification create randomly scattered clusters, consisting of a few thousand copies of the original molecule in very close proximity to each other. One of the DNA strands is removed to obtain single stranded, identically oriented copies, 3' ends of the DNA are blocked and a sequencing primer hybridized on the adapter sequences. Afterwards, the reversible terminator chemistry is performed (d). Here, four differently labeled nucleotides are provided and used for extension of the primers by DNA polymerases. The polymerase reaction terminates after the first base incorporation since the nucleotides used are not only labeled, but also 3'-blocked. After washing away free nucleotides, the nucleotides incorporated are readout by piece-wise imaging of the flow cell. Then, the terminator and fluorophore are removed and another incorporation cycle started. The four images are overlaid (registered) and light intensities extracted for each cluster and cycle using a cluster position template obtained from the first instrument cycles (e). Resulting intensity files serve as input for base calling, the conversion of intensity values into bases and quality scores (f).

Figure 2

Figure 2

Adaptors and adaptor chimeras are a common sources of sequence artifacts. Specific outer adapter sequences, complementary to the grafting sequences on the flow cell are essentially the only requirement for sequencing a DNA library on the Genome Analyzer platform. As different sequencing primers can be used, library design is very flexible and various protocols with partially distinct adapter sequences have been established. The Illumina NlaIII DGE tag protocol illustrated here (a protocol for digital gene expression tag profiling) uses short adapters which are not compatible with paired end sequencing and are added by overhang ligation (A). For this protocol the majority of adapter dimers are removed by a gel excision step after library preparation. However, the protocol may also create adapter chimeras with a length comparable to the targeted library molecules. The resulting chimera sequences also show the sequences required for cluster generation as well as the necessary priming site, causing them to be sequenced together with the real DGE tags. A program like TagDust [29] can be used with the original adapter and primer oligonucleotide sequences to identify such artifacts (B). Shown are the twenty most frequent identified artifacts from one lane with human DGE tags, as well as the oligosequences they might be based on. One of the 20 sequences seems to be a real DGE tag that was incorrectly identified as an artifact.

Figure 3

Figure 3

Effects of adapter sequence inclusion on mapping. Untrimmed adapter sequence at the read ends can interfere with alignment/mapping. We simulated 101-cycle human genomic shotgun reads for an Illumina Paired End library with 10,000 reads for every adapter starting point between 1 to 350nt, and the error profile observed for an actual run of this length. On this data set, we tested how ELAND and BWA are affected by inclusion of adapter sequence: (A) ELAND requires only a fixed seed (here 32nt) in the beginning of the read. Adapters beginning after this seed region may therefore have no effect on the output. ELAND reports 98% successful mappings for all simulated reads of at least 30nt insert size (2nt of adapter sequence being compensated by 2 mismatches allowed in the seed), BWA only reports 98% successful mappings for reads with an insert size of at least 97nt. (B) Frequently only uniquely placed molecules are considered in data analysis. ELAND reports the first uniquely placed fragment for 20nt insert size. BWA reports the first three uniquely placed fragments (mapping quality above 20) for an insert size of 83nt. (C) All uniquely placed reads reported by ELAND up to an insert length of 67nt are placed incorrectly (when comparing to the coordinates the sequence was extracted from), as is one of the 3 reported by BWA for an insert size of 83nt. When requiring 98% correct placements, ELAND handles up to 14nt of adapter (83nt insert size), while BWA can only compensate with mismatches for 4nt of adapter sequence (97nt insert size). (D) For analysis purposes, BWA shows the better performance due to the lower number of false positive placements. Moreover, for an insert size of at least the read length (i.e. no adapters interfering with the alignment), BWA reports 99.999% of uniquely placed reads (94.2% of all reported alignments) at the designated genomic positions, while ELAND only reports 98.757% of the uniquely placed reads (83.8% of all reported alignments) at the correct position.

Figure 4

Figure 4

Origin of image artifacts. Correct instrument adjustment is an important prerequisite for producing high quality sequencing data. Preparation and start of a sequencing run has to be done with careful attention to avoid or identify the following instrumentation artifacts: (A) Air bubbles, caused by leaks, insufficient priming of reagent pumps and long waiting times. Bubbles can obscure parts of the images or reduce chemistry efficiency. (B) Particles in the sequencing chemistry (e.g. crystals from an unfiltered incorporation mix) frequently result in image artifacts. (C) Incorrect adjustment of stage flatness and stage tilt can cause distortions, i.e. parts of the image are sharp while the rest is out of focus. A similar effect limited to tiles at the flow cell edges, can originate from liquids covering the flow cell surface. (D) Reflections in the GA instrument can cause variation in cluster brightness, like the commonly observed band of bright clusters in column 2 of lane 8. (E) If the position of laser excitement is not in sync with imaging (footprint) on GA instrument, a black straight band can be observed at the edges of multiple tiles (partially with comb like slots). (F) If this effect is limited to tiles at the flow cell edges, oil coverage is insufficient.

Figure 5

Figure 5

Image artifacts can generate false sequences. Cluster identification can identify crystals, dust and lint particles as well as other flow cell features as sequence clusters (A). Indicated are 103 non-library sequences originating from a lint particle that has been observed in a library that was sequenced with a three base pair tag ('GAC') in the beginning of each read. In this case, non-library sequences could therefore be distinguished based on these first three bases. The fraction of such artifact clusters is increased for low loading density and low intensity runs. A sequence entropy filterformula image is efficient for removing the majority of these sequences (82.52% for a cutoff of 0.85), but also removes non-artifact sequences (B) - as indicated in the figure, 0.01% of the human reference genome (GRCh37/hg19). For 3'/5' tagged libraries or indexed sequencing libraries, filtering for the index/tag is therefore superior to base composition/sequence entropy filters for removing such sequencing artifacts.

Similar articles

Cited by

References

    1. Bentley DR. et al.Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456(7218):53–9. doi: 10.1038/nature07517. - DOI - PMC - PubMed
    1. Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008;9:387–402. doi: 10.1146/annurev.genom.9.081307.164359. - DOI - PubMed
    1. Schuster SC. Next-generation sequencing transforms today's biology. Nat Methods. 2008;5(1):16–8. doi: 10.1038/nmeth1156. - DOI - PubMed
    1. Ansorge WJ. Next-generation DNA sequencing techniques. N Biotechnol. 2009;25(4):195–203. doi: 10.1016/j.nbt.2008.12.009. - DOI - PubMed
    1. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010;11(1):31–46. doi: 10.1038/nrg2626. - DOI - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources