A Drosophilafull-length cDNA resource (original) (raw)

Sequencing strategy

The Drosophila Gene Collection (DGC) consists of two releases, DGCr1 and DGCr2. A process flow diagram of our sequencing strategies is available online [[32](/article/10.1186/gb-2002-3-12-research0080#ref-CR32 "cDNA production process flow. [ http://www.fruitfly.org/DGC/FLSWorkflow.html

            ]")\] and is summarized below. The clones in DGCr1 were arrayed by insert size \[[1](/article/10.1186/gb-2002-3-12-research0080#ref-CR1 "Rubin GM, Hong L, Brokstein P, Evans-Holm M, Frise E, Stapleton M, Harvey DA: A Drosophila complementary DNA resource. Science. 2000, 287: 2222-2224. 10.1126/science.287.5461.2222.")\] and sequenced accordingly; clones in DGCr2 were not arrayed by size. DGCr1 clones less than 1.4 kb were assembled using phrap \[[33](/article/10.1186/gb-2002-3-12-research0080#ref-CR33 "The Phred/Phrap/Consed system home page. [
              http://www.phrap.org
              
            ]")\] and analyzed with custom scripts to determine whether they were complete. Autofinish (part of the consed computer software package) was used to automatically design custom primers \[[10](/article/10.1186/gb-2002-3-12-research0080#ref-CR10 "Gordon D, Desmarais C, Green P: Automated finishing with autofinish. Genome Res. 2001, 11: 614-625. 10.1101/gr.171401.")\] for clones that needed quality improvement. Clones that did not finish in the first two rounds of Autofinish were sent to a manual finishing queue for more sophisticated finishing. cDNA clones larger than 1.4 kb were divided into three groups: 1.4 to 3 kb, 3 to 4.5 kb, and greater than 4.5 kb. All clones were sequenced using the _in vitro_ Template Generation System (TGStm Finnzyme). Clones 3 to 4.5 kb in size, were sequenced using a minimal path of transposon-bearing clones. Clones, 1.4 to 3 kb and those greater than 4.5 kb, were sequenced with 24 and 48 unmapped transposon-bearing clones, respectively. After the initial cycle of transposon sequencing, the clones were analyzed using in-house scripts and Autofinish to determine their state of completeness and quality. DGCr2 clones were sequenced using 24 unmapped transposon-bearing clones. After an initial cycle of transposon sequencing, the clones were analyzed for completeness and quality as described above for DGCr1 clones, using in-house scripts and Autofinish. DGCr2 clone sequences were screened for transposable element sequences, cases of co-ligation, and presence of a poly(A) tail before any finishing work was ordered.

_In vitro_transposition and mapping insertion sites

Transposon insertion reactions were carried out in 96-well format using the Template Generation System (TGStm) according to the manufacturer's recommendations (Finnzyme). Transposon reactions consisted of 1 μl (50-150 ng) plasmid DNA isolated from Qiagen or Revprep DNA isolation robots, 1.6 μl 5× reaction buffer, 8 ng Entranceposon (KanR), 0.4 μl MuA transposase, and deionized water to bring the final volume to 8 μl. Reactions were carried out in PCR plates and incubated in an ABI thermocycler according to the manufacturer's instructions. After heat inactivation of the MuA transposase, 2 μl of the reaction were used to transform 17 μl of DH5α chemically competent cells (Invitrogen) in 96-well format. Following incubation at 37°C for 1 h in 183 μl SOC medium, cells were plated onto appropriate medium selecting for vector and Entranceposon antibiotic resistance. Plates were incubated at 37°C overnight. Colonies were picked into 1.2-ml polypropylene titer tubes (E&K Scientific) containing 0.5 ml LB medium supplemented with 7.5% glycerol and the appropriate antibiotics and incubated at 37°C overnight. These stocks were then used to inoculate 1.2 ml 2XYT medium in 96-well square deep-well plates (E&K Scientific) for culture and DNA plasmid preps. Transposon insertion sites were mapped relative to the vector ends by PCR essentially as described [34]. Forty-eight transposon-bearing clones were picked for PCR mapping using the Mu-End primer (present at both ends of the tranposon) in combination with vector-specific primers, resulting in 96 PCR products. Agarose gels were imaged using custom software developed in-house (Earl Cornell, LBNL) and analyzed using an algorithm, Supertramp [35,36], to identify a minimal path of transposon-bearing clones to be re-arrayed and sequenced.

DNA sequencing

Purified plasmid DNA from transposon-bearing clones was sequenced using 2 μl ABI BigDye II Dye terminator mix (Applied Biosystems) in a 10-μl reaction. Sequencing reactions were processed through 96-well Sephadex G-50 SF plates (Multiscreen filter plates; Millipore) and loaded onto ABI Prism 3700 DNA Analyzer. Sequencing primers specific for each end of the Entranceposon were used in the reactions (5'-ATCAGCGGCCGCGATCC-3' and 5'-TTATTCGGTCGAAAAGGATCC-3'). Sequencing of 5' and 3' cDNA ends was carried out as previously described [2]. The sequencing reported here was carried out over a 2-year period during which we made several major modifications to the strategy; for example, switching from sequencing mapped transposon insertions to random transposons. These changes improved throughput and cycle time, but made the process less efficient in terms of the required number of sequencing reads. Because of these changes, it is not possible to give a meaningful single efficiency estimate; however, our overall efficiency is comparable to other efforts using a similar strategy [8,9].

Data processing and assembly

cDNA clone data management relied on custom scripts and an Informix database. Sequences were processed using phred [37,38] and assembled using phrap [[33](/article/10.1186/gb-2002-3-12-research0080#ref-CR33 "The Phred/Phrap/Consed system home page. [ http://www.phrap.org

            ]")\]. 5' and 3' EST end-reads were combined with the transposon-based reads to generate cDNA clone assemblies. We adopted the sequence quality-control standards defined for the Mammalian Gene Collection project \[[39](/article/10.1186/gb-2002-3-12-research0080#ref-CR39 "Mammalian Gene Collection. [
              http://mgc.nci.nih.gov
              
            ]")\]. Custom scripts evaluated assemblies for: 5' and 3' EST reads in a single contig in the proper orientation; at least 10 bases of 3' poly(A) tail; phrap estimated error rate of less than one in 50,000 bases; and individual base quality of at least q25\. Double-stranded coverage was not a criterion for a clone to be considered finished; however, we have determined that 96.2% of all submitted bases are double-stranded and 48% of clones had complete double-stranded coverage. Autofinish \[[10](/article/10.1186/gb-2002-3-12-research0080#ref-CR10 "Gordon D, Desmarais C, Green P: Automated finishing with autofinish. Genome Res. 2001, 11: 614-625. 10.1101/gr.171401.")\] was used to design primers to improve quality or extend sequence from multiple sequence contigs. cDNA clones with an estimated error rate greater than one in 50,000 bp were automatically identified and processed with additional rounds of Autofinish designed finishing work. If Autofinish could not design primers, custom primers were designed manually using consed. Custom scripts were used to manually order primers to generate a further round of sequencing.

The sequence data described in this paper have been submitted to the GenBank data library under accession numbers:

AF132140-AF132196, AF160900,

AF132551-AF132560, AF160903-AF160904,

AF132562-AF132563, AF160906, AF160909,

AF132565-AF132567, AF160911-AF160913,

AF145594-AF145621, AF160916-AF160917,

AF145623-AF145684, AF160921, AF160923,

AF145686-AF145696, AF160929,

AF160879, AF160882, AF160933-AF160934,

AF160889-AF160891, AF160938-AF160944,

AF160893-AF160897, AF160947,

AF172635-AF172637, AY071209-AY071211,

AF181622-AF181650, AY071213-AY071216,

AF181652-AF181657, AY071218-AY071250,

AF184224-AF184230, AY071252-AY071266,

AY047496-AY047580, AY071268-AY071288,

AY050225-AY050241, AY071290-AY071313,

AY051411-AY052150, AY071315-AY071320,

AY058243-AY058797, AY071322-AY071331,

AY059433-AY059459, AY071333-AY071342,

AY060222-AY060487, AY071345,

AY060595-AY061633, AY071347-AY071381,

AY061821-AY061834, AY071383-AY071385,

AY069026-AY069757, AY071387,

AY069759-AY069867, AY071389-AY071406,

AY070491-AY070597, AY071408-AY071436,

AY070599-AY070602, AY071438-AY071445,

AY070604-AY070608, AY071447-AY0 71450,

AY070610-AY070623, AY071452-AY071454,

AY070625-AY070628, AY071456-AY071461,

AY070632-AY070634, AY071463-AY071476,

AY070636, AY071478-AY071489,

AY070638-AY070642, AY071491,

AY070644, AY071494-AY071543,

AY070646-AY070651, AY071545-AY071557,

AY070653-AY070656, AY071559-AY071564,

AY070658-AY070662, AY071566-AY071577,

AY070664-AY070667, AY071579-AY071581,

AY070671-AY070692, AY071583-AY071606,

AY070694-AY070716, AY071608-AY071632,

AY070777-AY070805, AY071634-AY071661,

AY070807-AY070830, AY071663-AY071664,

AY070832-AY070909, AY071666-AY071672,

AY070911-AY070913, AY071674,

AY070915-AY070920, AY071681-AY071683,

AY070922-AY070951, AY071685-AY071692,

AY070953-AY070954, AY071694-AY071703,

AY070957-AY070964, AY071705-AY071711,

AY070966, AY071713-AY071721,

AY070969-AY070973, AY071724,

AY070975-AY070985, AY071726-AY071727,

AY070987-AY071000, AY071729-AY071731,

AY071002, AY071733-AY071741,

AY071004-AY071006, AY071743-AY071745,

AY071008-AY071056, AY071747-AY071764,

AY071058-AY071064, AY071767-AY071768,

AY071066-AY071072, AY075158-AY075228,

AY071074-AY071084, AY075230-AY075262,

AY071086-AY071090, AY075264-AY075441,

AY071092, AY075443-AY075451,

AY071094-AY07H36, AY075453-AY075473,

AY071138-AY071140, AY075475-AY075524,

AY071142-AY071154, AY075526-AY075588,

AY071156-AY071157, AY084089-AY084152,

AY071159-AY071197, AY084154-AY084214,

AY071199-AY071203, AY089215-AY089229,

AY071205-AY071207, AY089231-AY089329,

AY089331-AY089461, AY118273-AY118672,

AY089463-AY089564, AYn8674-AYn8713,

AY089566-AY089601, AY118715-AY119132,

AY089603-AY089615, AY119134-AY119287,

AY089617-AY089700, AY119441-AY119665,

AY094627-AY094871, AY121612-AY121684,

AY094873-AY094970, AY121686-AY121700,

AY094996-AY095100, AY121702-AY121717,

AY095172-AY095206, AY122061-AY122270,

AY095508-AY095533, AY128413-AY128506,

AY102649-AY102700, AY129431-AY129464,

AY113190-AY113653, BT001253-BT001904.

Analysis of finished cDNA sequences

cDNA sequence was submitted to GenBank with a preliminary annotation of the longest ORF and a gene assignment based on a high BLASTN similarity score to the Release 2 genome annotations. Subsequent processing was used to determine a more detailed analysis of the clone quality. Using BLASTN, sequence from each cDNA clone was compared to genomic sequence, predicted genes, predicted coding sequences (CDSs), known Drosophila transposable elements, and Escherichia coli transposable elements. Using BLASTP, the translation of the longest ORF was compared to the predicted Release 3 translations [15]. Custom scripts were used to parse the BLAST output and record similarity results. We also compared the nucleotide sequence of each clone to the Release 3 genome sequence [14] using Sim4 and to the Release 3 predicted CDS with the highest BLAST score.

mRNA editing

We confirmed the sequence quality of the genomic region encompassing CG018314 (12,731 bp) by independently assembling an 18,284 bp contig consisting solely of whole-genome shotgun (WGS) traces. The assembled sequence contig has an average of 8.6× sequence coverage. The phrap estimated error rate for each genomic base corresponding to a mRNA edited base is q90. Similarly, we determined the phrap estimated error rate for each mRNA edited base to be q90. We manually inspected chromatograms for high-quality discrepancies in the genomic sequence and found none, indicating that the edited bases are not due to population heterozygosity. To validate the editing sites, total RNA was isolated from heads from a mixed population of male and female adult flies from the isogenic strain _y_1; _cn_1 _bw_1 _sp_1 using the Concert™ Cytoplasmic RNA isolation reagent according to the manufacturer's guidelines (Invitrogen). Nine independent gene-specific RT-PCR reactions were performed using the Superscript™ one-step RT-PCR kit according to the manufacturer (Invitrogen) and PCR products were cloned into the PCR2.1 vector. Twenty-four independent subclones from each of four independent RT-PCR products were sequenced and twelve independent subclones from an additional five independent RT-PCR products were sequenced; we considered amplicons to represent independent transcripts if they arose from different RT-PCR reactions or if they differed in sequence. The gene-specific primers used in the RT-PCR experiments were 5'-GTGCAGACGAAAACGAGATGCCAATG-3' and 5'-TGTAGTTCTTCTCAAAGGGATTACG-3'.