Complete sequencing and characterization of 21,243 full-length human cDNAs (original) (raw)

Main

Now that most of the human genomic sequence has been determined1,2,3,4,5 efforts are focused on its annotation. Full-length cDNAs, which are complete copies of mRNAs, are a particularly important resource for identifying genes and determining their structural features, forming a basis for transcriptome analysis. Physical cDNA clones are also indispensable reagents in the experimental analysis of gene functions, particularly in higher eukaryotes, such as humans.

In 1999, we started the FLJ project, which aimed to collect and determine the complete sequences of putatively full-length human cDNAs. At that time, there were only about 6,000 cDNAs in RefSeq, the curated informational resource6, with full-length cDNA sequence information. One million cDNAs clones were available from the IMAGE collection, the physical resource for cDNA clones. Unfortunately, most IMAGE clones were only partly sequenced (as expressed-sequence tags; ESTs7) and there were no good clues to indicate which ones were full-length. Therefore, we began the large-scale collection and full-length sequencing of cDNAs not only to obtain cDNA sequence information but also to provide a physical source of cDNA clones. Here, we report the first characterization of 21,243 clones.

Results

Collecting and sequencing full-length cDNAs

To facilitate the large-scale collection and sequencing of full-length cDNAs, we constructed 107 human cDNA libraries enriched for full-length cDNA clones representing 61 tissues, 21 primary cell cultures and 16 cell lines. We used a cap-targeted selection method called oligo-capping8,9 for all but one spleen library, which was constructed using a highly selective size-fractionation method to clone cDNAs of long mRNAs10. Supplementary Table 1 online shows the cDNA libraries we used. The average frequency of full-length cDNA clones in the libraries was 85%.

We randomly picked cDNA clones from these libraries and subjected them to one-pass sequencing. In total, we obtained the 5′-end one-pass sequences from 1,154,510 cDNA clones. In some cases, we compared these sequences to GenBank using BLAST searching11. We found that 40–60% of our cDNA sequences matched RefSeq entries at the time of the search. The rest matched only human ESTs or had no matches. We selected 21,243 cDNA clones from the two latter categories and determined their complete sequences. For each of the cDNAs, we completely sequenced both strands, mainly using the primer walking method12. Judging from the sequence quality scores calculated using Phred13, the accuracy of the sequence data was more than 99.99%. The average length of the cDNAs was 2,314 bp (Supplementary Fig. 1 online). All the sequence data have been registered in public databases through the DNA database of Japan (DDBJ). Searches for the cDNAs by accession numbers, chromosomal positions, various keywords or sequence similarities are enabled at our website (http://fldb.hri.co.jp/ cgi-bin/cDNA3/public/publication/index.cgi). All physical cDNA clones are freely available for research use on request. Requests for physical cDNA clones should be sent to flcdna@ims.u-tokyo.ac.jp or isogai-t@reprori.jp.

Comparison of the FLJ collection with predicted genes

We used BLAST searches to compare our full-length cDNA sequences in series with RefSeq (and other relevant data sets, as outlined in Fig. 1). For 14,490 cDNAs, there were no matches in RefSeq (excluding the 2,313 RefSeq entries that were derived from the FLJ collection), indicating that the full sequences of these cDNAs were unique to the FLJ collection ('FLJ-unique'). The remainder (6,753 cDNAs in 4,263 clusters) at least partially matched sequences in RefSeq. This overlap was because many cDNA sequences that did not have matches at the time of the selection were later sequenced by other researchers during the course of this project. The matches also included alternatively spliced isoforms in RefSeq. There were also a number of cDNAs that were identical to sequences in RefSeq but were selected because of inaccurate one-pass sequence data or human error when picking the clones. We clustered the 14,490 FLJ-unique cDNAs pairwise to remove the redundancy, resulting in 10,897 nonredundant cDNA clusters ('nonredundant FLJ-unique').

Figure 1: Flow chart of cDNA categorization.

Each cDNA was categorized as shown here. For further details on the categorization (steps *1–*3), refer to Supplementary Note and Supplementary Figure 6 online. Detailed descriptions of the cut-offs and the supporting evidence are also presented there.

Full size image

To determine what proportion of these 10,897 nonredundant FLJ-unique clusters had been previously predicted from the genome sequence (for chromosome assignments of the FLJ cDNAs, see Supplementary Table 2 online), we used BLAST searches to compare these cDNAs with Ensembl genes (using version 4.28.1; 29,076 genes), which are representative predicted genes based on comprehensive analyses of a wide range of evidence and are expected to cover most human genes14. Ensembl genes supported by RefSeq were removed from the search. Among the 10,897 clusters, 2,774 at least partly matched Ensembl genes ('Ensembl-predicted') and 8,123 did not ('Ensembl-nonpredicted').

We then examined whether the 8,123 Ensembl-nonpredicted clusters were predicted by ab initio gene prediction programs, such as DIGIT, FGENESH, GENSCAN and HMMGENES, because the criteria for identifying Ensembl genes are based on a somewhat conservative method of evaluation. As shown in Table 1, 643 Ensembl-nonpredicted clusters were predicted by at least one of these programs ('_ab initio_–predicted'). In total, 3,417 (31%) of the 10,897 nonredundant FLJ-unique clusters corresponded to genes predicted by one or more of the computational methods. These ab initio methods did not predict the existence of the remaining 7,480 clusters ('nonpredicted').

Table 1 Sequence comparison of the 10,897 non-redundant FLJ unique cDNAs and computationally predicted cDNAs

Full size table

One possible reason why these sequences were not predicted by the computational methods is that most of them were non-protein-coding, and many computational methods predict only protein-coding regions as genes. To check this possibility, we determined how many clusters met the criteria of ORF length >100 codons and coding potential >30 (calculated according to the standard method15). We used these criteria because most of the ORFs registered in RefSeq satisfied them, whereas ORFs randomly occurring in the human genome seldom do (Supplementary Figs. 2 and 3 online). Of 7,480 nonpredicted clusters, 1,999 ('new protein-coding') satisfied the criteria. Altogether, there were 5,416 protein-coding clusters (2,774 Ensembl-predicted, 643 _ab initio_–predicted and 1,999 new protein-coding clusters) among the 10,897 nonredundant FLJ-unique clusters ('protein-coding'). We found no ORFs meeting the criteria in the remaining 5,481 nonredundant FLJ-unique clusters (see also Supplementary Note and Supplementary Figs. 7 and 8 online).

Non-protein-coding FLJ-unique cDNA clusters

The length distribution of the cDNAs in the 5,481 clusters that were categorized as non-protein-coding was similar to that of the other clusters (see also Supplementary Fig. 4 online). Although some of these cDNAs might be cloning artifacts, recent evidence indicates that unexpectedly large populations of non-protein-coding transcripts exist in mammalian cells16,17. Genomic alignments of these clusters clearly showed that splicing had occurred for 1,378 of the 5,481 clusters. This splicing can be taken as explicit evidence that these cDNAs were derived from transcripts. We further examined whether these cDNAs contained any protein-coding-like sequences using GENSCAN, because it was possible that ORFs might be disrupted by retained introns, and because GENSCAN can identify interrupted ORFs in cDNA sequences. GENSCAN detected no coding exon–like regions in at least 850 clusters, ruling out the possibility that the retained introns disrupted the ORFs in these clusters.

To characterize the 850 non-coding-transcript clusters, we carried out BLAST searches using our one-pass sequence database (Supplementary Table 1 online), which contains many cDNA sequences derived from various tissues. On average, the BLAST search resulted in 3.7 one-pass-sequence matches per cDNA. Many cDNAs matched a few one-pass sequences, indicating their low expression levels, and 44 clusters matched >10 one-pass sequences. By determining from which cDNA library the matching one-pass sequences were derived, we were able to identify the tissue distribution patterns of the expression of these genes. Some cDNAs had ubiquitous expression patterns, whereas others had very tissue-specific patterns. For 22 cDNAs, >20% of the matching one-pass sequences were derived from a particular cDNA library (examples are shown in Supplementary Fig. 4 online).

GC content of new protein-coding cDNAs

To determine why 1,999 new protein-coding clusters in our FLJ-unique cDNAs were not predicted in Ensembl or by other computational methods, we compared the GC content of cDNAs between new protein-coding and 'predicted' (Ensembl-predicted and _ab initio_–predicted) clusters. The GC content of RefSeq mRNAs ranged broadly between 30% and 70% with two peaks, one at ∼42% and the other at ∼58% (Fig. 2a). The GC content of predicted and new protein-coding cDNAs showed similar broad distributions, but with only one peak each: at 42% for predicted and 58% for nonpredicted cDNAs. This suggests that there is some prediction bias against GC-rich transcripts in current gene prediction procedures. This can be seen also in the distribution of GC content of Ensembl genes. In contrast to RefSeq mRNAs, the peak at ∼58% for all Ensembl genes is less pronounced.

Figure 2: GC contents of the FLJ cDNAs and the corresponding genomic regions to which they were mapped.

(a) GC contents of the new protein-coding cDNAs, non-protein-coding cDNAs, RefSeqs and Ensembl transcripts are shown. (b) GC contents of the genomic regions to which the corresponding transcripts were mapped (from the 5′ ends of first exons to the 3′ ends of the last exons) are shown for each category of the transcripts. For the detailed protocol for the chromosomal assignments of the FLJ cDNAs, please refer to Supplementary Note online (section on length distribution and chromosomal assignments of the 10,897 nonredundant FLJ-unique clusters section). Chromosomal positions of RefSeqs and Ensembl genes are as presented at University of California Santa Cruz genome browser.

Full size image

As introns are generally more AT-rich than exons4, the distributions of GC content of the corresponding genomic regions for both new protein-coding and predicted cDNAs shifted towards being more AT-rich. But overall patterns showed similar tendencies as cDNAs (Fig. 2b). Thus, current gene prediction procedures may have slight bias against predicting genes in GC-rich regions. This contradicts previous observations that the accuracy of these gene prediction methods is insensitive to the GC content (or is better in GC-rich regions)18,19. The discrepancy is probably caused by the fact that previous analyses were based on a smaller data set and may not be extrapolated accurately to the full-genome scale.

We also calculated distribution of the GC content of noncoding and putative noncoding cDNAs (Fig. 2). To our surprise, both types of cDNA were relatively AT-rich (Fig. 2a). The new protein-coding and putative noncoding cDNAs were originally grouped together as 'nonpredicted' and later separated according to the criteria of ORF length >100 codons and coding potential >30 (Fig. 1). Thus, we anticipated that these two categories might have similar GC content distributions. Instead, the GC content distributions of both putative noncoding and noncoding cDNAs had a peak at ∼42%, similar to that of the predicted clones, but the range was much narrower. The GC content of the genomic regions where those cDNAs were mapped showed a similar AT-rich tendency (Fig. 2b). This raised the possibility that the noncoding cDNAs and the new protein-coding cDNAs are mainly transcribed from different regions of the human genome.

Annotation of protein-coding FLJ-unique cDNAs

For the 5,416 protein-coding clusters, we determined their amino acid sequences from the corresponding cDNA sequences. From the 1,999 new protein-coding cDNAs, we removed cDNAs that seemed to be derived from possibly truncated or immature forms of mRNA (T. Nishikawa et al., unpublished data; see also Supplementary Note online), leaving 1,872 cDNAs ('full new protein-coding'). Thus, a total of 5,289 clusters were subjected to further analysis. The average ORF length was 335 codons (see Supplementary Fig. 1 online for the distribution of the ORF length).

Using the amino acid sequences, we searched the protein motif databases PROSITE (version 16) and Pfam (version 5.5). In the PROSITE search, we found that 1,318 (25%) of these cDNAs had some protein signature (Table 2). In the Pfam search, we found that 3,112 (59%) of the cDNAs had some Pfam motif(s). In total, 1,529 kinds of Pfam motif, corresponding to 63% of the total 2,478 Pfam motifs, were represented. We also searched for putative membrane proteins and secretory proteins using SOSui and PSORT II, respectively (Table 3), which predicted 244 cDNAs for secretory proteins and 848 cDNAs for membrane proteins. Detailed functional annotations for each of these cDNAs are available at our websites20,21.

Table 2 Functional categorization and distribution of the protein motifs of the 5,289 protein-coding clusters

Full size table

Table 3 Predicted secretory proteins and membrane proteins

Full size table

Confirmation of transcripts from chromosomes 20–22

We mapped 45, 16 and 39 clusters of protein-coding transcripts to chromosomes 20, 21 and 22, respectively (Supplementary Table 2 online). These chromosomes are 'finished' chromosomes, whose initial annotations are considered complete. As shown in Table 4, 491 of 10,897 nonredundant FLJ-unique clusters mapped to these chromosomes. Of these 491, 268 were protein-coding cDNAs, including 100 full new protein-coding cDNAs. There were 2,188 previously predicted genes (Ensembl genes and _ab initio_–predicted genes). Of these, 1,004 were experimentally identified cDNAs registered in RefSeq; thus, our analyses of FLJ cDNAs confirmed the presence of 168 (14%) of the 1,184 genes that had been predicted without full-length cDNA support and also identified an additional 100 protein-coding genes. This result emphasizes that full-length cDNA data should contribute to the precise annotation of the human genome.

Table 4 FLJ cDNAs mapped to chromosomes 20, 21 and 22

Full size table

Of these 100 full new protein-coding clusters, we experimentally confirmed the expression of 84 (84%) by RT-PCR using eight kinds of human tissue (Supplementary Fig. 5 online). For the other cDNAs, we observed no clear band under our experimental conditions, perhaps because their expression levels were too low or their expression was highly specific for the tissues or operation materials from which we isolated the RNAs to construct the cDNA libraries.

We also attempted RT-PCR analysis of the noncoding transcripts that mapped to chromosomes 20–22. We detected PCR bands for 32 of these clusters (74%) and observed both ubiquitous and tissue-specific expression patterns (Supplementary Fig. 5 online). There seem to be at least hundreds of potential non-protein-coding transcripts in the human genome, some of which are expressed ubiquitously and others in a tissue-specific manner.

Discussion

Here we describe the complete sequencing and characterization of 21,243 cDNA clones of our FLJ cDNA collection (the results of statistical analyses are summarized in Supplementary Table 3 online). Of these, the full-length sequences of 14,490 cDNA clones or 10,876 cDNA clusters were unique to the FLJ project. Within the FLJ-unique clusters, 5,416 seemed to be protein-coding, and these included 2,774 Ensembl-predicted, 643 _ab initio_–predicted and 1,999 non-predicted protein-coding genes.

About two-thirds of the protein-coding genes (3,417 of 5,416) were predicted by computational methods. The total number of Ensembl-predicted genes is currently about 29,000, of which about 15,000 have been confirmed by fully sequenced cDNAs (according to RefSeq as of 1 August 2002). Here we confirmed an additional 2,774 such genes by identifying fully sequenced cDNA clusters (Ensembl-predicted). This corresponds to ∼20% of the 14,000 Ensembl genes that lacked full-length cDNA support. The overlap was not as large as we expected, considering the scale of our project and the fact that Ensembl uses comprehensive evidence of protein homology from various organisms. One reason for this may be that cDNAs derived from long mRNAs were under-represented in the cDNA libraries used in our project. There is a cDNA sequencing project, KIAA, in which considerable effort is aimed at collecting cDNAs of long transcripts10. Such projects, as well as technical development, will be needed to collect cDNAs of long mRNAs. Another reason for the limited overlap may be the limited repertory of our cDNA libraries. Although we analyzed more than 100 different cDNA libraries, we might have missed transcripts whose expression is limited to small organs or rare cell types.

About one-third of protein-coding cDNAs (1,999 of 5,416) were not predicted by Ensembl or by ab initio gene predictions. We found that these new protein-coding cDNAs are relatively GC-rich. Thus, gene prediction methods that are currently in use may have some bias against GC-rich transcripts. Consistent with this finding, we found that the GC content distribution of RefSeq entries has two peaks at ∼42% and ∼58%, and that the peak at ∼58% becomes insignificant when analyzing all Ensembl genes. This supports the suggestion that Ensembl genes that lack RefSeq support tend to be AT-rich. Bernardi proposed that the human genome could be divided into five different GC compositional categories (L1, L2 (AT-rich regions), H1, H2 and H3 (GC-rich regions)) and that L1 and L2 comprise ∼70% of the human genome22. He predicted that GC-rich isochores, especially H3 (GC >48%), are gene-rich and that ∼70% of genes are in the GC-rich region (GC >42%). Recent analysis using human genome draft sequence confirmed that known genes were rich in GC-rich regions (∼70%), although the length and distribution of GC-rich regions and AT-rich regions vary widely in the human genome4,5. The estimate based on the gene prediction suggests that ∼50% rather than ∼70% of genes are present in GC-rich regions5. Our results suggest that there may be more genes to discover in GC-rich regions of the human genome. The new protein-coding cDNAs may be a good training set for improving the gene prediction methods.

In addition to protein-coding genes, we observed a number of putative noncoding clusters (5,481 clusters) among our nonredundant FLJ-unique clusters. About 1,300 of them were derived from spliced transcripts and 850 of them ('noncoding') contained no predicted exons. RT-PCR showed that at least some of them are transcribed in vivo (Supplementary Fig. 5 online). Thus, we consider that most noncoding cDNAs come from real transcripts. The GC content of these cDNAs and the genomic regions in which they were mapped were in the range of low-GC regions. About 65% of those noncoding transcripts are transcribed from genomic regions that were low-GC regions and more than 5 kb upstream or downstream of any RefSeq- or Ensembl-predicted genes (data not shown). We categorized the remaining putative noncoding clusters as 'unclassified'. At present, we think that most of these unclassified cDNAs came from transcripts or their fragments rather than genomic DNA contaminants. For more detailed discussion of this issue, see Supplementary Note online.

Several large-scale projects for the systematic collection and complete sequencing of human and mouse cDNAs are underway23,24,25. The cDNAs identified in the present study, together with those from other projects, should produce a nearly complete physical collection of full-length cDNAs for human genes and those of important model organisms. For analysis of the proteome, the cDNAs are being transferred to various types of expression vectors. Recombinant proteins are being expressed with fusion tags and used for the systematic purification and identification of protein complexes26,27. Projects aimed at the large-scale determination of the three-dimensional structures of proteins have also been initiated based on the full-length cDNA resources28. Comprehensive analysis of the genome, transcriptome and proteome will lead to a better understanding of the architecture of life.

Methods

Construction and characterization of the libraries.

We constructed the cDNA libraries used in this project as previously described9,10. To evaluate the frequency of the full-length cDNAs in each of the libraries, we carried out BLAST searches with the cut-off value of e-100 and examined the relative positions of the 5′ end of the oligo-capped cDNAs compared with the RefSeq data, using all the one-pass sequences produced from each of the libraries. When the one-pass sequences covered the annotated coding-sequence start sites, we categorized them as 'full or near-full'. We applied this criterion because, in many cases, transcriptional start sites could not be sharply defined owing to possible slippery transcriptional start events29, which made it difficult to determine exactly which cDNAs should be more specifically categorized as either 'full' or 'near-full'. The proportion of full-length sequences was calculated as the frequency of the full or near-full sequences among the cDNAs that had matches in RefSeq. The cDNAs that did not match around the 5′ ends, possibly due to either alternative splicing or sequencing errors, were excluded from the calculation. We also excluded cDNAs in RefSeq that were derived from the FLJ project.

BLAST searches.

As those that matched RefSeq sequences, the cDNAs containing at least the annotated coding sequence start sites were categorized as 'full at the 5′ end'. The RefSeq records containing 'FLJ' in the description field were excluded, because 'FLJ' indicated that the FLJ cDNA data were essential for experimental identification of the complete cDNAs of these genes, whose presence was otherwise only predicted based on the homology or partial cDNAs. For re-BLAST searching of the total 21,243 cDNA sequences against RefSeq, we used the BLAST score 1,000 for the cut-off. To further remove the redundancy from 14,490 FLJ-unique cDNAs, we carried out pairwise BLAST searches with a cut-off score of 1.0e-100. When we observed redundancy, we selected the cDNA with the longer 5′ end as representative (10,897 nonredundant FLJ-unique cDNAs). For further details on bioinformatics procedures and supporting evidence, see Supplementary Note online.

Ensembl prediction and ab initio gene predictions.

We obtained Ensembl genes (using version 4.28.1; 29,076 genes) from the Ensembl website. We searched the sequences of 'Homo_sapiens.cdna.fa', which correspond to the predicted cDNA sequences, using our full-length cDNA sequences by BLAST with a cut-off score of 1.0e-100. Ensembl genes contained in RefSeq were excluded. Gene prediction programs were run against the human genomic sequence data (University of California Santa Cruz genome browser) with the default cut-off values. We used predicted exons to generate virtual cDNA sequences and did BLAST searches against them using FLJ cDNA sequences with a cut-off score of 1.0e-100.

URLs.

GenBank, http://www.ncbi.nlm.nih.gov /Sitemap/index.html#GenBank; FLJ-DB, http://fldb.hri.co.jp/ cgi-bin/cDNA3/public/publication/index.cgi; Ensembl, http://www.ensembl.org/Homo_sapiens/; PROSITE, http://www.expasy.ch/prosite/; Pfam, http://www.sanger.ac.uk/Pfam/; SOSui, http://sosui.proteome.bio.tuat.ac.jp/about-sosui.html; PSORT II, http://psort.ims.u-tokyo.ac.jp/; DIGIT, http://digit.ims.u-tokyo.ac.jp; FGENESH, http://www.softberry.com/berry.phtml; GENSCAN, http://genes.mit.edu/GENSCAN.html; HMMGENES, http://www.cbs.dtu.dk/services/HMMgene/; HUNT database, http://www.hri.co.jp/HUNT/; HUGE database, http://www.kazusa.or.jp/huge/; University of California Santa Cruz genome browser, http://genome.ucsc.edu/; NEDO, http://www.nedo.go.jp/bio/index.html.

Note: Supplementary information is available on the Nature Genetics website.

References

Hattori, M. et al. The DNA sequence of human chromosome 21. Nature 405, 311–319 (2000).
Article CAS Google Scholar
Dunham, I. et al. The DNA sequence of human chromosome 22. Nature 402, 489–495 (1999).
Article CAS Google Scholar
Deloukas, P. et al. The DNA sequence and comparative analysis of human chromosome 20. Nature 414, 865–871 (2001).
Article CAS Google Scholar
Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Article CAS Google Scholar
Venter, J.C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Article CAS Google Scholar
Pruitt, K.D. & Maglott, D.R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29, 137–140 (2001).
Article CAS Google Scholar
Boguski, M.S. The turning point in genome research. Trends Biochem. Sci. 20, 295–296 (1995).
Article CAS Google Scholar
Maruyama, K. & Sugano, S. Oligo-capping: a simple method to replace the cap structure of eukaryotic mRNAs with oligoribonucleotides. Gene 138, 171–174 (1994).
Article CAS Google Scholar
Suzuki, Y., Yoshitomo, K., Maruyama, K., Suyama, A. & Sugano, S. Construction and characterization of a full length-enriched and a 5′-end-enriched cDNA library. Gene 200, 149–156 (1997).
Article CAS Google Scholar
Nomura, N. et al. Prediction of the coding sequences of unidentified human genes. I. The coding sequences of 40 new genes (KIAA0001-KIAA0040) deduced by analysis of randomly sampled cDNA clones from human immature myeloid cell line KG-1. DNA Res. 1, 27–35 (1994).
Article CAS Google Scholar
Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Article CAS Google Scholar
Giesecke, H., Obermaier, B., Domdey, H. & Neubert, W.J. Rapid sequencing of the Sendai virus 6.8 kb large (L) gene through primer walking with an automated DNA sequencer. J. Virol. Methods. 38, 47–60 (1992).
Article CAS Google Scholar
Ewing, B., Hillier, L., Wendl, M.C. & Green, P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8, 175–185 (1998).
Article CAS Google Scholar
Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38–41 (2002).
Article CAS Google Scholar
Fickett, J.W. Predictive methods using nucleotide sequences. Methods Biochem. Anal. 39, 231–245 (1998).
CAS PubMed Google Scholar
Huttenhofer, A. et al. RNomics: an experimental approach that identifies 201 candidates for novel, small, non-messenger RNAs in mouse. EMBO J. 20, 2943–2953 (2001).
Article CAS Google Scholar
Kapranov, P. et al. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296, 916–919 (2002).
Article CAS Google Scholar
Burset, M. & Guigo, R. Evaluation of gene structure prediction programs. Genomics 34, 353–367 (1996).
Article CAS Google Scholar
Rogic, S., Mackworth, A.K. & Ouellette, F.B. Evaluation of gene-finding programs on mammalian sequences. Genome Res. 11, 817–832 (2001).
Article CAS Google Scholar
Yudate, H.T. et al. HUNT: launch of a full-length cDNA database from the helix research institute. Nucleic Acids Res. 29, 185–188 (2001).
Article CAS Google Scholar
Hattori, A. et al. Characterization of long cDNA clones from human adult spleen. DNA Res. 7, 1–11 (2001).
Google Scholar
Bernardi, G. The isochore organization of the human genome and its evolutionary history—a review. Gene. 135, 57–66 (1993).
Article CAS Google Scholar
The FANTOM consortium and The RIKEN Genome Exploration Research Group Phase I & II team. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420, 563–573 (2002).
Wiemann, S. et al. Toward a catalog of human genes and proteins: sequencing and analysis of 500 novel complete protein coding human cDNAs. Genome Res. 11, 422–435 (2001).
Article CAS Google Scholar
Strausberg, R.L., Feingold, E.A., Klausner, R.D. & Collins, F.S. The mammalian gene collection. Science 286, 455–457 (1999).
Article CAS Google Scholar
Gavin, A.C. et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415, 141–147 (2002).
Article CAS Google Scholar
Ho, Y. et al. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415, 180–183 (2002).
Article CAS Google Scholar
Chance, M.R. et al. Structural genomics: a pipeline for providing structures for the biologist. Protein Sci. 11, 723–738 (2002).
Article CAS Google Scholar
Suzuki, Y, et al. Diverse transcriptional initiation revealed by fine, large-scale mapping of mRNA start sites. EMBO Rep. 2, 388–393 (2001).
Article CAS Google Scholar

Download references

Acknowledgements

We thank A. Kishimoto, H. Ezoe and T. Matsuo for supporting the project and E. Nakajima for critically reading the manuscript. This project was supported by the Ministry of Economy Trade and Industry of Japan and also in part by Special Coordination Funds for Promoting Science and Technology from the Ministry of Education, Culture, Sports, Science and Technology of Japan. Requests for materials should be addressed to S. Sugano. Requests for physical cDNA clones should be addressed to S.Sugano (flcdna@ims.u-tokyo.ac.jp) or T. Isogai (isogai-t@reprori.jp). For more information on each cDNA clone, visit FLJ-DB. For general information on the FLJ project, please refer to NEDO website.

Author information

Authors and Affiliations

Helix Research Institute, 1532-3 Yana, Kisarazu, 292-0812, Chiba, Japan
Toshio Ota, Tetsuo Nishikawa, Tetsuji Otsuki, Tomoyasu Sugiyama, Ryotaro Irie, Ai Wakamatsu, Koji Hayashi, Hiroyuki Sato, Keiichi Nagai, Shizuko Ishii, Jun-ichi Yamamoto, Kaoru Saito, Yuri Kawai, Yuko Isono, Yoshitaka Nakamura, Kenji Nagahari, Yasuhiko Masuho & Takao Isogai
Kyowa Hakko Kogyo, Tokyo Research Laboratory, 3-6-6 Asahi-machi, Machida, Tokyo, 194-8533, Japan
Toshio Ota, Masaya Obayashi, Tatsunari Nishi, Satoshi Nakagawa, Akihiro Senoh, Hiroshi Mizoguchi, Ayako Kawabata, Takeshi Hikiji, Naoko Kobatake, Hiromi Inagaki, Yasuko Ikema, Sachiko Okamoto & Rie Okitani
The Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokane-dai, Minato-ku, Tokyo, 108-8639, Japan
Yutaka Suzuki, Toshikazu Shibahara, Toshihiro Tanaka, Yoshihiro Ohmori, Takuma Kawakami, Saori Noguchi, Tomoko Itoh, Keiko Shigeta, Tadashi Senba, Kyoka Matsumura, Yoshie Nakajima, Takae Mizuno, Misato Morinaga, Masahide Sasaki, Takushi Togashi, Masaaki Oyama, Hiroko Hata, Manabu Watanabe, Takami Komatsu, Junko Mizushima-Sugano, Tadashi Satoh, Yuko Shirai, Yukiko Takahashi, Kiyomi Nakagawa, Riu Yamashita, Kenta Nakai, Tetsushi Yada, Yusuke Nakamura & Sumio Sugano
Hitachi, Central Research Laboratory, 1-280 Higashi-koigakubo, Kokubunj, Tokyo, 185-8601, Japan
Tetsuo Nishikawa, Kouichi Kimura, Hiroshi Makita, Katsuhiko Murakami, Tomohiro Yasuda, Takao Iwayanagi & Tadashi Satoh
National Institute of Technology and Evaluation, 2-49-10 Nishihara, Shibuya-ku, Tokyo, 151-0066, Japan
Mitsuo Sekine & Hisashi Kikuchi
Otsuka Pharmaceutical, 463-10 Kagasuno Kawauchi-cho, Tokushima, 771-0192, Japan
Toshikazu Shibahara, Yoshihiro Goto, Fumio Shimizu, Hirokazu Wakebe, Haretsugu Hishigaki, Takeshi Watanabe, Akira Tanigami, Tsutomu Fujiwara, Toshihide Ono, Katsue Yamada, Yuka Fujii, Kouichi Ozaki, Maasa Hirao, Yoshihiro Ohmori & Takuma Kawakami
Hitachi, Life Science Group, 1-3-1 Minamidai, Kawagoe, 350-1165, Saitama, Japan
Masako Wagatsuma, Akiko Shiratori, Hiroaki Sudo, Takehiko Hosoiri, Yoshiko Kaku, Hiroyo Kodaira, Hiroshi Kondo, Masanori Sugawara, Makiko Takahashi, Katsuhiro Kanda, Takahide Yokoi, Takako Furuya, Emiko Kikkawa, Yuhi Omura, Kumi Abe, Kumiko Kamihara, Naoko Katsuta, Kazuomi Sato, Machiko Tanikawa, Makoto Yamazaki & Ken Ninomiya
Hitachi Science Systems, 1-280 Higashi-koigakubo, Kokubunji, Tokyo, 185-8601, Japan
Tadashi Ishibashi, Hiromichi Yamashita, Katsuji Murakawa, Kiyoshi Fujimori, Hiroyuki Tanai, Manabu Kimata, Motoji Watanabe, Susumu Hiraoka, Yoshiyuki Chiba, Shinichi Ishida, Yukio Ono, Sumiyo Takiguchi, Susumu Watanabe, Makoto Yosida, Tomoko Hotuta, Junko Kusano & Keiichi Kanehori
Takara Shuzo, 2257 Sunaike, Noji, Kusatu, Shiga, 525-0055, Japan
Asako Takahashi-Fujii, Hiroto Hara, Tomo-o Tanase, Yoshiko Nomura, Sakae Togiya, Fukuyo Komai, Reiko Hara, Kazuha Takeuchi, Miho Arita, Nobuyuki Imose, Kaoru Musashino, Hisatsugu Yuuki & Atsushi Oshima
Nisshinbo Industries, 1-2-3 Onodai, Midori-ku, Chiba, 267-0056, Japan
Naokazu Sasaki, Satoshi Aotsuka, Yoko Yoshikawa, Hiroshi Matsunawa, Tatsuo Ichihara, Namiko Shiohata, Sanae Sano, Shogo Moriya, Hiroko Momiyama, Noriko Satoh, Sachiko Takami, Yuko Terashima & Osamu Suzuki
Toyobo, 10-24 Toyo-cho, Tsuruga, 914-0047, Fukui, Japan
Akio Sugiyama, Makoto Takemoto & Bunsei Kawakami
Fujiya, 228 Soya, Hadano, Kanagawa, 257-0031, Japan
Masaaki Yamazaki, Koji Watanabe, Ayako Kumagai, Shoko Itakura, Yasuhito Fukuzumi, Yoshifumi Fujimori, Megumi Komiyama & Hiroyuki Tashiro
Aisin Cosmos R&D, 1698 Yana, Kisarazu, 292-0812, Chiba, Japan
Koji Okumura
Kazusa DNA Research Institute, 1532-3 Yana, Kisarazu, 292-0812, Chiba, Japan
Takahiro Nagase, Nobuo Nomura & Osamu Ohara
BIRC, AIST, 2-41-6 Aomi, Koto-ku, Tokyo, 135-0064, Japan
Nobuo Nomura & Sumio Sugano

Authors

Toshio Ota
You can also search for this author inPubMed Google Scholar
Yutaka Suzuki
You can also search for this author inPubMed Google Scholar
Tetsuo Nishikawa
You can also search for this author inPubMed Google Scholar
Tetsuji Otsuki
You can also search for this author inPubMed Google Scholar
Tomoyasu Sugiyama
You can also search for this author inPubMed Google Scholar
Ryotaro Irie
You can also search for this author inPubMed Google Scholar
Ai Wakamatsu
You can also search for this author inPubMed Google Scholar
Koji Hayashi
You can also search for this author inPubMed Google Scholar
Hiroyuki Sato
You can also search for this author inPubMed Google Scholar
Keiichi Nagai
You can also search for this author inPubMed Google Scholar
Kouichi Kimura
You can also search for this author inPubMed Google Scholar
Hiroshi Makita
You can also search for this author inPubMed Google Scholar
Mitsuo Sekine
You can also search for this author inPubMed Google Scholar
Masaya Obayashi
You can also search for this author inPubMed Google Scholar
Tatsunari Nishi
You can also search for this author inPubMed Google Scholar
Toshikazu Shibahara
You can also search for this author inPubMed Google Scholar
Toshihiro Tanaka
You can also search for this author inPubMed Google Scholar
Shizuko Ishii
You can also search for this author inPubMed Google Scholar
Jun-ichi Yamamoto
You can also search for this author inPubMed Google Scholar
Kaoru Saito
You can also search for this author inPubMed Google Scholar
Yuri Kawai
You can also search for this author inPubMed Google Scholar
Yuko Isono
You can also search for this author inPubMed Google Scholar
Yoshitaka Nakamura
You can also search for this author inPubMed Google Scholar
Kenji Nagahari
You can also search for this author inPubMed Google Scholar
Katsuhiko Murakami
You can also search for this author inPubMed Google Scholar
Tomohiro Yasuda
You can also search for this author inPubMed Google Scholar
Takao Iwayanagi
You can also search for this author inPubMed Google Scholar
Masako Wagatsuma
You can also search for this author inPubMed Google Scholar
Akiko Shiratori
You can also search for this author inPubMed Google Scholar
Hiroaki Sudo
You can also search for this author inPubMed Google Scholar
Takehiko Hosoiri
You can also search for this author inPubMed Google Scholar
Yoshiko Kaku
You can also search for this author inPubMed Google Scholar
Hiroyo Kodaira
You can also search for this author inPubMed Google Scholar
Hiroshi Kondo
You can also search for this author inPubMed Google Scholar
Masanori Sugawara
You can also search for this author inPubMed Google Scholar
Makiko Takahashi
You can also search for this author inPubMed Google Scholar
Katsuhiro Kanda
You can also search for this author inPubMed Google Scholar
Takahide Yokoi
You can also search for this author inPubMed Google Scholar
Takako Furuya
You can also search for this author inPubMed Google Scholar
Emiko Kikkawa
You can also search for this author inPubMed Google Scholar
Yuhi Omura
You can also search for this author inPubMed Google Scholar
Kumi Abe
You can also search for this author inPubMed Google Scholar
Kumiko Kamihara
You can also search for this author inPubMed Google Scholar
Naoko Katsuta
You can also search for this author inPubMed Google Scholar
Kazuomi Sato
You can also search for this author inPubMed Google Scholar
Machiko Tanikawa
You can also search for this author inPubMed Google Scholar
Makoto Yamazaki
You can also search for this author inPubMed Google Scholar
Ken Ninomiya
You can also search for this author inPubMed Google Scholar
Tadashi Ishibashi
You can also search for this author inPubMed Google Scholar
Hiromichi Yamashita
You can also search for this author inPubMed Google Scholar
Katsuji Murakawa
You can also search for this author inPubMed Google Scholar
Kiyoshi Fujimori
You can also search for this author inPubMed Google Scholar
Hiroyuki Tanai
You can also search for this author inPubMed Google Scholar
Manabu Kimata
You can also search for this author inPubMed Google Scholar
Motoji Watanabe
You can also search for this author inPubMed Google Scholar
Susumu Hiraoka
You can also search for this author inPubMed Google Scholar
Yoshiyuki Chiba
You can also search for this author inPubMed Google Scholar
Shinichi Ishida
You can also search for this author inPubMed Google Scholar
Yukio Ono
You can also search for this author inPubMed Google Scholar
Sumiyo Takiguchi
You can also search for this author inPubMed Google Scholar
Susumu Watanabe
You can also search for this author inPubMed Google Scholar
Makoto Yosida
You can also search for this author inPubMed Google Scholar
Tomoko Hotuta
You can also search for this author inPubMed Google Scholar
Junko Kusano
You can also search for this author inPubMed Google Scholar
Keiichi Kanehori
You can also search for this author inPubMed Google Scholar
Asako Takahashi-Fujii
You can also search for this author inPubMed Google Scholar
Hiroto Hara
You can also search for this author inPubMed Google Scholar
Tomo-o Tanase
You can also search for this author inPubMed Google Scholar
Yoshiko Nomura
You can also search for this author inPubMed Google Scholar
Sakae Togiya
You can also search for this author inPubMed Google Scholar
Fukuyo Komai
You can also search for this author inPubMed Google Scholar
Reiko Hara
You can also search for this author inPubMed Google Scholar
Kazuha Takeuchi
You can also search for this author inPubMed Google Scholar
Miho Arita
You can also search for this author inPubMed Google Scholar
Nobuyuki Imose
You can also search for this author inPubMed Google Scholar
Kaoru Musashino
You can also search for this author inPubMed Google Scholar
Hisatsugu Yuuki
You can also search for this author inPubMed Google Scholar
Atsushi Oshima
You can also search for this author inPubMed Google Scholar
Naokazu Sasaki
You can also search for this author inPubMed Google Scholar
Satoshi Aotsuka
You can also search for this author inPubMed Google Scholar
Yoko Yoshikawa
You can also search for this author inPubMed Google Scholar
Hiroshi Matsunawa
You can also search for this author inPubMed Google Scholar
Tatsuo Ichihara
You can also search for this author inPubMed Google Scholar
Namiko Shiohata
You can also search for this author inPubMed Google Scholar
Sanae Sano
You can also search for this author inPubMed Google Scholar
Shogo Moriya
You can also search for this author inPubMed Google Scholar
Hiroko Momiyama
You can also search for this author inPubMed Google Scholar
Noriko Satoh
You can also search for this author inPubMed Google Scholar
Sachiko Takami
You can also search for this author inPubMed Google Scholar
Yuko Terashima
You can also search for this author inPubMed Google Scholar
Osamu Suzuki
You can also search for this author inPubMed Google Scholar
Satoshi Nakagawa
You can also search for this author inPubMed Google Scholar
Akihiro Senoh
You can also search for this author inPubMed Google Scholar
Hiroshi Mizoguchi
You can also search for this author inPubMed Google Scholar
Yoshihiro Goto
You can also search for this author inPubMed Google Scholar
Fumio Shimizu
You can also search for this author inPubMed Google Scholar
Hirokazu Wakebe
You can also search for this author inPubMed Google Scholar
Haretsugu Hishigaki
You can also search for this author inPubMed Google Scholar
Takeshi Watanabe
You can also search for this author inPubMed Google Scholar
Akio Sugiyama
You can also search for this author inPubMed Google Scholar
Makoto Takemoto
You can also search for this author inPubMed Google Scholar
Bunsei Kawakami
You can also search for this author inPubMed Google Scholar
Masaaki Yamazaki
You can also search for this author inPubMed Google Scholar
Koji Watanabe
You can also search for this author inPubMed Google Scholar
Ayako Kumagai
You can also search for this author inPubMed Google Scholar
Shoko Itakura
You can also search for this author inPubMed Google Scholar
Yasuhito Fukuzumi
You can also search for this author inPubMed Google Scholar
Yoshifumi Fujimori
You can also search for this author inPubMed Google Scholar
Megumi Komiyama
You can also search for this author inPubMed Google Scholar
Hiroyuki Tashiro
You can also search for this author inPubMed Google Scholar
Akira Tanigami
You can also search for this author inPubMed Google Scholar
Tsutomu Fujiwara
You can also search for this author inPubMed Google Scholar
Toshihide Ono
You can also search for this author inPubMed Google Scholar
Katsue Yamada
You can also search for this author inPubMed Google Scholar
Yuka Fujii
You can also search for this author inPubMed Google Scholar
Kouichi Ozaki
You can also search for this author inPubMed Google Scholar
Maasa Hirao
You can also search for this author inPubMed Google Scholar
Yoshihiro Ohmori
You can also search for this author inPubMed Google Scholar
Ayako Kawabata
You can also search for this author inPubMed Google Scholar
Takeshi Hikiji
You can also search for this author inPubMed Google Scholar
Naoko Kobatake
You can also search for this author inPubMed Google Scholar
Hiromi Inagaki
You can also search for this author inPubMed Google Scholar
Yasuko Ikema
You can also search for this author inPubMed Google Scholar
Sachiko Okamoto
You can also search for this author inPubMed Google Scholar
Rie Okitani
You can also search for this author inPubMed Google Scholar
Takuma Kawakami
You can also search for this author inPubMed Google Scholar
Saori Noguchi
You can also search for this author inPubMed Google Scholar
Tomoko Itoh
You can also search for this author inPubMed Google Scholar
Keiko Shigeta
You can also search for this author inPubMed Google Scholar
Tadashi Senba
You can also search for this author inPubMed Google Scholar
Kyoka Matsumura
You can also search for this author inPubMed Google Scholar
Yoshie Nakajima
You can also search for this author inPubMed Google Scholar
Takae Mizuno
You can also search for this author inPubMed Google Scholar
Misato Morinaga
You can also search for this author inPubMed Google Scholar
Masahide Sasaki
You can also search for this author inPubMed Google Scholar
Takushi Togashi
You can also search for this author inPubMed Google Scholar
Masaaki Oyama
You can also search for this author inPubMed Google Scholar
Hiroko Hata
You can also search for this author inPubMed Google Scholar
Manabu Watanabe
You can also search for this author inPubMed Google Scholar
Takami Komatsu
You can also search for this author inPubMed Google Scholar
Junko Mizushima-Sugano
You can also search for this author inPubMed Google Scholar
Tadashi Satoh
You can also search for this author inPubMed Google Scholar
Yuko Shirai
You can also search for this author inPubMed Google Scholar
Yukiko Takahashi
You can also search for this author inPubMed Google Scholar
Kiyomi Nakagawa
You can also search for this author inPubMed Google Scholar
Koji Okumura
You can also search for this author inPubMed Google Scholar
Takahiro Nagase
You can also search for this author inPubMed Google Scholar
Nobuo Nomura
You can also search for this author inPubMed Google Scholar
Hisashi Kikuchi
You can also search for this author inPubMed Google Scholar
Yasuhiko Masuho
You can also search for this author inPubMed Google Scholar
Riu Yamashita
You can also search for this author inPubMed Google Scholar
Kenta Nakai
You can also search for this author inPubMed Google Scholar
Tetsushi Yada
You can also search for this author inPubMed Google Scholar
Yusuke Nakamura
You can also search for this author inPubMed Google Scholar
Osamu Ohara
You can also search for this author inPubMed Google Scholar
Takao Isogai
You can also search for this author inPubMed Google Scholar
Sumio Sugano
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toSumio Sugano.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Rights and permissions

About this article

Cite this article

Ota, T., Suzuki, Y., Nishikawa, T. et al. Complete sequencing and characterization of 21,243 full-length human cDNAs.Nat Genet 36, 40–45 (2004). https://doi.org/10.1038/ng1285

Download citation

Received: 07 October 2003
Accepted: 01 December 2003
Published: 21 December 2003
Issue Date: 01 January 2004
DOI: https://doi.org/10.1038/ng1285