Letter to the editor: SeqXML and OrthoXML: standards for sequence and orthology information - PubMed (original) (raw)
Letter to the editor: SeqXML and OrthoXML: standards for sequence and orthology information
Thomas Schmitt et al. Brief Bioinform. 2011 Sep.
Abstract
There is a great need for standards in the orthology field. Users must contend with different ortholog data representations from each provider, and the providers themselves must independently gather and parse the input sequence data. These burdensome and redundant procedures make data comparison and integration difficult. We have designed two XML-based formats, SeqXML and OrthoXML, to solve these problems. SeqXML is a lightweight format for sequence records-the input for orthology prediction. It stores the same sequence and metadata as typical FASTA format records, but overcomes common problems such as unstructured metadata in the header and erroneous sequence content. XML provides validation to prevent data integrity problems that are frequent in FASTA files. The range of applications for SeqXML is broad and not limited to ortholog prediction. We provide read/write functions for BioJava, BioPerl, and Biopython. OrthoXML was designed to represent ortholog assignments from any source in a consistent and structured way, yet cater to specific needs such as scoring schemes or meta-information. A unified format is particularly valuable for ortholog consumers that want to integrate data from numerous resources, e.g. for gene annotation projects. Reference proteomes for 61 organisms are already available in SeqXML, and 10 orthology databases have signed on to OrthoXML. Adoption by the entire field would substantially facilitate exchange and quality control of sequence and orthology information.
Similar articles
- pep2pro: a new tool for comprehensive proteome data analysis to reveal information about organ-specific proteomes in Arabidopsis thaliana.
Baerenfaller K, Hirsch-Hoffmann M, Svozil J, Hull R, Russenberger D, Bischof S, Lu Q, Gruissem W, Baginsky S. Baerenfaller K, et al. Integr Biol (Camb). 2011 Mar;3(3):225-37. doi: 10.1039/c0ib00078g. Epub 2011 Jan 24. Integr Biol (Camb). 2011. PMID: 21264403 - InParanoid 7: new algorithms and tools for eukaryotic orthology analysis.
Ostlund G, Schmitt T, Forslund K, Köstler T, Messina DN, Roopra S, Frings O, Sonnhammer EL. Ostlund G, et al. Nucleic Acids Res. 2010 Jan;38(Database issue):D196-203. doi: 10.1093/nar/gkp931. Epub 2009 Nov 5. Nucleic Acids Res. 2010. PMID: 19892828 Free PMC article. - Web-based infectious disease reporting using XML forms.
Liu D, Wang X, Pan F, Xu Y, Yang P, Rao K. Liu D, et al. Int J Med Inform. 2008 Sep;77(9):630-40. doi: 10.1016/j.ijmedinf.2007.10.011. Epub 2007 Dec 3. Int J Med Inform. 2008. PMID: 18060833 - Current status of proteomic standards development.
Orchard S, Taylor C, Hermjakob H, Zhu W, Julian R, Apweiler R. Orchard S, et al. Expert Rev Proteomics. 2004 Aug;1(2):179-83. doi: 10.1586/14789450.1.2.179. Expert Rev Proteomics. 2004. PMID: 15966812 Review. - The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.
Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM. Cock PJ, et al. Nucleic Acids Res. 2010 Apr;38(6):1767-71. doi: 10.1093/nar/gkp1137. Epub 2009 Dec 16. Nucleic Acids Res. 2010. PMID: 20015970 Free PMC article. Review.
Cited by
- Scripting Analyses of Genomes in Ensembl Plants.
Contreras-Moreira B, Naamati G, Rosello M, Allen JE, Hunt SE, Muffato M, Gall A, Flicek P. Contreras-Moreira B, et al. Methods Mol Biol. 2022;2443:27-55. doi: 10.1007/978-1-0716-2067-0_2. Methods Mol Biol. 2022. PMID: 35037199 Free PMC article. - PhylomeDB V5: an expanding repository for genome-wide catalogues of annotated gene phylogenies.
Fuentes D, Molina M, Chorostecki U, Capella-Gutiérrez S, Marcet-Houben M, Gabaldón T. Fuentes D, et al. Nucleic Acids Res. 2022 Jan 7;50(D1):D1062-D1068. doi: 10.1093/nar/gkab966. Nucleic Acids Res. 2022. PMID: 34718760 Free PMC article. - Scalable phylogenetic profiling using MinHash uncovers likely eukaryotic sexual reproduction genes.
Moi D, Kilchoer L, Aguilar PS, Dessimoz C. Moi D, et al. PLoS Comput Biol. 2020 Jul 22;16(7):e1007553. doi: 10.1371/journal.pcbi.1007553. eCollection 2020 Jul. PLoS Comput Biol. 2020. PMID: 32697802 Free PMC article. - The Quest for Orthologs benchmark service and consensus calls in 2020.
Altenhoff AM, Garrayo-Ventas J, Cosentino S, Emms D, Glover NM, Hernández-Plaza A, Nevers Y, Sundesha V, Szklarczyk D, Fernández JM, Codó L, For Orthologs Consortium TQ, Gelpi JL, Huerta-Cepas J, Iwasaki W, Kelly S, Lecompte O, Muffato M, Martin MJ, Capella-Gutierrez S, Thomas PD, Sonnhammer E, Dessimoz C. Altenhoff AM, et al. Nucleic Acids Res. 2020 Jul 2;48(W1):W538-W545. doi: 10.1093/nar/gkaa308. Nucleic Acids Res. 2020. PMID: 32374845 Free PMC article. - MetaPhOrs 2.0: integrative, phylogeny-based inference of orthology and paralogy across the tree of life.
Chorostecki U, Molina M, Pryszcz LP, Gabaldón T. Chorostecki U, et al. Nucleic Acids Res. 2020 Jul 2;48(W1):W553-W557. doi: 10.1093/nar/gkaa282. Nucleic Acids Res. 2020. PMID: 32343307 Free PMC article.
Publication types
MeSH terms
Substances
LinkOut - more resources
Full Text Sources