Wikidata as a semantic framework for the Gene Wiki initiative - PubMed (original) (raw)
Wikidata as a semantic framework for the Gene Wiki initiative
Sebastian Burgstaller-Muehlbacher et al. Database (Oxford). 2016.
Abstract
Open biological data are distributed over many resources making them challenging to integrate, to update and to disseminate quickly. Wikidata is a growing, open community database which can serve this purpose and also provides tight integration with Wikipedia. In order to improve the state of biological data, facilitate data management and dissemination, we imported all human and mouse genes, and all human and mouse proteins into Wikidata. In total, 59,721 human genes and 73,355 mouse genes have been imported from NCBI and 27,306 human proteins and 16,728 mouse proteins have been imported from the Swissprot subset of UniProt. As Wikidata is open and can be edited by anybody, our corpus of imported data serves as the starting point for integration of further data by scientists, the Wikidata community and citizen scientists alike. The first use case for these data is to populate Wikipedia Gene Wiki infoboxes directly from Wikidata with the data integrated above. This enables immediate updates of the Gene Wiki infoboxes as soon as the data in Wikidata are modified. Although Gene Wiki pages are currently only on the English language version of Wikipedia, the multilingual nature of Wikidata allows for usage of the data we imported in all 280 different language Wikipedias. Apart from the Gene Wiki infobox use case, a SPARQL endpoint and exporting functionality to several standard formats (e.g. JSON, XML) enable use of the data by scientists. In summary, we created a fully open and extensible data resource for human and mouse molecular biology and biochemistry data. This resource enriches all the Wikipedias with structured information and serves as a new linking hub for the biological semantic web. Database URL: https://www.wikidata.org/.
© The Author(s) 2016. Published by Oxford University Press.
Figures
Figure 1
Wikidata item and data organization. Wikidata items can be added or edited by anyone manually. A Wikidata item consists of: (1) a language-specific label, (2) its unique identifier, (3) language specific aliases, (4) interwiki links to the different language Wikipedia articles or other Wikimedia projects and (5) a list of statements. For this specific example, the human protein Reelin was used (
https://www.wikidata.org/wiki/Q13569356
)
Figure 2
Gene Wiki data model in Wikidata. Each entity (human gene, human protein, mouse gene, mouse protein) is represented as a separate Wikidata item. Arrows represent direct links between Wikidata statements. The English language interwiki link on the human gene item points to the corresponding Gene Wiki article on the English Wikipedia.
Figure 3
GeneWiki infobox populated with data from Wikidata, using data from Wikidata items Q414043 for the human gene, Q13561329 for human protein, Q14331135 for the mouse gene and Q14331165 for the mouse protein. Three dots indicate that there is more information in the real Gene Wiki infobox for Reelin (
https://en.wikipedia.org/wiki/Reelin
).
Figure 4
An example SPARQL query, using the Wikidata SPARQL endpoint (query.wikidata.org). It retrieves all Wikidata (WD) items which are of subclass protein-coding gene (Q840604), which have a chromosomal start position (P644) according to human genome build GRCh38 and reside on human chromosome (P659) 9 (Q20966585) and a chromosomal end position (P645) also on chromosome 9. Furthermore, the region of interest is restricted to a chromosomal start position between 21 and 30 megabase pairs. Colors: Red indicates SPARQL commands, blue represents variable names, green represents URIs and brown are strings. Arrows point to the source code the description applies to.
Similar articles
- Utilizing the Wikidata system to improve the quality of medical content in Wikipedia in diverse languages: a pilot study.
Pfundner A, Schönberg T, Horn J, Boyce RD, Samwald M. Pfundner A, et al. J Med Internet Res. 2015 May 5;17(5):e110. doi: 10.2196/jmir.4163. J Med Internet Res. 2015. PMID: 25944105 Free PMC article. - Wikidata: A large-scale collaborative ontological medical database.
Turki H, Shafee T, Hadj Taieb MA, Ben Aouicha M, Vrandečić D, Das D, Hamdi H. Turki H, et al. J Biomed Inform. 2019 Nov;99:103292. doi: 10.1016/j.jbi.2019.103292. Epub 2019 Sep 23. J Biomed Inform. 2019. PMID: 31557529 - Centralizing content and distributing labor: a community model for curating the very long tail of microbial genomes.
Putman TE, Burgstaller-Muehlbacher S, Waagmeester A, Wu C, Su AI, Good BM. Putman TE, et al. Database (Oxford). 2016 Mar 28;2016:baw028. doi: 10.1093/database/baw028. Print 2016. Database (Oxford). 2016. PMID: 27022157 Free PMC article. - LinkHub: a Semantic Web system that facilitates cross-database queries and information retrieval in proteomics.
Smith AK, Cheung KH, Yip KY, Schultz M, Gerstein MK. Smith AK, et al. BMC Bioinformatics. 2007 May 9;8 Suppl 3(Suppl 3):S5. doi: 10.1186/1471-2105-8-S3-S5. BMC Bioinformatics. 2007. PMID: 17493288 Free PMC article. Review. - XML for data representation and model specification in neuroscience.
Crook SM, Howell FW. Crook SM, et al. Methods Mol Biol. 2007;401:53-66. doi: 10.1007/978-1-59745-520-6_4. Methods Mol Biol. 2007. PMID: 18368360 Review.
Cited by
- Collective intelligence defines biological functions in Wikipedia as communities in the hidden protein connection network.
Zinovyev A, Czerwinska U, Cantini L, Barillot E, Frahm KM, Shepelyansky DL. Zinovyev A, et al. PLoS Comput Biol. 2020 Feb 18;16(2):e1007652. doi: 10.1371/journal.pcbi.1007652. eCollection 2020 Feb. PLoS Comput Biol. 2020. PMID: 32069277 Free PMC article. - Intestinal microbiota alterations by dietary exposure to chemicals from food cooking and processing. Application of data science for risk prediction.
Ruiz-Saavedra S, García-González H, Arboleya S, Salazar N, Emilio Labra-Gayo J, Díaz I, Gueimonde M, González S, de Los Reyes-Gavilán CG. Ruiz-Saavedra S, et al. Comput Struct Biotechnol J. 2021 Jan 29;19:1081-1091. doi: 10.1016/j.csbj.2021.01.037. eCollection 2021. Comput Struct Biotechnol J. 2021. PMID: 33680352 Free PMC article. Review. - Human Disease Ontology 2018 update: classification, content and workflow expansion.
Schriml LM, Mitraka E, Munro J, Tauber B, Schor M, Nickle L, Felix V, Jeng L, Bearer C, Lichenstein R, Bisordi K, Campion N, Hyman B, Kurland D, Oates CP, Kibbey S, Sreekumar P, Le C, Giglio M, Greene C. Schriml LM, et al. Nucleic Acids Res. 2019 Jan 8;47(D1):D955-D962. doi: 10.1093/nar/gky1032. Nucleic Acids Res. 2019. PMID: 30407550 Free PMC article. - Ten quick tips for editing Wikidata.
Shafee T, Mietchen D, Lubiana T, Jemielniak D, Waagmeester A. Shafee T, et al. PLoS Comput Biol. 2023 Jul 20;19(7):e1011235. doi: 10.1371/journal.pcbi.1011235. eCollection 2023 Jul. PLoS Comput Biol. 2023. PMID: 37471307 Free PMC article. No abstract available. - ChlamBase: a curated model organism database for the Chlamydia research community.
Putman T, Hybiske K, Jow D, Afrasiabi C, Lelong S, Cano MA, Wu C, Su AI. Putman T, et al. Database (Oxford). 2019 Jan 1;2019:baz041. doi: 10.1093/database/baz041. Database (Oxford). 2019. PMID: 30985891 Free PMC article.
References
Publication types
MeSH terms
Grants and funding
- DA036134/DA/NIDA NIH HHS/United States
- GM114833/GM/NIGMS NIH HHS/United States
- R01 GM089820/GM/NIGMS NIH HHS/United States
- U54 DA036134/DA/NIDA NIH HHS/United States
- R01 MH111099/MH/NIMH NIH HHS/United States
- GM089820/GM/NIGMS NIH HHS/United States
- GM083924/GM/NIGMS NIH HHS/United States
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials