An effective biomedical document classification scheme in support of biocuration: addressing class imbalance - PubMed (original) (raw)
An effective biomedical document classification scheme in support of biocuration: addressing class imbalance
Xiangying Jiang et al. Database (Oxford). 2019.
Abstract
Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. In addition, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential to a practical classifier. We present here an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating. The proposed scheme is based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies. We examined the performance of our method over a large imbalanced data set that was originally manually curated by the Jackson Laboratory's Gene Expression Database (GXD). The set consists of more than 90 000 PubMed abstracts, of which about 13 000 documents are labeled as relevant to GXD while the others are not relevant. Our results, 0.72 precision, 0.80 recall and 0.75 f-measure, demonstrate that our proposed classification scheme effectively categorizes such a large data set in the face of data imbalance.
© The Author(s) 2019. Published by Oxford University Press.
Figures
Similar articles
- Integrating image caption information into biomedical document classification in support of biocuration.
Jiang X, Li P, Kadin J, Blake JA, Ringwald M, Shatkay H. Jiang X, et al. Database (Oxford). 2020 Jan 1;2020:baaa024. doi: 10.1093/database/baaa024. Database (Oxford). 2020. PMID: 32294192 Free PMC article. - Effective biomedical document classification for identifying publications relevant to the mouse Gene Expression Database (GXD).
Jiang X, Ringwald M, Blake J, Shatkay H. Jiang X, et al. Database (Oxford). 2017 Jan 1;2017(1):bax017. doi: 10.1093/database/bax017. Database (Oxford). 2017. PMID: 28365740 Free PMC article. - Utilizing image and caption information for biomedical document classification.
Li P, Jiang X, Zhang G, Trabucco JT, Raciti D, Smith C, Ringwald M, Marai GE, Arighi C, Shatkay H. Li P, et al. Bioinformatics. 2021 Jul 12;37(Suppl_1):i468-i476. doi: 10.1093/bioinformatics/btab331. Bioinformatics. 2021. PMID: 34252939 Free PMC article. - Automatic categorization of diverse experimental information in the bioscience literature.
Fang R, Schindelman G, Van Auken K, Fernandes J, Chen W, Wang X, Davis P, Tuli MA, Marygold SJ, Millburn G, Matthews B, Zhang H, Brown N, Gelbart WM, Sternberg PW. Fang R, et al. BMC Bioinformatics. 2012 Jan 26;13:16. doi: 10.1186/1471-2105-13-16. BMC Bioinformatics. 2012. PMID: 22280404 Free PMC article. - HEALTH GeoJunction: place-time-concept browsing of health publications.
MacEachren AM, Stryker MS, Turton IJ, Pezanowski S. MacEachren AM, et al. Int J Health Geogr. 2010 May 18;9:23. doi: 10.1186/1476-072X-9-23. Int J Health Geogr. 2010. PMID: 20482806 Free PMC article. Review.
Cited by
- MetaTron: advancing biomedical annotation empowering relation annotation and collaboration.
Irrera O, Marchesin S, Silvello G. Irrera O, et al. BMC Bioinformatics. 2024 Mar 14;25(1):112. doi: 10.1186/s12859-024-05730-9. BMC Bioinformatics. 2024. PMID: 38486137 Free PMC article. - Automatic identification of scientific publications describing digital reconstructions of neural morphology.
Maraver P, Tecuatl C, Ascoli GA. Maraver P, et al. Brain Inform. 2023 Sep 8;10(1):23. doi: 10.1186/s40708-023-00202-x. Brain Inform. 2023. PMID: 37684527 Free PMC article. - Automatic identification of scientific publications describing digital reconstructions of neural morphology.
Maraver P, Tecuatl C, Ascoli GA. Maraver P, et al. bioRxiv [Preprint]. 2023 Feb 15:2023.02.14.527522. doi: 10.1101/2023.02.14.527522. bioRxiv. 2023. PMID: 36824882 Free PMC article. Updated. Preprint. - Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling.
Thielmann A, Weisser C, Krenz A, Säfken B. Thielmann A, et al. J Appl Stat. 2021 Apr 27;50(3):574-591. doi: 10.1080/02664763.2021.1919063. eCollection 2023. J Appl Stat. 2021. PMID: 36819086 Free PMC article. - Hagit Shatkay-Reshef 1965-2022.
Arighi CN. Arighi CN. Bioinform Adv. 2022 Mar 4;2(1):vbac012. doi: 10.1093/bioadv/vbac012. eCollection 2022. Bioinform Adv. 2022. PMID: 36699359 Free PMC article. No abstract available.
References
Publication types
MeSH terms
Grants and funding
- R56 LM011354/LM/NLM NIH HHS/United States
- U41 HG000330/HG/NHGRI NIH HHS/United States
- R01 LM012527/LM/NLM NIH HHS/United States
- P41 HD062499/HD/NICHD NIH HHS/United States
- R01 LM011945/LM/NLM NIH HHS/United States
LinkOut - more resources
Full Text Sources
Research Materials