An effective biomedical document classification scheme in support of biocuration: addressing class imbalance - PubMed (original) (raw)

An effective biomedical document classification scheme in support of biocuration: addressing class imbalance

Xiangying Jiang et al. Database (Oxford). 2019.

Abstract

Published literature is an important source of knowledge supporting biomedical research. Given the large and increasing number of publications, automated document classification plays an important role in biomedical research. Effective biomedical document classifiers are especially needed for bio-databases, in which the information stems from many thousands of biomedical publications that curators must read in detail and annotate. In addition, biomedical document classification often amounts to identifying a small subset of relevant publications within a much larger collection of available documents. As such, addressing class imbalance is essential to a practical classifier. We present here an effective classification scheme for automatically identifying papers among a large pool of biomedical publications that contain information relevant to a specific topic, which the curators are interested in annotating. The proposed scheme is based on a meta-classification framework using cluster-based under-sampling combined with named-entity recognition and statistical feature selection strategies. We examined the performance of our method over a large imbalanced data set that was originally manually curated by the Jackson Laboratory's Gene Expression Database (GXD). The set consists of more than 90 000 PubMed abstracts, of which about 13 000 documents are labeled as relevant to GXD while the others are not relevant. Our results, 0.72 precision, 0.80 recall and 0.75 f-measure, demonstrate that our proposed classification scheme effectively categorizes such a large data set in the face of data imbalance.

PubMed Disclaimer

Figures

Cited by

MetaTron: advancing biomedical annotation empowering relation annotation and collaboration.
Irrera O, Marchesin S, Silvello G. Irrera O, et al. BMC Bioinformatics. 2024 Mar 14;25(1):112. doi: 10.1186/s12859-024-05730-9. BMC Bioinformatics. 2024. PMID: 38486137 Free PMC article.
Automatic identification of scientific publications describing digital reconstructions of neural morphology.
Maraver P, Tecuatl C, Ascoli GA. Maraver P, et al. Brain Inform. 2023 Sep 8;10(1):23. doi: 10.1186/s40708-023-00202-x. Brain Inform. 2023. PMID: 37684527 Free PMC article.
Automatic identification of scientific publications describing digital reconstructions of neural morphology.
Maraver P, Tecuatl C, Ascoli GA. Maraver P, et al. bioRxiv [Preprint]. 2023 Feb 15:2023.02.14.527522. doi: 10.1101/2023.02.14.527522. bioRxiv. 2023. PMID: 36824882 Free PMC article. Updated. Preprint.
Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling.
Thielmann A, Weisser C, Krenz A, Säfken B. Thielmann A, et al. J Appl Stat. 2021 Apr 27;50(3):574-591. doi: 10.1080/02664763.2021.1919063. eCollection 2023. J Appl Stat. 2021. PMID: 36819086 Free PMC article.
Hagit Shatkay-Reshef 1965-2022.
Arighi CN. Arighi CN. Bioinform Adv. 2022 Mar 4;2(1):vbac012. doi: 10.1093/bioadv/vbac012. eCollection 2022. Bioinform Adv. 2022. PMID: 36699359 Free PMC article. No abstract available.

References

1. Chen D., Müller H.M. and Sternberg P.W. (2006) Automatic document classification of biological literature. BMC Bioinformatics, 7, 370. - PMC - PubMed
1. Fang R., Schindelman G., Van Auken K. et al. (2012) Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics, 13, 16. - PMC - PubMed
1. Li D., Berardini T.Z., Muller R.J. et al. (2012) Building an efficient curation workflow for the Arabidopsis literature corpus. Database, 2012, bas047. - PMC - PubMed
1. Hirschman L., Burns G.A., Kralinger M. et al. (2012) Text mining for the biocuration workflow. Database, 2012, bas020. - PMC - PubMed
1. Almeida H., Meurs M.J., Kosseim L. et al. (2014) Machine learning for biomedical literature triage. PloS One, 9, e115892. - PMC - PubMed

An effective biomedical document classification scheme in support of biocuration: addressing class imbalance - PubMed (original) (raw)