Funders and funding by kermitt2 · Pull Request #1046 · grobidOrg/grobid (original) (raw)
This PR introduces an additional model called funding-acknowledgement, to parse the content of the funding and acknowledgement sections. This include identification of the mentioned entities (person, affiliation/institution, project), with a particular effort on funder and funding information. Funder name, grant number, funded project, funding program and grant name are recognized.
Results are serialized in the TEI with the list of funders in the TEI header:
<funder ref="#_JVMscTc">
<orgName type="full">National Institutes of Health</orgName>
<orgName type="abbreviated">NIH</orgName>
<idno type="DOI" subtype="crossref">10.13039/100000002</idno>
</funder>
<funder>
<orgName type="full">Hopkins Sommer Scholarship</orgName>
</funder>
<funder>
<orgName type="full">Lieber Institute for Brain Development</orgName>
<orgName type="abbreviated">LIBD</orgName>
<idno type="DOI" subtype="crossref">10.13039/100015503</idno>
</funder>The @ref attribute link the funder to one or more funding element, which describe the funding with (when identified) grant number, grant name, project funded and name of the funding program:
JL and BL are supported by NIH R01 GM105705. AF is supported by a Hopkins Sommer Scholarship. AJ is supported by the Lieber Institute for Brain Development
<div type="references">...
As visible above, the acknowledgement and funding sections are enriched with inline mark-up corresponding to the identified entities.
In addition, it is possible to consolidate the identified funder by a look-up currently limited to CrossRef funder registry, using the CrossRef REST API. When the funder name is matched with a certain certainty with a registered CrossRef funder, we add the DOI of the funder as well as additional normalized name, acronym and country.
Consolidation will be improved via biblio-glutton in a next phase.
The PR includes a complete revision of the segmentation training data regarding acknowledgement and funding section and a set of around 1500 manually annotated funding and acknowledgement sections.
Standalone model accuracy (strict field matching): the winner is SciBERT+CRF, as usual for NER in scientific texts
CRF funding-acknowledgement model
===== Field-level results =====
label accuracy precision recall f1 support
<affiliation> 98.36 61.84 50 55.29 94
<funderName> 93.6 69.1 68.96 69.03 480
<grantName> 98.64 42.22 33.93 37.62 56
<grantNumber> 98.43 90.7 88.95 89.82 362
<institution> 95.93 36 36.73 36.36 147
<person> 98.34 93.69 93.84 93.77 617
<programName> 98.81 35.14 29.55 32.1 44
<projectName> 99.09 46.67 17.07 25 41
all (micro avg.) 97.65 77.3 74.52 75.88 1841
all (macro avg.) 97.65 59.42 52.38 54.87 1841
===== Instance-level results =====
Total expected instances: 316
Correct instances: 130
Instance-level recall: 41.14
BidLSTM_CRF_FEATURES
f1 (micro): 75.89
precision recall f1-score support
<affiliation> 0.7000 0.8750 0.7778 24
<funderName> 0.7165 0.7333 0.7248 255
<grantName> 0.3636 0.3077 0.3333 26
<grantNumber> 0.8171 0.8938 0.8537 160
<institution> 0.4955 0.5340 0.5140 103
<person> 0.9416 0.9699 0.9556 266
<programName> 0.2800 0.3043 0.2917 23
<projectName> 0.3750 0.4412 0.4054 34
all (micro avg.) 0.7399 0.7789 0.7589 891
BidLSTM_CRF_FEATURES + ELMo
Average over 10 folds
precision recall f1-score support
<affiliation> 0.7120 0.8833 0.7878 24
<funderName> 0.6911 0.8000 0.7411 255
<grantName> 0.4220 0.4423 0.4309 26
<grantNumber> 0.8044 0.8781 0.8396 160
<institution> 0.5717 0.5515 0.5596 103
<person> 0.9511 0.9643 0.9576 266
<programName> 0.2970 0.2913 0.2924 23
<projectName> 0.4887 0.4912 0.4894 34
all (micro avg.) 0.7483 0.8012 0.7739
BERT (allenai/scibert_scivocab_cased)
precision recall f1-score support
<affiliation> 0.7368 0.8000 0.7671 35
<funderName> 0.6900 0.7670 0.7264 206
<grantName> 0.3143 0.4074 0.3548 27
<grantNumber> 0.9185 0.9394 0.9288 132
<institution> 0.4167 0.4348 0.4255 69
<person> 0.9386 0.9701 0.9541 268
<programName> 0.1500 0.3000 0.2000 10
<projectName> 0.0952 0.2857 0.1429 7
all (micro avg.) 0.7449 0.8170 0.7793 754
BERT (allenai/scibert_scivocab_cased) + CRF
precision recall f1-score support
<affiliation> 0.7436 0.8286 0.7838 35
<funderName> 0.6725 0.7476 0.7080 206
<grantName> 0.3000 0.3333 0.3158 27
<grantNumber> 0.8929 0.9470 0.9191 132
<institution> 0.4557 0.5217 0.4865 69
<person> 0.9628 0.9664 0.9646 268
<programName> 0.1875 0.3000 0.2308 10
<projectName> 0.1579 0.4286 0.2308 7
all (micro avg.) 0.7527 0.8196 0.7848 754
BERT (bert-base-cased) + CRF
precision recall f1-score support
<affiliation> 0.7179 0.8000 0.7568 35
<funderName> 0.6754 0.7476 0.7097 206
<grantName> 0.2632 0.3704 0.3077 27
<grantNumber> 0.8936 0.9545 0.9231 132
<institution> 0.4430 0.5072 0.4730 69
<person> 0.9526 0.9739 0.9631 268
<programName> 0.1304 0.3000 0.1818 10
<projectName> 0.1818 0.5714 0.2759 7
all (micro avg.) 0.7358 0.8236 0.7772 754