Funders and funding by kermitt2 · Pull Request #1046 · grobidOrg/grobid (original) (raw)

This PR introduces an additional model called funding-acknowledgement, to parse the content of the funding and acknowledgement sections. This include identification of the mentioned entities (person, affiliation/institution, project), with a particular effort on funder and funding information. Funder name, grant number, funded project, funding program and grant name are recognized.

Results are serialized in the TEI with the list of funders in the TEI header:

    <funder ref="#_JVMscTc">
    <orgName type="full">National Institutes of Health</orgName>
    <orgName type="abbreviated">NIH</orgName>
    <idno type="DOI" subtype="crossref">10.13039/100000002</idno>
</funder>
<funder>
    <orgName type="full">Hopkins Sommer Scholarship</orgName>
</funder>
<funder>
    <orgName type="full">Lieber Institute for Brain Development</orgName>
    <orgName type="abbreviated">LIBD</orgName>
    <idno type="DOI" subtype="crossref">10.13039/100015503</idno>
</funder>

The @ref attribute link the funder to one or more funding element, which describe the funding with (when identified) grant number, grant name, project funded and name of the funding program:

Acknowledgements

JL and BL are supported by NIH R01 GM105705. AF is supported by a Hopkins Sommer Scholarship. AJ is supported by the Lieber Institute for Brain Development

R01 GM105705

<div type="references">

...

As visible above, the acknowledgement and funding sections are enriched with inline mark-up corresponding to the identified entities.

In addition, it is possible to consolidate the identified funder by a look-up currently limited to CrossRef funder registry, using the CrossRef REST API. When the funder name is matched with a certain certainty with a registered CrossRef funder, we add the DOI of the funder as well as additional normalized name, acronym and country.

Consolidation will be improved via biblio-glutton in a next phase.

The PR includes a complete revision of the segmentation training data regarding acknowledgement and funding section and a set of around 1500 manually annotated funding and acknowledgement sections.

Standalone model accuracy (strict field matching): the winner is SciBERT+CRF, as usual for NER in scientific texts

CRF funding-acknowledgement model

===== Field-level results =====

label                accuracy     precision    recall       f1           support

<affiliation>        98.36        61.84        50           55.29        94     
<funderName>         93.6         69.1         68.96        69.03        480    
<grantName>          98.64        42.22        33.93        37.62        56     
<grantNumber>        98.43        90.7         88.95        89.82        362    
<institution>        95.93        36           36.73        36.36        147    
<person>             98.34        93.69        93.84        93.77        617    
<programName>        98.81        35.14        29.55        32.1         44     
<projectName>        99.09        46.67        17.07        25           41     

all (micro avg.)     97.65        77.3         74.52        75.88        1841   
all (macro avg.)     97.65        59.42        52.38        54.87        1841   

===== Instance-level results =====

Total expected instances:   316
Correct instances:          130
Instance-level recall:      41.14

BidLSTM_CRF_FEATURES

  f1 (micro): 75.89
                  precision    recall  f1-score   support

   <affiliation>     0.7000    0.8750    0.7778        24
    <funderName>     0.7165    0.7333    0.7248       255
     <grantName>     0.3636    0.3077    0.3333        26
   <grantNumber>     0.8171    0.8938    0.8537       160
   <institution>     0.4955    0.5340    0.5140       103
        <person>     0.9416    0.9699    0.9556       266
   <programName>     0.2800    0.3043    0.2917        23
   <projectName>     0.3750    0.4412    0.4054        34

all (micro avg.)     0.7399    0.7789    0.7589       891

BidLSTM_CRF_FEATURES + ELMo

Average over 10 folds
                  precision    recall  f1-score   support

   <affiliation>     0.7120    0.8833    0.7878        24
    <funderName>     0.6911    0.8000    0.7411       255
     <grantName>     0.4220    0.4423    0.4309        26
   <grantNumber>     0.8044    0.8781    0.8396       160
   <institution>     0.5717    0.5515    0.5596       103
        <person>     0.9511    0.9643    0.9576       266
   <programName>     0.2970    0.2913    0.2924        23
   <projectName>     0.4887    0.4912    0.4894        34

all (micro avg.)     0.7483    0.8012    0.7739

BERT (allenai/scibert_scivocab_cased)

                  precision    recall  f1-score   support

   <affiliation>     0.7368    0.8000    0.7671        35
    <funderName>     0.6900    0.7670    0.7264       206
     <grantName>     0.3143    0.4074    0.3548        27
   <grantNumber>     0.9185    0.9394    0.9288       132
   <institution>     0.4167    0.4348    0.4255        69
        <person>     0.9386    0.9701    0.9541       268
   <programName>     0.1500    0.3000    0.2000        10
   <projectName>     0.0952    0.2857    0.1429         7

all (micro avg.)     0.7449    0.8170    0.7793       754

BERT (allenai/scibert_scivocab_cased) + CRF

                  precision    recall  f1-score   support

   <affiliation>     0.7436    0.8286    0.7838        35
    <funderName>     0.6725    0.7476    0.7080       206
     <grantName>     0.3000    0.3333    0.3158        27
   <grantNumber>     0.8929    0.9470    0.9191       132
   <institution>     0.4557    0.5217    0.4865        69
        <person>     0.9628    0.9664    0.9646       268
   <programName>     0.1875    0.3000    0.2308        10
   <projectName>     0.1579    0.4286    0.2308         7

all (micro avg.)     0.7527    0.8196    0.7848       754

BERT (bert-base-cased) + CRF 

                  precision    recall  f1-score   support

   <affiliation>     0.7179    0.8000    0.7568        35
    <funderName>     0.6754    0.7476    0.7097       206
     <grantName>     0.2632    0.3704    0.3077        27
   <grantNumber>     0.8936    0.9545    0.9231       132
   <institution>     0.4430    0.5072    0.4730        69
        <person>     0.9526    0.9739    0.9631       268
   <programName>     0.1304    0.3000    0.1818        10
   <projectName>     0.1818    0.5714    0.2759         7

all (micro avg.)     0.7358    0.8236    0.7772       754