Copyrights owner and licenses identification models by kermitt2 · Pull Request #1078 · grobidOrg/grobid (original) (raw)
This PR integrates two new models to identify the copyrights' owner of a document (publisher, authors or unknown) and to identify the license, if provided, for sharing the document file (e.g. CC-BY, CC-BY-NC, etc.). The models currently only work if the "delft" engine is selected. If this engine is not selected, the identification is currently skipped.
In the TEI, the result is serialized as followed - example is https://peerj.com/articles/cs-1022/
<publicationStmt>
<publisher>PeerJ</publisher>
<availability resp="authors" status="restricted">
<!-- the @rest attribute above gives the document copyrights owner (publisher, authors), if known -->
<licence>CC-BY</licence>
</availability>
<date type="published" when="2022-07-25">25 July 2022</date>
</publicationStmt>To encode the copyrights' owner, we use an attribute @resp ("responsible party") and add a comment explaining how to interpret it. Note that the standard @resp in TEI should be a pointer, here we customize it to 2 possible values to avoid overcomplicating it. When the copyright owner is undecided by the classifier or unknown, there is no @resp attribute at the element <availability>.
In addition, the service now includes a boolean parameter includeRawCopyrights to include or not in the <availability> part the full copyright/license section that has been extracted (under added element <p type="raw">). This section is used by the classifier to determine the copyrights owner and the license.
<publicationStmt>
<publisher>PeerJ</publisher>
<availability resp="authors" status="restricted">
<!-- the @rest attribute above gives the document copyrights owner (publisher, authors), if known -->
<licence>CC-BY</licence>
<p type="raw">Copyright 2022 Du et al. Distributed under Creative Commons CC-BY 4.0</note>
</availability>
<date type="published" when="2022-07-25">25 July 2022</date>
</publicationStmt>To have it working, edit grobid-home/config/grobid.yaml to indicate delft as engine for the two new models:
- name: "copyright"
# at this time, we only have a DeLFT implementation,
# use "wapiti" if the deep learning library JNI is not available and model will then be ignored
engine: "delft"
#engine: "wapiti"
delft:
# deep learning parameters
architecture: "gru"
#architecture: "bert"
#transformer: "allenai/scibert_scivocab_cased"
- name: "license"
# at this time, must always be DeLFT, not other implementation
# use "wapiti" if the deep learning library JNI is not available and model will then be ignored
engine: "delft"
#engine: "wapiti"
delft:
# deep learning parameters
architecture: "gru"
#architecture: "bert"
#transformer: "allenai/scibert_scivocab_cased"Latest evaluations:
GRU ensemble 10, glove-840B
===========================
* Copyrights owner
Evaluation on 76 instances:
precision recall f-score support
publisher 0.9310 1.0000 0.9643 27
authors 1.0000 1.0000 1.0000 24
undecided 1.0000 0.9200 0.9583 25
* License identification
Evaluation on 92 instances:
precision recall f-score support
CC-0 0.0000 0.0000 0.0000 0
CC-BY 1.0000 1.0000 1.0000 26
CC-BY-NC 1.0000 0.8000 0.8889 5
CC-BY-NC-ND 0.8000 1.0000 0.8889 8
CC-BY-SA 1.0000 1.0000 1.0000 6
CC-BY-NC-SA 1.0000 1.0000 1.0000 2
CC-BY-ND 1.0000 0.5000 0.6667 2
copyright 1.0000 0.9091 0.9524 11
other 0.0000 0.0000 0.0000 0
undecided 0.9697 1.0000 0.9846 32
SciBERT, base cased
===================
* Copyrights owner
Evaluation on 76 instances:
precision recall f-score support
publisher 0.9000 1.0000 0.9474 27
authors 1.0000 1.0000 1.0000 24
undecided 1.0000 0.8800 0.9362 25
* License identification
Evaluation on 83 instances:
precision recall f-score support
CC-0 0.0000 0.0000 0.0000 0
CC-BY 0.7857 1.0000 0.8800 22
CC-BY-NC 0.6000 0.7500 0.6667 4
CC-BY-NC-ND 0.8182 0.5625 0.6667 16
CC-BY-SA 0.2500 0.5000 0.3333 2
CC-BY-NC-SA 0.0000 0.0000 0.0000 2
CC-BY-ND 0.0000 0.0000 0.0000 1
copyright 1.0000 1.0000 1.0000 8
other 0.0000 0.0000 0.0000 1
undecided 1.0000 1.0000 1.0000 27
TODO:
- update the TEI ODD schema for the customized attribute
@resp - think about a lighter CPU only classifier maybe, not requiring all the Deep Learning libraries JNI and installation
- think about getting license version, because this is necessary to create a target URL associated to the license element (e.g. @target="https://creativecommons.org/licenses/by/3.0/"). Without the version, there is no URL possible for the CC license :/
