Copyrights owner and licenses identification models by kermitt2 · Pull Request #1078 · grobidOrg/grobid (original) (raw)

This PR integrates two new models to identify the copyrights' owner of a document (publisher, authors or unknown) and to identify the license, if provided, for sharing the document file (e.g. CC-BY, CC-BY-NC, etc.). The models currently only work if the "delft" engine is selected. If this engine is not selected, the identification is currently skipped.

In the TEI, the result is serialized as followed - example is https://peerj.com/articles/cs-1022/

Screenshot from 2024-01-29 18-36-35

        <publicationStmt>
            <publisher>PeerJ</publisher>
            <availability resp="authors" status="restricted">
                <!-- the @rest attribute above gives the document copyrights owner (publisher, authors), if known -->
                <licence>CC-BY</licence>
            </availability>
            <date type="published" when="2022-07-25">25 July 2022</date>
        </publicationStmt>

To encode the copyrights' owner, we use an attribute @resp ("responsible party") and add a comment explaining how to interpret it. Note that the standard @resp in TEI should be a pointer, here we customize it to 2 possible values to avoid overcomplicating it. When the copyright owner is undecided by the classifier or unknown, there is no @resp attribute at the element <availability>.

In addition, the service now includes a boolean parameter includeRawCopyrights to include or not in the <availability> part the full copyright/license section that has been extracted (under added element <p type="raw">). This section is used by the classifier to determine the copyrights owner and the license.

           <publicationStmt>
        <publisher>PeerJ</publisher>
        <availability resp="authors" status="restricted">
            <!-- the @rest attribute above gives the document copyrights owner (publisher, authors), if known -->
            <licence>CC-BY</licence>
            <p type="raw">Copyright 2022 Du et al. Distributed under Creative Commons CC-BY 4.0</note>
        </availability>
        <date type="published" when="2022-07-25">25 July 2022</date>
    </publicationStmt>

To have it working, edit grobid-home/config/grobid.yaml to indicate delft as engine for the two new models:

- name: "copyright"
  # at this time, we only have a DeLFT implementation, 
  # use "wapiti" if the deep learning library JNI is not available and model will then be ignored
  engine: "delft"
  #engine: "wapiti"
  delft:
    # deep learning parameters
    architecture: "gru"
    #architecture: "bert"
    #transformer: "allenai/scibert_scivocab_cased"

- name: "license"
  # at this time, must always be DeLFT, not other implementation
  # use "wapiti" if the deep learning library JNI is not available and model will then be ignored
  engine: "delft"
  #engine: "wapiti"
  delft:
    # deep learning parameters
    architecture: "gru"
    #architecture: "bert"
    #transformer: "allenai/scibert_scivocab_cased"

Latest evaluations:

GRU ensemble 10, glove-840B
===========================

* Copyrights owner

Evaluation on 76 instances:
                   precision        recall       f-score       support
     publisher        0.9310        1.0000        0.9643            27
       authors        1.0000        1.0000        1.0000            24
     undecided        1.0000        0.9200        0.9583            25

* License identification

Evaluation on 92 instances:
                   precision        recall       f-score       support
          CC-0        0.0000        0.0000        0.0000             0
         CC-BY        1.0000        1.0000        1.0000            26
      CC-BY-NC        1.0000        0.8000        0.8889             5
   CC-BY-NC-ND        0.8000        1.0000        0.8889             8
      CC-BY-SA        1.0000        1.0000        1.0000             6
   CC-BY-NC-SA        1.0000        1.0000        1.0000             2
      CC-BY-ND        1.0000        0.5000        0.6667             2
     copyright        1.0000        0.9091        0.9524            11
         other        0.0000        0.0000        0.0000             0
     undecided        0.9697        1.0000        0.9846            32

SciBERT, base cased
===================

* Copyrights owner

Evaluation on 76 instances:
                   precision        recall       f-score       support
     publisher        0.9000        1.0000        0.9474            27
       authors        1.0000        1.0000        1.0000            24
     undecided        1.0000        0.8800        0.9362            25

* License identification

Evaluation on 83 instances:
                   precision        recall       f-score       support
          CC-0        0.0000        0.0000        0.0000             0
         CC-BY        0.7857        1.0000        0.8800            22
      CC-BY-NC        0.6000        0.7500        0.6667             4
   CC-BY-NC-ND        0.8182        0.5625        0.6667            16
      CC-BY-SA        0.2500        0.5000        0.3333             2
   CC-BY-NC-SA        0.0000        0.0000        0.0000             2
      CC-BY-ND        0.0000        0.0000        0.0000             1
     copyright        1.0000        1.0000        1.0000             8
         other        0.0000        0.0000        0.0000             1
     undecided        1.0000        1.0000        1.0000            27

TODO: