Releases · grobidOrg/grobid (original) (raw)

0.9.0 What's Changed

Added

Conflict of interest and author contributions statement extraction in header and segmentation models #1319
Extract figures, tables and equations from back/annex sections #1215
Extract URLs from PDF annotations in fulltext #1315
Mark consolidated bibliographical references and header explicitly in TEI output #1313
Include middle name and format initials in BibTeX output #1356
Fetch ORCID from Crossref when not extracted by Grobid #1406
Timeout configuration for consolidation requests (separate glutton and Crossref timeout) #1340
Lingua as an alternative for language recognition #1239
Blingfire as an alternative sentence segmentation engine #1378
Native support for Linux ARM 64 architecture
Multi-architecture Docker builds with ARM64 support (pdfalto and wapiti binaries for Linux ARM 64)
Support for Python environment managers (virtualenv, conda) for DeLFT integration #1010
Added version and revision information in the web UI #1390
Added health status indicator with periodic updates in the web UI #1403
Added more explanation and links to documentation in the web UI #1391
More informative /api/health endpoint, failing early when models are partially initialised #1373
-modelPath CLI argument for training and eval-mode model loading #1383, #1389
Evaluation script for running end-to-end evaluation from the repository root
Enabled trivy security code scanning #1295
Updated Citation.cff and SWID metadata #1341

Changed

Revised and updated the Crossref integration, with better handling of API limits and errors, in collaboration with Crossref team #1398
Upgraded to JDK 21 and Gradle 9 #1321
Updated TensorFlow to 2.17 with Python 3.10-3.11 support #1188
Updated pdfalto to 0.6.0
Updated wapiti to 1.5.1
Updated JEP to 4.2.2 #1332
Updated DeLFT to > 0.4.1 in documentation and Dockerfiles #1400
Updated JRuby to 9.4.12.1 and pragmatic segmenter #1293
Updated Docker base images from deprecated openjdk to eclipse-temurin (21.0.10_7)
Updated Dropwizard to address Trivy vulnerability in Docker image
Updated grobid-lucene-analyzers #1346
Updated dependency versions in build.gradle #1377
Extensive model retraining: header, segmentation, fulltext, article-light, and article-light-ref models updated across CRF, BidLSTM_CRF_FEATURES, and BidLSTM_ChainCRF_FEATURES architectures
Significant expansion of training data for segmentation, fulltext, header, name, and affiliation-address models
Refactored training framework for clearer extensibility #1393
Updated benchmark results #1392
Removed obsolete and unused models #1367
Enhanced documentation structure and clarity for newcomers #1310, #1382
Return XML by default when no HTTP Accept header is provided #1405
CI speed-up #1374

Fixed

Figures, tables and equations identifier uniqueness and overlapping IDs in body and annex #1342
IndexOutOfBoundException in ORCID search by annotation #1369
Missing logic to correctly get conflicts and credits in the output TEI
BibTeX index bug #1409
Revision link format in the web UI #1404
German wordforms failing to load in the Lexicon #1362
Honour instance-level Wapiti params in train() #1383
Evaluation script now works from the repository root
Docker build crash caused by dynamic Python environment version fetching #1348
Dockerfile for ARM Linux #1395
Full Docker image build restored #1371
preload_embeddings.py crash when download directory doesn't exist
Security-oriented regex improvements #1366
Coveralls build and Gradle deprecations #1347
Numerous training data corrections across all models

New Contributors

@jgoodall made their first contribution in #1261
@haydn-jones made their first contribution in #1301
@seang096 made their first contribution in #1335
@homfunc made their first contribution in #1321
@thesanogoeffect made their first contribution in #1348

Full Changelog: 0.8.2...0.9.0

0.8.2 What's Changed

Added

New model specialization/variants (flavors) mechanism #1151
Specialization/variant process for a lightweight processing that covers other types of scientific articles that are not following the general segmentation schema (e.g., corrections, editorial letters, etc.) #1202
Additional training data covering additional cases where the Data Availability statements are over multiple pages #1200
Added a flag that allows the output of the raw copyright information in TEI #1181
New Docker container for running end-to-end evaluation #1255
New Grobid client in Go #1159
Make the start/end page for header processing customizable #282
Return configuration processing parameters in TEI XML response header #1274

Changed

Update PDFalto recognition of non-standard fonts #1216
Revert text that does not belong to graphics as paragraphs instead of dropping it #1266
Updated Grobid lucene analyzers for CJK languages #1228

Fixed

Fix URL identification for certain edge cases #1190, #1191, #1185
Fix fulltext model training data #1107
Fix header model training data #1128
Updated the docker image's packages to reduce the vulnerabilities #1173
Fixed a bug in the handling of badly formatted figures/tables #1207
Correct replacement in the filenames of the fulltext generated files #1204
Fixed full-text block start #1203
Fix affiliation missing when using DL affiliation-address model #1166
Fixed various security vulnerabilities #1125 #1123 #1205
Avoid NPE when iterating over annotations that might have null bounding Boxes #1194

New Contributors

@annelhote made their first contribution in #1179
@miku made their first contribution in #1159
@Schroedi made their first contribution in #1107

Full Changelog: 0.8.1...0.8.2

0.8.1

Added

Identified URLs are now added in the TEI output #1099
Added DL models for patent processing #1082
Copyrights owner and licenses identification models #1078
Add research infrastructure recognition for funding processing #1085
Add paragraphs coordinates in the TEI output #1068
Specify configuration file with DL models enabled for the full docker image #1117
Support for biblio-glutton 0.3 #1086

Changed

Update affiliation process #1069
Improved the recognition of URLs using (when available) PDF annotations, such as clickable links
Updated TEI schema #1084
Review patent process #1082
Add Kotlin language to support development and testing #1096

Fixed

Avoid splitting URLs between sentences #1097
Add missing sentence segmentation in funding and acknowledgement #1106
Docker image was optimized to reduce the needed space #1088
Fixed OOBE when processing large quantities of notes #1075
Corrected <title> coordinate attribute name #1070
Fix missing coordinates in paragraph continuation #1076
Fixed JSON log output
Fixed notes identification #1124
Fixed extraneous semicolon in the training data #1133
Reduced security vulnerabilities in the dependencies #1136 #1137

New Contributors

@tanaynayak made their first contribution in #1133
@vipulg13 made their first contribution in #1137

Version 0.8.0

Added

Extraction of funder and funding information with a specific new model, see #1046 for details
Optional consolidation of funder with CrossRef Funder Registry
Identification of acknowledged entities in the acknowledgement section
Optional coordinates in title elements

Changed

Dropwizard upgrade to 4.0
Minimum JDK/JVM requirement for building/running the project is now 1.11
Logging now with Logback, removal of Log4j2, optional logs in json format
General review of logs
Enable Github actions / Disable circleci #678

Fixed

Set dynamic memory limit in pdfalto_server #1038
Logging in files when training models work now as expected
Various dependency upgrades
Fix #1051 with possible problematic PDF
Fix #1036 for pdfalto memory limit
fix readthedocs build #1040
fix for null equation #1030
Other minor fixes

Version 0.7.3

Added

Support for JDK beyond 1.11, tested up to Java 17, thanks to removal of dynamic native library loading after the start of the JVM
Incremental training (all models and ML engines), add this option in training command line and training web service (#971)
Systematic benchmarking on two new sets: PLOS (1000 artilces) and eLife (984 articles)
All end-to-end evaluation datasets are now available from the same place: https://zenodo.org/record/7708580
Option to output coordinates in notes and figure/table captions
Support for Mac ARM architecture (#975)
Play With Docker documentation (#962)

Changed

Update to DeLFT version 0.3.3
Demo now hosted as HuggingFace space
Additional training data, in particular for citation, reference-segmenter, segmentation, header, etc.
Update Deep Learning models (and some of the CRF)
The standard analyzer for sub-lexical tokenization is available in grobid-core, and used for the citation model (in particular for improving CJK references) (#990)
Update evaluations

Fixed

Correct wrong content type in doc for processCitation web service
Sentence segmentation applied to notes (#995)
Other minor fixes

Version 0.7.2

Added

Explicit identification of data/code availability statements (#951) and funding statements (#959), including when they are located in the header
Link footnote and their "callout" marker in full text (#944)
Option to consolidate header only with DOI if a DOI is extracted (#742)
"Window" application of RNN model for reference-segmenter to cover long bibliographical sections
Add dynamic timeout on pdfalto_server (#926)
A modest Python script to help to find "interesting" error cases in a repo of JATS/PDF pairs, grobid-home/scripts/select_error_cases.py

Changed

Update to DeLFT version 0.3.2
Some more training data (authors in reference, segmentation, citation, reference-segmenter) (including #961, #864)
Update of some models, RNN with feature channels and CRF (segmentation, header, reference-segmenter, citation)
Review guidelines for segmentation model
Better URL matching, using in particular PDF URL annotation in account

Fixed

Fix unexpected figure and table labeling in short texts
When matching an ORCID to an author, prioritize Crossref info over extracted ORCID from the PDF (#838)
Annotation errors for acknowledgement and other minor stuff
Fix for Python library loading on Mac
Update docker file to support new CUDA key
Do not dehyphenize text in superscript or subscript
Allow absolute temporary paths
Fix redirected stderr from pdfalto not "gobbled" by the java ProcessBuilder call (#923)
Other minor fixes

Version 0.7.1

Added

Web services for training models (#778)
Some additional training data for bibliographical references from arXiv
Add a web service to process a list of reference strings, see https://grobid.readthedocs.io/en/processcitationlist/Grobid-service/#apiprocesscitationlist
Extended processHeaderDocument to get result in bibTeX

Changed

Update to DeLFT version to 0.3.1 and TensorFlow 2.7, with many improvements, see https://github.com/kermitt2/delft/releases/tag/v0.3.0
Update of Deep Learning models
Update of JEP and add install script
Update to new biblio-glutton version 0.2, for improved and faster bibliographical reference matching
circleci to replace Travis
Update of processFulltextAssetDocument service to use the same parameters as processFulltextDocument
Pre-compile regex if not already done
Review features for header model

Fixed

Improved date normalization (#760)
Fix possible issue with coordinates related to reference markers (#908) and sentence (#811)
Fix path to bitmap/vector graphics (#836)
Fix possible catastrophic regex backtracking (#867)
Other minor fixes

Version 0.7.0

Added

New YAML configuration: all the settings are in one single yaml file, each model can be fully configured independently
Improvement of the segmentation and header models (for header, +1 F1-score for PMC evaluation, +4 F1-score for bioRxiv), improvements for body and citations
Add figure and table pop-up visualization on PDF in the console demo
Add PDF MD5 digest in the TEI results (service only)
Language support packages and xpdfrc file for pdfalto (support of CJK and exotic fonts)
Prometheus metrics
BidLSTM-CRF-FEATURES implementation available for more models
Addition of a "How GROBID works" page in the documentation

Changed

JitPack release (RIP jcenter)
Improved DOI cleaning
Speed improvement (around +10%), by factorizing some layout token manipulation
Update CrossRef requests implementation to align to the current usage of CrossRef's X-Rate-Limit-Limit response parameter

Fixed

Fix base url in demo console
Add missing pdfalto Graphics information when -noImage is used, fix graphics data path in TEI
Fix the tendency to merge tables when they are in close proximity

Version 0.6.2

Added

Docker image covering both Deep Learning and CRF models, with GPU detection and preloading of embeddings
For Deep Learning models, labeling is now done by batch: application of the citation DL model is 4 times faster for BidLSTM-CRF (with or without features) and 6 times faster for SciBERT
More tests for sentence segmentation
Add orcid of persons when available from the PDF or via consolidation (i.e. if in CrossRef metadata)
Add BidLSTM-CRF-FEATURES header model (with feature channel)
Add bioRxiv end-to-end evaluation
Bounding boxes for optional section titles coordinates

Changed

Reduce the size of docker images
Improve end-to-end evaluation: multithreaded processing of PDF, progress bar, output the evaluation report in markdown format
Update of several models covering CRF, BidLSTM-CRF and BidLSTM-CRF-FEATURES, mainly improving citation and author recognitions
OpenNLP is the default optional sentence segmenter (similar result as Pragmatic Segmenter for scholar documents after benchmarking, but 30 times faster)
Refine sentence segmentation to exploit layout information and predicted reference callouts
Update jep version to 3.9.1

Fixed

Ignore invalid utf-8 sequences
Update CrossRef multithreaded calls to avoid using the unreliable time interval returned by the CrossRef REST API service, update usage of Crossref-Plus-API-Token and update the deprecated crossref field query.title
Missing last table or figure when generating training data for the fulltext model
Fix an error related to the feature value for the reference callout for the fulltext model
Review/correct DeLFT configuration documentation, with a step-by-step configuration documentation
Other minor fixes

Version 0.6.1

Added

Support of line number (typically in preprints)
End-to-end evaluation and benchmark for preprints using the bioRxiv 10k dataset
Check whether PDF annotation is orcid and add orcid to author in the TEI result
Configuration for making sequence labeling engine (CRF Wapiti or Deep Learning) specific to models
Add a developers guide and a FAQ section in the documentation
Visualization of formulas on PDF layout in the demo console
Feature for subscript/superscript style in fulltext model

Changed

New significantly improved header model: with new features, new training data (600 new annotated examples, old training data is entirely removed), new labels and updated data structures in line with the other models
Update of the segmentation models with more training data
Removal of heuristics related to the header
Update to gradle 6.5.1 to support JDK 13 and 14
TEI schemas
Windows is not supported in this release

Fixed

Preserve affiliations after consolidation of the authors
Environment variable config override for all properties
Unfrequent duplication of the abstract in the TEI result
Incorrect merging of affiliations
Noisy parentheses in the bibliographical reference markers
In the console demo, fix the output filename wrongly taken from the input form when the text form is used
Synchronisation of the language detection singleton initialisation in case of multithread environment
Other minor fixes