GROBID 0.7.0 100% CPU loop/hang/timeout with some PDFs (original) (raw)

In our production systems at archive.org, I upgraded from GROBID ~0.5.5 to ~0.7.0 (git commit beebd9a6b to be exact) a few weeks ago. Since then, when processing hundreds of thousands of PDFs, I have regularly experienced GROBID "looping" with a couple threads consuming 100% CPU on individual cores. The symptoms are consistent with a small fraction of PDFs causing this behavior repeatably, while most PDFs continue to processed successfully. If we push a large number of PDFs through in parallel, eventually GROBID ends up spinning all the cores. The mitigation has been to restart GROBID every few hours.

It has been a struggle to get this issue to reproduce reliably in a minimum test case, even after a few hours debugging, and I apologize for any mistakes I have in debugging, or for lack of clarity in this issue write-up.

My best hypothesis right now is that there is an issue with a specific regex, when running GROBID with openjdk-8-jre (JVM 8). With openjdk-11-jre (JVM 11), the issue does not reproduce.

I have used a couple different clients and PDFs. A simple combination that has been reproducing the problem for me is the file sha1:5fa184d7eee96a504bae646e6c045699530bc023, which you can find an exact copy of at https://web.archive.org/web/20180719070056_/https://www.dovepress.com/getfile.php?fileID=14985. The API endpoint is processFulltextDocument, with consolidation disabled, raw citations and affiliations included, a number of TEI coordinates enabled, and sentance segmentation on. I haven't experimented with removing most of these options. Here is an example command using the httpie CLI client:

http --form --timeout 120 post http://localhost:8070/api/processFulltextDocument consolidateHeader=0 consolidateCitations=0 includeRawCitations=1 includeRawAffiliations=1 segmentSentences=1 teiCoordinates=ref teiCoordinates=figure teiCoordinates=biblStruct input@5fa184d7eee96a504bae646e6c045699530bc023.pdf

This command times out after two minutes, and the threads continue to spin for at least 10 minutes on the server. In other cases the CPU seems to continue to spin for hours, though I haven't isolated and reproduced exactly that behavior.

The specific regex is PARENTHESIS_NUMBER_PATTERN, which is called on line 66 of CalloutAnalyzer.java: https://github.com/kermitt2/grobid/blob/beebd9a6b7847565fac86c680919aaacfbc9ad14/grobid-core/src/main/java/org/grobid/core/engines/citations/CalloutAnalyzer.java#L65-L68

To debug, I submit a couple parallel requests of the same PDF file, then send signal 3 to a high-CPU process, which causes java to dump a thread trace. These are long, and include many threads, but a common pattern is a trace like:

Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: "dw-35 - POST /api/processFulltextDocument" #35 prio=5 os_prio=0 tid=0x00007f534c001000 nid=0xc239e runnable [0x00007f53b828e000]
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]:    java.lang.Thread.State: RUNNABLE
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Curly.match(Pattern.java:4241)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupHead.match(Pattern.java:4672)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Branch.match(Pattern.java:4618)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupHead.match(Pattern.java:4672)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Loop.match(Pattern.java:4799)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupTail.match(Pattern.java:4731)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$BranchConn.match(Pattern.java:4582)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupTail.match(Pattern.java:4731)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Ques.match(Pattern.java:4196)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Curly.match0(Pattern.java:4286)

[... many java.util.regex lines ...]

Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Curly.match0(Pattern.java:4286)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Curly.match(Pattern.java:4248)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupHead.match(Pattern.java:4672)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Branch.match(Pattern.java:4618)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$GroupHead.match(Pattern.java:4672)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Loop.matchInit(Pattern.java:4815)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Prolog.match(Pattern.java:4755)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3812)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Pattern$Start.match(Pattern.java:3475)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Matcher.search(Matcher.java:1248)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.util.regex.Matcher.find(Matcher.java:637)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.citations.CalloutAnalyzer.getCalloutType(CalloutAnalyzer.java:66)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.FullTextParser.postProcessCallout(FullTextParser.java:2359)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:298)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.FullTextParser.processing(FullTextParser.java:110)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.Engine.fullTextToTEIDoc(Engine.java:507)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.core.engines.Engine.fullTextToTEI(Engine.java:497)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.service.process.GrobidRestProcessFiles.processFulltextDocument(GrobidRestProcessFiles.java:207)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.service.GrobidRestService.processFulltext(GrobidRestService.java:268)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.grobid.service.GrobidRestService.processFulltextDocument_post(GrobidRestService.java:220)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

[... many service lines ...]

Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.eclipse.jetty.io.ManagedSelector$$Lambda$101/448084064.run(Unknown Source)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
Dec  4 02๐Ÿ”ž11 wbgrp-svc096.us.archive.org GROBID[795471]: #011at java.lang.Thread.run(Thread.java:748)

The org.grobid.core.engines.citations.CalloutAnalyzer.getCalloutType(CalloutAnalyzer.java:66) is common across a few repeats with different files, different requests, and multiple threads stuck in the same regex at the same point in time. This is what makes me suspect the regex linked above.

When I run GROBID on my laptop, using the same artifacts and configuration, except lower concurrency, I can not reproduce the issue. I noted that my laptop (running Debian Linux, on an intel processor) has JDK 11, while the server (Ubuntu focal on intel processor) has JDK 8 (which is what GROBID supports, as per documentation). I tried installing JDK 11 on the server instead, and could no longer reproduce the problem.

I haven't really worked through what the specific root cause is... difficult regex for some body text on some platforms in a specific version of the JVM? But figured I would share what I have found so far here.

Thanks, as always, for maintaining GROBID, and apologies this bug report runs so long.