Question - what is a reasonable box configuration for a GROBID REST endpoint #349

curtkohler · 2018-09-24T13:14:57Z

I've been trying to write a quick PDF->XML Spark conversion job leveraging GROBID. I am remotely reading the PDFs and since the GROBID Java library doesn't support passing in ByteArrays, I decided to spin up a separate server with GROBID that I can send REST transform requests to instead of writing and reading tmp files on the Spark Nodes. While the code appears to work fine during simple testing, when scaling things up, the GROBID REST server quickly bogs down considerably and I see many Jetty timeouts, very long processing times, etc. The documentation doesn't really address many details about configuring the server and there are a number of moving parts (Jetty, Grobid, pdf2xml, etc.) in play, so I was hoping you might be able to provide some recommendations based on your experience.

For instance, assuming I have a virtual box with 8 virtual CPUs and 16GB of memory:
How many concurrent connections would be to set in the grobid.properties?
Would you lengthen pool max wait?
Would you modify any Java settings for memory allocation?
Modify any of the Jetty settings?
Etc.

Thanks in advance for any insight you can provide.

kermitt2 · 2018-09-29T21:43:44Z

Hello @curtkohler

Thanks for the issue!

Your issue made me realized that a change has been made when moving to Dropwizard leading to incorrect status code in the service responses for the PDF processing in version 0.5.0 and 0.5.1 (I overlooked this change when merging, the services for text processing were not changed and are OK).

Normally GROBID send a status 503 (service unavailable) when all the threads are used so that the client can wait a bit that some threads become available before re-sending the query. This is the way the service can scale, avoiding mega-queues of PDF queries at the server. The change made a runtime error for this case and accumulating queries is very likely the reason of these problems.

I've corrected this in the current master version of GROBID. In addition I completed the documentation:
https://grobid.readthedocs.io/en/latest/Grobid-service/
I describe for each service the response status codes. I also added some more explanation here:
https://grobid.readthedocs.io/en/latest/Grobid-service/#parallel-mode

In addition, I've written 3 clients that use the service in the foreseen scalable way:

Java client: https://github.com/kermitt2/grobid-client-java
python client: https://github.com/kermitt2/grobid-client-python
node client: https://github.com/kermitt2/grobid-client-node

Normally with a machine with 8 CPU you should get good performance with the default setting of GROBID and of these clients, you don't need to increase the size of the thread pool or the max wait. 16GB memory is enough for exploiting all your available threads.

In the next weeks, I will perform more large scale tests with these clients (with millions PDF) and I will report if everything is fine.

bnewbold · 2018-11-27T02:43:00Z

It's not "official", but I can describe a box set up that has been working ok for some time:

Single large worker host, running in a virtual machine: 30 cores, 2.1 GHz, 50 GByte RAM, slow spinning disk (not SSD).
Python workers GET (HTTP) PDFs from remote storage and POST them to GROBID worker, then POST XML response elsewhere, so as little hits disk locally as possible. Run 50x python worker processes (synchronous, single threaded processes; plenty of RAM to do so). Not using the (new) python library, just requests.

org.grobid.max.connections=40
org.grobid.pool.max.wait=1
grobid.temp.path=/run/grobid/tmp
org.grobid.service.is.parallel.execution=true (default)

As an environment variable: TMPDIR=/run/grobid/tmp/

Process called as: ./gradlew run --gradle-user-home .

File logging is disabled; console logging (at WARN level) goes through syslog and does end up on disk. With a slow spinning disk, the fact that pdftoxml could be impacting performance, but doesn't seem too bad.

In aggregate, it takes about 3 core-seconds per CPU to do fulltext extraction, and we can do a million PDFs in about 28 hours.

kermitt2 self-assigned this Sep 29, 2018

lfoppiano added the question There's no such thing as a stupid question label Oct 17, 2018

lfoppiano mentioned this issue Jun 24, 2019

Production Grobid Server Configuration #443

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question - what is a reasonable box configuration for a GROBID REST endpoint #349

Question - what is a reasonable box configuration for a GROBID REST endpoint #349

curtkohler commented Sep 24, 2018

kermitt2 commented Sep 29, 2018 •

edited

Loading

bnewbold commented Nov 27, 2018

Question - what is a reasonable box configuration for a GROBID REST endpoint #349

Question - what is a reasonable box configuration for a GROBID REST endpoint #349

Comments

curtkohler commented Sep 24, 2018

kermitt2 commented Sep 29, 2018 • edited Loading

bnewbold commented Nov 27, 2018

kermitt2 commented Sep 29, 2018 •

edited

Loading