Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question - what is a reasonable box configuration for a GROBID REST endpoint #349

Open
curtkohler opened this issue Sep 24, 2018 · 2 comments
Assignees
Labels
question There's no such thing as a stupid question

Comments

@curtkohler
Copy link

I've been trying to write a quick PDF->XML Spark conversion job leveraging GROBID. I am remotely reading the PDFs and since the GROBID Java library doesn't support passing in ByteArrays, I decided to spin up a separate server with GROBID that I can send REST transform requests to instead of writing and reading tmp files on the Spark Nodes. While the code appears to work fine during simple testing, when scaling things up, the GROBID REST server quickly bogs down considerably and I see many Jetty timeouts, very long processing times, etc. The documentation doesn't really address many details about configuring the server and there are a number of moving parts (Jetty, Grobid, pdf2xml, etc.) in play, so I was hoping you might be able to provide some recommendations based on your experience.

For instance, assuming I have a virtual box with 8 virtual CPUs and 16GB of memory:
How many concurrent connections would be to set in the grobid.properties?
Would you lengthen pool max wait?
Would you modify any Java settings for memory allocation?
Modify any of the Jetty settings?
Etc.

Thanks in advance for any insight you can provide.

@kermitt2
Copy link
Owner

kermitt2 commented Sep 29, 2018

Hello @curtkohler

Thanks for the issue!

Your issue made me realized that a change has been made when moving to Dropwizard leading to incorrect status code in the service responses for the PDF processing in version 0.5.0 and 0.5.1 (I overlooked this change when merging, the services for text processing were not changed and are OK).

Normally GROBID send a status 503 (service unavailable) when all the threads are used so that the client can wait a bit that some threads become available before re-sending the query. This is the way the service can scale, avoiding mega-queues of PDF queries at the server. The change made a runtime error for this case and accumulating queries is very likely the reason of these problems.

I've corrected this in the current master version of GROBID. In addition I completed the documentation:
https://grobid.readthedocs.io/en/latest/Grobid-service/
I describe for each service the response status codes. I also added some more explanation here:
https://grobid.readthedocs.io/en/latest/Grobid-service/#parallel-mode

In addition, I've written 3 clients that use the service in the foreseen scalable way:

Normally with a machine with 8 CPU you should get good performance with the default setting of GROBID and of these clients, you don't need to increase the size of the thread pool or the max wait. 16GB memory is enough for exploiting all your available threads.

In the next weeks, I will perform more large scale tests with these clients (with millions PDF) and I will report if everything is fine.

@kermitt2 kermitt2 self-assigned this Sep 29, 2018
@lfoppiano lfoppiano added the question There's no such thing as a stupid question label Oct 17, 2018
@bnewbold
Copy link
Contributor

It's not "official", but I can describe a box set up that has been working ok for some time:

Single large worker host, running in a virtual machine: 30 cores, 2.1 GHz, 50 GByte RAM, slow spinning disk (not SSD).
Python workers GET (HTTP) PDFs from remote storage and POST them to GROBID worker, then POST XML response elsewhere, so as little hits disk locally as possible. Run 50x python worker processes (synchronous, single threaded processes; plenty of RAM to do so). Not using the (new) python library, just requests.

org.grobid.max.connections=40
org.grobid.pool.max.wait=1
grobid.temp.path=/run/grobid/tmp
org.grobid.service.is.parallel.execution=true (default)

As an environment variable: TMPDIR=/run/grobid/tmp/

Process called as: ./gradlew run --gradle-user-home .

File logging is disabled; console logging (at WARN level) goes through syslog and does end up on disk. With a slow spinning disk, the fact that pdftoxml could be impacting performance, but doesn't seem too bad.

In aggregate, it takes about 3 core-seconds per CPU to do fulltext extraction, and we can do a million PDFs in about 28 hours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question There's no such thing as a stupid question
Projects
None yet
Development

No branches or pull requests

4 participants