-
Notifications
You must be signed in to change notification settings - Fork 449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question - what is a reasonable box configuration for a GROBID REST endpoint #349
Comments
Hello @curtkohler Thanks for the issue! Your issue made me realized that a change has been made when moving to Dropwizard leading to incorrect status code in the service responses for the PDF processing in version 0.5.0 and 0.5.1 (I overlooked this change when merging, the services for text processing were not changed and are OK). Normally GROBID send a status I've corrected this in the current master version of GROBID. In addition I completed the documentation: In addition, I've written 3 clients that use the service in the foreseen scalable way:
Normally with a machine with 8 CPU you should get good performance with the default setting of GROBID and of these clients, you don't need to increase the size of the thread pool or the max wait. 16GB memory is enough for exploiting all your available threads. In the next weeks, I will perform more large scale tests with these clients (with millions PDF) and I will report if everything is fine. |
It's not "official", but I can describe a box set up that has been working ok for some time: Single large worker host, running in a virtual machine: 30 cores, 2.1 GHz, 50 GByte RAM, slow spinning disk (not SSD).
As an environment variable: Process called as: File logging is disabled; console logging (at WARN level) goes through syslog and does end up on disk. With a slow spinning disk, the fact that In aggregate, it takes about 3 core-seconds per CPU to do fulltext extraction, and we can do a million PDFs in about 28 hours. |
I've been trying to write a quick PDF->XML Spark conversion job leveraging GROBID. I am remotely reading the PDFs and since the GROBID Java library doesn't support passing in ByteArrays, I decided to spin up a separate server with GROBID that I can send REST transform requests to instead of writing and reading tmp files on the Spark Nodes. While the code appears to work fine during simple testing, when scaling things up, the GROBID REST server quickly bogs down considerably and I see many Jetty timeouts, very long processing times, etc. The documentation doesn't really address many details about configuring the server and there are a number of moving parts (Jetty, Grobid, pdf2xml, etc.) in play, so I was hoping you might be able to provide some recommendations based on your experience.
For instance, assuming I have a virtual box with 8 virtual CPUs and 16GB of memory:
How many concurrent connections would be to set in the grobid.properties?
Would you lengthen pool max wait?
Would you modify any Java settings for memory allocation?
Modify any of the Jetty settings?
Etc.
Thanks in advance for any insight you can provide.
The text was updated successfully, but these errors were encountered: