-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Update python wrapper to use gunicorn #684
Conversation
It woudl be worth adding to the flag to the s2i wrapper (the The reason why it is important to be able to change the number of workers is because there would be situations where you would only want to have 1 worker per pod - an example would be tasks that have heavy resource usage, such as memory. If you have an ML graph that takes 50gb RAM for example, you may want your scaling to be horizontally, as if you request 50gb for the pods, as a user they may be confused that their pods keep going over their requests. |
@axsaucedo You can set the env vairaible GUNICORN_WORKERS to limit the number of workers so don't think an s2i change is needed. |
Ohh my bad, I had not seen the |
I've been evaluating Seldon this week and this was on my list, nice job as I'm guessing this will land well within our time frame! Not sure if this is the place to request this but would be great to include an option to choose the worker type, some of our models at inference time go out to feature store to enrich input requests. We'd be doing at as a part of an PS. Apologies if there is already a different pattern for this! |
Digging in some more, looking at the server implementations as they load the models in memory on the first prediction. This would be done post process fork resulting in the model being loaded into memory as many times are there are workers. If you also expose the This does have some downsides as it's incompatible with the |
Very interesting @alexlatchford, that's one of the challenges I outlined above, the preload_app functionality would certainly allow for the workers to only load one copy of the model reducing memory footprint. Of course there is still a challenge if the features extracted are big but that is a different challenge. I am not sure how much work may be involved to extend the wrapper to support this, definitely worth checking out, but depending on that we may have to open a separate issue to do this once the initial/basic move to gunicorn is landed. |
self.class_names = ["class:{}".format(str(i)) for i in range(10)] | ||
|
||
def load(self): | ||
print("Loading model",os.getpid()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use log.debug as convention, not critical tho
|
||
### Gunicorn workers | ||
|
||
By default 4 gunicorn workers will be created for your model. If you wish to change this default then add the environment variable GUNICORN_WORKERS to the container for your model. An example is shown below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are using gunicorn workers and make this variable overridable, you probably also need to consider (1) the preload
option to reduce the memory usage, and (2) max_requests
(probably along with max_requests_jitter
) to restart the workers to avoid their memory leak. These are the opts in gunicorn command, but I guess there should be corresponding APIs that provide those options too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would need to make peload optional as not sure it will work for Tensorflow graphs that need a separate session in each fork.
--workers
defaults to 4 (only uses gunicorn if workers>1)load()
method to user_model prototype. This is called after init in each worker.Fixes #453
Fixes #383
Fixes #674