Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse Knative Eventing containers in order to keep GPU/CPU/Memory state #8310

Open
milo157 opened this issue Nov 7, 2024 · 2 comments
Open

Comments

@milo157
Copy link

milo157 commented Nov 7, 2024

Problem
For long running tasks, you can't reuse containers. It seems to be because knative eventing creates jobs not pods. It would be great to reuse event containers (whether jobs or pods) instead of creating new jobs for a task.

The use case is for jobs that have a high initialisation time eg: Loading LLM's to process data that take minutes to load into GPU Memory and that take a long time to process

Persona:
Which persona is this feature for?
Event consumer

Exit Criteria
A measurable (binary) test that would indicate that the problem has been resolved.

Time Estimate (optional):
How many developer-days do you think this may take to resolve?
Unclear

Additional context (optional)
Add any other context about the feature request here.

@skonto
Copy link
Contributor

skonto commented Nov 7, 2024

Hi @milo157 could you elaborate on your use case, for example what do you use Eventing for eg. feed events for inference? Could you describe the architecture a bit? Your request is to basically process more than one event per job/pod is that right (now each event creates one job) and re-use existing jobs/pods meant for the same group of events?

@milo157
Copy link
Author

milo157 commented Nov 8, 2024

It is for long running tasks data processing/inferencing tasks.

We have an application that takes 2-3 minutes to load various ML models into GPU memory. Once loaded, we would send a event to be processed. Could be a few seconds, a few minutes or a few hours but we would like to know the status of the task at various points and get logs.

Once an event finishes processing, we would like to reuse that container since it has already spend the 2-3 minutes loading the models, so essentially we want to bypass that for efficiency/cost reasons. Currently once a event finishes and we send a new event, it would create a new job on a new container and we would have to wait 2-3 minutes for the models to load.

Of course, if we send a task and existing containers/jobs are busy then it should start a new one.

To answer your question, yes, re-using existing jobs/pods would be for the same group of events

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants