Agent QOS #1378

psFried · 2024-02-14T12:58:52Z

psFried
Feb 14, 2024
Maintainer

We've observed significant backlogs in the agent due to the high volume of auto-discovers jobs. This has a severe negative impact on user experience, because the agent processes jobs sequentially, so users sometimes need to sit and wait for the agent to process 30+ auto-discovers before it will even start to process their interactive discover. So let's talk about what we can do to fix this.

The most obvious and straight forward solution seems to be the introduction of a prioritization mechanism within the agent, so that it will process interactive jobs before processing any automated background jobs.

The main issue with that approach is due to how the agent processes jobs. Each time a job row is inserted or updated, the agent will send a HandlerInvocation onto an unbounded channel. There's one channel for all job types, and we process each HandlerInvocation in the order it was received. The HandlerInvocation doesn't contain any information about each job. It only has the name of the job table (e.g. publications). So there's no direct way to prioritize the different HandlerInvocations directly.

Put another way: It would be easy to update the dequeue function for each handler to have it prioritize interactive jobs over automated ones. But it wouldn't help us to prioritize an interactive publications job over an automated discovers job. If n background discovers jobs are created, followed by 1 interactive publications job, then all n background discovers jobs would still be processed before the interactive publications job.

The behavior we want is for all interactive jobs, of all types, to be processed before any background jobs.

One possible path forward would be to update the event payload that's sent with NOTIFY to include a boolean indicating whether it's a background job. Currently these notifications only contain the table name. Adding a background boolean would allow us to put HandlerInvocations in a sorted data structure and handle the interactive ones before the background ones.

That's not quite as straight forward as it sounds. For one thing, we still use polling, so we'd need to somehow account for background with that. Perhaps each poll of a handler could actually be two separate polls, one that allows background jobs, and another that does not. The other issue is that we currently execute pg_notify for every update of a job table, even those that update it to a terminal status. I expect we'll want to update that logic to only notify for jobs where job_status->>type = 'queued'.

At this point, I'm reasonably confident that we can significantly improve the latency of handling interactive jobs with this approach. But overall, our job handling still feels pretty gross and unwieldy. I don't like how we can get notified about one job and then dequeue a different job. And there's still the issue of how we hold locks for a long time when processing publications, which seems like it will require a slightly different approach for how we handle jobs. So I'm going to spend a little time thinking about job handling more holistically, and see if there's a way to solve both of those issues at once. If I don't think of anything better soon, then we can just do the relatively quick and dirty thing described above.

jgraettinger · 2024-02-14T16:47:46Z

jgraettinger
Feb 14, 2024
Maintainer

What happens if we reinterpret channel/notify to simply be something like a Rust Waker ?

It's job isn't to carry job details -- it just awakens agents to poll again. Ideally it awakens a single idle one. Maybe it awakens all idle ones and they race, which would probably still be okay.

5 replies

psFried Feb 14, 2024
Maintainer Author

🤔 At least with the current handlers::serve code, we'd still run this issue:

... But it wouldn't help us to prioritize an interactive publications job over an automated discovers job. If n background discovers jobs are created, followed by 1 interactive publications job, then all n background discovers jobs would still be processed before the interactive publications job.

The behavior we want is for all interactive jobs, of all types, to be processed before any background jobs.

I'm fairly confident that there's a re-framing of the handlers::serve implementation that could make this work. I think it would involve differentiating between background and interactive jobs in the Handler::handle function, which would need to be threaded through to the dequeue functions. This seems acceptable to me, but I'm still working through a POC

jgraettinger Feb 14, 2024
Maintainer

Agreed; an unstated assumption which should have been stated, is that we'd need to update the dequeue queries to have an understanding of "interactive" vs "background" jobs so that they could order them accordingly. Where today they order only on the job id (which encodes wall time) and take the first, tomorrow they might order on (interactive-before-background, id)

I was just getting at "do we need to have the channel/notify stuff also be in the loop?"

jgraettinger Feb 14, 2024
Maintainer

Right, and I suppose there's a further question of ordering across job types. If there are a bunch of queued discovers and publications, how do we ensure QoS for interactive jobs of either type?

Which -- as I recall, which isn't super well -- probably still breaks the proposed, because we may still starve out an interactive publication because we're busy with a bunch of background discovers.

psFried Feb 14, 2024
Maintainer Author

Right, and I suppose there's a further question of ordering across job types. If there are a bunch of queued discovers and publications, how do we ensure QoS for interactive jobs of either type?

Yeah, this is what I was trying to say. I'm currently working on an approach that seems promising:

The channel/notify just adds a Status::PollInteractive value to a hashmap to indicate that the job type potentially has interactive jobs waiting to be processed.
Loop through all the handlers with Status::PollInteractive, and handle an interactive job for each
If a handler returns HandlerStatus::Idle, then update its status to Status::PollBackground
Once there's no more handlers with Status::PollInteractive, then loop through and handle background jobs for each one with Status::PollBackground
This time, if it returns HandlerStatus::Idle, we can set its status to Status::Idle, to indicate that it no longer needs polled at all

The code for this seems to be going well so far, so I'm hopeful that I'll have it working soon

psFried Feb 14, 2024
Maintainer Author

Ok this seems like it ought to pretty much work. I opened #1379 as a draft, in case you want to get a better idea of what the code is looking like. I still need to actually update all the dequeue functions there, but I think the logic in handlers::new_serve should more or less do what we want.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent QOS #1378

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Agent QOS #1378

psFried Feb 14, 2024 Maintainer

Replies: 1 comment · 5 replies

jgraettinger Feb 14, 2024 Maintainer

psFried Feb 14, 2024 Maintainer Author

jgraettinger Feb 14, 2024 Maintainer

jgraettinger Feb 14, 2024 Maintainer

psFried Feb 14, 2024 Maintainer Author

psFried Feb 14, 2024 Maintainer Author

psFried
Feb 14, 2024
Maintainer

Replies: 1 comment 5 replies

jgraettinger
Feb 14, 2024
Maintainer

psFried Feb 14, 2024
Maintainer Author

jgraettinger Feb 14, 2024
Maintainer

jgraettinger Feb 14, 2024
Maintainer

psFried Feb 14, 2024
Maintainer Author

psFried Feb 14, 2024
Maintainer Author