Investigate using a priority queue for the general queue server actions #423

redshiftzero · 2019-06-13T22:54:07Z

In #400 we discussed the advantages of having the ApiJobQueue be the single object that contains auth state (i.e. the SDK API object) and mediates all network requests. In addition to that advantage, by using the queue in this way we'd wouldn't be duplicating logic like e.g. retries of API requests. However, some actions we want to make sure are speedy and don't get queued behind less important actions. For example, deletion should happen quickly as the UI only gets updated when the action completes server side. To do so, we'd want to either:

Add additional RunnableQueues (we get concurrency for free).
Make the general queue a priority queue.

Note that either way choice we need some method (probably on ApiJobQueue) for pausing all queues. For example, after merge of #421 when a RequestTimeoutError is raised, it means the network is down and we should pause all network action. In either scenario we'd have (unless we subsume the file download queue into this priority queue) multiple queues.

The ticket here is to do an investigatory spike on making the RunnableQueue use a PriorityQueue to uncover any issues not discussed here. Note that one factor simplifying the ask here is that python's queue.PriorityQueue has the same methods as queue.Queue (what we use now).

We could run jobs in this priority (this is a proposal, feel free to disagree):

(top priority) Logout job that is prioritized over all other actions
Metadata sync job that is prioritized high so that it happens on a predictable timescale (we enqueue one every 5 minutes and we know it executes first)
New message, reply download (prioritized next in light of the fact that we want this to be snappy so it is displayed as a complete sync to the user)
Deletion jobs
Reply send jobs
All other push actions

The text was updated successfully, but these errors were encountered:

eloquence · 2019-06-13T23:11:54Z

(Given the security implications, we've agreed discussion & investigation of this issue will continue during the 6/12-6/26 sprint, with potential implementation work kicking off next sprint.)

redshiftzero · 2019-07-18T23:44:45Z

wip in spike-priority-queue

redshiftzero · 2019-07-19T16:12:36Z

A minimal implementation is in that branch now (tests need updating) except for how to handle RunnableQueue.last_job. In this comment I'm going to write out my expectation of the last_job behavior in the priority queue case in case any interested observers disagree. To recap with the current queue architecture in master, keeping track of the last job is important for the following case:

Job A is added to queue. It starts running.
Job A times out.
After Pausing the queue #443 is implemented, we'll pause the queues at this point. The queue keeps track of Job A in last_job.
Network resumes, user clicks retry.
last_job (Job A) will get executed first.

If we did not have this last job, one of two things would happen:

We lose track of the job (i.e. it is silently dropped).
To prevent losing track of the job, we could resubmit it to the queue, which would result in out of order execution. This is also not acceptable for example in a case where the user has submitted multiple reply jobs to the same source.

After we make the move to priority queues, we now have the following situation:

Job A with priority 2 is added to queue. It starts running.
Job B with priority 1 is added to the queue. (priority(B) > priority(A))
Job A times out.
After Pausing the queue #443 is implemented, we'll pause the queues at this point. The queue keeps track of Job A in last_job.
Network resumes, user clicks retry.
Job B should be executed before last_job (Job A) since its priority is higher.
Then Job A should resume.

sssoleileraaa · 2019-07-22T18:15:12Z

This summarizes our queue conversation really well, and I don't think you missed anything. As far as implementation goes, I just have one question about how we're going to modify last_job in order to avoid silently dropping Job A (in your example above) if Job B ends up timing out in step 6. If Job B times out, won't last_job be set to Job B, which would result in replacing Job A?

Instead of using last_job as a way to prioritize the job that failed last because of a timeout error, we could instead bump it back to the queue with a priority of 2. So the sitution you described above would be (showing my modifications in bold):

Job A with priority 2 is added to queue. It starts running.
Job B with priority 1 is added to the queue. (priority(B) > priority(A))
Job A times out.
After Pausing the queue #443 is implemented, we'll pause the queues at this point. The queue pushes Job A back into the queue with a high priority of 2.
Network resumes, user clicks retry.
Job B should be executed before ~~last_job~~ (Job A) since its priority is higher.
Then Job A should resume.

redshiftzero · 2019-07-22T18:30:41Z

The issue is that we can also have this situation:

Job A with priority 2 is added to the queue. It starts running.
Job C with priority 2 is added to the queue.
Job D with priority 2 is added to the queue. At this point the user is expecting the order of execution to be A -> C -> D.
Job B with priority 1 is added to the queue. (priority(B) > priority(A))
Job A times out.
After Pausing the queue #443 is implemented, we'll pause the queues at this point. We have the queue push Job A back into the queue with a high priority of 2.
Network resumes, user clicks retry. The order of execution will be C -> D -> A when it should be A -> C -> D (important for a scenario where the job type of A, C, D is replies for example).

What we could do to handle these situations is keep last_job and then, picking up at step 7 when the job is resumed:

Network resumes, user clicks retry.
We retrieve the highest priority job in the queue.
If the job from step 8 is higher priority than last_job, we run the job from step 8. Else, we run last_job and store the job from step 8 as the new last_job.

This results in the following job execution order for the resumed queue: B (priority(B) > priority(A) so we want this) -> A -> C -> D ✅

redshiftzero · 2019-07-22T19:06:52Z

Thinking about this a bit more: since I implemented a counter on jobs to ensure that jobs with equal priorities are processed in the order they are submitted due to unrelated reasons, the execution order if we resubmit the job without modifying the counter will be A -> C -> D - this is far simpler than using last_job 🎉

sssoleileraaa · 2019-07-22T19:12:43Z

Excellent, so the problem where we could potentially replace the previously failed job stored in last_job is solved because it would instead be sorted back into the correct place in the queue?

redshiftzero · 2019-07-22T19:14:45Z

exactly - which means we don't need last_job anymore at all

eloquence added this to the 0.2.0alpha milestone Jul 2, 2019

redshiftzero self-assigned this Jul 18, 2019

redshiftzero mentioned this issue Jul 22, 2019

use priority queue for job processing #486

Merged

3 tasks

sssoleileraaa closed this as completed in #486 Jul 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate using a priority queue for the general queue server actions #423

Investigate using a priority queue for the general queue server actions #423

redshiftzero commented Jun 13, 2019 •

edited

Loading

eloquence commented Jun 13, 2019

redshiftzero commented Jul 18, 2019

redshiftzero commented Jul 19, 2019

sssoleileraaa commented Jul 22, 2019

redshiftzero commented Jul 22, 2019

redshiftzero commented Jul 22, 2019

sssoleileraaa commented Jul 22, 2019

redshiftzero commented Jul 22, 2019

Investigate using a priority queue for the general queue server actions #423

Investigate using a priority queue for the general queue server actions #423

Comments

redshiftzero commented Jun 13, 2019 • edited Loading

eloquence commented Jun 13, 2019

redshiftzero commented Jul 18, 2019

redshiftzero commented Jul 19, 2019

sssoleileraaa commented Jul 22, 2019

redshiftzero commented Jul 22, 2019

redshiftzero commented Jul 22, 2019

sssoleileraaa commented Jul 22, 2019

redshiftzero commented Jul 22, 2019

redshiftzero commented Jun 13, 2019 •

edited

Loading