Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate using a priority queue for the general queue server actions #423

Closed
redshiftzero opened this issue Jun 13, 2019 · 8 comments · Fixed by #486
Closed

Investigate using a priority queue for the general queue server actions #423

redshiftzero opened this issue Jun 13, 2019 · 8 comments · Fixed by #486
Assignees
Milestone

Comments

@redshiftzero
Copy link
Contributor

redshiftzero commented Jun 13, 2019

In #400 we discussed the advantages of having the ApiJobQueue be the single object that contains auth state (i.e. the SDK API object) and mediates all network requests. In addition to that advantage, by using the queue in this way we'd wouldn't be duplicating logic like e.g. retries of API requests. However, some actions we want to make sure are speedy and don't get queued behind less important actions. For example, deletion should happen quickly as the UI only gets updated when the action completes server side. To do so, we'd want to either:

  1. Add additional RunnableQueues (we get concurrency for free).
  2. Make the general queue a priority queue.

Note that either way choice we need some method (probably on ApiJobQueue) for pausing all queues. For example, after merge of #421 when a RequestTimeoutError is raised, it means the network is down and we should pause all network action. In either scenario we'd have (unless we subsume the file download queue into this priority queue) multiple queues.

The ticket here is to do an investigatory spike on making the RunnableQueue use a PriorityQueue to uncover any issues not discussed here. Note that one factor simplifying the ask here is that python's queue.PriorityQueue has the same methods as queue.Queue (what we use now).

We could run jobs in this priority (this is a proposal, feel free to disagree):

  1. (top priority) Logout job that is prioritized over all other actions
  2. Metadata sync job that is prioritized high so that it happens on a predictable timescale (we enqueue one every 5 minutes and we know it executes first)
  3. New message, reply download (prioritized next in light of the fact that we want this to be snappy so it is displayed as a complete sync to the user)
  4. Deletion jobs
  5. Reply send jobs
  6. All other push actions
@eloquence
Copy link
Member

(Given the security implications, we've agreed discussion & investigation of this issue will continue during the 6/12-6/26 sprint, with potential implementation work kicking off next sprint.)

@eloquence eloquence added this to the 0.2.0alpha milestone Jul 2, 2019
@redshiftzero redshiftzero self-assigned this Jul 18, 2019
@redshiftzero
Copy link
Contributor Author

wip in spike-priority-queue

@redshiftzero
Copy link
Contributor Author

A minimal implementation is in that branch now (tests need updating) except for how to handle RunnableQueue.last_job. In this comment I'm going to write out my expectation of the last_job behavior in the priority queue case in case any interested observers disagree. To recap with the current queue architecture in master, keeping track of the last job is important for the following case:

  1. Job A is added to queue. It starts running.
  2. Job A times out.
  3. After Pausing the queue #443 is implemented, we'll pause the queues at this point. The queue keeps track of Job A in last_job.
  4. Network resumes, user clicks retry.
  5. last_job (Job A) will get executed first.

If we did not have this last job, one of two things would happen:

  • We lose track of the job (i.e. it is silently dropped).
  • To prevent losing track of the job, we could resubmit it to the queue, which would result in out of order execution. This is also not acceptable for example in a case where the user has submitted multiple reply jobs to the same source.

After we make the move to priority queues, we now have the following situation:

  1. Job A with priority 2 is added to queue. It starts running.
  2. Job B with priority 1 is added to the queue. (priority(B) > priority(A))
  3. Job A times out.
  4. After Pausing the queue #443 is implemented, we'll pause the queues at this point. The queue keeps track of Job A in last_job.
  5. Network resumes, user clicks retry.
  6. Job B should be executed before last_job (Job A) since its priority is higher.
  7. Then Job A should resume.

@sssoleileraaa
Copy link
Contributor

This summarizes our queue conversation really well, and I don't think you missed anything. As far as implementation goes, I just have one question about how we're going to modify last_job in order to avoid silently dropping Job A (in your example above) if Job B ends up timing out in step 6. If Job B times out, won't last_job be set to Job B, which would result in replacing Job A?

Instead of using last_job as a way to prioritize the job that failed last because of a timeout error, we could instead bump it back to the queue with a priority of 2. So the sitution you described above would be (showing my modifications in bold):

  1. Job A with priority 2 is added to queue. It starts running.
  2. Job B with priority 1 is added to the queue. (priority(B) > priority(A))
  3. Job A times out.
  4. After Pausing the queue #443 is implemented, we'll pause the queues at this point. The queue pushes Job A back into the queue with a high priority of 2.
  5. Network resumes, user clicks retry.
  6. Job B should be executed before last_job (Job A) since its priority is higher.
  7. Then Job A should resume.

@redshiftzero
Copy link
Contributor Author

The issue is that we can also have this situation:

  1. Job A with priority 2 is added to the queue. It starts running.
  2. Job C with priority 2 is added to the queue.
  3. Job D with priority 2 is added to the queue. At this point the user is expecting the order of execution to be A -> C -> D.
  4. Job B with priority 1 is added to the queue. (priority(B) > priority(A))
  5. Job A times out.
  6. After Pausing the queue #443 is implemented, we'll pause the queues at this point. We have the queue push Job A back into the queue with a high priority of 2.
  7. Network resumes, user clicks retry. The order of execution will be C -> D -> A when it should be A -> C -> D (important for a scenario where the job type of A, C, D is replies for example).

What we could do to handle these situations is keep last_job and then, picking up at step 7 when the job is resumed:

  1. Network resumes, user clicks retry.
  2. We retrieve the highest priority job in the queue.
  3. If the job from step 8 is higher priority than last_job, we run the job from step 8. Else, we run last_job and store the job from step 8 as the new last_job.

This results in the following job execution order for the resumed queue: B (priority(B) > priority(A) so we want this) -> A -> C -> D ✅

@redshiftzero
Copy link
Contributor Author

Thinking about this a bit more: since I implemented a counter on jobs to ensure that jobs with equal priorities are processed in the order they are submitted due to unrelated reasons, the execution order if we resubmit the job without modifying the counter will be A -> C -> D - this is far simpler than using last_job 🎉

@sssoleileraaa
Copy link
Contributor

Excellent, so the problem where we could potentially replace the previously failed job stored in last_job is solved because it would instead be sorted back into the correct place in the queue?

@redshiftzero
Copy link
Contributor Author

exactly - which means we don't need last_job anymore at all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants