Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The timeout problem in watchbot 4 #203

Open
jakepruitt opened this issue May 24, 2018 · 12 comments
Open

The timeout problem in watchbot 4 #203

jakepruitt opened this issue May 24, 2018 · 12 comments
Assignees
Labels

Comments

@jakepruitt
Copy link

Background

With the distributed nature of watchbot 4’s polling, we resurface an old problem with receiving messages that are currently processing but have timed out their message timeouts and been sent back into the queue.

From @rclark:

I just thought of another watchbot regression at v4:

Right now watchbot is "smart" about detecting the situation where your SQS messages time out before they finish. The watcher cross-checks an incoming message with receive > 1 against SQS ids that it is currently processing, and if they match, that means its a "duplicate receive". It used to be that I told everyone to solve this by changing sqs message timeouts, which are configurable via watchbot, but confusing as hell to everyone.

At v4, you can't check for duplication because watchers are distributed.

Important reading on duplicate receives: #175

Further reading https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html

Next actions

We should draft up a list of potential options to mitigate this situation. @rclark’s solution in watchbot 3 is outlined here: #176. SQS suggests a “heartbeat” system in https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html:

If you don't know how long it takes to process a message, create a heartbeat for your consumer process: Specify the initial visibility timeout (for example, 2 minutes) and then—as long as your consumer still works on the message—keep extending the visibility timeout by 2 minutes every minute.

We also have some interesting methods of implementing hard timeouts on containers by killing sub-processes if we wanted to go in a more brutal direction.

The floor is open for other suggestions from @mapbox/platform-engine-room. Ideally we can get rolling on some implementations and tests soon.

cc/ @emilymcafee @rclark

@rclark
Copy link
Contributor

rclark commented May 24, 2018

oooh that heartbeat solution is really a good one. The watcher can do that for sure.

We would want to change the watchbot options a little though. You'd give the caller the option to specify a max job duration. This is how long the watcher would let the worker run. It wouldn't be tied to the SQS message timeout except that the watcher would keep extending the timeout until either (a) the worker was done, or (b) hit that optional configuration timeout.

I think this'd be killer -- a maxJobDuration parameter is far more tangible to a developer than the existing messageTimeout parameter.

@rclark
Copy link
Contributor

rclark commented May 24, 2018

Note: I think that there is a limit to how long you can defer visibility. 12 hours as I recall.

@arunasank
Copy link
Contributor

The maxJobDuration implementation would be really cool. I am wondering if we can work with estimates (so a jobDuration) instead of hard stops (because that seems a little harsh to me), where:

  • We first use the same value as the jobDuration for the visibility timeout.
  • Start to extend by Y additional minutes every Y minutes, and up to the MaxVisibilityTimeout, just as we're nearing the end of the jobDuration, if the worker is still running. The worker stop effectively deletes the message from the queue, so we don't run the risk of the message lingering after that even if the visibility timeout was increased just before the worker stops.

The other advantage to implementing this way over the heartbeat solution would be the number of API calls we make to the change-visibility-timeout API.

@rclark
Copy link
Contributor

rclark commented May 24, 2018

If you have an option called jobDuration, but that does not represent any cap on a worker's duration, what is the purpose of the option?

With the hearbeat solution I was outlining, I imagined that by default, someone would not set maxJobDuration, and the heartbeat would allow a worker to run up to the max visibility limit by extending by 2 min every heartbeat.

Then, the maxJobDuration property is a way to give the user the option to cut off a job sooner. "I know my worker shouldn't take more than 2 hours unless something went wrong".

In other words:

  • there are no hard stops by default
  • the user can specify a hard stop if they so desire
  • message visibility is extended by the watcher automatically until the worker ends

The last situation to consider would be what to do if a worker takes more than the maximum visibility timeout: do we cut it off, or do we let the work get duplicated?

the number of API calls we make to the change-visibility-timeout API

p.s. SQS has no rate limits O_O

@arunasank
Copy link
Contributor

arunasank commented May 24, 2018

If you have an option called jobDuration, but that does not represent any cap on a worker's duration, what is the purpose of the option?

The purpose I was imagining was an estimate for the VisibilityTimeout, but without requiring folks to measure how long their jobs take, and without the expectation of them knowing the time taken to process their largest job. For example, in stacks where the processing time varies based on the amount of data being processed in each job, I imagine this can be hard, but maybe users already have hard stops on the maximum amount of data they're processing in each job?

Then, the maxJobDuration property is a way to give the user the option to cut off a job sooner. "I know my worker shouldn't take more than 2 hours unless something went wrong".

This clarifies a lot. I see what you mean now, and I'm more convinced with what you've outlined. With this additional information, it also seems like the maxJobDuration property might be clearer to use because of this hard stop as opposed to the vagueness that I would associate with estimates, and make users think harder about the time taken by their largest jobs.

Given that we have this maxJobDuration though, we can also implement our own version of the heartbeat solution where we:

  • Set the visibility timeout to be this maxJobDuration value.
  • Start the 2 minute heartbeat at (maxJobDuration - 120 seconds), up to the MaxVisibilityTimeout, if the worker is still running.

For smaller jobs, the end of the job will delete the message from the queue and the container, nullifying the need for a heartbeat, and ignoring the VisibilityTimeout set.

The last situation to consider would be what to do if a worker takes more than the maximum visibility timeout: do we cut it off, or do we let the work get duplicated?

For this situation, can we not delete the message from the queue when the MaxVisibilityTimeout is reached, and then send a fresh message to the queue (using the received SQS message), if the worker fails to finish, that will reset this VisibilityTimeout? We will be messing with the MaxRetries behaviour configured on the stack if we do this though. 😬

p.s. SQS has no rate limits O_O

Here's a hand-crafted expression of being mind-blown: #til :party_duck:

screen shot 2018-05-24 at 4 39 58 pm

@rclark
Copy link
Contributor

rclark commented May 25, 2018

Set the visibility timeout to be this maxJobDuration value.

I would suggest we pull from the queue with a 3 min visibility timeout, always. Then, every 2 minutes the watcher increases the timeout in 3 minute increments until you reach maxJobDuration or the SQS max visibility timeout.

What this does is prevent you from having to wait a long time to retry a message if by some stroke of bad luck the watcher gets killed. In that case, the you have to wait for whatever visibility timeout the watcher asked for. If you originally asked for a long time, then you have to wait that long to retry. Bumping the visibility timeout in relatively small increments makes sure that in the case of a message pulled from the queue that never gets a response, you can retry soon.

@brendanmcfarland
Copy link
Contributor

Timeout problem should be addressed with #230

message.heartbeat() - function to refresh the minimum visibility timeout on a message every two minutes.

maxJobDuration - when a worker child_process's duration is greater than the maxJobDuration, the worker will log that duration has been exceeded and attempt to kill the child_process and every process in the child_process tree. This is only evaluated if maxJobDuration is greater than zero.

@brendanmcfarland
Copy link
Contributor

message.heartbeat() - function to refresh the minimum visibility timeout on a message every two minutes.

We missed a key fact in the docs when pushing this out:

The default (minimum) visibility timeout for a message is 30 seconds. The maximum is 12 hours.

This max is not per extension of the timeout, it's total. If the heartbeat tries to extend the timeout any further than this you receive:

[error] [sqs] Value 180 for parameter VisibilityTimeout is invalid. 
Reason: Total VisibilityTimeout for the message is beyond the limit [43200 seconds]

Right now watchbot4 is not compatible with messages that take longer than 12 hours. They will return to visibility and get picked up by a new worker while still being processed.

At v4, you can't check for duplication because watchers are distributed is looking like a tougher problem for long running work. Thinking on some next actions here. Quick mitigation would be to set a default maxJobDuration at 43200 and error on any duration over that; but that would be a regression from watchbot3.

@rclark
Copy link
Contributor

rclark commented Jul 18, 2018

The most straightforward option to consider is to make 12 hour a hard job duration limit. If it gets to 12 hours, the child process gets killed and the job will be retried. That sucks, but its better than a 5 min lambda hard limit.

Anything else is quite challenging without a source of truth about what is being processed.

  • after 12 hours, the message will become visible in the queue again. There is no way to stop this.
  • another watcher will pick it up, get a new receipt handle
  • that watcher can observe that receive count is > 1
  • when the original worker completes, it cannot delete the message from the queue, because the receipt handle it had is no longer valid after another watcher receives the message

Your single source of truth about what is in flight has to (a) track sqs message ids that are being processed, and (b) be able to be updated to the most recent receipt handle that's valid for the message. Its really quite the mess.

@KeithYJohnson
Copy link

@rclark I'm just curious how you guys verified that a message returns to the queue when the 12-hour visibility timeout is exceeded, could you please let me know?

@rclark
Copy link
Contributor

rclark commented Nov 27, 2018

Hi @KeithYJohnson -- that is something we entirely rely on SQS to manage for us. Check out their docs here.

@KeithYJohnson
Copy link

Thanks! I did notice that because I'm dealing with similar issue and just wanted to verify that someone has observed the message returning to the queue when exceeding that hard limit. I suppose I should just stop being paranoid and trust the docs :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants