Batch Polling #418

ericallam · 2023-08-28T13:56:05Z

ericallam
Aug 28, 2023
Maintainer

This idea comes from thinking through the scaling implications of the io.waitUntil task, which would "poll" the task on an interval to determine if some condition inside a callback was true, and then only move to the next task once that happens:

await io.waitUntil("form-submitted", {
  condition: async () => {
    // Simulate waiting for a form submission status
    const isFormSubmitted = await checkFormSubmissionStatus();
    return isFormSubmitted;
  },
  checkInterval: 60, // Check every 60 seconds
});

This would be extremely useful, but unfortunately could also cause a massive amount of wasted function execution time for Trigger.dev clients. Imagine the condition runs every 60 seconds (the minimum interval) but never returns true (and the timeout is in 14 days). That would be an additional 20,160 executions. And that's only for a single task in a single run.

Clearly, this does not scale well. And this situation would get even worse when we implement polling triggers (e.g. Notion doesn't have webhooks).

I think we'll need to developing a "Batch Polling" system before these features can be released, which will aggregate all the Polling work over the last interval and then we make a single request to perform all the work in a single request.

One consideration with this system is we'd have to make sure our request body payload isn't too large, and if it is, we should be able to split the polling work across multiple requests.

Another issue this system could run into (and open to ideas on how to solve this) is not being able to finish all the work required within the function execution time.

Maybe instead of doing the Batch Polling request every X seconds, it would do it after X number of polling tasks had been scheduled. This could drastically reduce the work required while getting around the issues listed above. Even if it was just every 10 polling tasks, that would drastically reduce the number of requests.

AQuackenbos · 2023-09-19T22:59:13Z

AQuackenbos
Sep 19, 2023

Performing the batch based on event count is consistent with many other batched processing systems I've seen, and can be facilitated with some flexibility by offering a method by which to trigger the execution on-demand as well (or in more standard terms to a batching process, "flush" the queue). However, that does make the task have a variable response time - instead of a guaranteed fire within 60 seconds of the event, it is now dependent on the frequency of poll events a given system has. This is generally acceptable in other systems I've seen, so long as a configurable autoFlushQueueSize is exposed - offering more rapid firing for users willing to take on the costs, or for those who need more direct control of the execution windows.

waitUntil definitely infers a sense of a singleton-task, so there's the potential to expose configuration that the same registered event should check its previous Run status before considering running again. This slightly defies expectations of a serverless code model, but if the configuration is explicit, this may be acceptable. It would, however, require more context about the job to be exposed to an individual run, which may be infeasible at scale or depending on the environment.

One consideration may be to utilize an expansion of the "backoff" strategy configuration utilized by backgroundFetch. This would allow a waitUntil job to define a strategy for continued failures to match a specific need. For instance, a user may configure a waitUntil job that checks rapidly for the first few minutes following an event trigger, and then slows the pace in steps until the defined timeout.

This kind of logic could alternatively be implemented by accepting a function for the checkInterval value, that receives some amount of context similar to run. While this is a bit dangerous ideologically (since it presents the opportunity for attempting to control the execution flow with run-timer logic), if the context included the "attempt number" (effectively, run iteration), then direct scaling could be offered (i.e. something like checkInterval: (runCount) => 60 * Math.min(runCount, 1)). However, this could have its own implications at scale in terms of how the general timeout is checked.

If the number of attempted runs for a poll is exposed at all to the job/run, it also offers the opportunity to set another configuration value, such as maximumPollAttempts, to limit the job by polling count instead of overall timeout.

Apologies if this overlooks any of the existing options or limitations in place for Tasks/Runs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Polling #418

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Batch Polling #418

ericallam Aug 28, 2023 Maintainer

Replies: 1 comment

AQuackenbos Sep 19, 2023

ericallam
Aug 28, 2023
Maintainer

AQuackenbos
Sep 19, 2023