implement basic consumer push model #78098

john-z-yang · 2024-09-25T00:28:56Z

Overview

Adds a push mode where the grpc server is on the worker side, and grpc client is on the consumer side.
Handle grpc errors when failing to communicate with worker.

Verified that this works with super long running tasks (30 minutes or more) and the grpc server and client seems to maintain an active socket connect this whole time, so if the worker dies, the client immediately sees it.

Start the arroyo consumer as before
Start the consumer GRPC with

sentry run kafka-task-grpc-push -W 127.0.0.1:50051,127.0.0.1:50052

Start the worker with

sentry run taskworker-push --namespace demos -P 50052
sentry run taskworker-push --namespace demos -P 50051

With Django shell, run

for i in range(16):
    say_hello.delay(str(i))

The work should be distributed pretty evenly across the 2 workers
worker 1

23:28:44 [INFO] sentry.taskdemo: hello 1
23:28:44 [INFO] taskworker.results: task.complete, 1727306923, 1727306924.542855, 1.5428550243377686
23:28:44 [INFO] sentry.taskdemo: hello 7
23:28:44 [INFO] taskworker.results: task.complete, 1727306923, 1727306924.557736, 1.5577359199523926
23:28:44 [INFO] sentry.taskdemo: hello 4
23:28:44 [INFO] taskworker.results: task.complete, 1727306923, 1727306924.5677671, 1.5677671432495117
23:28:44 [INFO] sentry.taskdemo: hello 11
23:28:44 [INFO] taskworker.results: task.complete, 1727306923, 1727306924.576597, 1.576596975326538
23:28:44 [INFO] sentry.taskdemo: hello 2
23:28:44 [INFO] taskworker.results: task.complete, 1727306923, 1727306924.585857, 1.5858569145202637
23:28:44 [INFO] sentry.taskdemo: hello 5
23:28:44 [INFO] taskworker.results: task.complete, 1727306923, 1727306924.5937028, 1.593702793121338
23:28:47 [INFO] sentry.taskdemo: hello 15
23:28:47 [INFO] taskworker.results: task.complete, 1727306923, 1727306927.6729941, 4.672994136810303

worker 2

23:28:44 [INFO] sentry.taskdemo: hello 10
23:28:44 [INFO] taskworker.results: task.complete, 1727306923, 1727306924.53108, 1.5310800075531006
23:28:44 [INFO] sentry.taskdemo: hello 9
23:28:44 [INFO] taskworker.results: task.complete, 1727306923, 1727306924.551745, 1.5517449378967285
23:28:44 [INFO] sentry.taskdemo: hello 8
23:28:44 [INFO] taskworker.results: task.complete, 1727306923, 1727306924.562934, 1.5629339218139648
23:28:44 [INFO] sentry.taskdemo: hello 0
23:28:44 [INFO] taskworker.results: task.complete, 1727306923, 1727306924.572237, 1.5722370147705078
23:28:44 [INFO] sentry.taskdemo: hello 3
23:28:44 [INFO] taskworker.results: task.complete, 1727306923, 1727306924.580666, 1.5806660652160645
23:28:44 [INFO] sentry.taskdemo: hello 6
23:28:44 [INFO] taskworker.results: task.complete, 1727306923, 1727306924.590097, 1.5900969505310059
23:28:44 [INFO] sentry.taskdemo: hello 12
23:28:44 [INFO] taskworker.results: task.complete, 1727306923, 1727306924.597898, 1.597898006439209
23:28:46 [INFO] sentry.taskdemo: hello 13
23:28:46 [INFO] taskworker.results: task.complete, 1727306923, 1727306926.637703, 3.6377029418945312
23:28:47 [INFO] sentry.taskdemo: hello 14
23:28:47 [INFO] taskworker.results: task.complete, 1727306923, 1727306927.6825259, 4.682525873184204

john-z-yang · 2024-09-25T01:12:32Z

src/sentry/taskworker/consumer_grpc.py

+            logger.exception(
+                "Connection lost with worker, code: %s, details: %s",
+                rpc_error.code(),
+                rpc_error.details(),
+            )
+            self.pending_task_store.set_task_status(
+                task_id=in_flight_activation.activation.id,
+                task_status=TASK_ACTIVATION_STATUS_PENDING,
+            )


I'm questioning whether this is the correct behaviour. My intuition is to not use the same code paths to handle worker connection failure because this seems to be an error with the platform, so it should not consume quotas set by the Task defined by the user.
The argument can be made that the execution of the task can cause the worker to OOM, which is a user issue.

Regardless, not having a check here could be dangerous in a production environment.

If we fail to connect to a worker, the task wouldn't have been processed and restoring the state to pending makes sense to me. We'd also need to reset the processing_deadline for the activation so that it doesn't get flagged as a worker timeout and retried later.

john-z-yang · 2024-09-25T01:22:10Z

src/sentry/taskworker/consumer_grpc.py

+        while True:
+            self.dispatch_task()


Maybe we should have a multithreading pool or async here since consumer and worker are unlikely to be 1:1. We only query the db when an ongoing connection with the worker closes. Thoughts?

I think multithreading is going to be necessary. I think we'll want to pair the threadpool count to be <= the number of workers per partition-consumer. Having more threads than workers feels like it would lead to contention on workers.

markstory · 2024-09-25T16:12:43Z

src/sentry/taskworker/consumer_grpc.py

+        while True:
+            self.dispatch_task()


I think multithreading is going to be necessary. I think we'll want to pair the threadpool count to be <= the number of workers per partition-consumer. Having more threads than workers feels like it would lead to contention on workers.

markstory · 2024-09-25T16:16:42Z

src/sentry/taskworker/consumer_grpc.py

+            logger.exception(
+                "Connection lost with worker, code: %s, details: %s",
+                rpc_error.code(),
+                rpc_error.details(),
+            )
+            self.pending_task_store.set_task_status(
+                task_id=in_flight_activation.activation.id,
+                task_status=TASK_ACTIVATION_STATUS_PENDING,
+            )


If we fail to connect to a worker, the task wouldn't have been processed and restoring the state to pending makes sense to me. We'd also need to reset the processing_deadline for the activation so that it doesn't get flagged as a worker timeout and retried later.

markstory · 2024-09-25T16:17:55Z

src/sentry/taskworker/worker.py


 from sentry.taskworker.config import TaskNamespace, taskregistry

 logger = logging.getLogger("sentry.taskworker")


-class Worker:
+class WorkerServicer(BaseWorkerServiceServicer):


It would be good if we had a way to have the push and pull models co-exist in the repository at the same time. That would help with throughput comparisons as we could run the two options close to each other.

Perhaps we can have two Worker implementations that are toggled with CLI flags?

Agree, we split the command into push and pull variants for worker and grpc

src/sentry/taskworker/consumer_grpc_push.py

john-z-yang · 2024-09-25T23:46:37Z

src/sentry/taskworker/consumer_grpc_push.py

+        with ThreadPoolExecutor(max_workers=len(self.available_stubs)) as executor:
+            logger.info("Starting consumer grpc with %s threads", len(self.available_stubs))
+            while True:
+                inflight_activation = self._poll_pending_task()
+
+                if len(self.available_stubs) == 0:
+                    done, not_done = wait(self.current_connections, return_when=FIRST_COMPLETED)
+                    self.available_stubs.extend([future.result() for future in done])
+                    self.current_connections = not_done


I think ideally this should be done via async grpc instead of regular grpc with multithreading, because with async we only perform preemption during IO, where as with multithreading it's up to the discretion of cpython.

But I am not sure how to replicate this behaviour (where we only poll db when a stub is free) with async python, and if the difference is enough to pose measurable performance benefits. So I decided to implement this instead

enochtangg · 2024-09-26T17:09:11Z

src/sentry/taskworker/pending_task_store.py

@@ -33,7 +33,7 @@ def get_pending_task(self) -> InflightActivation | None:
                return None

            # TODO this duration should be a tasknamespace setting, or with an option
-            deadline = task.added_at + timedelta(minutes=3)
+            deadline = datetime.now() + timedelta(minutes=3)


The processing deadline should be the timestamp at which the task was pulled out of the datastore to be sent/received to/by the worker plus some adjustable duration. The added_at timestamp is first captured when the message was read from kafka and inserted to the datastore. In the scenario where the worker does not pick up the task before task.added_at + <adjustable duration>, then the task can never be completed. More simply, the current time will always be ahead of processing_deadline.

john-z-yang · 2024-09-26T17:44:30Z

src/sentry/taskworker/consumer_grpc_push.py

+            timeout_in_sec = inflight_activation.processing_deadline.seconds - time.time()
+            dispatch_task_response = stub.Dispatch(
+                DispatchRequest(task_activation=inflight_activation.activation),
+                timeout=timeout_in_sec,
+            )


Thinking about this a bit more, a timeout might be better handled on server (worker) side instead of client side.

Here's my reasoning:
I think when a task times out, this should be considered as if it has thrown an exception, because it is not a platform problem (like failing to connect to a worker), but an issue with the execution of the task itself. So it should go into the same flow in the worker that determines the next state of the activation (here) instead of requeuing the task into the store like what we're doing right now.

@enochtangg what do you think?

implement basic consumer push model

29b955f

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Sep 25, 2024

handle grpc connection error gracefully when pushing to worker

543be62

vercel bot deployed to Preview September 25, 2024 01:10 View deployment

john-z-yang commented Sep 25, 2024

View reviewed changes

john-z-yang requested review from markstory and enochtangg September 25, 2024 01:13

john-z-yang commented Sep 25, 2024

View reviewed changes

markstory reviewed Sep 25, 2024

View reviewed changes

Merge branch 'hackweek-kafkatasks' into consumer-push

9bdea69

vercel bot deployed to Preview September 25, 2024 20:01 View deployment

combine push and pull

aaeee8a

vercel bot deployed to Preview September 25, 2024 20:36 View deployment

reset task deadline

445e43f

vercel bot deployed to Preview September 25, 2024 21:14 View deployment

vercel bot deployed to Preview September 25, 2024 23:33 View deployment

Add multithreading to support multiple workers

0ca7a36

john-z-yang force-pushed the consumer-push branch from 5e62a25 to 0ca7a36 Compare September 25, 2024 23:37

john-z-yang commented Sep 25, 2024

View reviewed changes

src/sentry/taskworker/consumer_grpc_push.py Outdated Show resolved Hide resolved

vercel bot deployed to Preview September 25, 2024 23:40 View deployment

john-z-yang commented Sep 25, 2024

View reviewed changes

john-z-yang requested a review from markstory September 25, 2024 23:55

Poll for pending activation only when free stubs are available

b064e25

vercel bot deployed to Preview September 26, 2024 01:55 View deployment

add log for polled pending task

61ecc8c

vercel bot deployed to Preview September 26, 2024 16:32 View deployment

handle grpc timeout with processing deadline

8e5fd62

vercel bot deployed to Preview September 26, 2024 17:02 View deployment

enochtangg reviewed Sep 26, 2024

View reviewed changes

john-z-yang commented Sep 26, 2024

View reviewed changes

vercel bot deployed to Preview September 28, 2024 00:07 View deployment

Run task in separate process

5b17c85

john-z-yang force-pushed the consumer-push branch from 4251bb0 to 5b17c85 Compare September 28, 2024 00:33

vercel bot deployed to Preview September 28, 2024 00:36 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement basic consumer push model #78098

implement basic consumer push model #78098

john-z-yang commented Sep 25, 2024 •

edited

Loading

john-z-yang Sep 25, 2024 •

edited

Loading

markstory Sep 25, 2024

john-z-yang Sep 25, 2024 •

edited

Loading

markstory Sep 25, 2024

markstory Sep 25, 2024

markstory Sep 25, 2024

markstory Sep 25, 2024

john-z-yang Sep 25, 2024 •

edited

Loading

john-z-yang Sep 25, 2024 •

edited

Loading

enochtangg Sep 26, 2024

john-z-yang Sep 26, 2024 •

edited

Loading

implement basic consumer push model #78098

Are you sure you want to change the base?

implement basic consumer push model #78098

Conversation

john-z-yang commented Sep 25, 2024 • edited Loading

Overview

john-z-yang Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

markstory Sep 25, 2024

Choose a reason for hiding this comment

john-z-yang Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

markstory Sep 25, 2024

Choose a reason for hiding this comment

markstory Sep 25, 2024

Choose a reason for hiding this comment

markstory Sep 25, 2024

Choose a reason for hiding this comment

markstory Sep 25, 2024

Choose a reason for hiding this comment

john-z-yang Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

john-z-yang Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

enochtangg Sep 26, 2024

Choose a reason for hiding this comment

john-z-yang Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

john-z-yang commented Sep 25, 2024 •

edited

Loading

john-z-yang Sep 25, 2024 •

edited

Loading

john-z-yang Sep 25, 2024 •

edited

Loading

john-z-yang Sep 25, 2024 •

edited

Loading

john-z-yang Sep 25, 2024 •

edited

Loading

john-z-yang Sep 26, 2024 •

edited

Loading