_PushStatus stays stuck as 'running' #4315

flovilmart · 2017-11-03T15:27:02Z

When sending large number pushes, the push pipeline originally evaluates the amount of push to send by running a count on the _Installations table. This count is then stored in the _PushStatus object and decremented each times the PushWorker completes the push send.

Now, it creates multiple issues.

If the pushWorker fails to dequeue tu PushWorkItem, this will never be retried, therefore the count will never be decremented.
If the count is off / installations are appended / removed while the push gets sent, this will also cause issues.

Counts are knowingly unreliable with mongodb. So we should not rely on those.

One fix would be to

remove the 'scheduled' status and replace by the 'pending'
make all direct push notifications go to the 'succeeded' state if there's no error scheduling the runs
make schedules pushes 'running' when the 1st TZ is being sent (in case of localized)
mark pushes as failed as we are right now.

Thoughts? @montymxb @acinader

The text was updated successfully, but these errors were encountered:

montymxb · 2017-11-03T22:32:35Z

Have there been issues dequeuing? For handling a failure it makes sense that we would just indicate it as failed like it is.

As for relying on count, we could add a relation of devices or installations to _PushStatus. Instead of just storing the count we could actually add the installations to the relation. As the pushes are sent we would just drop them from the relation. Would be pretty stable regardless of what installations are added/removed.

flovilmart · 2017-11-03T22:38:12Z

That would be utterly inefficient, having to do n additional writes on the server befor sending the push’s.

The count is only used to know how many batches we need to send in a distributed manner (keep in mind we have 20+ parse-server instances with as many push workers).

Count also used as décrémented after the worker completed the job, main issue is that count in unlikely to reach 0 and PushStatus stays running.

flovilmart · 2017-11-03T22:39:45Z

We haven’t seen any issues dequeuing, and it doesn’t matter, we don’t guarantee delivery yet, and this should be at the queue level that it should be solved, ie: the message should be put back in the working queue if processing didn’t start

montymxb · 2017-11-04T01:42:43Z

It would be super slow, but it wouldn't miss! But that would be too slow most likely. If I'm understanding this right the query used to get the original count is later used to continue to provide _Installation objects right? As far as preventing new objects from messing it in you could just add a time constraint to the query to limit results to anything less than the moment the _PushStatus was created. Would prevent new objects from popping in, but that wouldn't help the issue of old ones being dropped.

I think I need to take a look at this code to see if something more obvious comes to mind, I haven't looked it over really. The one thing that I was thinking of, for preventing hanging pushes when installations are removed, would be to determine if the, although the count is not 0, the Installations available to be sent is 0. Basically pick either or, but this is just speculation.

flovilmart · 2017-11-04T04:07:36Z

In any case, counts are unreliable, and documented as such on mongodb. So, for all that matters, it’s useful to estimate the count but not more. Also, for the sake of consistency, if an object enters the scope of a push, it should be included, even if there’s some latency. Otherwise, that’s a constraint on updatedAt we should add, which is unacceptable also, as may conflict with user defined constraints.

montymxb · 2017-11-04T04:21:33Z

Yeah :/, modifying the query could mess with a user defined constraint. I'm assuming you're talking about sharded cluster counts, considering that it would make sense we can't really trust it.

In lieu of anything from my end could you elaborate on some of those fixes you proposed initially?

flovilmart · 2017-11-05T18:05:37Z

Closing as done!

montymxb added the discussion label Nov 3, 2017

montymxb mentioned this issue Nov 4, 2017

Fix for _PushStatus Stuck 'running' when Count is Off #4319

Merged

flovilmart closed this as completed Nov 5, 2017

snyk-bot mentioned this issue Apr 12, 2022

refactor: upgrade @graphql-tools/links from 8.2.4 to 8.2.6 #7935

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_PushStatus stays stuck as 'running' #4315

_PushStatus stays stuck as 'running' #4315

flovilmart commented Nov 3, 2017 •

edited

Loading

montymxb commented Nov 3, 2017

flovilmart commented Nov 3, 2017

flovilmart commented Nov 3, 2017

montymxb commented Nov 4, 2017

flovilmart commented Nov 4, 2017

montymxb commented Nov 4, 2017

flovilmart commented Nov 5, 2017

_PushStatus stays stuck as 'running' #4315

_PushStatus stays stuck as 'running' #4315

Comments

flovilmart commented Nov 3, 2017 • edited Loading

montymxb commented Nov 3, 2017

flovilmart commented Nov 3, 2017

flovilmart commented Nov 3, 2017

montymxb commented Nov 4, 2017

flovilmart commented Nov 4, 2017

montymxb commented Nov 4, 2017

flovilmart commented Nov 5, 2017

flovilmart commented Nov 3, 2017 •

edited

Loading