Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_PushStatus stays stuck as 'running' #4315

Closed
flovilmart opened this issue Nov 3, 2017 · 7 comments
Closed

_PushStatus stays stuck as 'running' #4315

flovilmart opened this issue Nov 3, 2017 · 7 comments

Comments

@flovilmart
Copy link
Contributor

flovilmart commented Nov 3, 2017

When sending large number pushes, the push pipeline originally evaluates the amount of push to send by running a count on the _Installations table. This count is then stored in the _PushStatus object and decremented each times the PushWorker completes the push send.

Now, it creates multiple issues.

  • If the pushWorker fails to dequeue tu PushWorkItem, this will never be retried, therefore the count will never be decremented.
  • If the count is off / installations are appended / removed while the push gets sent, this will also cause issues.

Counts are knowingly unreliable with mongodb. So we should not rely on those.

One fix would be to

  • remove the 'scheduled' status and replace by the 'pending'
  • make all direct push notifications go to the 'succeeded' state if there's no error scheduling the runs
  • make schedules pushes 'running' when the 1st TZ is being sent (in case of localized)
  • mark pushes as failed as we are right now.

Thoughts? @montymxb @acinader

@montymxb
Copy link
Contributor

montymxb commented Nov 3, 2017

Have there been issues dequeuing? For handling a failure it makes sense that we would just indicate it as failed like it is.

As for relying on count, we could add a relation of devices or installations to _PushStatus. Instead of just storing the count we could actually add the installations to the relation. As the pushes are sent we would just drop them from the relation. Would be pretty stable regardless of what installations are added/removed.

@flovilmart
Copy link
Contributor Author

That would be utterly inefficient, having to do n additional writes on the server befor sending the push’s.

The count is only used to know how many batches we need to send in a distributed manner (keep in mind we have 20+ parse-server instances with as many push workers).

Count also used as décrémented after the worker completed the job, main issue is that count in unlikely to reach 0 and PushStatus stays running.

@flovilmart
Copy link
Contributor Author

We haven’t seen any issues dequeuing, and it doesn’t matter, we don’t guarantee delivery yet, and this should be at the queue level that it should be solved, ie: the message should be put back in the working queue if processing didn’t start

@montymxb
Copy link
Contributor

montymxb commented Nov 4, 2017

It would be super slow, but it wouldn't miss! But that would be too slow most likely. If I'm understanding this right the query used to get the original count is later used to continue to provide _Installation objects right? As far as preventing new objects from messing it in you could just add a time constraint to the query to limit results to anything less than the moment the _PushStatus was created. Would prevent new objects from popping in, but that wouldn't help the issue of old ones being dropped.

I think I need to take a look at this code to see if something more obvious comes to mind, I haven't looked it over really. The one thing that I was thinking of, for preventing hanging pushes when installations are removed, would be to determine if the, although the count is not 0, the Installations available to be sent is 0. Basically pick either or, but this is just speculation.

@flovilmart
Copy link
Contributor Author

In any case, counts are unreliable, and documented as such on mongodb. So, for all that matters, it’s useful to estimate the count but not more. Also, for the sake of consistency, if an object enters the scope of a push, it should be included, even if there’s some latency. Otherwise, that’s a constraint on updatedAt we should add, which is unacceptable also, as may conflict with user defined constraints.

@montymxb
Copy link
Contributor

montymxb commented Nov 4, 2017

Yeah :/, modifying the query could mess with a user defined constraint. I'm assuming you're talking about sharded cluster counts, considering that it would make sense we can't really trust it.

In lieu of anything from my end could you elaborate on some of those fixes you proposed initially?

@flovilmart
Copy link
Contributor Author

Closing as done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants