Providing custom `Tracker.list` timeout #124

sb8244 · 2019-04-29T15:37:01Z

Tracker.Shard uses a GenServer.call which doesn't accept any options from the caller. This call can be expensive in a large shard or one which is under heavy load.

Is it worthwhile to expose a timeout option the entire way through the function calls? My thought is to set a low timeout value when writing a listeners? function and then just assuming true if the function times out.

I'm happy to add this in but wanted to run it by the maintainers first. I could see a desire to expose this through all of the Tracker public APIs.

The text was updated successfully, but these errors were encountered:

michalmuskala · 2019-04-29T16:19:59Z

The work inside the call itself is minuscule - the main part of obtaining the list is performed by the caller directly. So the only reason for timeouts there would be that the shard server is overloaded and has a very long message queue - in that case crashing callers is a very primitive back-pressure mechanism.

sb8244 · 2019-04-29T16:40:23Z

In my use case I am willing to accept the caller timing out and then just saying "there's listeners" as a safe default. The 5 second timeout is longer than I'd want however as the timeout length directly affects time to deliver messages when the system is backed up. The timeout of the tracker for track calls will cause the client to reconnect.

Still looking at why my tracker process is backing up. I have ensured that there are no single large topics and have 8 shards which have pretty evenly distributed reduction count. However, having the ability to customize the timeout would allow me to reduce the maximum duration of message delivery in the worst case scenario of a backed up tracker shard.

Previous list and get_by_key had to go through GenServer to acquire values ets table and replicas information. In case GenServer was processing an update (e.g. heartbeat, track, untrack) then list and get_by_key functions were blocked until it was completed. We saw this behaviour in our cluster where simple list/get_by_key calls were sometimes taking over few hundred milliseconds. Storing replicas information in an ets table allows us to avoid going through genserver and allows us to process list/get_by_key immediately. I removed dirty_list function which was not public / exposed and which was trying to resolve the same issue. dirty_list was called dirty because it didn't check for down_replicas. This solution checks down_replicas and doesn't change the api interface. This should also resolve phoenixframework#124

Previous list and get_by_key had to go through GenServer to acquire values ets table and replicas information. In case GenServer was processing an update (e.g. heartbeat, track, untrack) then list and get_by_key functions were blocked until it was completed. We saw this behaviour in our cluster where simple list/get_by_key calls were sometimes taking over few hundred milliseconds. Storing down replicas information in an ets table allows us to avoid going through genserver and allows us to process list/get_by_key immediately. I removed dirty_list function which was not public / exposed and which was trying to resolve the same issue. dirty_list was called dirty because it didn't check for down_replicas. This solution checks down_replicas and doesn't change the api interface. This should also resolve phoenixframework#124

indrekj linked a pull request Jul 15, 2019 that will close this issue

Optimize Shard.list and Shard.get_by_key #127

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Providing custom `Tracker.list` timeout #124

Providing custom `Tracker.list` timeout #124

sb8244 commented Apr 29, 2019

michalmuskala commented Apr 29, 2019

sb8244 commented Apr 29, 2019

Providing custom Tracker.list timeout #124

Providing custom Tracker.list timeout #124

Comments

sb8244 commented Apr 29, 2019

michalmuskala commented Apr 29, 2019

sb8244 commented Apr 29, 2019

Providing custom `Tracker.list` timeout #124

Providing custom `Tracker.list` timeout #124