Refactor stats to use atomics #375

magec · 2023-03-20T13:28:46Z

When we are dealing with a high number of connections, generated stats cannot be consumed fast enough by the stats collector loop. This makes the stats subsystem inconsistent and a log of warning messages are thrown due to unregistered server/clients.

This change refactors the stats subsystem so it uses atomics:

Now counters are handled using U64 atomics
Event system is dropped and averages are calculated using a loop every 15 seconds.
Now, instead of snapshots being generated ever second we keep track of servers/clients that have registered. Each pool/server/client has its own instance of the counter and makes changes directly, instead of adding an event that gets processed later.

Note that now, the system has some contention when registering/deregistering clients, as we need to write to a hash over a lock. Given that it is only when registering/deregistering I assumed performance won't be comopromised, specially given that now, we do not copy snapshots every second anymore.

Test coverage was good enough so I didn't add new ruby tests here.

I had to modify one test because maxwait is not deleted every second now, do we need this?

src/stats/address.rs

levkk · 2023-03-23T15:52:40Z

Hey @magec , the PR looks amazing, thank you so much for writing it. I've tagged @drdrsh to review it because he's running PgCat in production at Instacart, so I wanted to make sure that the changes looked good to him as well.

On another note, I think it would be great for us 3 (the main production users of PgCat that I know of) to open up a more direct communication channel (e.g. Slack, Discord, whatever), so we can:

share knowledge and battle stories
talk through bugs, fixes, features
share test plans, results

What do you think?

levkk

Good to merge! Let me know what the production metrics look like! Our main concern is speed of atomics vs. tokio channels and accuracy of the new stats implementation (it took us a few to get the stats right the first time, just because our understanding of them was different than PgBouncer, for which we were writing an in-place replacement).

magec · 2023-03-24T07:50:59Z

Hey!, yes, another communication channel would be great, Slack works for me. As for the PR, I have noticed that we are leaking connections in cl_idle, I think this was already happening don't really know, but yesterday I disconnected everything abruptly and stats kept saying that I had several clients connected (several as in 20k). Will try to reproduce and fix it, the rest works like a charm and I have no issues with load anymore.

src/client.rs

When we are dealing with a high number of connections, generated stats cannot be consumed fast enough by the stats collector loop. This makes the stats subsystem inconsistent and a log of warning messages are thrown due to unregistered server/clients. This change refactors the stats subsystem so it uses atomics: - Now counters are handled using U64 atomics - Event system is dropped and averages are calculated using a loop every 15 seconds. - Now, instead of snapshots being generated ever second we keep track of servers/clients that have registered. Each pool/server/client has its own instance of the counter and makes changes directly, instead of adding an event that gets processed later.

levkk · 2023-03-27T18:08:05Z

Hey!, yes, another communication channel would be great, Slack works for me. As for the PR, I have noticed that we are leaking connections in cl_idle, I think this was already happening don't really know, but yesterday I disconnected everything abruptly and stats kept saying that I had several clients connected (several as in 20k). Will try to reproduce and fix it, the rest works like a charm and I have no issues with load anymore.

@magec So we ended up settling on Discord for our public chat platform. You can join here: https://discord.com/invite/DmyJP3qJ7U (you can also invite anyone you want, it's public and free for anyone). There is a channel called #pgcat that's been pretty idle, but hopefully we can get it going!

magec · 2023-03-28T15:01:29Z

So, in the end I was testing badly 🤦. In the test I spawn a psql that connects and then kill it and check the cl_idle counter. I do this because if I use ruby PG, then I cannot close the socket without closing the session (which is what I want to test).

Funny thing (I didn't know), psql also forks and the socket is opened in the child, so I was killing the parent, and thus the socket didn't disconnect because the child was left out. Now I execute a pkill and kill both (I create a random string to ensure I don't kill more than I want).

After that, I could see that I do receive the event of the client disconnecting and I could detect where was the issue, which was only stats related in the end.

That said, this is ready to be merged.

magec requested a review from levkk March 20, 2023 13:28

levkk requested a review from drdrsh March 20, 2023 15:27

drdrsh reviewed Mar 21, 2023

View reviewed changes

src/stats/address.rs Outdated Show resolved Hide resolved

magec requested a review from drdrsh March 23, 2023 09:53

levkk approved these changes Mar 23, 2023

View reviewed changes

levkk reviewed Mar 24, 2023

View reviewed changes

src/client.rs Show resolved Hide resolved

magec added 2 commits March 27, 2023 10:11

Manually mplement Hash/Eq in config::Address ignoring stats

761a2fa

magec force-pushed the atomic-stats branch from 9ab522f to fa94bd0 Compare March 27, 2023 14:38

Add tests for client connection counters

e279723

magec force-pushed the atomic-stats branch from fa94bd0 to e279723 Compare March 28, 2023 13:18

Allow connecting to dockerized dev pgcat from the host

b63274b

stats: Decrease cl_idle when idle socket disconnects

ba032e8

magec force-pushed the atomic-stats branch from 6bbcd46 to ba032e8 Compare March 28, 2023 15:02

magec merged commit 58ce76d into postgresml:main Mar 28, 2023

magec deleted the atomic-stats branch March 28, 2023 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor stats to use atomics #375

Refactor stats to use atomics #375

magec commented Mar 20, 2023

levkk commented Mar 23, 2023

levkk left a comment

magec commented Mar 24, 2023

levkk commented Mar 27, 2023

magec commented Mar 28, 2023

Refactor stats to use atomics #375

Refactor stats to use atomics #375

Conversation

magec commented Mar 20, 2023

levkk commented Mar 23, 2023

levkk left a comment

Choose a reason for hiding this comment

magec commented Mar 24, 2023

levkk commented Mar 27, 2023

magec commented Mar 28, 2023