-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(postgres): Adding healtcheck and reconnection mechanism to the postgres archive driver #1997
Conversation
…driver It starts an asynchronous infinite task that checks the connectivity with the database. In case of error, the postgres_healthcheck task tries to reconnect for a while, and if it determines that the connection cannot be resumed, then it invokes a callback indicating that situation. For the case of the `wakunode2` app, this callback quits the application itself and adds a log trace indicating the connectivity issue with the database.
This PR may contain changes to database schema of one of the drivers. If you are introducing any changes to the schema, make sure the upgrade from the latest release to this change passes without any errors/issues. Please make sure the label |
You can find the experimental image built from this PR at
|
You can find the image built from this PR at
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Curious to know - it seems that on failure, we will only continue trying to reconnect for 20 seconds and then quit the application. Is that enough to allow e.g. for a postgresql restart without crashing the node?
Thanks for the comment! I just set an arbitrary time of retrial. We can make it longer. May 60'' look better? |
Well, it is something between 20s and 50s (healthcheck every 30s + 20s of retries, so if PG fails right before the healthcheck, then you have 20s, otherwise it is more:D), but I do agree that this might be highly dependent on the platform you run, so we should turn this into an option eventually (locally 20s+ should be enouch, in Kubernetes under high load, it could take a bit longer) I am a fan of failing fast if part of the infra does not work though - assuming your infra is properly configured, your node will restart on failure and then you get another chance to retry with postgres, so the 20s should be fine IMO |
Description
It starts an asynchronous infinite task that checks the connectivity with the database.
In case of error, the
postgres_healthcheck
task tries to reconnect for a while, and if it determines that the connection cannot be resumed, then it invokes a callback indicating that situation.For the case of the
wakunode2
app, that callback quits the application itself and adds a log trace indicating the connectivity issue with the database.Changes
proc resetConnPool*(pool: PgAsyncPool)
to force close the pool if either the database got down or lost the connection with it.wakunode2
quits if it has Store mounted and the Postgres database goes down for more than 30''.waku/waku_archive/driver/postgres_driver/postgres_healthcheck.nim
that contains the infinite health check proc.How to test
Test case:
wakunode2
stops if the database gets stopped.Run node A -
./build/wakunode2 --config-file=cfg_node_a.txt
cfg_node_a.txt
Run node B -
./build/wakunode2 --config-file=cfg_node_b.txt
cfg_node_b.txt
Run a Postgres instance:
docker compose -f postgres-docker-compose.yml up
Content of postgres-docker-compose.yml:
wakunode2
should stop.Test case:
wakunode2
resumes connection with the Postgres database.Run node A -
./build/wakunode2 --config-file=cfg_node_a.txt
cfg_node_a.txt
Run node B -
./build/wakunode2 --config-file=cfg_node_b.txt
cfg_node_b.txt
Run a Postgres instance:
docker compose -f postgres-docker-compose.yml up
Content of postgres-docker-compose.yml:
curl -d '{"jsonrpc":"2.0","id":"id","method":"get_waku_v2_store_v1_messages"}' --header "Content-Type: application/json" http://localhost:8546
In this case, a valid response should be returned.
Now, the next error should appear:
{"jsonrpc":"2.0","id":"id","error":{"code":-32000,"message":"get_waku_v2_store_v1_messages raised an exception","data":"BAD_REQUEST: invalid cursor"}}
Issue
closes #1893