chore(postgres): Adding healtcheck and reconnection mechanism to the postgres archive driver #1997

Ivansete-status · 2023-09-06T13:02:28Z

Description

It starts an asynchronous infinite task that checks the connectivity with the database.

In case of error, the postgres_healthcheck task tries to reconnect for a while, and if it determines that the connection cannot be resumed, then it invokes a callback indicating that situation.

For the case of the wakunode2 app, that callback quits the application itself and adds a log trace indicating the connectivity issue with the database.

Changes

New proc resetConnPool*(pool: PgAsyncPool) to force close the pool if either the database got down or lost the connection with it.
wakunode2 quits if it has Store mounted and the Postgres database goes down for more than 30''.
New waku/waku_archive/driver/postgres_driver/postgres_healthcheck.nim that contains the infinite health check proc.

How to test

Test case: `wakunode2` stops if the database gets stopped.

Run node A - ./build/wakunode2 --config-file=cfg_node_a.txt
cfg_node_a.txt
Run node B - ./build/wakunode2 --config-file=cfg_node_b.txt
cfg_node_b.txt
Run a Postgres instance: docker compose -f postgres-docker-compose.yml up
Content of postgres-docker-compose.yml:

version: "3.8"

services:
  db:
    image: postgres:9.6-alpine
    restart: always
    environment:
      POSTGRES_PASSWORD: test123
    ports:
      - "5432:5432"

Close the Postgres instance started in 3. After less than 30'' the wakunode2 should stop.

Test case: `wakunode2` resumes connection with the Postgres database.

Run node A - ./build/wakunode2 --config-file=cfg_node_a.txt
cfg_node_a.txt
Run node B - ./build/wakunode2 --config-file=cfg_node_b.txt
cfg_node_b.txt
Run a Postgres instance: docker compose -f postgres-docker-compose.yml up
Content of postgres-docker-compose.yml:

version: "3.8"

services:
  db:
    image: postgres:9.6-alpine
    restart: always
    environment:
      POSTGRES_PASSWORD: test123
    ports:
      - "5432:5432"

Perform a Store request: curl -d '{"jsonrpc":"2.0","id":"id","method":"get_waku_v2_store_v1_messages"}' --header "Content-Type: application/json" http://localhost:8546
In this case, a valid response should be returned.
Close the database started in 3.
Send another Store request.
Now, the next error should appear: {"jsonrpc":"2.0","id":"id","error":{"code":-32000,"message":"get_waku_v2_store_v1_messages raised an exception","data":"BAD_REQUEST: invalid cursor"}}
Start the database again (point 3.) and wait for 30''.
Send another Store request. A valid response should come.

Issue

closes #1893

…driver It starts an asynchronous infinite task that checks the connectivity with the database. In case of error, the postgres_healthcheck task tries to reconnect for a while, and if it determines that the connection cannot be resumed, then it invokes a callback indicating that situation. For the case of the `wakunode2` app, this callback quits the application itself and adds a log trace indicating the connectivity issue with the database.

github-actions · 2023-09-06T13:02:41Z

This PR may contain changes to database schema of one of the drivers.

If you are introducing any changes to the schema, make sure the upgrade from the latest release to this change passes without any errors/issues.

Please make sure the label release-notes is added to make sure upgrade instructions properly highlight this change.

github-actions · 2023-09-06T13:08:10Z

You can find the experimental image built from this PR at

quay.io/wakuorg/nwaku-pr:1997-experimental

github-actions · 2023-09-06T13:08:53Z

You can find the image built from this PR at

quay.io/wakuorg/nwaku-pr:1997

jm-clius

Thanks! Curious to know - it seems that on failure, we will only continue trying to reconnect for 20 seconds and then quit the application. Is that enough to allow e.g. for a postgresql restart without crashing the node?

Ivansete-status · 2023-09-06T14:02:31Z

Thanks! Curious to know - it seems that on failure, we will only continue trying to reconnect for 20 seconds and then quit the application. Is that enough to allow e.g. for a postgresql restart without crashing the node?

Thanks for the comment! I just set an arbitrary time of retrial. We can make it longer. May 60'' look better?

vpavlin · 2023-09-06T14:43:46Z

Well, it is something between 20s and 50s (healthcheck every 30s + 20s of retries, so if PG fails right before the healthcheck, then you have 20s, otherwise it is more:D), but I do agree that this might be highly dependent on the platform you run, so we should turn this into an option eventually (locally 20s+ should be enouch, in Kubernetes under high load, it could take a bit longer)

I am a fan of failing fast if part of the infra does not work though - assuming your infra is properly configured, your node will restart on failure and then you get another chance to retry with postgres, so the 20s should be fine IMO

Ivansete-status marked this pull request as ready for review September 6, 2023 13:34

Ivansete-status requested review from alrevuelta, jm-clius, vpavlin and SionoiS September 6, 2023 13:34

Ivansete-status self-assigned this Sep 6, 2023

jm-clius approved these changes Sep 6, 2023

View reviewed changes

vpavlin approved these changes Sep 6, 2023

View reviewed changes

SionoiS approved these changes Sep 6, 2023

View reviewed changes

Ivansete-status merged commit 1fb13b0 into master Sep 6, 2023
14 checks passed

Ivansete-status deleted the fix-reconnect-to-postgres branch September 6, 2023 17:16

Ivansete-status mentioned this pull request Sep 8, 2023

PostgreSQL #1888

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(postgres): Adding healtcheck and reconnection mechanism to the postgres archive driver #1997

chore(postgres): Adding healtcheck and reconnection mechanism to the postgres archive driver #1997

Ivansete-status commented Sep 6, 2023

github-actions bot commented Sep 6, 2023

github-actions bot commented Sep 6, 2023

github-actions bot commented Sep 6, 2023

jm-clius left a comment

Ivansete-status commented Sep 6, 2023

vpavlin commented Sep 6, 2023 •

edited

Loading

chore(postgres): Adding healtcheck and reconnection mechanism to the postgres archive driver #1997

chore(postgres): Adding healtcheck and reconnection mechanism to the postgres archive driver #1997

Conversation

Ivansete-status commented Sep 6, 2023

Description

Changes

How to test

Test case: wakunode2 stops if the database gets stopped.

Test case: wakunode2 resumes connection with the Postgres database.

Issue

github-actions bot commented Sep 6, 2023

github-actions bot commented Sep 6, 2023

github-actions bot commented Sep 6, 2023

jm-clius left a comment

Choose a reason for hiding this comment

Ivansete-status commented Sep 6, 2023

vpavlin commented Sep 6, 2023 • edited Loading

Test case: `wakunode2` stops if the database gets stopped.

Test case: `wakunode2` resumes connection with the Postgres database.

vpavlin commented Sep 6, 2023 •

edited

Loading