Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(postgres): Adding healtcheck and reconnection mechanism to the postgres archive driver #1997

Merged
merged 1 commit into from
Sep 6, 2023

Conversation

Ivansete-status
Copy link
Collaborator

Description

It starts an asynchronous infinite task that checks the connectivity with the database.

In case of error, the postgres_healthcheck task tries to reconnect for a while, and if it determines that the connection cannot be resumed, then it invokes a callback indicating that situation.

For the case of the wakunode2 app, that callback quits the application itself and adds a log trace indicating the connectivity issue with the database.

Changes

  • New proc resetConnPool*(pool: PgAsyncPool) to force close the pool if either the database got down or lost the connection with it.
  • wakunode2 quits if it has Store mounted and the Postgres database goes down for more than 30''.
  • New waku/waku_archive/driver/postgres_driver/postgres_healthcheck.nim that contains the infinite health check proc.

How to test

Test case: wakunode2 stops if the database gets stopped.

  1. Run node A - ./build/wakunode2 --config-file=cfg_node_a.txt
    cfg_node_a.txt

  2. Run node B - ./build/wakunode2 --config-file=cfg_node_b.txt
    cfg_node_b.txt

  3. Run a Postgres instance: docker compose -f postgres-docker-compose.yml up
    Content of postgres-docker-compose.yml:

version: "3.8"

services:
  db:
    image: postgres:9.6-alpine
    restart: always
    environment:
      POSTGRES_PASSWORD: test123
    ports:
      - "5432:5432"
  1. Close the Postgres instance started in 3. After less than 30'' the wakunode2 should stop.

Test case: wakunode2 resumes connection with the Postgres database.

  1. Run node A - ./build/wakunode2 --config-file=cfg_node_a.txt
    cfg_node_a.txt

  2. Run node B - ./build/wakunode2 --config-file=cfg_node_b.txt
    cfg_node_b.txt

  3. Run a Postgres instance: docker compose -f postgres-docker-compose.yml up
    Content of postgres-docker-compose.yml:

version: "3.8"

services:
  db:
    image: postgres:9.6-alpine
    restart: always
    environment:
      POSTGRES_PASSWORD: test123
    ports:
      - "5432:5432"
  • Perform a Store request: curl -d '{"jsonrpc":"2.0","id":"id","method":"get_waku_v2_store_v1_messages"}' --header "Content-Type: application/json" http://localhost:8546
    In this case, a valid response should be returned.
  • Close the database started in 3.
  • Send another Store request.
    Now, the next error should appear: {"jsonrpc":"2.0","id":"id","error":{"code":-32000,"message":"get_waku_v2_store_v1_messages raised an exception","data":"BAD_REQUEST: invalid cursor"}}
  • Start the database again (point 3.) and wait for 30''.
  • Send another Store request. A valid response should come.

Issue

closes #1893

…driver

It starts an asynchronous infinite task that checks the connectivity
with the database. In case of error, the postgres_healthcheck task
tries to reconnect for a while, and if it determines that the connection
cannot be resumed, then it invokes a callback indicating that
situation. For the case of the `wakunode2` app, this callback
quits the application itself and adds a log trace indicating
the connectivity issue with the database.
@github-actions
Copy link

github-actions bot commented Sep 6, 2023

This PR may contain changes to database schema of one of the drivers.

If you are introducing any changes to the schema, make sure the upgrade from the latest release to this change passes without any errors/issues.

Please make sure the label release-notes is added to make sure upgrade instructions properly highlight this change.

@github-actions
Copy link

github-actions bot commented Sep 6, 2023

You can find the experimental image built from this PR at

quay.io/wakuorg/nwaku-pr:1997-experimental

@github-actions
Copy link

github-actions bot commented Sep 6, 2023

You can find the image built from this PR at

quay.io/wakuorg/nwaku-pr:1997

@Ivansete-status Ivansete-status marked this pull request as ready for review September 6, 2023 13:34
@Ivansete-status Ivansete-status self-assigned this Sep 6, 2023
Copy link
Contributor

@jm-clius jm-clius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Curious to know - it seems that on failure, we will only continue trying to reconnect for 20 seconds and then quit the application. Is that enough to allow e.g. for a postgresql restart without crashing the node?

@Ivansete-status
Copy link
Collaborator Author

Thanks! Curious to know - it seems that on failure, we will only continue trying to reconnect for 20 seconds and then quit the application. Is that enough to allow e.g. for a postgresql restart without crashing the node?

Thanks for the comment! I just set an arbitrary time of retrial. We can make it longer. May 60'' look better?

@vpavlin
Copy link
Member

vpavlin commented Sep 6, 2023

Well, it is something between 20s and 50s (healthcheck every 30s + 20s of retries, so if PG fails right before the healthcheck, then you have 20s, otherwise it is more:D), but I do agree that this might be highly dependent on the platform you run, so we should turn this into an option eventually (locally 20s+ should be enouch, in Kubernetes under high load, it could take a bit longer)

I am a fan of failing fast if part of the infra does not work though - assuming your infra is properly configured, your node will restart on failure and then you get another chance to retry with postgres, so the 20s should be fine IMO

@Ivansete-status Ivansete-status merged commit 1fb13b0 into master Sep 6, 2023
14 checks passed
@Ivansete-status Ivansete-status deleted the fix-reconnect-to-postgres branch September 6, 2023 17:16
@Ivansete-status Ivansete-status mentioned this pull request Sep 8, 2023
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

chore (postgres): Make nwaku to reconnect to the Postgres database if it restarts
4 participants