Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: multi nat initialization causing dead lock in waku tests + serialize test runs to avoid timing and port occupied issues #2799

Merged
merged 5 commits into from
Jun 12, 2024

Conversation

NagyZoltanPeter
Copy link
Contributor

@NagyZoltanPeter NagyZoltanPeter commented Jun 11, 2024

Description

Multiple NAT module initialization cause dead lock in tests

In case nat setup found proper device to make the port mapping (that does not happens in jenkins CI) could cause multiple initilalization of nim-eth/nat module.
That module is not designed for that and changing this needs a bigger rework of that module.
The root cause of the issue with multiple initialization inside one application run leads to multiple remapping thread created. That thread is responsible for refreshing the port mapping on the router if needed.
As such multiple thread created but only the last one tracked caused dead lock in the shut down mechanism of it as the used Channel[bool] single module variable locking mechanism do not handle such situation and remains blocked.

Simple workaround is applied: waku nat module prevents multiple initialization of nim-eth/nat module and will behave as no proper device would be found (which is still an ok case for testing).

Reduce tests flakyness

In order to reduce probability of timing issue during CI test runs also possibility of failed tests because of ports already in use we made test execution sequential.

How to test

  1. make testwakunode2

Issue

#2628

@NagyZoltanPeter NagyZoltanPeter changed the title bug: Fix for multiple nat module initialization causes dead lock in nat refresh thread fix: for multiple nat module initialization causes dead lock in nat refresh thread Jun 11, 2024
Copy link

github-actions bot commented Jun 11, 2024

You can find the image built from this PR at

quay.io/wakuorg/nwaku-pr:2799-rln-v1

Built from 7ade6c2

Copy link

github-actions bot commented Jun 11, 2024

You can find the image built from this PR at

quay.io/wakuorg/nwaku-pr:2799-rln-v2

Built from 7ade6c2

Copy link
Contributor

@gabrielmer gabrielmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazingg, thanks so much!

waku/common/utils/nat.nim Outdated Show resolved Hide resolved
@NagyZoltanPeter
Copy link
Contributor Author

@Ivansete-status, @gabrielmer
cc: @SionoiS , @DarshanBPatel

In addition to the original PR I still encountered flaky tests mostly on mac-os.
Like REST ports already in use, and some timing sensitive tests.
with @Ivansete-status we made our bets and tried to serialize the tests by not allowing paralel runs for nim compiler.
This seems eliminates mostly the issues, although seems some variance in the tests can come.
Finally test run times seems not much affected, still around 27-31 mins.

Copy link
Contributor

@SionoiS SionoiS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zoltan the 🕵️.

Very nice!

Copy link
Collaborator

@Ivansete-status Ivansete-status left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks! Just added a comment to split it in separate PRs

Comment on lines +116 to +122
sudo docker run --rm -d -e POSTGRES_PASSWORD=test123 -p 5432:5432 postgres:15.4-alpine3.18
postgres_enabled=1
fi

export MAKEFLAGS="-j1"
export NIMFLAGS="--colors:off -d:chronicles_colors:none"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That LGTM but shall we add that in a separate PR so that is clear the commit that applied it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That LGTM but shall we add that in a separate PR so that is clear the commit that applied it?

@NagyZoltanPeter - maybe easier to just update the PR title and description to also reflect the CI test change :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-) agree! I do it.

@NagyZoltanPeter NagyZoltanPeter changed the title fix: for multiple nat module initialization causes dead lock in nat refresh thread fix: multi nat initialization causing dead lock in waku tests + serialize test runs to avoid timing and port occupied issues Jun 12, 2024
@NagyZoltanPeter NagyZoltanPeter merged commit 5989de8 into master Jun 12, 2024
18 of 19 checks passed
@NagyZoltanPeter NagyZoltanPeter deleted the bug-nat-tear-down branch June 12, 2024 05:49
rymnc pushed a commit that referenced this pull request Jun 20, 2024
…lize test runs to avoid timing and port occupied issues (#2799)

* Prevent multiple nat module initialization that cause dead lock in nat refresh thread tear down during tests.
* NPROC to 1 to avoid parallel test runs can lead to timing and port allocation issues

Co-authored-by: gabrielmer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants