-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: fix more flakes, move itests to GitHub (except ARM itest) #5811
Conversation
93274ef
to
4f4d48d
Compare
Wow, all GitHub itests green on the first run 😮 |
Is this good or bad? |
I would say this is very good. Not sure why you would think it wasn't? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just one q about a comment I'm unsure of.
@@ -31,7 +31,7 @@ var ( | |||
"lndexec", itestLndBinary, "full path to lnd binary", | |||
) | |||
|
|||
slowMineDelay = 50 * time.Millisecond | |||
slowMineDelay = 20 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re commit message: why does decreasing this value slow things down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already had the mineBlocksSlow
function that used the 50ms delay. By replacing all instances of mineBlocks
with mineBlocksSlow
, we make slow everything down. To reduce the amount of overall slowdown we decrease the delay from 50ms to 20ms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to update the commit message to make this more clear.
// Did the event can close in the meantime? We want to | ||
// avoid a "close of closed channel" panic since we're | ||
// re-using the same event chan for multiple requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really understanding this comment? would this channel get closed when chanWatchRequests
is finished with it? Also are we always sure this is close not another channel policy update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's that if the channel has already been closed here, and we send in another request, it'll end up double closing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly.
If the tests were previously flakey because of them running on slow test machines, it could be that they uncovered issues that only show on slow production machines. So perhaps these are missed now with github actions. I have to admit that I don't even know for sure that tests being green is caused by faster test machines. Sorry about the lack of threading, should have put my initial comment on some line. |
I think it's mainly the lack of consistent timing w/ the series of timeouts we have. When we run w/ Travis (and their potato cluster) we end up with several processes (replicated db, 2x full node, up to 6 lnd nodes in some tests), so it's understandable that we run into some CPU scheduling weirdness that causes these flakes at times. At the same time, we've also eliminated a ton of flakes over the past 2 months due to flake hunting szn. Travis as a service has really consistently degraded over the past year or so, then they have that massive security failure on top of that. We've given them enough chances to get their services together after being acquired by that PE firm IMO. |
I think the bigger gain here is also the restoration of all the lost developer time (sitting there baby sitting the test to restart it, odd failures w/ the machine (?) itself) due to Travis. |
Also worth nothing this brings in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🥻
// Did the event can close in the meantime? We want to | ||
// avoid a "close of closed channel" panic since we're | ||
// re-using the same event chan for multiple requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's that if the channel has already been closed here, and we send in another request, it'll end up double closing.
@@ -47,71 +47,16 @@ jobs: | |||
- GOGC=30 make lint | |||
|
|||
- stage: Integration Test | |||
name: Btcd Integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cy@ Travis 🤡
4f4d48d
to
c987f0a
Compare
Rebased. But still blocked by btcsuite/btcd#1752. |
c987f0a
to
ef07c27
Compare
Interceptor tests need a
|
The latest version of btcd allows its stall handler to be disabled. We use that new config option to make sure the mining btcd node and the lnd chain backend btcd node aren't disconnected if some test takes too long and no new p2p messages are exchanged.
We now redirect the mineBlocks function to the mineBlocksSlow function which waits after each mined block. To reduce the overall time impact of using that function everywhere, we only wait 20 milliseconds instead of 50ms after each mined block to give all nodes some time to process the block. This will still slow down everything by a bit but reduce flakes that are caused by different sub systems not being up-to-date.
Fixes the docker build that was caused by docker-library/postgres#884. Using the alpine and version 13 image avoids the problem introduced with postgres 14 and debian bullseye.
ef07c27
to
134be24
Compare
Race cond flake is new, notified OP of that new test of it, needs a |
Depends on btcsuite/btcd#1752.
Fixes two problems in the itest:
btcd
and chain backendbtcd
node losing their connection because of the peer stall detection inbtcd
--> We fix this by disabling stall detection