-
-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ping repeatedly from ping thread #383
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvement in principal!
I wonder if it is better to put the retry logic in the ping function; it already have a loop that checks on the future, maybe we should make the timeout 1 minute and retry 4 times to mimic current behavior?
Try a rebuild after infrastructure issues. |
I've wanted to look into making this ping thread more resilient. Changing it can be dangerous because there are some odd behaviors associated with it. Thanks for the submission. I'll try to find some time to take a look at it. |
Re-implemented repeated ping, addressing the mentioned issues and trying to be less invasive:
|
Looks like there's a compilation failure. |
Let's try a CI rebuild to see where we currently are with this. |
Yes, there is a remaining compilation failure. @cpt1gl0 , are you interested in getting back to finish this up? |
@cpt1gl0 I hope you don't mind, I took the liberty of resolving the simple issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good, but the overall proposal is incomplete. Changes to the ping thread need a Jira ticket, even if they seem to maintain the existing behavior. It would be good to have more information on what this solves and how it will be used.
It may still make sense to accept this change, but it would help to have better background information.
The changes are intended to solve issues where a node is disconnected because of a single failing ping. |
@cpt1gl0, thanks for getting back to this. Builds are passing now so we're in better shape. Could you clarify a few things about your experience with this change, please? Are you observing a problem that is fixed by this PR? In other words, are you experiencing an issue "where a node is disconnected because of a single failing ping"? If so, is this PR correcting the problem? (It will help in accepting this PR if we have a confirmed fix and more testing, though we can still consider it as an improvement without that.) Could you submit a Jira about the problem this addresses and link it to this PR? |
@jeffret-b I just can tell you that at my company, where we use Jenkins as an integral part of our build environment, we experience problems with disconnecting nodes when the build system is under heavy load. Usually those disconnections are caused by failing pings from ping thread. From that perspective, I think it would make sense to make the ping thread more fault tolerant, and allow for single failing pings without disconnect. |
@cpt1gl0, have you been able to try out your improvement, perhaps in a test environment, to see that it improves the situation? If we can get some real-world success with this change I'm less concerned about the risk of introducing it. |
@jeffret-b I can not give you any hard facts. Personally, I think it's a good idea to introduce that change, but it is for you to decide. |
I'll see about getting this in some future release. If anyone else would like to review or even better try it out that be very appreciated. |
Normally the ping thread is the symptom not the cause. |
I agree it is desirable to have it confirmed that a thread reporting a ping failure will actually recover and later operates fine. |
A totally agree and I think this is not for discussion. In our case usually a system under heavy load is the original cause, and yes, a failing ping thread is the symptom. For me it's not a question of cause and effect, but more about reasonable measures to take. Does it make sense to disconnect a node after a single ping failure? What's the gain? Why not be a bit more fault tolerant and leave the possibility for the system to recover without any failing builds? If pings keep failing the node would still be disconnected eventually and possible issues will not go unnoticed. Personally I would prefer a system which does not immediately chop off one of its legs because of a numb toe. |
Not the topic, but still: Consider using Pipeline. |
@cpt1gl0 , if you have any experience using this change that results in better stability or results, that would help to move it forward. I agree that it would be better to be more resilient. |
I don't know, but perhaps this could help https://issues.jenkins.io/browse/JENKINS-50730 |
@felipecrs In what way does that help? Or how is that issue related to the ping thread? |
@jeffret-b, I'm sorry, thinking about it now, I think it was a mistake from my side. Please disregard. |
The PR seems to be solving a problem we see as well. |
@cpt1gl0 Did you want me to open a jenkins ticket for this? Or could you? |
To expand on #383 (comment), a Pipeline build should recover automatically despite a node disconnection. There are three cases:
Given these resilience modes, and the fact that you can already configure a longer ping timeout, I am not sure what is gained by retrying a failed ping. |
Ping repeatedly from ping thread before assuming error to avoid unnecessary node disconnects.