Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix websocket connection retries logic #771

Merged

Conversation

Vlatombe
Copy link
Member

We have observed cases where the websocket connection would fail temporarily with a handshake error that was not retried and causing the agent to fail.

Also, in CloudBees CI HA, there are some cases when an agent is disconnected on purpose and we expect a reconnect attempt as soon as possible. The default 10 seconds delay appeared too long in that case.

To reconcile this use case with the typical failure scenario, I have implemented exponential backoff for retries (immediate, 1 sec, 3 seconds, 7 seconds, 10 seconds).

Testing done

  • WS: Initial connection
  • WS: Applying a rolling restart to the controller. The previous fatal handshake error is now retriable and prints a single log line

Submitter checklist

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

We have observed cases where the websocket connection would fail temporarily with a handshake error that was not retried and causing the agent to fail.

Also, in CloudBees CI HA, there are some cases when an agent is disconnected on purpose and we expect a reconnect attempt as soon as possible. The default 10 seconds delay appeared too long in that case.

To reconcile this use case with the typical failure scenario, I have implemented exponential backoff for retries (immediate, 1 sec, 3 seconds, 7 seconds, 10 seconds).
@Vlatombe Vlatombe requested a review from jglick October 24, 2024 12:01
@Vlatombe
Copy link
Member Author

I'm also considering a bigger refactoring (to be filed in a separate PR) to reduce duplication across inbound tcp and websocket connection flows and make them more similar.

return true;
}
} catch (Exception x) {
events.status("Failed to connect: " + x.getMessage());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are losing the stack trace here; is that potentially important for diagnosis?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience these stacktraces are noisy. I could eventually log them with standard JUL logger (FINE) in parallel, but I don't really see the point of keeping them in status.

final Duration maxDelay;

ExponentialRetry(Duration timeout) {
this(Duration.ofSeconds(0), timeout, 2, Duration.ofSeconds(1), Duration.ofSeconds(10));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare user-configurable version in #676

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jitter could be useful, but exposing these as user-level settings seem overkill to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, just noting for reference.

src/main/java/hudson/remoting/Engine.java Outdated Show resolved Hide resolved
src/main/java/hudson/remoting/Engine.java Outdated Show resolved Hide resolved
@jglick jglick added the bug For changelog: Fixes a bug. label Oct 24, 2024
src/main/java/hudson/remoting/Engine.java Outdated Show resolved Hide resolved
Co-authored-by: Jesse Glick <[email protected]>
@Vlatombe Vlatombe changed the title Refactor websocket connection logic Improve websocket connection retries logic Nov 5, 2024
@Vlatombe Vlatombe changed the title Improve websocket connection retries logic Fix websocket connection retries logic Nov 5, 2024
@Vlatombe Vlatombe merged commit 92c105e into jenkinsci:master Nov 5, 2024
14 checks passed
@Vlatombe Vlatombe deleted the websocket-connection-logic-rework branch November 5, 2024 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For changelog: Fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants