Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logs when there is a timeout error #1946

Merged
merged 2 commits into from
Mar 18, 2021
Merged

Conversation

hty690
Copy link
Contributor

@hty690 hty690 commented Mar 11, 2021

The agent logs will print the failure message when the connection to
the controller is timeout. Also adding HealthCheck to remind
the users of the connectivity issue.

Fixes #822

@vmwclabot
Copy link

@hty690, you must sign our contributor license agreement before your changes are merged. Click here to sign the agreement. If you are a VMware employee, read this for further instruction.

@vmwclabot
Copy link

@hty690, your company's legal contact has approved your signed contributor license agreement. It will also be reviewed by VMware, but the merge can proceed.

@@ -576,6 +576,11 @@ func (w *watcher) watch() {
klog.Warningf("Failed to start watch for %s: %v", w.objectType, err)
return
}
_, ok := watcher.(*watch.StreamWatcher)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment for this handling for others to understand.

The unit test failure failed because of this. It should be because the fake client used in test is not a StreamWatcher. Maybe your original approach that compares the type of watch with watch.NewEmptyWatch() can resolve this but it looks a little implementation intrusive (maybe current approach is same).

@antoninbas what's your opinion here? The context is, the Watch method does not return error if it's a timeout error or "ProbableEOF" and return watch.NewEmptyWatch() instead, that's why you saw no log about connection problem in the issue. See https://github.com/kubernetes/kubernetes/blob/823fa75643bdc352a51cbbc83b55420734832eda/staging/src/k8s.io/client-go/rest/request.go#L702-L709

}
return fmt.Errorf("Some watchers may not be connected")
})
apiServer.GenericAPIServer.AddHealthChecks(check)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HealthCheck will be added to livezChecks too, what the issue asked was watcher's connection state should be reflected in readiness probe.

@tnqn
Copy link
Member

tnqn commented Mar 11, 2021

@hty690 please add description to the commit message and PR description about what else the PR does, e.g. improving logs when there is a timeout error, and don't add trailing period to the title of commit message and PR. See https://github.com/golang/go/wiki/CommitMessage.

@hty690 hty690 changed the title Add readiness probe for the agent. Improve logs when there is a timeout error Mar 11, 2021
@codecov-io
Copy link

codecov-io commented Mar 11, 2021

Codecov Report

Merging #1946 (a3081a6) into main (e80ab3b) will increase coverage by 3.13%.
The diff coverage is 62.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1946      +/-   ##
==========================================
+ Coverage   64.35%   67.48%   +3.13%     
==========================================
  Files         193      195       +2     
  Lines       16967    16882      -85     
==========================================
+ Hits        10919    11393     +474     
+ Misses       4899     4305     -594     
- Partials     1149     1184      +35     
Flag Coverage Δ
e2e-tests 48.40% <62.50%> (?)
kind-e2e-tests 56.01% <62.50%> (+0.80%) ⬆️
unit-tests 41.89% <0.00%> (+0.24%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ntroller/networkpolicy/networkpolicy_controller.go 70.03% <0.00%> (-0.70%) ⬇️
pkg/agent/apiserver/apiserver.go 69.11% <100.00%> (+1.93%) ⬆️
pkg/apiserver/certificate/cacert_controller.go 55.83% <0.00%> (-10.01%) ⬇️
pkg/apiserver/apiserver.go 87.25% <0.00%> (-1.97%) ⬇️
pkg/agent/agent_linux.go 100.00% <0.00%> (ø)
pkg/controller/traceflow/controller.go 69.48% <0.00%> (ø)
pkg/agent/openflow/pipeline_other.go 66.66% <0.00%> (ø)
pkg/agent/openflow/client_other.go 0.00% <0.00%> (ø)
pkg/agent/openflow/network_policy.go 76.31% <0.00%> (+0.59%) ⬆️
pkg/agent/cniserver/pod_configuration.go 54.68% <0.00%> (+0.78%) ⬆️
... and 26 more

@vmwclabot
Copy link

@hty690, VMware has approved your signed contributor license agreement.

@@ -576,7 +576,12 @@ func (w *watcher) watch() {
klog.Warningf("Failed to start watch for %s: %v", w.objectType, err)
return
}

//Make sure that watcher is not the type of watch.NewEmptyWatch()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//Make sure -> // Makesure

@hty690 hty690 force-pushed the log_issue branch 2 times, most recently from 4f74539 to e8b73e2 Compare March 16, 2021 07:38
@@ -584,6 +585,11 @@ func (w *watcher) watch() {
klog.Warningf("Failed to start watch for %s: %v", w.objectType, err)
return
}
// Makesure that watcher is not the type of watch.NewEmptyWatch()
if reflect.TypeOf(watcher) == reflect.TypeOf(watch.NewEmptyWatch()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declare the latter as a variable to avoid unnecessary construction of emptyWatch everytime we need to compare.

if npq.GetControllerConnectionStatus() {
return nil
}
return fmt.Errorf("Some watchers may not be connected")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returned error shouldn't be capitalized

@@ -584,6 +585,11 @@ func (w *watcher) watch() {
klog.Warningf("Failed to start watch for %s: %v", w.objectType, err)
return
}
// Makesure that watcher is not the type of watch.NewEmptyWatch()
if reflect.TypeOf(watcher) == reflect.TypeOf(watch.NewEmptyWatch()) {
klog.Warningf("Failed to start watch for %s. Something wrong with the connection?", w.objectType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Warningf("Failed to start watch for %s. Something wrong with the connection?", w.objectType)
klog.Warningf("Failed to start watch for %s. Please ensure antrea service is reachable for the agent", w.objectType)

@@ -97,6 +99,14 @@ func New(aq agentquerier.AgentQuerier, npq querier.AgentNetworkPolicyInfoQuerier
if err != nil {
return nil, err
}
// Add readiness probe to check the status of watchers.
check := healthz.NamedCheck("watcher", func(_ *http.Request) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you move this to newConfig given that function initializes the whole config?

The agent logs will print the failure message when the connection to
the controller is timeout. Also adding readiness probe to remind
the users of the connectivity issue.

Fixes antrea-io#822
tnqn
tnqn previously approved these changes Mar 17, 2021
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.
@antoninbas could you check if this PR addesses the original issue?

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to test this and it meets expectations, except for the readiness probe failing with reason withheld which I am unsure about.

Comment on lines +160 to +166
// Add readiness probe to check the status of watchers.
check := healthz.NamedCheck("watcher", func(_ *http.Request) error {
if npq.GetControllerConnectionStatus() {
return nil
}
return fmt.Errorf("some watchers may not be connected")
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I test the probe, I get [-]watcher failed: reason withheld

reason withheld is not what I'd expect here, shouldn't it say some watchers may not be connected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented Mar 18, 2021

/test-all

@tnqn tnqn merged commit c086437 into antrea-io:main Mar 18, 2021
@hty690 hty690 deleted the log_issue branch March 30, 2021 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better error logs for Agent - Controller connectivity issues
6 participants