-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement service health check in Antrea-agent #4120
Conversation
CC: @jianjuns @hongliangl For any early comment on the general direction/approach taken here. |
Codecov Report
@@ Coverage Diff @@
## main #4120 +/- ##
==========================================
- Coverage 65.70% 56.33% -9.38%
==========================================
Files 304 371 +67
Lines 46604 53558 +6954
==========================================
- Hits 30621 30170 -451
- Misses 13557 21037 +7480
+ Partials 2426 2351 -75
*This pull request uses carry forward flags. Click here to find out more.
|
pkg/agent/proxy/proxier.go
Outdated
|
||
p.removeStaleServices() | ||
p.installServices() | ||
p.removeStaleEndpoints() | ||
|
||
if p.serviceHealthServer != nil { | ||
if err := p.serviceHealthServer.SyncServices(serviceUpdateResult.HCServiceNodePorts); err != nil { | ||
klog.ErrorS(err, "Error syncing healthcheck services") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI - we use CamelCase for resource/CRD kinds, so it should be "Services", "Endpoints".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Thanks for working on this. I have not done detailed review yet, but the approach looks good to me after a quick look. |
8752281
to
587ca05
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shettyg thanks for the PR. I have two comments, otherwise LGTM.
"sync" | ||
|
||
"github.com/lithammer/dedent" | ||
"k8s.io/klog" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Antrea uses klog/v2, could you change it to "k8s.io/klog/v2" so the patch won't add "k8s.io/klog v1.0.0" back in go.mod. The upstream code have also switched to v2: https://github.com/kubernetes/kubernetes/blob/58c10aa6eb5adfb1f3aa4d6cb898b8c347ba9e72/pkg/proxy/healthcheck/service_health.go#L28
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to import healthcheck from 1.24.4 tree as it uses klog/v2 and supports dualstack
addr := fmt.Sprintf(":%d", port) | ||
svc.server = hcs.httpFactory.New(addr, hcHandler{name: nsn, hcs: hcs}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this doesn't work in dual-stack clusters as the same address and port will be opened twice, the second attemp would fail and cause error logs.
Should we pass an ipv6
boolean or an bindAddress
string when calling newServiceHealthServer
so each stack of Proxier only listens to its own address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. I decided to import healthcheck from 1.24.4 tree as it supports dualstack. Please take a look again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, maybe you could add an e2e test to verify the feature.
@tnqn @hongliangl PTAL again (I am trying to see what is the best way to add a unit test here, making myself familiar with the type of tests that currently exist) |
pkg/agent/proxy/proxier.go
Outdated
nodePortAddressesString := make([]string, len(nodePortAddresses)) | ||
for i, address := range nodePortAddresses { | ||
if isIPv6 { | ||
nodePortAddressesString[i] = fmt.Sprintf("%s/128", address.String()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether passing /128 is fine. I will test this.
pkg/agent/proxy/proxier.go
Outdated
nodePortAddressesString := make([]string, len(nodePortAddresses)) | ||
for i, address := range nodePortAddresses { | ||
if isIPv6 { | ||
nodePortAddressesString[i] = fmt.Sprintf("%s/128", address.String()) | ||
} else { | ||
nodePortAddressesString[i] = fmt.Sprintf("%s/32", address.String()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some difference between the nodePortAddresses
argument of Antrea's proxy and kube-proxy: the former is already the concrete Node IPs that this Node is configured with:
antrea/cmd/antrea-agent/agent.go
Lines 215 to 221 in 52db699
var nodePortAddressesIPv4, nodePortAddressesIPv6 []net.IP | |
if o.config.AntreaProxy.ProxyAll { | |
nodePortAddressesIPv4, nodePortAddressesIPv6, err = getAvailableNodePortAddresses(o.config.AntreaProxy.NodePortAddresses, append(excludeNodePortDevices, o.config.HostGateway)) | |
if err != nil { | |
return fmt.Errorf("getting available NodePort IP addresses failed: %v", err) | |
} | |
} |
Since service health server actually needs the former, I think we could pass the slice of node IP strings to it directly, instead of passing the slice of CIDRs and calling utilproxy.GetNodeAddresses(nodePortAddresses, utilproxy.RealNetwork{})
to get concrete Node IPs again, which should just get the same results.
In this way, the changes to util/network.go and util/utils.go are not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be a small defect that unlike kube-proxy listening to "0.0.0.0"/"::0" when user provided nodeAddresses is empty, antrea proxy will always listen to concrete node IPs because the nodePortAddresses
argument are used for other purposes which need the IPs to be concrete ones (for checking whether a packet is accessing a Node IP). But it shouldn't affect the functionality of service health check so should be ok at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the insight!
Since most of the implementation code of service health server is from K8s which should already be verified, maybe you could add a verification about service health port to existing Antrea e2e test for LB service with Local ExternalTrafficPolicy that we can get expected output by accessing the service health port of each Node: Lines 228 to 230 in 7afa4fb
|
These APIs will be used in an upcoming commit to provide service health check (to support externalTrafficPolicy: Local for a k8s service. Signed-off-by: Guru Shetty <[email protected]>
When services are created with "externalTrafficPolicy: Local", a "healthCheckNodePort" is created in the k8s service object. kube-proxy in turn will listen on this port and answer queries on http://0.0.0.0:healthCheckNodePort/healthz". In kube-proxy replacement mode, antrea does not support this feature. This becomes more important for Windows support as userspace kube-proxy is being deprecated. This commit implements this feature in Antrea and is inspired by the same feature in upstream kube-proxy with some differences. The predominant difference is that upstream kube-proxy goes through all endpoints of cluster in each iteration of endpoints:update() to find the local endpoints. This has been changed here to only look for changed endpoints. Signed-off-by: Guru Shetty <[email protected]>
@tnqn I added a e2e test and updated the service health library to reflect your suggestions. PTAL again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @shettyg
|
@jianjuns @hongliangl do you have further comments? |
/test-all |
No further comment from me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When services are created with "externalTrafficPolicy: Local", a "healthCheckNodePort" is created in the k8s service object. kube-proxy in turn will listen on this port and answer queries on http://0.0.0.0:healthCheckNodePort/healthz". In kube-proxy replacement mode, antrea does not support this feature. This becomes more important for Windows support as userspace kube-proxy is being deprecated. This commit implements this feature in Antrea and is inspired by the same feature in upstream kube-proxy with some differences. The predominant difference is that upstream kube-proxy goes through all endpoints of cluster in each iteration of endpoints:update() to find the local endpoints. This has been changed here to only look for changed endpoints. Signed-off-by: Guru Shetty <[email protected]>
antrea-agent: Implement service health check.
When services are created with "externalTrafficPolicy: Local",
a "healthCheckNodePort" is created in the k8s service object.
kube-proxy in turn will listen on this port and answer queries
on http://0.0.0.0:healthCheckNodePort/healthz". In kube-proxy
replacement mode, antrea does not support this feature. This
becomes more important for Windows support as userspace
kube-proxy is being deprecated.
This commit implements this feature in Antrea and is inspired
by the same feature in upstream kube-proxy with some differences.
The predominant difference is that upstream kube-proxy goes
through all endpoints of cluster in each iteration of
endpoints:update() to find the local endpoints. This has been
changed here to only look for changed endpoints.