Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement service health check in Antrea-agent #4120

Merged
merged 2 commits into from
Sep 1, 2022
Merged

Conversation

shettyg
Copy link
Contributor

@shettyg shettyg commented Aug 16, 2022

antrea-agent: Implement service health check.

When services are created with "externalTrafficPolicy: Local",
a "healthCheckNodePort" is created in the k8s service object.
kube-proxy in turn will listen on this port and answer queries
on http://0.0.0.0:healthCheckNodePort/healthz". In kube-proxy
replacement mode, antrea does not support this feature. This
becomes more important for Windows support as userspace
kube-proxy is being deprecated.

This commit implements this feature in Antrea and is inspired
by the same feature in upstream kube-proxy with some differences.
The predominant difference is that upstream kube-proxy goes
through all endpoints of cluster in each iteration of
endpoints:update() to find the local endpoints. This has been
changed here to only look for changed endpoints.

@shettyg
Copy link
Contributor Author

shettyg commented Aug 16, 2022

CC: @jianjuns @hongliangl For any early comment on the general direction/approach taken here.

@codecov
Copy link

codecov bot commented Aug 16, 2022

Codecov Report

Merging #4120 (0b95eee) into main (d1c6a43) will decrease coverage by 9.37%.
The diff coverage is 71.03%.

❗ Current head 0b95eee differs from pull request most recent head faa2f6b. Consider uploading reports for the commit faa2f6b to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4120      +/-   ##
==========================================
- Coverage   65.70%   56.33%   -9.38%     
==========================================
  Files         304      371      +67     
  Lines       46604    53558    +6954     
==========================================
- Hits        30621    30170     -451     
- Misses      13557    21037    +7480     
+ Partials     2426     2351      -75     
Flag Coverage Δ *Carryforward flag
e2e-tests 39.40% <28.46%> (?)
integration-tests 34.97% <49.35%> (+0.02%) ⬆️ Carriedforward from 3714619
kind-e2e-tests 43.05% <48.52%> (-6.30%) ⬇️ Carriedforward from 3714619
unit-tests 40.27% <51.96%> (-4.11%) ⬇️ Carriedforward from 3714619

*This pull request uses carry forward flags. Click here to find out more.

Impacted Files Coverage Δ
cmd/antrea-agent/agent.go 0.00% <0.00%> (ø)
cmd/antrea-agent/options.go 20.83% <0.00%> (ø)
cmd/antrea-controller/controller.go 0.00% <0.00%> (ø)
cmd/antrea-controller/options.go 0.00% <0.00%> (ø)
cmd/flow-aggregator/flow-aggregator.go 0.00% <0.00%> (ø)
...icluster/controllers/multicluster/common/helper.go 85.29% <ø> (+27.29%) ⬆️
...uster/commonarea/acnp_resourceimport_controller.go 0.00% <0.00%> (-64.71%) ⬇️
...llers/multicluster/leader_clusterset_controller.go 0.00% <ø> (-56.04%) ⬇️
pkg/agent/agent.go 54.43% <ø> (+4.09%) ⬆️
pkg/agent/apiserver/apiserver.go 67.02% <0.00%> (ø)
... and 267 more


p.removeStaleServices()
p.installServices()
p.removeStaleEndpoints()

if p.serviceHealthServer != nil {
if err := p.serviceHealthServer.SyncServices(serviceUpdateResult.HCServiceNodePorts); err != nil {
klog.ErrorS(err, "Error syncing healthcheck services")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI - we use CamelCase for resource/CRD kinds, so it should be "Services", "Endpoints".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jianjuns jianjuns added the area/proxy Issues or PRs related to proxy functions in Antrea label Aug 17, 2022
@jianjuns
Copy link
Contributor

Thanks for working on this. I have not done detailed review yet, but the approach looks good to me after a quick look.

@shettyg shettyg force-pushed the main branch 2 times, most recently from 8752281 to 587ca05 Compare August 17, 2022 22:09
Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shettyg thanks for the PR. I have two comments, otherwise LGTM.

"sync"

"github.com/lithammer/dedent"
"k8s.io/klog"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Antrea uses klog/v2, could you change it to "k8s.io/klog/v2" so the patch won't add "k8s.io/klog v1.0.0" back in go.mod. The upstream code have also switched to v2: https://github.com/kubernetes/kubernetes/blob/58c10aa6eb5adfb1f3aa4d6cb898b8c347ba9e72/pkg/proxy/healthcheck/service_health.go#L28

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to import healthcheck from 1.24.4 tree as it uses klog/v2 and supports dualstack

Comment on lines 117 to 118
addr := fmt.Sprintf(":%d", port)
svc.server = hcs.httpFactory.New(addr, hcHandler{name: nsn, hcs: hcs})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't work in dual-stack clusters as the same address and port will be opened twice, the second attemp would fail and cause error logs.

Should we pass an ipv6 boolean or an bindAddress string when calling newServiceHealthServer so each stack of Proxier only listens to its own address?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. I decided to import healthcheck from 1.24.4 tree as it supports dualstack. Please take a look again.

@tnqn tnqn added this to the Antrea v1.9 release milestone Aug 18, 2022
@tnqn tnqn added the action/release-note Indicates a PR that should be included in release notes. label Aug 18, 2022
Copy link
Contributor

@hongliangl hongliangl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, maybe you could add an e2e test to verify the feature.

third_party/proxy/healthcheck/service_health.go Outdated Show resolved Hide resolved
@shettyg
Copy link
Contributor Author

shettyg commented Aug 24, 2022

@tnqn @hongliangl PTAL again (I am trying to see what is the best way to add a unit test here, making myself familiar with the type of tests that currently exist)

nodePortAddressesString := make([]string, len(nodePortAddresses))
for i, address := range nodePortAddresses {
if isIPv6 {
nodePortAddressesString[i] = fmt.Sprintf("%s/128", address.String())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether passing /128 is fine. I will test this.

Comment on lines 995 to 1001
nodePortAddressesString := make([]string, len(nodePortAddresses))
for i, address := range nodePortAddresses {
if isIPv6 {
nodePortAddressesString[i] = fmt.Sprintf("%s/128", address.String())
} else {
nodePortAddressesString[i] = fmt.Sprintf("%s/32", address.String())
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some difference between the nodePortAddresses argument of Antrea's proxy and kube-proxy: the former is already the concrete Node IPs that this Node is configured with:

var nodePortAddressesIPv4, nodePortAddressesIPv6 []net.IP
if o.config.AntreaProxy.ProxyAll {
nodePortAddressesIPv4, nodePortAddressesIPv6, err = getAvailableNodePortAddresses(o.config.AntreaProxy.NodePortAddresses, append(excludeNodePortDevices, o.config.HostGateway))
if err != nil {
return fmt.Errorf("getting available NodePort IP addresses failed: %v", err)
}
}
while the latter is the user provided CIDRs that will be used to filter concrete Node IPs.

Since service health server actually needs the former, I think we could pass the slice of node IP strings to it directly, instead of passing the slice of CIDRs and calling utilproxy.GetNodeAddresses(nodePortAddresses, utilproxy.RealNetwork{}) to get concrete Node IPs again, which should just get the same results.

In this way, the changes to util/network.go and util/utils.go are not needed.

Copy link
Member

@tnqn tnqn Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be a small defect that unlike kube-proxy listening to "0.0.0.0"/"::0" when user provided nodeAddresses is empty, antrea proxy will always listen to concrete node IPs because the nodePortAddresses argument are used for other purposes which need the IPs to be concrete ones (for checking whether a packet is accessing a Node IP). But it shouldn't affect the functionality of service health check so should be ok at the moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the insight!

@tnqn
Copy link
Member

tnqn commented Aug 25, 2022

@tnqn @hongliangl PTAL again (I am trying to see what is the best way to add a unit test here, making myself familiar with the type of tests that currently exist)

Since most of the implementation code of service health server is from K8s which should already be verified, maybe you could add a verification about service health port to existing Antrea e2e test for LB service with Local ExternalTrafficPolicy that we can get expected output by accessing the service health port of each Node:

t.Run("ExternalTrafficPolicy:Local/Client:Node", func(t *testing.T) {
testLoadBalancerLocalFromNode(t, data, nodes, localUrl, hostnames)
})

Guru Shetty added 2 commits August 30, 2022 16:44
These APIs will be used in an upcoming commit to provide
service health check (to support externalTrafficPolicy: Local
for a k8s service.

Signed-off-by: Guru Shetty <[email protected]>
When services are created with "externalTrafficPolicy: Local",
a "healthCheckNodePort" is created in the k8s service object.
kube-proxy in turn will listen on this port and answer queries
on http://0.0.0.0:healthCheckNodePort/healthz". In kube-proxy
replacement mode, antrea does not support this feature. This
becomes more important for Windows support as userspace
kube-proxy is being deprecated.

This commit implements this feature in Antrea and is inspired
by the same feature in upstream kube-proxy with some differences.
The predominant difference is that upstream kube-proxy goes
through all endpoints of cluster in each iteration of
endpoints:update() to find the local endpoints. This has been
changed here to only look for changed endpoints.

Signed-off-by: Guru Shetty <[email protected]>
@shettyg
Copy link
Contributor Author

shettyg commented Aug 30, 2022

@tnqn I added a e2e test and updated the service health library to reflect your suggestions. PTAL again

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @shettyg

@tnqn
Copy link
Member

tnqn commented Aug 31, 2022

/skip-all (typed by mistake)

@tnqn
Copy link
Member

tnqn commented Aug 31, 2022

@jianjuns @hongliangl do you have further comments?

@tnqn
Copy link
Member

tnqn commented Aug 31, 2022

/test-all

@jianjuns
Copy link
Contributor

@jianjuns @hongliangl do you have further comments?

No further comment from me.

Copy link
Contributor

@hongliangl hongliangl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn tnqn merged commit 7cf5277 into antrea-io:main Sep 1, 2022
heanlan pushed a commit to heanlan/antrea that referenced this pull request Mar 29, 2023
When services are created with "externalTrafficPolicy: Local",
a "healthCheckNodePort" is created in the k8s service object.
kube-proxy in turn will listen on this port and answer queries
on http://0.0.0.0:healthCheckNodePort/healthz". In kube-proxy
replacement mode, antrea does not support this feature. This
becomes more important for Windows support as userspace
kube-proxy is being deprecated.

This commit implements this feature in Antrea and is inspired
by the same feature in upstream kube-proxy with some differences.
The predominant difference is that upstream kube-proxy goes
through all endpoints of cluster in each iteration of
endpoints:update() to find the local endpoints. This has been
changed here to only look for changed endpoints.

Signed-off-by: Guru Shetty <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/release-note Indicates a PR that should be included in release notes. area/proxy Issues or PRs related to proxy functions in Antrea
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants