Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows support for Flow Exporter with Flow Aggregator #2138

Merged
merged 1 commit into from
May 21, 2021

Conversation

dreamtalen
Copy link
Contributor

@dreamtalen dreamtalen commented Apr 29, 2021

In this commit, we fix the error when running Flow Exporter on Windows node with Flow Aggregator.
There is a limitation in DNS resolution on Windows, flow-aggregator.flow-aggregator.svc DNS name couldn't be resolved.
The reason is because on Windows the Antrea Agent runs as a process, it uses the host's default DNS setting and the DNS resolver will not be configured to talk to the CoreDNS Service for cluster local DNS queries likeflow-aggregator.flow-aggregator.svc.
So we require flowCollectorAddr could only be IP for Flow Exporter on Windows node.
For Flow Aggregator, we add IP in certicate since flow exporter on Windows node can't resolve DNS name.
Also change to use dpctl/ct-get-limits intead of dpctl/ct-get-maxconns since it returns operation not supported on Windows node.

@codecov-commenter
Copy link

codecov-commenter commented Apr 29, 2021

Codecov Report

Merging #2138 (baf4ef4) into main (335e6bb) will increase coverage by 4.04%.
The diff coverage is 50.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2138      +/-   ##
==========================================
+ Coverage   61.09%   65.14%   +4.04%     
==========================================
  Files         273      273              
  Lines       20644    20648       +4     
==========================================
+ Hits        12613    13451     +838     
+ Misses       6713     5818     -895     
- Partials     1318     1379      +61     
Flag Coverage Δ
e2e-tests 56.31% <50.00%> (?)
kind-e2e-tests 52.05% <0.00%> (+<0.01%) ⬆️
unit-tests 41.01% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/flowaggregator/certificate.go 77.47% <50.00%> (+77.47%) ⬆️
pkg/controller/networkpolicy/tier.go 47.50% <0.00%> (-5.00%) ⬇️
pkg/ovs/openflow/ofctrl_bridge.go 50.36% <0.00%> (+0.36%) ⬆️
pkg/agent/openflow/network_policy.go 76.41% <0.00%> (+0.59%) ⬆️
pkg/controller/networkpolicy/status_controller.go 87.09% <0.00%> (+0.64%) ⬆️
...ntroller/networkpolicy/networkpolicy_controller.go 70.00% <0.00%> (+0.64%) ⬆️
pkg/agent/cniserver/pod_configuration.go 54.68% <0.00%> (+0.78%) ⬆️
pkg/controller/egress/store/egressgroup.go 1.38% <0.00%> (+1.38%) ⬆️
pkg/ovs/openflow/ofctrl_flow.go 47.36% <0.00%> (+1.75%) ⬆️
pkg/agent/flowexporter/exporter/exporter.go 77.88% <0.00%> (+2.64%) ⬆️
... and 42 more

Copy link
Member

@srikartati srikartati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Did you get a chance to run e2e test suite on windows and check if its passing? I see new Jenkins job for this.

name or the Cluster IP of the Service. Please note that the default values for
name or the Cluster IP of the Service.

Please note that for Antrea Agent running on a Windows node, `flowCollectorAddr` can only be IP right now because there is a DNS resolution issue in current Windows support.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge this line with the above paragraph. Something like..

"... with the Name and Namespace set to flow-aggregator. For Antrea Agent running on a Windows node, the user is required to change the default value of HOST in flowCollectorAddr from DNS name to the Cluster IP of Flow Aggregator service. The reason is because of the DNS resolution issue in the current Windows support of Kubernetes. In addition, if you deploy the Flow Aggregator Service with a different Name and Namespace, then either use the appropriate DNS name or the Cluster IP of the Service."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, addressed.

@dreamtalen
Copy link
Contributor Author

/test-windows-e2e
/test-windows-conformance
/test-windows-networkpolicy

@dreamtalen
Copy link
Contributor Author

LGTM.
Did you get a chance to run e2e test suite on windows and check if its passing? I see new Jenkins job for this.

Just triggered these Windows related Jenkins jobs, will pay attention to the result. Thanks.

@srikartati srikartati requested a review from antoninbas May 3, 2021 18:00
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the lack of UDP support in netsh portproxy is a well-known issue, but what's the explanation for this:

However even we added --tcp-only option in Resolve-DnsName utility, it still can't resolve this DNS name.

connTrackOvsCtl
}

// dpctl/ct-get-maxconns returns operation not supported on Windows node, use dpctl/ct-get-limits intead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a good function-level comment. It belongs with the call to ct.ovsctlClient.RunAppctlCmd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, addressed.

return maxConns, nil
}
}
return 0, fmt.Errorf("doesn't find limit field in dpctl/ct-get-limits command output '%s'", cmdOutput)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return 0, fmt.Errorf("doesn't find limit field in dpctl/ct-get-limits command output '%s'", cmdOutput)
return 0, fmt.Errorf("couldn't find limit field in dpctl/ct-get-limits command output '%s'", cmdOutput)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, addressed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not pushed yet?

pkg/agent/flowexporter/connections/conntrack_windows.go Outdated Show resolved Hide resolved
with the Name and Namespace set to `flow-aggregator`. If you deploy the Flow Aggregator
Service with a different Name and Namespace, then either use the appropriate DNS
name or the Cluster IP of the Service. Please note that the default values for
with the Name and Namespace set to `flow-aggregator`. For Antrea Agent running on a Windows node, the user is required to change the default value of `HOST` in `flowCollectorAddr` from DNS name to the Cluster IP of Flow Aggregator service. The reason is because of the DNS resolution issue in the current Windows support of Kubernetes. In addition, if you deploy the Flow Aggregator Service with a different Name and Namespace, then either use the appropriate DNS name or the Cluster IP of the Service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep line wrapped

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, addressed.

@dreamtalen
Copy link
Contributor Author

It seems the lack of UDP support in netsh portproxy is a well-known issue, but what's the explanation for this:

However even we added --tcp-only option in Resolve-DnsName utility, it still can't resolve this DNS name.

Sorry, we didn't probe further about "--tcp-only option" not working either, not sure if this is an existing issue of Resolve-DnsName utility on k8s clusters.

@antoninbas
Copy link
Contributor

Sorry, we didn't probe further about "--tcp-only option" not working either, not sure if this is an existing issue of Resolve-DnsName utility on k8s clusters.

That's orthogonal to this PR, but I'd feel better knowing that this is not an issue on our side. Any chance the DNS traffic can be captured to see what happens with the DNS request?

name or the Cluster IP of the Service. Please note that the default values for
with the Name and Namespace set to `flow-aggregator`. For Antrea Agent running on
a Windows node, the user is required to change the default value of `HOST` in `flowCollectorAddr`
from DNS name to the Cluster IP of Flow Aggregator service. The reason is because
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of the Flow Aggregator Service

with the Name and Namespace set to `flow-aggregator`. For Antrea Agent running on
a Windows node, the user is required to change the default value of `HOST` in `flowCollectorAddr`
from DNS name to the Cluster IP of Flow Aggregator service. The reason is because
of the DNS resolution issue in the current Windows support of Kubernetes. In addition,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too vague. Please provide more context & point to a Github issue. The issue is specific to the userspace kube-proxy implementation on Windows, which Antrea currently depends on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I didn't find an exisitng issue specific to the DNS resolve problem of kube-proxy on Windows, are these two issue good to cited here: #1935, #1954?

return maxConns, nil
}
}
return 0, fmt.Errorf("doesn't find limit field in dpctl/ct-get-limits command output '%s'", cmdOutput)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not pushed yet?

@@ -97,6 +97,12 @@ func generateCertKey(caCert *x509.Certificate, caKey *rsa.PrivateKey, isServer b
cert.IPAddresses = []net.IP{ip}
} else {
cert.DNSNames = []string{flowAggregatorAddress}
// add IP in certicate since flow exporter on Windows node can't resolve DNS name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/node/Node

@dreamtalen
Copy link
Contributor Author

Sorry, we didn't probe further about "--tcp-only option" not working either, not sure if this is an existing issue of Resolve-DnsName utility on k8s clusters.

That's orthogonal to this PR, but I'd feel better knowing that this is not an issue on our side. Any chance the DNS traffic can be captured to see what happens with the DNS request?

I have used wireshark to capture some DNS traffic during DNS name resolving using ping or Resolve-DnsName -tcponly:
Capture
Not sure if it's relevant, we were able to use nslookup to resolve flow aggregator DNS name, here is the traffic captured:
Capture2

@srikartati srikartati added this to the Antrea v1.1 release milestone May 12, 2021
@srikartati
Copy link
Member

Not sure if it's relevant, we were able to use nslookup to resolve flow aggregator DNS name, here is the traffic captured:

As discussed offline, usingnslookup is one solution to resolve the DNS name instead of adding clusterIP in the TLS certificates. Any thoughts? @antoninbas

@antoninbas
Copy link
Contributor

@dreamtalen @srikartati I spent some more time thinking about it. On Windows the Antrea Agent cannot be run as a Pod. It's run as a process (with possibly management Pods when the Docker container runtime is used). See https://github.com/antrea-io/antrea/blob/main/docs/windows.md. Unlike on Linux, the DNS resolver will not be configured to talk to the CoreDNS Service for cluster local DNS queries (e.g. flow-aggregator.flow-aggregator.svc). Instead the host's default DNS settings will be used. I believe that 10.72.40.1 is not the CoreDNS ClusterIP but the address of the DNS server configured for the host.

The nslookup query is not "working". I don't know what 10.195.47.241 is but it's not the IP address for the Flow Aggregator service in your cluster. It's a VMware webserver.

I think the only AI is to include the information above in the PR description / commit message.

@dreamtalen
Copy link
Contributor Author

@dreamtalen @srikartati I spent some more time thinking about it. On Windows the Antrea Agent cannot be run as a Pod. It's run as a process (with possibly management Pods when the Docker container runtime is used). See https://github.com/antrea-io/antrea/blob/main/docs/windows.md. Unlike on Linux, the DNS resolver will not be configured to talk to the CoreDNS Service for cluster local DNS queries (e.g. flow-aggregator.flow-aggregator.svc). Instead the host's default DNS settings will be used. I believe that 10.72.40.1 is not the CoreDNS ClusterIP but the address of the DNS server configured for the host.

The nslookup query is not "working". I don't know what 10.195.47.241 is but it's not the IP address for the Flow Aggregator service in your cluster. It's a VMware webserver.

I think the only AI is to include the information above in the PR description / commit message.

Thanks a lot Antonin, just double checked the ClusterIP of Flow Aggregator service should be 10.98.233.117 in my cluster. Will update the PR accordingly.

@dreamtalen dreamtalen force-pushed the local-fa-win branch 2 times, most recently from 1d429fc to 5059761 Compare May 13, 2021 00:07
Copy link
Member

@srikartati srikartati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misunderstood when Yongming mentioned that nslookup is working. Good that we could figure there is no issue on Antrea side for DNS resolution. Thanks, Antonin.

# stream of packets, a flow record will be exported to the collector once the elapsed
# time since the last export event is equal to the value of this timeout.
# Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
#activeFlowExportTimeout: "60s"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are moving the default to 30s as part of this PR: #1949
Could you update that here to reflect the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated.

@dreamtalen dreamtalen force-pushed the local-fa-win branch 2 times, most recently from b013830 to baf0aed Compare May 17, 2021 17:08
srikartati
srikartati previously approved these changes May 17, 2021
Copy link
Member

@srikartati srikartati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Please fix the failed CI tests.

@srikartati
Copy link
Member

In addition, windows e2e tests are enabled now.
#2018
Could we add flow aggregator tests there?

@dreamtalen
Copy link
Contributor Author

In addition, windows e2e tests are enabled now.
#2018
Could we add flow aggregator tests there?

LGTM.
Please fix the failed CI tests.

Sure, thanks.

@dreamtalen
Copy link
Contributor Author

In addition, windows e2e tests are enabled now.
#2018
Could we add flow aggregator tests there?

In addition, windows e2e tests are enabled now.
#2018
Could we add flow aggregator tests there?

Yes, we could do. Just got a new Windows cluster to test from zhecheng, will work on it.

srikartati
srikartati previously approved these changes May 19, 2021
Copy link
Member

@srikartati srikartati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of windows test infra challenges, e2e support will be added in a follow-up PR.

Please rebase the PR with main for jenkins tests to pass.

In this commit, we fix the error when running Flow Exporter on Windows
node with Flow Aggregator.
There is a limitation in DNS resolution on Windows,
flow-aggregator.flow-aggregator.svc DNS name couldn't be resolved.
The reason is because on Windows the Antrea Agent runs as a process,
it uses the host's default DNS setting and the DNS resolver will not be
configured to talk to the CoreDNS Service for cluster local DNS queries.
So we require flowCollectorAddr could only be IP for Flow Exporter on
Windows node and add IP in certicate for flow aggregator.
Also change to use dpctl/ct-get-limits intead of dpctl/ct-get-maxconns
since it returns operation not supported on Windows node.

Signed-off-by: Yongming Ding <[email protected]>
@dreamtalen
Copy link
Contributor Author

/test-all

Copy link
Member

@srikartati srikartati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dreamtalen
Copy link
Contributor Author

/test-windows-conformance
/test-windows-networkpolicy
/test-e2e

@dreamtalen
Copy link
Contributor Author

/test-windows-conformance
/test-windows-networkpolicy

@srikartati
Copy link
Member

All the tests have passed. Merging it.
Flow Aggregator e2e test will be added to windows e2e test jenkins job.

@srikartati srikartati merged commit fa54190 into antrea-io:main May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants