-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows support for Flow Exporter with Flow Aggregator #2138
Conversation
b18503d
to
e8040c3
Compare
Codecov Report
@@ Coverage Diff @@
## main #2138 +/- ##
==========================================
+ Coverage 61.09% 65.14% +4.04%
==========================================
Files 273 273
Lines 20644 20648 +4
==========================================
+ Hits 12613 13451 +838
+ Misses 6713 5818 -895
- Partials 1318 1379 +61
Flags with carried forward coverage won't be shown. Click here to find out more.
|
e8040c3
to
d6a6ebd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Did you get a chance to run e2e test suite on windows and check if its passing? I see new Jenkins job for this.
docs/network-flow-visibility.md
Outdated
name or the Cluster IP of the Service. Please note that the default values for | ||
name or the Cluster IP of the Service. | ||
|
||
Please note that for Antrea Agent running on a Windows node, `flowCollectorAddr` can only be IP right now because there is a DNS resolution issue in current Windows support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge this line with the above paragraph. Something like..
"... with the Name and Namespace set to flow-aggregator
. For Antrea Agent running on a Windows node, the user is required to change the default value of HOST
in flowCollectorAddr
from DNS name to the Cluster IP of Flow Aggregator service. The reason is because of the DNS resolution issue in the current Windows support of Kubernetes. In addition, if you deploy the Flow Aggregator Service with a different Name and Namespace, then either use the appropriate DNS name or the Cluster IP of the Service."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, addressed.
/test-windows-e2e |
Just triggered these Windows related Jenkins jobs, will pay attention to the result. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the lack of UDP support in netsh portproxy
is a well-known issue, but what's the explanation for this:
However even we added --tcp-only option in Resolve-DnsName utility, it still can't resolve this DNS name.
connTrackOvsCtl | ||
} | ||
|
||
// dpctl/ct-get-maxconns returns operation not supported on Windows node, use dpctl/ct-get-limits intead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a good function-level comment. It belongs with the call to ct.ovsctlClient.RunAppctlCmd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, addressed.
return maxConns, nil | ||
} | ||
} | ||
return 0, fmt.Errorf("doesn't find limit field in dpctl/ct-get-limits command output '%s'", cmdOutput) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return 0, fmt.Errorf("doesn't find limit field in dpctl/ct-get-limits command output '%s'", cmdOutput) | |
return 0, fmt.Errorf("couldn't find limit field in dpctl/ct-get-limits command output '%s'", cmdOutput) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not pushed yet?
docs/network-flow-visibility.md
Outdated
with the Name and Namespace set to `flow-aggregator`. If you deploy the Flow Aggregator | ||
Service with a different Name and Namespace, then either use the appropriate DNS | ||
name or the Cluster IP of the Service. Please note that the default values for | ||
with the Name and Namespace set to `flow-aggregator`. For Antrea Agent running on a Windows node, the user is required to change the default value of `HOST` in `flowCollectorAddr` from DNS name to the Cluster IP of Flow Aggregator service. The reason is because of the DNS resolution issue in the current Windows support of Kubernetes. In addition, if you deploy the Flow Aggregator Service with a different Name and Namespace, then either use the appropriate DNS name or the Cluster IP of the Service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please keep line wrapped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, addressed.
Sorry, we didn't probe further about "--tcp-only option" not working either, not sure if this is an existing issue of Resolve-DnsName utility on k8s clusters. |
That's orthogonal to this PR, but I'd feel better knowing that this is not an issue on our side. Any chance the DNS traffic can be captured to see what happens with the DNS request? |
docs/network-flow-visibility.md
Outdated
name or the Cluster IP of the Service. Please note that the default values for | ||
with the Name and Namespace set to `flow-aggregator`. For Antrea Agent running on | ||
a Windows node, the user is required to change the default value of `HOST` in `flowCollectorAddr` | ||
from DNS name to the Cluster IP of Flow Aggregator service. The reason is because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of the Flow Aggregator Service
docs/network-flow-visibility.md
Outdated
with the Name and Namespace set to `flow-aggregator`. For Antrea Agent running on | ||
a Windows node, the user is required to change the default value of `HOST` in `flowCollectorAddr` | ||
from DNS name to the Cluster IP of Flow Aggregator service. The reason is because | ||
of the DNS resolution issue in the current Windows support of Kubernetes. In addition, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too vague. Please provide more context & point to a Github issue. The issue is specific to the userspace kube-proxy implementation on Windows, which Antrea currently depends on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return maxConns, nil | ||
} | ||
} | ||
return 0, fmt.Errorf("doesn't find limit field in dpctl/ct-get-limits command output '%s'", cmdOutput) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not pushed yet?
pkg/flowaggregator/certificate.go
Outdated
@@ -97,6 +97,12 @@ func generateCertKey(caCert *x509.Certificate, caKey *rsa.PrivateKey, isServer b | |||
cert.IPAddresses = []net.IP{ip} | |||
} else { | |||
cert.DNSNames = []string{flowAggregatorAddress} | |||
// add IP in certicate since flow exporter on Windows node can't resolve DNS name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/node/Node
As discussed offline, using |
@dreamtalen @srikartati I spent some more time thinking about it. On Windows the Antrea Agent cannot be run as a Pod. It's run as a process (with possibly management Pods when the Docker container runtime is used). See https://github.com/antrea-io/antrea/blob/main/docs/windows.md. Unlike on Linux, the DNS resolver will not be configured to talk to the CoreDNS Service for cluster local DNS queries (e.g. The nslookup query is not "working". I don't know what 10.195.47.241 is but it's not the IP address for the Flow Aggregator service in your cluster. It's a VMware webserver. I think the only AI is to include the information above in the PR description / commit message. |
Thanks a lot Antonin, just double checked the ClusterIP of Flow Aggregator service should be 10.98.233.117 in my cluster. Will update the PR accordingly. |
1d429fc
to
5059761
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misunderstood when Yongming mentioned that nslookup is working. Good that we could figure there is no issue on Antrea side for DNS resolution. Thanks, Antonin.
# stream of packets, a flow record will be exported to the collector once the elapsed | ||
# time since the last export event is equal to the value of this timeout. | ||
# Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h". | ||
#activeFlowExportTimeout: "60s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are moving the default to 30s as part of this PR: #1949
Could you update that here to reflect the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, updated.
b013830
to
baf0aed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Please fix the failed CI tests.
In addition, windows e2e tests are enabled now. |
Sure, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of windows test infra challenges, e2e support will be added in a follow-up PR.
Please rebase the PR with main for jenkins tests to pass.
In this commit, we fix the error when running Flow Exporter on Windows node with Flow Aggregator. There is a limitation in DNS resolution on Windows, flow-aggregator.flow-aggregator.svc DNS name couldn't be resolved. The reason is because on Windows the Antrea Agent runs as a process, it uses the host's default DNS setting and the DNS resolver will not be configured to talk to the CoreDNS Service for cluster local DNS queries. So we require flowCollectorAddr could only be IP for Flow Exporter on Windows node and add IP in certicate for flow aggregator. Also change to use dpctl/ct-get-limits intead of dpctl/ct-get-maxconns since it returns operation not supported on Windows node. Signed-off-by: Yongming Ding <[email protected]>
/test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-windows-conformance |
/test-windows-conformance |
All the tests have passed. Merging it. |
In this commit, we fix the error when running Flow Exporter on Windows node with Flow Aggregator.
There is a limitation in DNS resolution on Windows,
flow-aggregator.flow-aggregator.svc
DNS name couldn't be resolved.The reason is because on Windows the Antrea Agent runs as a process, it uses the host's default DNS setting and the DNS resolver will not be configured to talk to the CoreDNS Service for cluster local DNS queries like
flow-aggregator.flow-aggregator.svc
.So we require
flowCollectorAddr
could only be IP for Flow Exporter on Windows node.For Flow Aggregator, we add IP in certicate since flow exporter on Windows node can't resolve DNS name.
Also change to use
dpctl/ct-get-limits
intead ofdpctl/ct-get-maxconns
since it returns operation not supported on Windows node.