Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PWX-36917/PWX-32899_pt2: K8s DNS fix #1629

Merged
merged 2 commits into from
Apr 17, 2024
Merged

Conversation

zoxpx
Copy link
Contributor

@zoxpx zoxpx commented Feb 9, 2024

What type of PR is this?
improvement

What this PR does / why we need it:

Stork should not assume that Kubernetes DNS is configured with default .svc.cluster.local domain

FIX:

  • as a fix, we are modifying the hostnames to <service>.<endpoint> (and no .svc.cluster.local domain)
  • instead, we will rely on the /etc/resolv.conf to provide the correct search <dns-domain> entry, so DNS resolution will still work correctly

Does this PR change a user-facing CRD or CLI?:

no

Is a release note needed?:
no

Does this change need to be cherry-picked to a release branch?:

not sure yet

MANUAL TESTING:

This test was performed on a cluster with modified K8s DNS domain

  • BEFORE the fix, stork is reporting errors with resolving portworx-REST endpoints
    • note, DNS-queries are failing on FQDN host portworx-api.portworx.svc.cluster.local
time="2024-02-08T01:32:08Z" level=debug msg="Monitoring storage nodes"
time="2024-02-08T01:32:08Z" level=error msg="Error getting nodes: Failed to get nodes for the driver: Get \"http://portworx-api.portworx.svc.cluster.local:9001/v1/cluster/enumerate\": dial tcp: lookup portworx-api.portworx.svc.cluster.local on 10.96.0.10:53: no such host"
2024/02/08 01:32:08 DoRetryWithTimeout - Error: {failed to create cluster domains status object for driver pxd: Failed to get clusterID for the driver: Get "http://portworx-api.portworx.svc.cluster.local:9001/v1/cluster/enumerate": dial tcp: lookup portworx-api.portworx.svc.cluster.local on 10.96.0.10:53: no such host}, Next try in [10s], timeout [30m0s]
2024/02/08 01:32:18 DoRetryWithTimeout - Error: {failed to create cluster domains status object for driver pxd: Failed to get clusterID for the driver: Get "http://portworx-api.portworx.svc.cluster.local:9001/v1/cluster/enumerate": dial tcp: lookup portworx-api.portworx.svc.cluster.local on 10.96.0.10:53: no such host}, Next try in [10s], timeout [30m0s]
2024/02/08 01:32:28 DoRetryWithTimeout - Error: {failed to create cluster domains status object for driver pxd: Failed to get clusterID for the driver: Get "http://portworx-api.portworx.svc.cluster.local:9001/v1/cluster/enumerate": dial tcp: lookup portworx-api.portworx.svc.cluster.local on 10.96.0.10:53: no such host}, Next try in [10s], timeout [30m0s]
...
  • AFTER the fix, the "short hostname" (portworx-api.portworx) is working fine:
time="2024-02-08T02:06:24Z" level=debug msg="Monitoring storage nodes"
time="2024-02-08T02:06:24Z" level=info msg="Registering CRDs"
time="2024-02-08T02:06:24Z" level=info msg="Using http://portworx-api.portworx:9001 as endpoint for portworx REST API"
time="2024-02-08T02:06:24Z" level=info msg="Using portworx-api.portworx:9020 as endpoint for portworx gRPC API"
I0208 02:06:24.163463       1 snapshot-controller.go:184] Starting snapshot controller
I0208 02:06:24.163654       1 snapshot-controller.go:171] Waiting for caches to sync for snapshot-controller controller
I0208 02:06:24.265908       1 snapshot-controller.go:178] Caches are synced for snapshot-controller controller
I0208 02:06:24.277374       1 controller.go:835] Starting provisioner controller stork-snapshot_stork-5676c48459-t2b79_540b5cd5-1fc2-4383-9072-5e6ee2eea0e9!
I0208 02:06:24.378570       1 controller.go:884] Started provisioner controller stork-snapshot_stork-5676c48459-t2b79_540b5cd5-1fc2-4383-9072-5e6ee2eea0e9!
...

TODO: will need to fix the openstorage and operator, and vendor the fixes "in"

@zoxpx zoxpx requested a review from a team February 9, 2024 04:00
@zoxpx zoxpx self-assigned this Feb 9, 2024
@cnbu-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@diptiranjanpx
Copy link
Contributor

@zoxpx we should merge this PR with right vendoring from openstorage branch. For that, can you please go after openstorage PR's review and merge it ?

@zoxpx zoxpx changed the title PWX-32899_pt2: K8s DNS fix PWX-36917/PWX-32899_pt2: K8s DNS fix Apr 16, 2024
Signed-off-by: Zoran Rajic <[email protected]>
@zoxpx zoxpx force-pushed the PWX-32899_pt2_k8s-dns-fix branch from 2e3dd28 to 1c4ea91 Compare April 16, 2024 19:41
Vendor in latest version of openstorage
@zoxpx
Copy link
Contributor Author

zoxpx commented Apr 16, 2024

Note, updated PR to the openstorage@latest -- this brings in 10 commits, done in the past 4 weeks on the openstorage project.

@zoxpx
Copy link
Contributor Author

zoxpx commented Apr 17, 2024

Thanks @diptiranjanpx -- merging the PR.

Which stork-branch should this be picked up into?

@zoxpx zoxpx merged commit acff28b into master Apr 17, 2024
5 checks passed
@zoxpx zoxpx deleted the PWX-32899_pt2_k8s-dns-fix branch April 17, 2024 02:43
@diptiranjanpx
Copy link
Contributor

Thanks @diptiranjanpx -- merging the PR.

Which stork-branch should this be picked up into?

This will be part of 24.2.0 once the release branch gets created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants