-
Notifications
You must be signed in to change notification settings - Fork 410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1868158: gcp, azure: Handle azure vips similar to GCP #2011
Bug 1868158: gcp, azure: Handle azure vips similar to GCP #2011
Conversation
@squeed: This pull request references Bugzilla bug 1868158, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test e2e-azure |
/cc @sttts |
| +---------+ | | ||
| | | ||
+---------------+ | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ascii figures 😍
/test e2e-gcp-op |
oops, typo'd the image key. Good thing the tests mostly failed... |
/refresh |
/test e2e-azure |
/retest |
h.vip = addrs[0] | ||
glog.Infof("Using VIP %s", h.vip) | ||
if len(addrs) != 1 { | ||
return nil, fmt.Errorf("hostname %s has %d addresses, expected 1 - aborting", uri.Hostname(), len(addrs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This isn't a behavior change, just removing some dead code and a corresponding re-indentation)
It could happen if we somehow switch to RRDNS.
@@ -1,17 +1,17 @@ | |||
mode: 0644 | |||
path: "/etc/kubernetes/manifests/gcp-routes-controller.yaml" | |||
path: "/etc/kubernetes/manifests/apiserver-watcher.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you rely on a fresh base image (i.e. reboot) to remove the old static pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always reboot for config changes today, yes.
Though this gets into #1190 and in fact due to the way the MCO works today there will be a window where both are running unfortunately.
We probably need to change the new code to at least detect the case where the old static pod exists and exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could also keep it as the same filename; the filename definitely doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old static pod doesn't matter; it writes to /run/gcp-routes
while the new one is /run/cloud-routes
, so they can happily coexist (and should, until the service is swapped).
@@ -0,0 +1,180 @@ | |||
mode: 0755 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a separate commit with just the copied file from gcp would help to review the differences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty different from GCP, so it needs a review.
azure quota limits /retest |
/hold holding so this doesn't merge until it looks like azure does what we want. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really have the background knowledge to validate the functionality, so I think I should defer the lgtm to someone with more networking knowledge.
In terms of the operation here I suppose we're really just extending the existing GCP watcher to also work on Azure, which seems fine to me.
cmd/apiserver-watcher/README.md
Outdated
When /readyz fails, write `/run/cloud-routes/$VIP.down`, which tells the | ||
provider-specific service to update iptables rules. | ||
|
||
Separately, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to continue here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah. Maybe?
it's down, or else the node (i.e. kubelet) loses access to the apiserver VIP | ||
and becomes unmanagable. | ||
|
||
### Azure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to also add some platform specific descriptions to how the service operates on that platform, so its more clear how differences are handled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean exactly; apiserver-watcher is identical on azure and gzp. I did add pointers to the cloud-provider-specific scripts, so maybe that's helpful?
@@ -1,17 +1,17 @@ | |||
mode: 0644 | |||
path: "/etc/kubernetes/manifests/gcp-routes-controller.yaml" | |||
path: "/etc/kubernetes/manifests/apiserver-watcher.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always reboot for config changes today, yes.
Though this gets into #1190 and in fact due to the way the MCO works today there will be a window where both are running unfortunately.
We probably need to change the new code to at least detect the case where the old static pod exists and exit.
path: "/opt/libexec/openshift-azure-routes.sh" | ||
contents: | ||
inline: | | ||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please always use http://redsymbol.net/articles/unofficial-bash-strict-mode/
Also it's really unfortunate we keep accumulating this nontrivial bash code; like I said in the OVS review it is possible today to have this in the MCD since we pull that binary and execute on the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree; I don't like adding all this bash. If it helps, I extract it and run it through shellcheck automatically. I could probably add that to make verify
.
For 4.7, should we add an item to rewrite all this in go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love this to be in go! defer to @cgwalters / @runcom on whether waiting to 4.7 makes sense.
/approve |
/retest |
templates/master/00-master/azure/units/openshift-azure-routes.path.yaml
Outdated
Show resolved
Hide resolved
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
/lgtm cancel I'll look at fixing it. |
This PR does the following things: - Rename gcp-routes-controller to apiserver-watcher, since it is generic - Remove obsolete service-management mode from gcp-routes-controller - Change downfile directory to /run/cloud-routes from /run/gcp-routes - Write $VIP.up as well as $VIP.down - Add an azure routes script that fixes hairpin. Background: Azure hosts cannot hairpin back to themselves over a load balancer. Thus, we need to redirect traffic to the apiserver vip to ourselves via iptables. However, we should only do this when our local apiserver is running. The apiserver-watcher drops a $VIP.up and $VIP.down file, accordingly, depending on the state of the apiserver. Then, we add or remove iptables rules that short-circuit the load balancer. Unlike GCP, we don't need to do this for external traffic, only local clients.
I'm holding the mutex 🔒 around force pushing updates here. |
/test e2e-azure |
/retest |
Thanks |
OK we have a green azure run here: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2011/pull-ci-openshift-machine-config-operator-master-e2e-azure/1304091886448807936 |
Confirmed we fixed the ordering cycle by looking at the journal from the current run versus the previous:
/approve |
@squeed: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Eh we had prior approvals on the old code and the new one just fixes systemd ordering issues so |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, mfojtik, squeed The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@squeed: All pull requests linked via external trackers have merged: Bugzilla bug 1868158 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
YAY! |
The original introduction of this service probably used `gcpRoutesController` which happens to be the same as the MCO image because we didn't have a reference to it, and plumbing the image substitution through all the abstraction layers in the code is certainly not obvious. Prep for openshift#2011 which wants to abstract the GCP work to also handle Azure and it was confusing that `machine-config-daemon-pull.service` was referencing an image with a GCP name.
This PR does the following things:
Background: Azure hosts cannot hairpin back to themselves over a load balancer. Thus, we need to redirect traffic to the apiserver vip to ourselves via iptables. However, we should only do this when our local apiserver is running.
The apiserver-watcher drops a $VIP.up and $VIP.down file, accordingly, depending on the state of the apiserver. Then, we add or remove iptables rules that short-circuit the load balancer.
Unlike GCP, we don't need to do this for external traffic, only local clients.
- How to verify it
Install on azure, ensure connections to the internal API load balancer are reliable - both when the local apiserver process is running and stopped.
- Description for the changelog
Masters on azure can now reliably connect to the apiserver service, without encountering hairpin issues