-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix SDN registry startup to not depend on master having been started first #6684
Conversation
Approved, LGTM |
[Test]ing while waiting on the merge queue |
continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4747/) (Image: devenv-rhel7_3248) |
Legitimate bug in the PR
|
@smarterclayton yeah, danw will be back soon but I'm digging into it. |
@smarterclayton @danwinship it looks like a race between master and node. Note the logs never print out "Started Origin Controllers" which indicates that the master is not fully set up. startControllers() gets run async and that's where the SDN gets started, and that's where ClusterNetwork("default") gets created if it doesn't yet exist. But the test code that starts the server only verifies that it can call some API and then lets client startup proceed; it doesn't actually wait for startControllers() to return because that's done from a goroutine. So the test code proceeds to start the node while the master is still in startControllers(). Node startup has less work to do, so by the time it gets to config.RunSDN() it's raced the master and fails hard on getting ClusterNetwork("default"). Honestly I'd say it's a bug in the testcases, but clearly something that needs to get fixed. |
I'd recommend doing a retry loop on cluster network using the retry library
On Fri, Jan 15, 2016 at 6:51 PM, Dan Williams [email protected]
|
this PR is being un-proposed for 3.1.1. We will continue to work to resolve the issue, but the risk/reward does not appear to be sufficient to race this in at the last moment. |
a8fad7e
to
e4f6b1d
Compare
(Note that the openshift-sdn side hasn't actually been committed there yet, I just pushed it here to get the tests to run before committing it there.) |
e4f6b1d
to
a3e5e28
Compare
Latest push pulls in openshift/openshift-sdn#250. There's no loop/exponential backoff, because with this rearragement, the endpoints-related |
cn, err := registry.oClient.ClusterNetwork().Get("default") | ||
if err != nil { | ||
// "can't happen"; StartNode() will already have ensured that there's no error | ||
panic("Failed to get ClusterNetwork: " + err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not log.Fatalf like other places in this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
a3e5e28
to
293bade
Compare
293bade
to
5288d3e
Compare
[test] |
3 similar comments
[test] |
[test] |
[test] |
Evaluated for origin test up to 5288d3e |
continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/365/) |
Noticing here that the plugin is coupled to the service proxy - when we move the service proxy and the plugins out of node.go, how is this code going to work? |
I don't know... I hadn't heard about that. Is there a PR/issue for it? (The coupling is to keep people from being able to break isolation by manually creating service endpoints pointing to another tenant's pods. The source network ID gets lost when the packet moves from OVS to iptables, so we have to decide whether the packet is allowed through before that point. Maybe we could just make it impossible to manually create endpoints pointing into ClusterNetworkCIDR or ServicesNetworkCIDR?) |
Maybe. Can you add a card / TODO in openshift-sdn to track that? We might have to have a way for openshift-sdn to ask the kube-proxy for info about endpoints or similar. |
[merge] |
Evaluated for origin merge up to 5288d3e |
Merged by openshift-bot
Origin side of openshift/openshift-sdn#248
(Replacing #6682 which was wrong.)
@eparis @smarterclayton