Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SDN registry startup to not depend on master having been started first #6684

Merged
merged 1 commit into from
Jan 26, 2016

Conversation

danwinship
Copy link
Contributor

Origin side of openshift/openshift-sdn#248
(Replacing #6682 which was wrong.)

@eparis @smarterclayton

@smarterclayton
Copy link
Contributor

Approved, LGTM

@smarterclayton smarterclayton added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 15, 2016
@openshift-bot
Copy link
Contributor

[Test]ing while waiting on the merge queue

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4747/) (Image: devenv-rhel7_3248)

@smarterclayton
Copy link
Contributor

smarterclayton commented Jan 15, 2016 via email

@dcbw
Copy link
Contributor

dcbw commented Jan 15, 2016

@smarterclayton yeah, danw will be back soon but I'm digging into it.

@dcbw
Copy link
Contributor

dcbw commented Jan 15, 2016

@smarterclayton @danwinship it looks like a race between master and node.

Note the logs never print out "Started Origin Controllers" which indicates that the master is not fully set up. startControllers() gets run async and that's where the SDN gets started, and that's where ClusterNetwork("default") gets created if it doesn't yet exist.

But the test code that starts the server only verifies that it can call some API and then lets client startup proceed; it doesn't actually wait for startControllers() to return because that's done from a goroutine.

So the test code proceeds to start the node while the master is still in startControllers(). Node startup has less work to do, so by the time it gets to config.RunSDN() it's raced the master and fails hard on getting ClusterNetwork("default").

Honestly I'd say it's a bug in the testcases, but clearly something that needs to get fixed.

@smarterclayton
Copy link
Contributor

I'd recommend doing a retry loop on cluster network using the retry library

  • there are other reasons it could fail and a few checks never hurt anyone.
    pkg/util/wait.ExponentialBackoff with a duration of a second and 3-4 tries
    (doubling each time, maybe)

On Fri, Jan 15, 2016 at 6:51 PM, Dan Williams [email protected]
wrote:

@smarterclayton https://github.com/smarterclayton @danwinship
https://github.com/danwinship it looks like a race between master and
node.

Note the logs never get print out "Started Origin Controllers" which
indicates that the master is fully set up. startControllers() gets run
async and that's where the SDN gets started, and that's where
ClusterNetwork("default") gets created if it doesn't yet exist.

But the test code that starts the server only verifies that it can call
some API and then lets client startup proceed; it doesn't actually wait for
startControllers() to return because that's done from a goroutine.

So the test code proceeds to start the node while the master is still in
startControllers(). Node startup has less work to do, so by the time it
gets to config.RunSDN() it's raced the master and fails hard on getting
ClusterNetwork("default").

Honestly I'd say it's a bug in the testcases, but clearly something that
needs to get fixed.


Reply to this email directly or view it on GitHub
#6684 (comment).

@eparis eparis removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 16, 2016
@eparis
Copy link
Member

eparis commented Jan 16, 2016

this PR is being un-proposed for 3.1.1. We will continue to work to resolve the issue, but the risk/reward does not appear to be sufficient to race this in at the last moment.

@danwinship
Copy link
Contributor Author

(Note that the openshift-sdn side hasn't actually been committed there yet, I just pushed it here to get the tests to run before committing it there.)

@danwinship danwinship changed the title Crash in the right place if the ClusterNetwork record can't be read Fix SDN registry startup to not depend on master having been started first Jan 19, 2016
@danwinship
Copy link
Contributor Author

Latest push pulls in openshift/openshift-sdn#250. There's no loop/exponential backoff, because with this rearragement, the endpoints-related ClusterNetwork.Get("default") occurs after another part of the registry code has already verified that it is non-nil.

cn, err := registry.oClient.ClusterNetwork().Get("default")
if err != nil {
// "can't happen"; StartNode() will already have ensured that there's no error
panic("Failed to get ClusterNetwork: " + err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not log.Fatalf like other places in this file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@danwinship
Copy link
Contributor Author

[test]

3 similar comments
@danwinship
Copy link
Contributor Author

[test]

@danwinship
Copy link
Contributor Author

[test]

@smarterclayton
Copy link
Contributor

[test]

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 5288d3e

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/365/)

@smarterclayton
Copy link
Contributor

Noticing here that the plugin is coupled to the service proxy - when we move the service proxy and the plugins out of node.go, how is this code going to work?

@danwinship
Copy link
Contributor Author

when we move the service proxy and the plugins out of node.go, how is this code going to work?

I don't know... I hadn't heard about that. Is there a PR/issue for it?

(The coupling is to keep people from being able to break isolation by manually creating service endpoints pointing to another tenant's pods. The source network ID gets lost when the packet moves from OVS to iptables, so we have to decide whether the packet is allowed through before that point. Maybe we could just make it impossible to manually create endpoints pointing into ClusterNetworkCIDR or ServicesNetworkCIDR?)

@smarterclayton
Copy link
Contributor

Maybe. Can you add a card / TODO in openshift-sdn to track that? We might have to have a way for openshift-sdn to ask the kube-proxy for info about endpoints or similar.

@danwinship
Copy link
Contributor Author

@smarterclayton
Copy link
Contributor

[merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 5288d3e

openshift-bot pushed a commit that referenced this pull request Jan 26, 2016
@openshift-bot openshift-bot merged commit eb7e8bf into openshift:master Jan 26, 2016
@danwinship danwinship deleted the endpoint-nil-crash branch February 17, 2016 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants