Fix SDN registry startup to not depend on master having been started first #6684

danwinship · 2016-01-15T21:30:44Z

Origin side of openshift/openshift-sdn#248
(Replacing #6682 which was wrong.)

smarterclayton · 2016-01-15T21:31:23Z

Approved, LGTM

openshift-bot · 2016-01-15T21:35:34Z

[Test]ing while waiting on the merge queue

openshift-bot · 2016-01-15T23:25:33Z

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/merge_pull_requests_origin/4747/) (Image: devenv-rhel7_3248)

smarterclayton · 2016-01-15T23:28:56Z

Legitimate bug in the PR

dcbw · 2016-01-15T23:38:00Z

@smarterclayton yeah, danw will be back soon but I'm digging into it.

dcbw · 2016-01-15T23:51:45Z

@smarterclayton @danwinship it looks like a race between master and node.

Note the logs never print out "Started Origin Controllers" which indicates that the master is not fully set up. startControllers() gets run async and that's where the SDN gets started, and that's where ClusterNetwork("default") gets created if it doesn't yet exist.

But the test code that starts the server only verifies that it can call some API and then lets client startup proceed; it doesn't actually wait for startControllers() to return because that's done from a goroutine.

So the test code proceeds to start the node while the master is still in startControllers(). Node startup has less work to do, so by the time it gets to config.RunSDN() it's raced the master and fails hard on getting ClusterNetwork("default").

Honestly I'd say it's a bug in the testcases, but clearly something that needs to get fixed.

smarterclayton · 2016-01-15T23:54:18Z

I'd recommend doing a retry loop on cluster network using the retry library

there are other reasons it could fail and a few checks never hurt anyone.
pkg/util/wait.ExponentialBackoff with a duration of a second and 3-4 tries
(doubling each time, maybe)

On Fri, Jan 15, 2016 at 6:51 PM, Dan Williams [email protected]
wrote:

@smarterclayton https://github.com/smarterclayton @danwinship
https://github.com/danwinship it looks like a race between master and
node.

Note the logs never get print out "Started Origin Controllers" which
indicates that the master is fully set up. startControllers() gets run
async and that's where the SDN gets started, and that's where
ClusterNetwork("default") gets created if it doesn't yet exist.

But the test code that starts the server only verifies that it can call
some API and then lets client startup proceed; it doesn't actually wait for
startControllers() to return because that's done from a goroutine.

So the test code proceeds to start the node while the master is still in
startControllers(). Node startup has less work to do, so by the time it
gets to config.RunSDN() it's raced the master and fails hard on getting
ClusterNetwork("default").

Honestly I'd say it's a bug in the testcases, but clearly something that
needs to get fixed.

—
Reply to this email directly or view it on GitHub
#6684 (comment).

eparis · 2016-01-16T02:18:20Z

this PR is being un-proposed for 3.1.1. We will continue to work to resolve the issue, but the risk/reward does not appear to be sufficient to race this in at the last moment.

danwinship · 2016-01-18T17:01:36Z

(Note that the openshift-sdn side hasn't actually been committed there yet, I just pushed it here to get the tests to run before committing it there.)

danwinship · 2016-01-19T21:53:24Z

Latest push pulls in openshift/openshift-sdn#250. There's no loop/exponential backoff, because with this rearragement, the endpoints-related ClusterNetwork.Get("default") occurs after another part of the registry code has already verified that it is non-nil.

liggitt · 2016-01-20T02:09:05Z

Godeps/_workspace/src/github.com/openshift/openshift-sdn/plugins/osdn/registry.go

+	cn, err := registry.oClient.ClusterNetwork().Get("default")
+	if err != nil {
+		// "can't happen"; StartNode() will already have ensured that there's no error
+		panic("Failed to get ClusterNetwork: " + err.Error())


Why not log.Fatalf like other places in this file?

danwinship · 2016-01-21T16:09:24Z

[test]

danwinship · 2016-01-22T14:43:47Z

[test]

danwinship · 2016-01-24T13:04:23Z

[test]

smarterclayton · 2016-01-24T17:29:34Z

[test]

openshift-bot · 2016-01-24T17:30:13Z

Evaluated for origin test up to 5288d3e

openshift-bot · 2016-01-24T18:47:46Z

continuous-integration/openshift-jenkins/test SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/365/)

smarterclayton · 2016-01-25T01:24:36Z

Noticing here that the plugin is coupled to the service proxy - when we move the service proxy and the plugins out of node.go, how is this code going to work?

danwinship · 2016-01-25T13:20:34Z

when we move the service proxy and the plugins out of node.go, how is this code going to work?

I don't know... I hadn't heard about that. Is there a PR/issue for it?

(The coupling is to keep people from being able to break isolation by manually creating service endpoints pointing to another tenant's pods. The source network ID gets lost when the packet moves from OVS to iptables, so we have to decide whether the packet is allowed through before that point. Maybe we could just make it impossible to manually create endpoints pointing into ClusterNetworkCIDR or ServicesNetworkCIDR?)

smarterclayton · 2016-01-25T22:55:36Z

Maybe. Can you add a card / TODO in openshift-sdn to track that? We might have to have a way for openshift-sdn to ask the kube-proxy for info about endpoints or similar.

danwinship · 2016-01-26T14:12:00Z

https://trello.com/c/80S5GwWW

smarterclayton · 2016-01-26T20:44:12Z

[merge]

openshift-bot · 2016-01-26T20:50:29Z

Evaluated for origin merge up to 5288d3e

Merged by openshift-bot

smarterclayton added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 15, 2016

dcbw mentioned this pull request Jan 15, 2016

[DO NOT MERGE] Run SDN after Proxy to ensure correct bridge-nf-call-iptables value #6686

Closed

eparis removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 16, 2016

danwinship force-pushed the endpoint-nil-crash branch from a8fad7e to e4f6b1d Compare January 18, 2016 16:41

danwinship force-pushed the endpoint-nil-crash branch from e4f6b1d to a3e5e28 Compare January 19, 2016 21:40

danwinship changed the title ~~Crash in the right place if the ClusterNetwork record can't be read~~ Fix SDN registry startup to not depend on master having been started first Jan 19, 2016

liggitt reviewed Jan 20, 2016
View reviewed changes

danwinship force-pushed the endpoint-nil-crash branch from a3e5e28 to 293bade Compare January 20, 2016 14:14

bump(github.com/openshift/openshift-sdn) fccee21

5288d3e

danwinship force-pushed the endpoint-nil-crash branch from 293bade to 5288d3e Compare January 21, 2016 14:07

detiber mentioned this pull request Jan 25, 2016

No route to host error openshift/openshift-ansible#283

Closed

openshift-bot pushed a commit that referenced this pull request Jan 26, 2016

Merge pull request #6684 from danwinship/endpoint-nil-crash

eb7e8bf

Merged by openshift-bot

openshift-bot merged commit eb7e8bf into openshift:master Jan 26, 2016

danwinship deleted the endpoint-nil-crash branch February 17, 2016 14:23

danwinship mentioned this pull request Feb 25, 2016

Ensure NodeIP doesn't overlap with the cluster network openshift/openshift-sdn#245

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SDN registry startup to not depend on master having been started first #6684

Fix SDN registry startup to not depend on master having been started first #6684

danwinship commented Jan 15, 2016

smarterclayton commented Jan 15, 2016

openshift-bot commented Jan 15, 2016

openshift-bot commented Jan 15, 2016

smarterclayton commented Jan 15, 2016 via email

dcbw commented Jan 15, 2016

dcbw commented Jan 15, 2016

smarterclayton commented Jan 15, 2016

eparis commented Jan 16, 2016

danwinship commented Jan 18, 2016

danwinship commented Jan 19, 2016

liggitt Jan 20, 2016

danwinship Jan 20, 2016

danwinship commented Jan 21, 2016

danwinship commented Jan 22, 2016

danwinship commented Jan 24, 2016

smarterclayton commented Jan 24, 2016

openshift-bot commented Jan 24, 2016

openshift-bot commented Jan 24, 2016

smarterclayton commented Jan 25, 2016

danwinship commented Jan 25, 2016

smarterclayton commented Jan 25, 2016

danwinship commented Jan 26, 2016

smarterclayton commented Jan 26, 2016

openshift-bot commented Jan 26, 2016

Fix SDN registry startup to not depend on master having been started first #6684

Fix SDN registry startup to not depend on master having been started first #6684

Conversation

danwinship commented Jan 15, 2016

smarterclayton commented Jan 15, 2016

openshift-bot commented Jan 15, 2016

openshift-bot commented Jan 15, 2016

smarterclayton commented Jan 15, 2016 via email

dcbw commented Jan 15, 2016

dcbw commented Jan 15, 2016

smarterclayton commented Jan 15, 2016

eparis commented Jan 16, 2016

danwinship commented Jan 18, 2016

danwinship commented Jan 19, 2016

liggitt Jan 20, 2016

Choose a reason for hiding this comment

danwinship Jan 20, 2016

Choose a reason for hiding this comment

danwinship commented Jan 21, 2016

danwinship commented Jan 22, 2016

danwinship commented Jan 24, 2016

smarterclayton commented Jan 24, 2016

openshift-bot commented Jan 24, 2016

openshift-bot commented Jan 24, 2016

smarterclayton commented Jan 25, 2016

danwinship commented Jan 25, 2016

smarterclayton commented Jan 25, 2016

danwinship commented Jan 26, 2016

smarterclayton commented Jan 26, 2016

openshift-bot commented Jan 26, 2016