Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd: wait for the cluster to be initialized #1132

Merged
merged 3 commits into from
Jan 31, 2019
Merged

cmd: wait for the cluster to be initialized #1132

merged 3 commits into from
Jan 31, 2019

Conversation

staebler
Copy link
Contributor

After creating the cluster, wait until the ClusterVersion object indicates that the cluster has been initialized prior to exiting the installer.

@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 26, 2019
@staebler
Copy link
Contributor Author

Also, this was made worse because neither the installer nor the e2e run setup waits for cluster version to be valid, so we were starting e2e before the cluster was fully configured.

We should fix that as well (probably in both places).

@smarterclayton Is this what you are looking for with regards to having the installer verify that the cluster version is valid?

@staebler
Copy link
Contributor Author

staebler commented Jan 26, 2019

This is the error from the installer when run with these changes.

FATAL Cluster failed to initialize: Could not update clusteroperator "cluster-monitoring-operator" (config.openshift.io/v1, 177 of 241)


But the cluster did eventually initialize, which means that the installer should not stop when the cluster version object has a Failing/True condition.

@wking
Copy link
Member

wking commented Jan 26, 2019

But the cluster did eventually initialize...

Would be interesting to see how it recovered. Maybe it will turn up in CI where we'll have logs.

@staebler
Copy link
Contributor Author

staebler commented Jan 26, 2019

Output from installer with it changed to not stop after seeing a Failure/True condition.

time="2019-01-25T23:15:12-05:00" level=info msg="Waiting up to 10m0s for the cluster to be initialized..."
time="2019-01-25T23:15:13-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:15:44-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:15:53-05:00" level=debug msg="Still waiting for the cluster to initialize: Could not update clusteroperator \"openshift-apiserver\" (config.openshift.io/v1, 116 of 241)"
time="2019-01-25T23:16:24-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:16:54-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:17:25-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:17:56-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:18:34-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:19:12-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:19:43-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:20:03-05:00" level=debug msg="Still waiting for the cluster to initialize: Could not update clusteroperator \"cluster-monitoring-operator\" (config.openshift.io/v1, 177 of 241)"
time="2019-01-25T23:20:34-05:00" level=debug msg="Still waiting for the cluster to initialize: Could not update clusteroperator \"cluster-monitoring-operator\" (config.openshift.io/v1, 177 of 241)"
time="2019-01-25T23:20:53-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:21:24-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:21:55-05:00" level=debug msg="Still waiting for the cluster to initialize..."
time="2019-01-25T23:22:07-05:00" level=debug msg="Cluster is initialized"
time="2019-01-25T23:22:07-05:00" level=info msg="Waiting up to 10m0s for the openshift-console route to be created..."

@staebler
Copy link
Contributor Author

Pushed update to wait up to 30 minutes, to not stop when ClusterVersion reports Failing/True, and to better downsample the log messages.

@staebler
Copy link
Contributor Author

/retest

Copy link
Contributor

@smarterclayton smarterclayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments but yes.

cmd/openshift-install/create.go Outdated Show resolved Hide resolved
cmd/openshift-install/create.go Outdated Show resolved Hide resolved
The ClusterVersion client is not available in the version of openshift/client-go
that was pinned. These changes remove the pinnings of openshift/client-go and
openshift/api.

Commands run:
  dep ensure
  dep ensure -update github.com/openshift/api
Vendor openshift/library-go for access to helper functions for evaluating
the status conditions of the ClusterVersion object.
After creating the cluster, wait up to 30 minutes for the
ClusterVersion object to indicate that the cluster has been
initialized prior to exiting the installer.
@abhinavdahiya
Copy link
Contributor

/approve

This looks good, will wait for sometime to allow other people to comment.
@smarterclayton @wking PTAL

@smarterclayton
Copy link
Contributor

/test e2e-aws

Sorry, using you as a guinea pig

@wking
Copy link
Member

wking commented Jan 30, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 30, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, staebler, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,staebler,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member

wking commented Jan 31, 2019

e2e-aws:

Flaky tests:

[sig-storage] In-tree Volumes [Driver: nfs] [Testpattern: Dynamic PV (default fs)] subPath should support existing directory [Suite:openshift/conformance/parallel] [Suite:k8s]

Failing tests:

[sig-apps] StatefulSet [k8s.io] Basic StatefulSet functionality [StatefulSetBasic] should not deadlock when a pod's predecessor fails [Suite:openshift/conformance/parallel] [Suite:k8s]

Notes on the StatefulSet panics here.

/retest

@openshift-merge-robot openshift-merge-robot merged commit 9f73958 into openshift:master Jan 31, 2019
@wking wking mentioned this pull request Jan 31, 2019
wking added a commit to wking/openshift-installer that referenced this pull request Oct 27, 2021
The console may become optional [1], so teach the installer to handle
its absence gracefully.

We've waited on the console since way back in ff53523 (add logs at
end of install for kubeadmin, consoleURL, 2018-12-06, openshift#806).  Back
then, install-complete timing was much less organized, and since
e17ba3c (cmd: wait for the cluster to be initialized, 2019-01-25, openshift#1132)
we've blocked on ClusterVersion going Available=True. So the current
dependency chain is:

1. Console route admission blocks console operator from going
   Available=True in its ClusterOperator.
2. Console ClusterOperator blocks cluster-version operator from
   going Available=True in ClusterVersion.
3. ClusterVersion blocks installer's waitForInitializedCluster.

So we no longer need to wait for the route to show up, and can fail
fast if we get a clear IsNotFound.  I'm keeping a bit of polling so we
don't fail an install on a temporary network hiccup.

We don't want to drop the console check entirely, because when it is
found, we want:

* To continue to log that access pathway on install-complete.
* To continue to append the router CA to the kubeconfig.

That latter point has been done since 4033577 (append router CA to
cluster CA in kubeconfig, 2019-02-12, openshift#1242).  The motication in that
commit message is not explicit, but the idea is to support folks who
naively run 'oc login' with the kubeadmin kubeconfig [2] (despite that
kubeconfig already having cluster-root access) when the console
route's cert's CA happens to be something that the user's local trust
store doesn't include by default.

[1]: openshift/enhancements#922
[2]: openshift#1541 (comment)
wking added a commit to wking/openshift-installer that referenced this pull request Aug 1, 2022
The console may become optional [1], so teach the installer to handle
its absence gracefully.

We've waited on the console since way back in ff53523 (add logs at
end of install for kubeadmin, consoleURL, 2018-12-06, openshift#806).  Back
then, install-complete timing was much less organized, and since
e17ba3c (cmd: wait for the cluster to be initialized, 2019-01-25, openshift#1132)
we've blocked on ClusterVersion going Available=True. So the current
dependency chain is:

1. Console route admission blocks console operator from going
   Available=True in its ClusterOperator.
2. Console ClusterOperator blocks cluster-version operator from
   going Available=True in ClusterVersion.
3. ClusterVersion blocks installer's waitForInitializedCluster.

So we no longer need to wait for the route to show up, and can fail
fast if we get a clear IsNotFound.  I'm keeping a bit of polling so we
don't fail an install on a temporary network hiccup.

We don't want to drop the console check entirely, because when it is
found, we want:

* To continue to log that access pathway on install-complete.
* To continue to append the router CA to the kubeconfig.

That latter point has been done since 4033577 (append router CA to
cluster CA in kubeconfig, 2019-02-12, openshift#1242).  The motivation in that
commit message is not explicit, but the idea is to support folks who
naively run 'oc login' with the kubeadmin kubeconfig [2] (despite that
kubeconfig already having cluster-root access) when the console
route's cert's CA happens to be something that the user's local trust
store doesn't include by default.

[1]: openshift/enhancements#922
[2]: openshift#1541 (comment)
TrilokGeer pushed a commit to TrilokGeer/installer that referenced this pull request Sep 14, 2022
The console may become optional [1], so teach the installer to handle
its absence gracefully.

We've waited on the console since way back in ff53523 (add logs at
end of install for kubeadmin, consoleURL, 2018-12-06, openshift#806).  Back
then, install-complete timing was much less organized, and since
e17ba3c (cmd: wait for the cluster to be initialized, 2019-01-25, openshift#1132)
we've blocked on ClusterVersion going Available=True. So the current
dependency chain is:

1. Console route admission blocks console operator from going
   Available=True in its ClusterOperator.
2. Console ClusterOperator blocks cluster-version operator from
   going Available=True in ClusterVersion.
3. ClusterVersion blocks installer's waitForInitializedCluster.

So we no longer need to wait for the route to show up, and can fail
fast if we get a clear IsNotFound.  I'm keeping a bit of polling so we
don't fail an install on a temporary network hiccup.

We don't want to drop the console check entirely, because when it is
found, we want:

* To continue to log that access pathway on install-complete.
* To continue to append the router CA to the kubeconfig.

That latter point has been done since 4033577 (append router CA to
cluster CA in kubeconfig, 2019-02-12, openshift#1242).  The motivation in that
commit message is not explicit, but the idea is to support folks who
naively run 'oc login' with the kubeadmin kubeconfig [2] (despite that
kubeconfig already having cluster-root access) when the console
route's cert's CA happens to be something that the user's local trust
store doesn't include by default.

[1]: openshift/enhancements#922
[2]: openshift#1541 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants