-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Insights Operator pulling and exposing data from the OCM API #683
Insights Operator pulling and exposing data from the OCM API #683
Conversation
4de96ea
to
4dd02df
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
0078ca5
to
c9ca705
Compare
Does insights Operator have access to the org ID and/or cluster Id? |
@sbose78 The Insights Operator does have access to the cluster ID, but from what I know it doesn't have access to the org ID. |
/lgtm |
Thanks Ivan, I think we have all the approvals needed at this stage. |
Hi @smarterclayton Would you be able to review this, and potentially merge this ? |
LGTM from my perspective. |
c9ca705
to
c969d7c
Compare
I'd like to address a couple of questions I see in the above in multiple discussion threads. How does an entity make the SCA cert accessible to service accounts in other namespaces ?
Consumption by Pods
Consumption by BuildsAfter https://issues.redhat.com/browse/BUILD-274 is done, one would be able to do the following:
As of today, the Build API team has made progress with supporting Secret & ConfigMap volume mounts in Builds as part of https://issues.redhat.com/browse/BUILD-87 . https://issues.redhat.com/browse/BUILD-274 is a candidate for 4.10 . |
As Ben mentioned, this EP's sole goal is to ensure the SCA certs are pulled into the cluster and lifecycle'd by the Insights Operator. Consumption of the same is being driven by work in distinct Openshift components:
|
I slightly updated the proposal to address some points and cover the part of SCA certs use. Can you @bparees, @dhellmann please take a look at my last commit? Let's discuss potential blockers tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting one change : Create the Share
resource.
|
||
- `insights-operator-e2e-tests` suite can verify the SCA cert data | ||
is available | ||
- Basic test of the validity of the SCA certs. Mount the `etc-pki-entitlement` secret and run e.g `yum install` in the container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since most consumers will presumably be mounting the Share, maybe this integration test should use that approach instead of shortcutting to use the Secret directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was discussed in a since-resolved comment thread somewhere, but.... i don't want the insights operator team's testing to be dependent on Share behavior.
they have the ability to test their functionality end to end directly, so they should do that.
The team that owns the Shared-Resource driver+builds should have tests that ensure they can consume this content successfully at their end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the bits involving the Share
resource should ultimately live in the OpenShift build suite. Test plan as follows:
- insights operator obtains SCA cert from OCM
- insights operator creates
Share
resource and ClusterRoleBinding - ocp build suite creates a build that does a yum install of subscription-only content
|
||
### SCA certs in API | ||
|
||
The SCA certificate is available via the `etc-pki-entitlement` secret in the `openshift-config-managed` namespace. The secret will be available for use in other namespaces by creating a cluster-scoped `Share` resource. Cluster admin creates a `clusterrolebinding` to allow a service account access to the `Share` resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not covered in this sentence is the structure of the Secret. Will a particular well-known key be used? If so, can we document it here, or is that a low-enough level of detail that it's not worth including in the enhancement proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the key name can be any .pem
file. On a RHEL system the subscription cert key names have what appear to be a uuid for the file name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should mention it here. I am using kubernetes.io/tls
secret with tls.crt
for the certificate and tls.key
for the private key.
|
||
### Upgrade / Downgrade Strategy | ||
|
||
There is no upgrade/downgrade strategy needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a cluster is downgraded to a version that does not poll for entitlement updates, will that version of the insights operator (4.8.z?) have logic around to remove the etc-pki-entitlement
secret and other cruft to keep in-cluster components from trying to consume stale data? If the insights operator is downgraded before some consumer (builds and/or CSI drivers?), will the higher-version consumers gracefully handle the consumed secret's removal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a cluster is downgraded to a version that does not poll for entitlement updates, will that version of the insights operator (4.8.z?)
for tech preview it's not applicable since upgrade/downgrade isn't allowed, but in general if you downgrade below the level at which the insights operator has this behavior then yes, i'd expect the content to become stale (i'm not sure how long the tokens are good for)
i'm not sure how you'd propose to solve that on downgrade, though. the older version of the operator wouldn't even know about the content to remove it(even if we could agree that was the right thing to do, which i don't think i do)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this sounds like an edge case to me and yes there will be a stale secret in such case.
|
||
Risk: Insights Operator is unable to expose/update the data in the OpenShift API | ||
|
||
Mitigation: The Insights Operator is marked as Degraded (in case the HTTP code is lower than 200 or greater than 399 and is not equal to 404, because HTTP 404 means that the organization didn't allow this feature). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Degraded means the cluster can't upgrade. Are we sure we want this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we want it.
-
this is tech preview. we should absolutely gather data on whether this operator is ending up degraded during the tech preview period before making a final decision for GA
-
degraded doesn't block z-stream upgrades, and can be overridden for y-stream upgrades, the point is it's a conscious choice. I gave this (somewhat dubious) example in another comment thread on this EP: suppose you have pods that need this entitlement to do work at startup. Doing an upgrade is going to trigger all those pods to restart. You want to know, before you upgrade your cluster, that you don't have valid entitlements on your cluster, or all those pods are going to fail to restart after the node reboots triggered by the upgrade.
fundamentally this operator has a job to do, and if it's not doing the job, it should report degraded(after a reasonable period of time)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that said, going degraded should only happen after some number of retries or period of time. not on first failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and needs to take into account disconnected clusters (i.e. it shouldn't be degraded on a cluster w/ no network access)
|
||
### SCA certs in API | ||
|
||
The SCA certificate is available via the `etc-pki-entitlement` secret in the `openshift-config-managed` namespace. The secret will be available for use in other namespaces by creating a cluster-scoped `Share` resource. Cluster admin creates a `clusterrolebinding` to allow a service account access to the `Share` resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the key name can be any .pem
file. On a RHEL system the subscription cert key names have what appear to be a uuid for the file name.
|
||
- `insights-operator-e2e-tests` suite can verify the SCA cert data | ||
is available | ||
- Basic test of the validity of the SCA certs. Mount the `etc-pki-entitlement` secret and run e.g `yum install` in the container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the bits involving the Share
resource should ultimately live in the OpenShift build suite. Test plan as follows:
- insights operator obtains SCA cert from OCM
- insights operator creates
Share
resource and ClusterRoleBinding - ocp build suite creates a build that does a yum install of subscription-only content
|
||
### Graduation Criteria | ||
|
||
This feature is planned as a technical preview in OCP 4.9 and is planned to go GA in 4.10. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the Share
bits are now planned as tech preview for OCP 4.10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that it will block the graduation criteria from the TP to GA mentioned here? If so then we would need to go GA with your bits...I guess
#### Dev Preview -> Tech Preview | ||
- opt-in feature (called `InsightsOperatorPullingSCA`) enabled with `TechPreviewNoUpgrade` feature set | ||
- Insights Operator is able to download the certificates from OCM API and expose it in a cluster API | ||
- Insights Operator is marked as degraded in case of the number of unsuccessful requests to the OCM API exceeds defined threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto my comments above - degraded means no cluster upgrades. Do we want this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that prior GH comments we reached consensus that Degraded is a desirable condition. I would like to see this justification reflected in the proposal itself.
|
||
- `insights-operator-e2e-tests` suite can verify the SCA cert data | ||
is available | ||
- Basic test of the validity of the SCA certs. Mount the `etc-pki-entitlement` secret and run e.g `yum install` in the container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was discussed in a since-resolved comment thread somewhere, but.... i don't want the insights operator team's testing to be dependent on Share behavior.
they have the ability to test their functionality end to end directly, so they should do that.
The team that owns the Shared-Resource driver+builds should have tests that ensure they can consume this content successfully at their end.
|
||
### Upgrade / Downgrade Strategy | ||
|
||
There is no upgrade/downgrade strategy needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a cluster is downgraded to a version that does not poll for entitlement updates, will that version of the insights operator (4.8.z?)
for tech preview it's not applicable since upgrade/downgrade isn't allowed, but in general if you downgrade below the level at which the insights operator has this behavior then yes, i'd expect the content to become stale (i'm not sure how long the tokens are good for)
i'm not sure how you'd propose to solve that on downgrade, though. the older version of the operator wouldn't even know about the content to remove it(even if we could agree that was the right thing to do, which i don't think i do)
|
||
Risk: Insights Operator is unable to expose/update the data in the OpenShift API | ||
|
||
Mitigation: The Insights Operator is marked as Degraded (in case the HTTP code is lower than 200 or greater than 399 and is not equal to 404, because HTTP 404 means that the organization didn't allow this feature). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we want it.
-
this is tech preview. we should absolutely gather data on whether this operator is ending up degraded during the tech preview period before making a final decision for GA
-
degraded doesn't block z-stream upgrades, and can be overridden for y-stream upgrades, the point is it's a conscious choice. I gave this (somewhat dubious) example in another comment thread on this EP: suppose you have pods that need this entitlement to do work at startup. Doing an upgrade is going to trigger all those pods to restart. You want to know, before you upgrade your cluster, that you don't have valid entitlements on your cluster, or all those pods are going to fail to restart after the node reboots triggered by the upgrade.
fundamentally this operator has a job to do, and if it's not doing the job, it should report degraded(after a reasonable period of time)
|
||
Risk: Insights Operator is unable to expose/update the data in the OpenShift API | ||
|
||
Mitigation: The Insights Operator is marked as Degraded (in case the HTTP code is lower than 200 or greater than 399 and is not equal to 404, because HTTP 404 means that the organization didn't allow this feature). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that said, going degraded should only happen after some number of retries or period of time. not on first failure.
|
||
Risk: Insights Operator is unable to expose/update the data in the OpenShift API | ||
|
||
Mitigation: The Insights Operator is marked as Degraded (in case the HTTP code is lower than 200 or greater than 399 and is not equal to 404, because HTTP 404 means that the organization didn't allow this feature). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and needs to take into account disconnected clusters (i.e. it shouldn't be degraded on a cluster w/ no network access)
/approve looks like there might be a few final items @adambkaplan would like to see explicitly stated in the EP, but in general i think the critical concerns are addressed and/or roadmapped. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bparees, iNecas The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks @bparees. I added one more commit mentioning the reasons for the decision on the degraded status. |
/lgtm |
FYI, the consumption APIs are undergoing changes |
No description provided.