-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEP-0089: Non-falsifiable provenance support #529
Conversation
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a bunch of questions that, I think, mostly stem from my lack of knowledge about how SPIFFE/SPIRE works.
/assign @pritidesai |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding more detail! I've left some initial comments @priyawadhwa - would really like to dig into some of the requirements in more detail and get more clarity on exactly what assurances this will give us.
81181aa
to
0c33dda
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding all the additional detail @priyawadhwa, I think the overall picture is becoming a lot clearer for me!
A couple thoughts compared to the initial POC and proposal:
- Sounds like we're shifting from securing specifically the results in the TaskRun to securing the entire TaskRun spec and status, is that right?
- It sounds like we no longer need the individual pods to be connecting to spire - or maybe that's for solving a different problem? (in this proposal someone still modify the termination message of the pod? maybe its actually not possible to modify the termination message unless you're the kubelet...)
teps/0089-spire-support.md
Outdated
|
||
## Proposal | ||
As mentioned above, the basic design looks like this: | ||
1. Tekton Pipelines receives a TaskRun config, and generates the Pod for it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as an extra detail, related to my question above about modifications during execution, afaik the pipelines controller will be updating the TaskRun status as the TaskRun executes - I'm wondering if we want to be signing the TaskRun with each modification (and maybe even verifying that nothing has changed since the last modification??)
im not sure if im going too far but it seems like if we dont do that, something else could modify the taskrun status during execution, and the pipelines controller would just incrementally add to those changes, only signing at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense, and would probably be good to have. i'll add in some thoughts into the proposal!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of signing the TaskRun with each modification, do you think it could make sense to store the current TaskRun in memory and compare it to the actual thing? maybe something like this:
- store the initial TaskRun in memory
- when we want to modify the running TaskRun, compare the running TaskRun to the one in memory.
- if it doesn't match, then we never request an SVID and skip any SPIRE signing altogether
- if it does match, then we modify the TaskRun and store the new modified version in memory, and repeat steps 2-4 until taskrun completes
Yah, I think the entire spec needs to be secured to guarantee non-falsifiable provenance (since we take the results, steps, parameters and other info from the TaskRun which all ends up in provenance).
That's a good point! If the termination message can be modified then we'll also need pods connected to spire so that results can be secured... from the docs it isn't super clear if anyone other than the kubelet has access to the termination message. If we're not sure I think we can just play it safe and also secure results. |
Made some updates! I put the termination message issue as a Risk for now, I think it should be easy enough to build signing Results on top of the work in this proposal if it's needed. I'll try and find out if the termination message can be changed by anyone other than the kubelet in the meantime! |
Looks like they can. 😢 https://gist.github.com/imjasonh/9011395d1f88af02b5bdc5901b739090 This might be another good signal that we should stop relying on termination messages in general. |
Thanks @imjasonh! I'll update the TEP. |
fbb8fd6
to
f89d048
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits from me, and a couple of bigger things I'm wondering, about, specifically I'm wondering about including the requirement to verify the TaskRun itself.
To fulfill non-falsifiablity do we need to verify the TaskRun, or just the results in the TaskRun - and/or if we DO need to verify the TaskRun, do we need to verify pods and Pipelines instead?
If we do need to verify the TaskRun (and the PipelineRun), I'm wondering about the performance implications - keeping all executing TaskRuns and PipelineRuns in memory has a memory implication but also I think that wouldn't work b/c the controller can be restarted during execution. So I think we'd need to sign and verify the entire TaskRun (maybe PipelineRun) content on every reconcile that updates them which I'm thinking could have a significant performance impact?
Lastly, verifying TaskRuns/PipelineRuns (and/or even pods) will introduce limitations around interoperablity with other mutating admission controllers that users might want to be using (e.g. isito sidecar injection, or the SolarWinds approach to inject tasks into a Pipeline)
3b01fcd
to
557f32a
Compare
/assign @jerop |
This looks great to me now, the wording on exactly what this guarantees is much clearer than I originally came up with :) |
One detail to consider is that Spire certificates are going to be short lived (intentionally) so if used for signing persistent data, they may be expired by the time you try to verify them. That shouldn't be a blocker, just something to realize you may need to code around (either by ignoring the expiration, having a grace period, or using a TSA). Overall, I like where this is going, TEP looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nits from me, I think we're generally on the same page tho so nothing blocking on my side!! Thanks for all the back and forth on this
Final couple thoughts from me but not blockers:
- What is involved / required to install SPIRE itself in a cluster? i'm wondering if there is anything a user would need to be aware of in order to use this feature as far as the SPIRE requirements (i dunno what this would be, maybe something like 'requires an image registry to push to' or something)
- It might be more clear to rename this TEP to something like "Non-falsifiable provenance support" vs "SPIRE support" (so it's more clear to a reader who isn't familiar with SPIRE how this TEP might be relevant to them)
/approve
teps/0089-spire-support.md
Outdated
1. Enabling the alpha feature for SPIRE in Tekton | ||
1. Requesting an SVID & signature over Results for a TaskRun | ||
1. Verification of SPIRE with Chains | ||
1. Verify that a TaskRun that isn't created by Tekton isn't signed Chains |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you be planning to set these up as something that runs regularly or something that's done manually to verify the behavior? I think these would be valuable as end to end tests we keep going forward, not 100% sure if we'd prefer to always have spire enabled in our end to end tests (and deployed and configured in the cluster under test) or only do it for a sub set of tests
we'd need to setup a test cluster that has both chains + pipelines in it as well (maybe chains already has this?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally it would be something that runs regularly but i'll probably know more once i start actually writing the code 😅
i think we will need to set up a new test cluster, chains doesn't have one as of now!
|
||
### 2. Tekton Pipelines can't verify that the results it reads weren't modified | ||
The solution to this is Signed Results. | ||
We will modifiy the entrypointer image to sign results with SPIRE once they're available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know how fine grained authorization is for the signing keys? e.g. does it allow per-container authorization or is it only for the Pod? I'm wondering if/how we might be able to reduce access to keys from user steps 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few different selectors you can configure, including the container image and container name, so I believe we should be able to reduce access from user steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall sounds good, just few questions:
- This could be an opt-in or opt-out feature right ?
- If that's the case, I would like to explore the possible problem that could arise, like CSI driver not being present, etc..
- On the "release" problem if this is enabled or not (because of the need to have a volume mounted in the controller), the operator could help
Thanks @bobcatfish, and @vdemeester! To answer some of your questions:
I might need to look into this a little more, but I definitely think we'll need to add this to any documentation around this feature. In addition to the k8s yaml described in the docs, it looks like users will need to install the k8s node attestor which requires some setup within the cluster.
SGTM
Yup, it'll be opt-in as an experimental feature.
Right now, I'm just planning to clearly document the required tools users will need to install before using the feature. We can definitely consider better ways of handling potential errors in the future.
That's a good idea! I'll add that into the TEP as a potential option for releasing (@bobcatfish had also suggested doing two separate releases, one with the volume and one without the volume). |
sgtm! @jerop and @pritidesai i think you are currently the outstanding approvers on this one |
Thanks @priyawadhwa , LGTM. We have time to figure things out 😇 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @priyawadhwa!
"Marking a TaskRun as failed" has a specific meaning right now - that we mark ConditionSucceeded
as false - so it'd be great to clarify in the proposal that it means adding an annotation to the TaskRun indicating that the verification of the TaskRun failed (understood this is what was meant from reading other discussions here, that it's not really failing the TaskRun)
Also had a few questions below
(Please squash the commits)
As mentioned above, the basic design looks like this (this is meant to be high level and still needs to be fleshed out a bit): | ||
1. Tekton Pipelines receives a TaskRun config, and generates the Pod for it with SPIRE mounted in | ||
1. The Pod executes, and the entrypointer requests an SVID and signature over the Results | ||
1. Tekton Pipelines verifies the Results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we referring to task results
here? Task results are part of the taskRun status. So the entrypoint is requesting SVID and signature for that particular result? lets say if my task has 10 different results, will I see those many SVIDs and signatures? Are these SVIDs and signatures also stored as annotations in addition to the SVIDs and signature of the taskRun itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SVID for a taskRun could be spiffe://tekton.dev/<taskrun>
and for task result it could be spiffe://tekton.dev/<taskrun>/result/<result>
or spiffe://tekton.dev/<taskrun>/<result>
🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, we're referring to task results
. Each result
will have an associated signature, and all signatures can be verified with one SVID
. These will be stored as annotations.
Are these SVIDs and signatures also stored as annotations in addition to the SVIDs and signature of the taskRun itself?
Yup, ultimately we will have signatures and an SVID for the Task results, and we'll also have a signature and an SVID over the entire TaskRun.
1. Tekton Pipelines receives a TaskRun config, and generates the Pod for it with SPIRE mounted in | ||
1. The Pod executes, and the entrypointer requests an SVID and signature over the Results | ||
1. Tekton Pipelines verifies the Results | ||
1. Meanwhile, Tekton Pipelines has been verifying that the TaskRun hasn't been modified during execution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please elaborate a little on TaskRun
, I think TaskRun
here means task specifications (including input params) either through taskRef
or taskSpec
🤔 Or does it also include status
since status
is modified by the pipeline controller until the execution is over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i basically meant it to include the entire TaskRun yaml that's being executed, and i'm guessing the status
should be included as well.
thanks @priyawadhwa for this proposal, excited to see this moving forward 🎉 please squash all the commits, add a little description in the PR, and update the The proposal is very high level at this point like you mentioned. I would love to see more details when we move this TEP to /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bobcatfish, jerop, pritidesai, vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
40b4be3
to
a22f8a7
Compare
Thanks @jerop and @pritidesai! I just squashed the commits and added in some more details based on your comments. |
We've got approvals from all the approvers so I'm going to go ahead and lgtm this so it will be merged as proposed - @priyawadhwa the next step would be to open a PR to update the status to "implementable" - @pritidesai mentioned including some more examples and visuals as part of that PR as well (#529 (comment)) - maybe an architecture diagram and/or a sequence diagram? /lgtm |
This change fixes the date on TEP-0089 to pass the PipelineRun on open TEPs. The `pull-community-teps-lint` PipelineRun did not run when we merged TEP-0089 in tektoncd#529. It got merged with a linting issue - wrong date - that's causing the same PipelineRun to fail in open TEPs.
This change fixes the date on TEP-0089 to pass the PipelineRun on open TEPs. The `pull-community-teps-lint` PipelineRun did not run when we merged TEP-0089 in #529. It got merged with a linting issue - wrong date - that's causing the same PipelineRun to fail in open TEPs.
No description provided.