Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve UX of Step Resource Requests #2986

Closed
danielhelfand opened this issue Jul 22, 2020 · 24 comments
Closed

Improve UX of Step Resource Requests #2986

danielhelfand opened this issue Jul 22, 2020 · 24 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@danielhelfand
Copy link
Member

Feature request

Currently, users have the ability to define resource requests as part of a step for a Task for each step:

    steps:
    - name: step1
      image: ...
      command: ...
      resources:
        limits:
          cpu: 500m
          memory: 1Gi
        requests:
          cpu: 500m
          memory: 1Gi
    - name: step2
      image: ...
      command: ...
      resources:
        limits:
          cpu: 500m
          memory: 1Gi
        requests:
          cpu: 500m
          memory: 1Gi

This is misleading in my opinion because it makes it seem like the user has control of setting the requests for each step. However, only the max requests from all steps are ever used as documented here.

So a user should only have to, at most, define resource requests (the max values needed for a container) once for steps. Maybe it makes sense to allow users to define CPU, memory, and ephemeral storage once as part of a TaskRun and allow this to be configurable for PipelineRuns via taskRunSpecs?

Use case

Discussed in #2984: #2984 (comment)
Discussed in #2931: #2931 (comment)

The idea has come up a couple times in issues where the experience of requests is confusing for users, and it might help to provide this option and clearly document with better examples.

@danielhelfand danielhelfand added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 22, 2020
@imjasonh
Copy link
Member

+1, the CLI/UI could also make this clearer

Per-step limits are still valid and enforced, so maybe we should migrate to a world where requests are specified per-Task and each step can only set a limit?

@danielhelfand
Copy link
Member Author

+1, the CLI/UI could also make this clearer

Yes, think the CLI could definitely add some kind of option for this.

Per-step limits are still valid and enforced, so maybe we should migrate to a world where requests are specified per-Task and each step can only set a limit?

Think it's reasonable to still support limits for steps.

Another benefit of this approach would be potentially eliminating needing to search through all containers as part of a TaskRun to find the max since the values are known in advance.

@danielhelfand
Copy link
Member Author

Something that will have to be worked out with this though is how to deal with step limits and not repeating what occurred in #1937.

@danielhelfand
Copy link
Member Author

In rethinking this with regards to limits, why not have one limit for the entire Task? This way, the Task can be set up with a limit to be enforced via a TaskRun, but then resource requests can be specified at run time via a request property?

Since a step has access to all resources of a pod, is there any reason for restricting individual steps?

@qu1queee
Copy link
Contributor

Regarding steps limits being configurable, would this apply also to the Tekton containers like the step-create-dir-* found in a TaskRun?

@imjasonh
Copy link
Member

Regarding steps limits being configurable, would this apply also to the Tekton containers like the step-create-dir-* found in a TaskRun?

I don't think so. Tekton is responsible for those steps, and should be responsible for configuring their limits.

Perhaps that step should be an initContainer anyway, so there's a clearer distinction between "step containers" (containers) and "Tekton's containers" (initContainers)

@qu1queee
Copy link
Contributor

Makes sense. Do you think it would be possible for Tekton to define reasonable Limits numbers for those containers(e.g. step-create-dir-*), instead of relying of the LimitRange default value?

@SaschaSchwarze0
Copy link
Contributor

Regarding steps limits being configurable, would this apply also to the Tekton containers like the step-create-dir-* found in a TaskRun?

I don't think so. Tekton is responsible for those steps, and should be responsible for configuring their limits.

Perhaps that step should be an initContainer anyway, so there's a clearer distinction between "step containers" (containers) and "Tekton's containers" (initContainers)

@imjasonh I +1 on Tekton being responsible for those steps (init containers as well as containers added by Tekton) and therefore providing the resources (imo not just limits but also requests). But, I suggest that Tekton provides reasonable values. What we (I am in the same team as @qu1queee) see (we are still on Tekton 0.11, about to get to 0.14; so if things are changed in the meantime, pls correct me) is the following: we have a LimitRange in place in the namespace where the TaskRun (with embedded TaskSpec) gets created. The LimitRange contains default, defaultRequest, min and max for CPU and Memory. What we see for initContainers: Tekton takes default and defaultRequest. This is imo not correct (because in our case it requests 1 Gi of memory which is way too much). If Tekton is responsible for those init containers, then it should specify reasonable values. Tekton knows what these containers are doing. As such, it should specify a request and limit that is suitable for that (while still respecting min and max of the LimitRange to create valid containers). The same applies to Tekton's containers (step-create-dir-image, step-image-digest-exporter) where Tekton imo takes min for request and default as limit. Tekton here should also specify reasonable values suitable for what the containers are doing and not arbitrary values from a LimitRange that it can't control (but has to respect). And I agree, step-create-dir-image sounds more like an initContainer.

In rethinking this with regards to limits, why not have one limit for the entire Task? This way, the Task can be set up with a limit to be enforced via a TaskRun, but then resource requests can be specified at run time via a request property?

Since a step has access to all resources of a pod, is there any reason for restricting individual steps?

@danielhelfand that would be new to me. A step is a container and has only access to the memory and CPU defined for this container.

This is misleading in my opinion because it makes it seem like the user has control of setting the requests for each step. However, only the max requests from all steps are ever used as documented here.

So a user should only have to, at most, define resource requests (the max values needed for a container) once for steps. Maybe it makes sense to allow users to define CPU, memory, and ephemeral storage once as part of a TaskRun and allow this to be configurable for PipelineRuns via taskRunSpecs?

@danielhelfand No, this refers to my previous comment. As my understanding is that the CPU and memory are assigned to a container, it can't be assigned once per TaskRun. I would really like to see a mechanism available where the user creating the TaskSpec has control over the step resource requests AND limit. I agree with the concern outlined elsewhere that due to containers being started all at the same time this can easily eat resources of a node in case there are many steps (or even cause it to not be schedulable). But, I also think that the current algorithm can cause an issue. Assuming there are two steps and we leave all tekton-added containers out of the picture for simplification. The first one has request = limit = 1 Gi, the second one has request = limit = 2 Gi. When the LimitRange min is 128 Mi, then Tekton will change the request for the first step to 128 Mi. Kubernetes can now schedule this pod to a node that currently has only 2128 Mi memory available because scheduling happens based on the request (and not the limit). The first one now starts to run. If that's an operation that really needs memory up to 1 Gi, then this container will be OOMKilled as the node can't give it the necessary memory although the container limit is not reached. I am not asking for the current mechanism to be generally changed, but at least there should be an option somewhere (global as env property, a flag on the TaskSpec, something like that) that allows to force the step's request and limit to be set in the Pod's containers as well without modification.

@danielhelfand
Copy link
Member Author

@SaschaSchwarze0 Not sure if you have read how Tekton handles step/container resource requests, but there are some differences from the traditional approach by Kubernetes. Please see the documentation below:

@SaschaSchwarze0
Copy link
Contributor

@SaschaSchwarze0 Not sure if you have read how Tekton handles step/container resource requests, but there are some differences from the traditional approach by Kubernetes. Please see the documentation below:

* https://docs.google.com/presentation/d/1-FNMMbRuxckAInO2aJPtzuINqV-9pr-M25OXQoGbV0s/edit#slide=id.p

* https://github.com/tektoncd/pipeline/blob/master/docs/taskruns.md#specifying-limitrange-values

* Also, see issue #598

Yes, I know it.

@danielhelfand
Copy link
Member Author

I agree that the request is reasonable as far as being able override step resource requests if it is the case that each step container doesn't have access to the max resources defined for all containers.

cc @imjasonh Any thoughts on this?

@qu1queee
Copy link
Contributor

qu1queee commented Aug 8, 2020

folks, any update on this?

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 6, 2020
@vdemeester
Copy link
Member

/remove-lifecycle stale

@tekton-robot tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 9, 2020
@vdemeester vdemeester mentioned this issue Nov 20, 2020
17 tasks
@erinz
Copy link

erinz commented Dec 1, 2020

+1, we need this feature too.

@zhangtbj
Copy link
Contributor

zhangtbj commented Dec 2, 2020

+1, I think it doesn't need to be an explicit setting in the Tekton resource definition.

But in a shared or complex Kubernetes environment, it is better to provide the capability to customize the settings by admin. Although some of the containers are created by Tekton but they also depend on the end user input may have some problem if just using default settings.

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 2, 2021
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 1, 2021
@vdemeester
Copy link
Member

/lifecycle frozen
We still need to support this.

@tekton-robot tekton-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Apr 16, 2021
@dibyom dibyom added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label May 4, 2021
@bobcatfish
Copy link
Collaborator

quick update @vdemeester is making some improvements to how tekton handles limitranges (#4176), tho i think this particular issue is also requesting support for specifying limits at the Task level as well

lbernick added a commit to lbernick/community that referenced this issue Dec 20, 2021
This commit expands the scope of TEP-0094 to cover the user experience of
specifying resource requests and limits in Tasks.
Focusing only on Step and Sidecar resource requirements may be too narrow of a scope for this TEP.

This is largely motivated by tektoncd/pipeline#2986, because the solution
to this problem may involve removing the ability to specify Step resource requests.
It doesn't make sense to override Step resource requests in TaskRuns if users shouldn't be able
to specify Step resource requests in the first place.

The scope is also expanded to include parameterizing resource requests based on discussion in
tektoncd#560, around treating resource requirement parameterization
and runtime overrides as "both/and", rather than "either/or". Fixing Task's resource requirement UX
may allow us to get parameterization for free.
lbernick added a commit to lbernick/community that referenced this issue Dec 20, 2021
This commit expands the scope of TEP-0094 to cover the user experience of
specifying resource requests and limits in Tasks.
Focusing only on Step and Sidecar resource requirements may be too narrow of a scope for this TEP.

This is largely motivated by tektoncd/pipeline#2986, because the solution
to this problem may involve removing the ability to specify Step resource requests.
It doesn't make sense to override Step resource requests in TaskRuns if users shouldn't be able
to specify Step resource requests in the first place.

The scope is also expanded to include parameterizing resource requests based on discussion in
tektoncd#560, around treating resource requirement parameterization
and runtime overrides as "both/and", rather than "either/or". Fixing Task's resource requirement UX
may allow us to get parameterization for free.
@lbernick
Copy link
Member

As of v0.28.0, Step resource requests and limits are left unchanged, or adjusted based on limitranges present (see docs for more info)-- thanks @vdemeester!

I'm going to close this issue because the Step level behavior described is no longer true. I've opened #4470 as a FR for Task-level resource requirements.

/close

@tekton-robot
Copy link
Collaborator

@lbernick: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

As of v0.28.0, Step resource requests and limits are left unchanged, or adjusted based on limitranges present (see docs for more info)-- thanks @vdemeester!

I'm going to close this issue because the Step level behavior described is no longer true. I've opened #4470 as a FR for Task-level resource requirements.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vdemeester
Copy link
Member

/close
🙏🏼

@tekton-robot
Copy link
Collaborator

@vdemeester: Closing this issue.

In response to this:

/close
🙏🏼

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
Status: Done
Development

No branches or pull requests