Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-5445] Reduce the required resources for the Kubernetes's sidecar #6062

Merged
merged 1 commit into from
Sep 15, 2019

Conversation

mik-laj
Copy link
Member

@mik-laj mik-laj commented Sep 9, 2019

For GKE, the default value is 100m.

05:25 $ kubectl describe limitrange
Name:       limits
Namespace:  default
Type        Resource  Min  Max  Default Request  Default Limit  Max Limit/Request Ratio
----        --------  ---  ---  ---------------  -------------  -----------------------
Container   cpu       -    -    100m             -              -

Which means that only 5 pods can be started on single machine n1-standard-1. After making this change, we will be able to run 9 pods at the same time A small change, but in special cases can significantly increase the performance of the cluster.

--


Make sure you have checked all steps below.

Jira

  • My PR addresses the following Airflow Jira issues and references them in the PR title. For example, "[AIRFLOW-XXX] My Airflow PR"
    • https://issues.apache.org/jira/browse/AIRFLOW-5445
    • In case you are fixing a typo in the documentation you can prepend your commit with [AIRFLOW-XXX], code changes always need a Jira issue.
    • In case you are proposing a fundamental code change, you need to create an Airflow Improvement Proposal (AIP).
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Description

  • Here are some details about my PR, including screenshots of any UI changes:

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does
    • If you implement backwards incompatible changes, please leave a note in the Updating.md so we can assign it to a appropriate release

@mik-laj mik-laj added the k8s label Sep 9, 2019
@mik-laj mik-laj self-assigned this Sep 9, 2019
@potiuk
Copy link
Member

potiuk commented Sep 9, 2019

@dimberman - is this OK from the "production" point of view? I am not sure what was the design considerations for pod scheduling via Kubernetes?

@mik-laj
Copy link
Member Author

mik-laj commented Sep 9, 2019

I thought about it longer and we can turn off resource request at all. It is common practice for this type of container.
https://github.com/istio/istio/blob/c4db9cb56fa619291c8092c7b8add3b980c3fb47/pkg/kube/inject/testdata/inject/traffic-params-empty-includes.yaml.injected

@codecov-io
Copy link

Codecov Report

Merging #6062 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6062      +/-   ##
==========================================
+ Coverage   80.02%   80.02%   +<.01%     
==========================================
  Files         594      594              
  Lines       34765    34765              
==========================================
+ Hits        27821    27822       +1     
+ Misses       6944     6943       -1
Impacted Files Coverage Δ
airflow/kubernetes/pod_generator.py 94.69% <ø> (ø) ⬆️
airflow/utils/dag_processing.py 58.98% <0%> (+0.18%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b21be80...cecab23. Read the comment docs.

@mik-laj mik-laj changed the title [AIRFLOW-5445] Reduce the required resources for the Kubernetes's sideecar [AIRFLOW-5445] Reduce the required resources for the Kubernetes's sidecar Sep 10, 2019
@mik-laj
Copy link
Member Author

mik-laj commented Sep 10, 2019

@potiuk What do you think about turning this mechanism off completely?

@potiuk
Copy link
Member

potiuk commented Sep 15, 2019

@mik-laj I assume you refer to:

resources: {}

I could not find how it works in this case. Will we have requests/limits set to default values in this case? I believe so - from the documentation it looks like. And this is something you tried to prevent.

But maybe you find some proof otherwise?

@mik-laj
Copy link
Member Author

mik-laj commented Sep 15, 2019

I checked carefully and resource: {} does not work properly.

@mik-laj mik-laj merged commit 7b5cf44 into apache:master Sep 15, 2019
ashb pushed a commit to ashb/airflow that referenced this pull request Oct 11, 2019
adityav pushed a commit to adityav/airflow that referenced this pull request Oct 14, 2019
@jkgeyti
Copy link

jkgeyti commented Dec 3, 2019

This default breaks airflow on kubernetes clusters with a limit on the difference between requests and limits (limiting the allowed amount of oversubscription):

{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"<PODNAME>\" is forbidden: cpu max limit to request ratio per Container is <N>, but provided ratio is <M>.","reason":"Forbidden","details":{"name":"<PODNAME>","kind":"pods"},"code":403}

@dimberman
Copy link
Contributor

@potiuk @mik-laj please check @jkgeyti's comment. Do we need to revert this PR?

@mik-laj
Copy link
Member Author

mik-laj commented Dec 13, 2019

This is a valid value, but the user has configured his cluster that prevents from working with Airflow. Users can configure many policies and restrictions in Kubernetes, but this is the configuration of their cluster and we have no influence on it. In a moment, a user may appear who will enter a completely reverse configuration and report an issues that the Pod is not properly configured, because each container should have these fields filled.

@jkgeyti
Copy link

jkgeyti commented Dec 13, 2019

That's a valid point, but/and there's a couple of points I'd raise:

  • I haven't checked in detail, but I believe the merged change is incompatible with default settings on AWS EKS.
  • This is a regression compared to how Airflow has previously worked. This MR breaks support on some clusters, whereas it would work on all clusters before.
  • Limiting over-subscription isn't too esoteric, and it's a shame that configuring it prevents default installations of Airflow from working, since the proposed change isn't configurable.
  • The chosen requests value must be chosen because the sidecar is ensured to be able to work correctly and efficiently with that amount of memory, and it should therefore be reasonable to expect setting limits to the same as (or not-too-much-more than) the requests value.

@mik-laj
Copy link
Member Author

mik-laj commented Dec 13, 2019

the proposed change isn't configurable.
You can change the value of this field using Pod Mutation Hook.

This is a regression compared to how Airflow has previously worked.
This is a improvement compared to how Airflow has previously worked, because it's allow to work on cluster that have other limit ranger configuration. It's not prevent, because you could change this value.
https://airflow.readthedocs.io/en/latest/kubernetes.html#pod-mutation-hook
This is also an improvement because it allows you to manage resources more efficiently and run more tasks compared to the previous configuration.

We can set an upper limit as an additional improvement, which may allow other clusters to be launched, but we can still find people who have restrictions configured on cluster. There are many configuration options that can be configured by users.
https://github.com/kubernetes/kubernetes/blob/master/plugin/pkg/admission/limitranger/admission.go

It is also worth adding the default LimitRanger configuration in the documentation, but the documentation is WIP. How did you install Airflow? It may be worth adding a working configuration in this repository.

@jkgeyti
Copy link

jkgeyti commented Dec 13, 2019

Clusters can be misconfigured, sure. But I feel there's a difference here, where an options is explicitly decided upon by Airflow. If you didn't specify the requests and limits, and airflow wouldn't start up, then it's the user's "fault". In this case, Airflow isn't working due to Airflow overriding a default value.

That said, I feel overriding the limits to something sensible (like 2x the requests) is a good course of action. It'll make for saner defaults, the sidecar is granted limited resources, it'll never request an unreasonable amount of memory, and most importantly, you won't have to revert the PR.

We run Airflow (webserver etc) on k8s in EKS, and use both the k8s executor and operator to spawn workers on the same cluster. The setup is just a bog-standard collection of k8s deployments, services, ingresses, and service accounts.

@mik-laj
Copy link
Member Author

mik-laj commented Dec 13, 2019

We already have a consensus - adding an upper limit. Can you prepare PR with this change? I will gladly review.

dimberman pushed a commit that referenced this pull request Jun 4, 2020
kaxil pushed a commit that referenced this pull request Jul 1, 2020
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants