Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-6706] Lazy load operator extra links #7327

Conversation

mik-laj
Copy link
Member

@mik-laj mik-laj commented Feb 1, 2020

When we import the airflow package, many modules are loaded, so I looked at what modules are exactly loaded. I found a lot of classes that should not be loaded and delay the start of the application very much. I suggest that some classes be loaded lazily when needed.

Performance benchmark:

seq 1 10 | xargs -n 1 -I {} time python -c "import airflow; import sys; print(len(sys.modules));"

Before:

1521
        3.45 real         1.83 user         0.59 sys
1521
        1.86 real         1.60 user         0.40 sys
1521
        1.87 real         1.61 user         0.40 sys
1521
        1.83 real         1.60 user         0.39 sys
1521
        1.84 real         1.62 user         0.40 sys
1521
        2.00 real         1.71 user         0.42 sys
1521
        1.84 real         1.60 user         0.41 sys
1521
        1.84 real         1.60 user         0.40 sys
1521
        1.90 real         1.64 user         0.41 sys
1521
        1.87 real         1.61 user         0.42 sys

After

941
        1.79 real         0.95 user         0.27 sys
941
        1.23 real         0.90 user         0.20 sys
941
        1.26 real         0.90 user         0.21 sys
941
        1.25 real         0.89 user         0.20 sys
941
        1.25 real         0.88 user         0.20 sys
941
        1.30 real         0.90 user         0.21 sys
941
        1.23 real         0.88 user         0.20 sys
941
        1.19 real         0.87 user         0.19 sys
941
        1.23 real         0.87 user         0.21 sys
941
        1.20 real         0.87 user         0.20 sys

Result:
Screenshot 2020-02-01 at 18 42 24

and 580 fewer modules - 61%

If anyone is interested, I attach an exact log that shows the import process.
https://gist.github.com/mik-laj/002f5a714c221ba04bc638970094519c

CC: @evgenyshulman


Issue link: AIRFLOW-6706

Make sure to mark the boxes below before creating PR: [x]

  • Description above provides context of the change
  • Commit message/PR title starts with [AIRFLOW-NNNN]. AIRFLOW-NNNN = JIRA ID*
  • Unit tests coverage for changes (not needed for documentation changes)
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions.
  • I will engage committers as explained in Contribution Workflow Example.

* For document-only changes commit message can start with [AIRFLOW-XXXX].


In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

@potiuk
Copy link
Member

potiuk commented Feb 1, 2020

One comment - i think we should not load plugins_manager at all at init.py. Why do we do it? Could we remove it from init.py altogether and import it when it is really needed?

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "integrate_plugins()" method should only be called in case
a) webserver is run
b) scheduler is run
c) maybe when tests are run (?)

We can explicitly call that method in those places rather than in arirfow.init.py

This way we could avoid this lazy-loading for specific imports and have it also solved for all future cases

@codecov-io
Copy link

codecov-io commented Feb 1, 2020

Codecov Report

Merging #7327 into master will decrease coverage by 0.16%.
The diff coverage is 90%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #7327      +/-   ##
=========================================
- Coverage   85.87%   85.7%   -0.17%     
=========================================
  Files         862     863       +1     
  Lines       40484   40510      +26     
=========================================
- Hits        34767   34721      -46     
- Misses       5717    5789      +72
Impacted Files Coverage Δ
airflow/plugins_manager.py 89.11% <100%> (+1.53%) ⬆️
airflow/serialization/serialized_objects.py 90.16% <88.88%> (+0.13%) ⬆️
airflow/providers/postgres/operators/postgres.py 0% <0%> (-100%) ⬇️
...irflow/example_dags/example_kubernetes_executor.py 15% <0%> (-65.96%) ⬇️
...example_dags/example_kubernetes_executor_config.py 16.66% <0%> (-51.76%) ⬇️
...roviders/google/cloud/operators/postgres_to_gcs.py 52.94% <0%> (-32.36%) ⬇️
airflow/providers/postgres/hooks/postgres.py 77.46% <0%> (-16.91%) ⬇️
airflow/providers/papermill/operators/papermill.py 96% <0%> (-4%) ⬇️
airflow/hooks/dbapi_hook.py 90.08% <0%> (-1.66%) ⬇️
airflow/jobs/scheduler_job.py 88.9% <0%> (-0.44%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 373c6aa...7511cd8. Read the comment docs.

@mik-laj
Copy link
Member Author

mik-laj commented Feb 1, 2020

I think this is a completely different problem. I also think it is worth solving it, but it will require more time and testing. If we call the integrate plugins method elsewhere, it will still load GCP and Qubole classes, when it is not needed. In this PR I want to focus only on one small problem and not change the way whole plugins mechanism work.

My change is not even related to the plugin loading mechanism at all. This only applies to operators that are in the core. It is a mistake that this code is in the same module.

BTW. I think that it will also be much easier to do when we drop plugin support for operators, sensors and hook.

@mik-laj mik-laj requested a review from potiuk February 1, 2020 20:17
@mik-laj
Copy link
Member Author

mik-laj commented Feb 1, 2020

This way we could avoid this lazy-loading for specific imports and have it also solved for all future cases

We want to load lazily because some users do not want to use certain operators, so their code should be loaded when it is needed. Now all classes are loaded for each user. When we will move this method call, it will still load all classes for all users, not according to the user's needs.

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than name of the variable, looks good

@kaxil kaxil merged commit b180e4b into apache:master Feb 2, 2020
@kaxil
Copy link
Member

kaxil commented Feb 2, 2020

Good work @mik-laj

galuszkak pushed a commit to FlyrInc/apache-airflow that referenced this pull request Mar 5, 2020
turbaszek pushed a commit to PolideaInternal/airflow that referenced this pull request Aug 13, 2020
kaxil pushed a commit that referenced this pull request Aug 13, 2020
Co-authored-by: Kamil Breguła <[email protected]>
Backported from #7327
cherry-picked from b180e4b
kaxil pushed a commit that referenced this pull request Aug 15, 2020
Co-authored-by: Kamil Breguła <[email protected]>
Backported from #7327
cherry-picked from b180e4b
cfei18 pushed a commit to cfei18/incubator-airflow that referenced this pull request Mar 5, 2021
Co-authored-by: Kamil Breguła <[email protected]>
Backported from apache#7327
cherry-picked from b180e4b
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Jun 4, 2021
Co-authored-by: Kamil Breguła <[email protected]>
Backported from apache/airflow#7327
cherry-picked from b180e4b
GitOrigin-RevId: ea32d0d83ce915798ba9779dbf7c1df9faf7c241
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Jun 4, 2021
Co-authored-by: Kamil Breguła <[email protected]>
Backported from apache/airflow#7327
cherry-picked from b180e4b
GitOrigin-RevId: ea32d0d83ce915798ba9779dbf7c1df9faf7c241
leahecole pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Jun 9, 2021
Co-authored-by: Kamil Breguła <[email protected]>
Backported from apache/airflow#7327
cherry-picked from b180e4b
GitOrigin-RevId: ea32d0d83ce915798ba9779dbf7c1df9faf7c241
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants