Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_tree_view can consume extreme amounts of memory. #41505

Closed
1 of 2 tasks
mobuchowski opened this issue Aug 15, 2024 · 3 comments
Closed
1 of 2 tasks

get_tree_view can consume extreme amounts of memory. #41505

mobuchowski opened this issue Aug 15, 2024 · 3 comments
Assignees
Labels
affected_version:2.10 Issues Reported for 2.10 area:core area:UI Related to UI/UX. For Frontend Developers. kind:bug This is a clearly a bug

Comments

@mobuchowski
Copy link
Contributor

Apache Airflow version

2.10.0rc1

If "Other Airflow 2 version" selected, which one?

No response

What happened?

get_tree_view in degenerated case can take a lot of memory.

For a DAG

    with DAG("aaa_big_get_tree_view", schedule=None) as dag:
        first_set = [LongEmptyOperator(task_id=f"hello_{i}_{'a' * 230}") for i in range(900)]
        chain(*first_set)

        last_task_in_first_set = first_set[-1]

        chain(
            last_task_in_first_set, [LongEmptyOperator(task_id=f"world_{i}_{'a' * 230}") for i in range(900)]
        )

        chain(
            last_task_in_first_set, [LongEmptyOperator(task_id=f"this_{i}_{'a' * 230}") for i in range(900)]
        )

        chain(last_task_in_first_set, [LongEmptyOperator(task_id=f"is_{i}_{'a' * 230}") for i in range(900)])

        chain(
            last_task_in_first_set, [LongEmptyOperator(task_id=f"silly_{i}_{'a' * 230}") for i in range(900)]
        )

        chain(
            last_task_in_first_set, [LongEmptyOperator(task_id=f"stuff_{i}_{'a' * 230}") for i in range(900)]
        )

serializing it can take 2.7GB

root@a24bae3584cb:/opt/airflow# pytest --memray tests/providers/openlineage/utils/test_utils.py::test_get_dag_tree_large_dag
=========================================================================================================================================================================== test session starts ============================================================================================================================================================================
platform linux -- Python 3.12.5, pytest-8.3.2, pluggy-1.5.0 -- /usr/local/bin/python
cachedir: .pytest_cache
rootdir: /opt/airflow
configfile: pyproject.toml
plugins: memray-1.7.0, timeouts-1.2.1, icdiff-0.9, mock-3.14.0, rerunfailures-14.0, requests-mock-1.12.1, xdist-3.6.1, asyncio-0.23.8, anyio-4.4.0, instafail-0.5.0, cov-5.0.0, time-machine-2.15.0, custom-exit-code-0.3.0
asyncio: mode=Mode.STRICT
setup timeout: 0.0s, execution timeout: 0.0s, teardown timeout: 0.0s
collected 1 item

tests/providers/openlineage/utils/test_utils.py::test_get_dag_tree_large_dag PASSED                                                                                                                                                                                                                                                                                  [100%]


============================================================================================================================================================================== MEMRAY REPORT ===============================================================================================================================================================================
Allocation results for tests/providers/openlineage/utils/test_utils.py::test_get_dag_tree_large_dag at the high watermark

	 📦 Total memory allocated: 5.4GiB
	 📏 Total allocations: 23
	 📊 Histogram of allocation sizes: |▁▁█  |
	 🥇 Biggest allocating functions:
		- _safe_get_dag_tree_view:/opt/airflow/airflow/providers/openlineage/utils/utils.py:446 -> 2.7GiB
		- get_tree_view:/opt/airflow/airflow/models/dag.py:2445 -> 2.7GiB
		- __setattr__:/opt/airflow/airflow/models/baseoperator.py:1191 -> 1.3MiB
		- __setattr__:/opt/airflow/airflow/models/baseoperator.py:1191 -> 1.3MiB
		- __setattr__:/opt/airflow/airflow/models/baseoperator.py:1191 -> 1.3MiB


=================================================================================================================================================================== Warning summary. Total: 3, Unique: 3 ===================================================================================================================================================================
airflow: total 1, unique 1
  collect: total 1, unique 1
other: total 2, unique 2
  collect: total 2, unique 2
Warnings saved into /opt/airflow/tests/warnings.txt file.
============================================================================================================================================================================ 1 passed in 8.60s =============================================================================================================================================================================

#41494

What you think should happen instead?

I think tree_view format should be changed to one that does not require extraordinary amount of whitespace in deeply nested cases.

Would be good to know in which cases it's being used though.

How to reproduce

You can use above dag.

Operating System

Docker/breeze on MacOS

Versions of Apache Airflow Providers

No response

Deployment

Other

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@mobuchowski mobuchowski added kind:bug This is a clearly a bug area:core needs-triage label for new issues that we didn't triage yet labels Aug 15, 2024
@dosubot dosubot bot added the area:UI Related to UI/UX. For Frontend Developers. label Aug 15, 2024
@jedcunningham
Copy link
Member

This is a better reproduction DAG:

with DAG("aaa_repro", schedule=None):
    start = EmptyOperator(task_id="start")

    a = [
        start
        >> EmptyOperator(task_id=f"a_1_{i}")
        >> EmptyOperator(task_id=f"a_2_{i}")
        >> EmptyOperator(task_id=f"a_3_{i}")
        for i in range(200)
    ]

    middle = EmptyOperator(task_id="middle")

    b = [
        middle
        >> EmptyOperator(task_id=f"b_1_{i}")
        >> EmptyOperator(task_id=f"b_2_{i}")
        >> EmptyOperator(task_id=f"b_3_{i}")
        for i in range(200)
    ]

    middle2 = EmptyOperator(task_id="middle2")

    c = [
        middle2
        >> EmptyOperator(task_id=f"c_1_{i}")
        >> EmptyOperator(task_id=f"c_2_{i}")
        >> EmptyOperator(task_id=f"c_3_{i}")
        for i in range(200)
    ]

    end = EmptyOperator(task_id="end")

    start >> a >> middle >> b >> middle2 >> c >> end

It uses 5+GB and takes just under 8 minutes to generate on my machine.

And I believe it's less about the whitespace and more that we duplicate tasks in the output. For example, this DAG:

with DAG("aaa_runaway", schedule=None):
    start = EmptyOperator(task_id="start")
    x = [EmptyOperator(task_id=f"x_{i}") for i in range(3)]
    middle = EmptyOperator(task_id="middle")
    y = [EmptyOperator(task_id=f"y_{i}") for i in range(3)]
    end = EmptyOperator(task_id="end")

    start >> x >> middle >> y >> end

which results in this output:

<Task(EmptyOperator): start>
    <Task(EmptyOperator): x_0>
        <Task(EmptyOperator): middle>
            <Task(EmptyOperator): y_0>
                <Task(EmptyOperator): end>
            <Task(EmptyOperator): y_1>
                <Task(EmptyOperator): end>
            <Task(EmptyOperator): y_2>
                <Task(EmptyOperator): end>
    <Task(EmptyOperator): x_1>
        <Task(EmptyOperator): middle>
            <Task(EmptyOperator): y_0>
                <Task(EmptyOperator): end>
            <Task(EmptyOperator): y_1>
                <Task(EmptyOperator): end>
            <Task(EmptyOperator): y_2>
                <Task(EmptyOperator): end>
    <Task(EmptyOperator): x_2>
        <Task(EmptyOperator): middle>
            <Task(EmptyOperator): y_0>
                <Task(EmptyOperator): end>
            <Task(EmptyOperator): y_1>
                <Task(EmptyOperator): end>
            <Task(EmptyOperator): y_2>
                <Task(EmptyOperator): end>

Note how we get 9 ends. If you have a sufficiently complex DAG, this format becomes really problematic.

@jedcunningham
Copy link
Member

It looks like get_tree_view is only used in a test and is relatively new from #37162. Probably should deprecate it in 2 so we can remove it in 3.

@jedcunningham
Copy link
Member

I've marked the whole "tree" concept - the --tree flag and the helper functions - as deprecated in 2.10 and removed them in 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.10 Issues Reported for 2.10 area:core area:UI Related to UI/UX. For Frontend Developers. kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

3 participants