-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up clear_task_instances by doing a single sql delete for TaskReschedule #14048
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. @kaxil - can you take a look on that one as well? It looks like nice thing to cherry-pick to 2.0.1, very localized and seems to improve the case that is rather useful.
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
60b7f56
to
f5088b1
Compare
Thanks @potiuk and @mik-laj for the review. I updated the PR according to your comments. Also, I noticed that doing |
f5088b1
to
d01db8c
Compare
Minor suggestions added, LGTM otherwise |
This reverts commit c6d4f301e622858053e52bac27e799841f1c68eb.
clear_task_instances took 2.240063190460205
99a77f3
to
6b67c43
Compare
Followed @ashb 's suggestion and speed it up even further. Please take another look. |
Hi @kaxil this is the experiment I did. import pytest
@pytest.fixture(scope="module")
def big_dag():
with DAG(
'test_create_large_dag_with_task_reschedule',
start_date=DEFAULT_DATE,
end_date=DEFAULT_DATE + datetime.timedelta(days=30),
) as dag:
for i in range(1000):
PythonSensor(task_id=f'task_{i}', python_callable=lambda: False, mode="reschedule")
yield dag
import copy
def test_create_large_dag_with_task_reschedule(big_dag):
tis = []
for i in range(10):
execution_date = DEFAULT_DATE + datetime.timedelta(days=i)
big_dag.create_dagrun(
execution_date=execution_date,
state=State.RUNNING,
run_type=DagRunType.SCHEDULED,
)
for task in big_dag.tasks:
ti = TI(task=copy.copy(task), execution_date=execution_date)
tis.append(ti)
import pendulum
tss = []
for ti in tis:
for i in range(5):
tss.append(TaskReschedule(
task=ti.task,
execution_date=ti.execution_date,
try_number=1,
start_date=pendulum.now(),
end_date=pendulum.now(),
reschedule_date=pendulum.now()))
def chunks(lst, n):
"""Yield successive n-sized chunks from lst."""
for i in range(0, len(lst), n):
yield lst[i:i + n]
with create_session() as session:
for chunk in chunks([{"task_id": tr.task_id, "dag_id": tr.dag_id, "execution_date": tr.execution_date,
"try_number": tr.try_number, "start_date": tr.start_date, "end_date": tr.end_date,
"duration": tr.duration, "reschedule_date": tr.reschedule_date} for tr in tss], 10000):
session.execute(TaskReschedule.__table__.insert(), chunk)
session.commit()
def test_clear_large_dag_with_task_reschedule(big_dag):
import time
with create_session() as session:
def count_task_reschedule():
return (
session.query(TaskReschedule)
.filter(
TaskReschedule.dag_id == big_dag.dag_id,
TaskReschedule.try_number == 1,
)
.count()
)
assert count_task_reschedule() == 1000 * 10 * 5
qry = session.query(TI).filter(TI.dag_id == big_dag.dag_id).all()
start = time.time()
clear_task_instances(qry, session, dag=big_dag)
end = time.time()
print(f"clear_task_instances took {end - start}")
assert count_task_reschedule() == 0
session.rollback() |
…schedule (#14048) Clearing large number of tasks takes a long time. Most of the time is spent at this line in clear_task_instances (more than 95% time). This slowness sometimes causes the webserver to timeout because the web_server_worker_timeout is hit. ``` # Clear all reschedules related to the ti to clear session.query(TR).filter( TR.dag_id == ti.dag_id, TR.task_id == ti.task_id, TR.execution_date == ti.execution_date, TR.try_number == ti.try_number, ).delete() ``` This line was very slow because it's deleting TaskReschedule rows in a for loop one by one. This PR simply changes this code to delete TaskReschedule in a single sql query with a bunch of OR conditions. It's effectively doing the same, but now it's much faster. Some profiling showed great speed improvement (something like 40 to 50 times faster) compared to the first iteration. So the overall performance should now be 300 times faster than the original for loop deletion. (cherry picked from commit 9036ce2)
Clearing large number of tasks takes a long time. Most of the time is spent at this line in
clear_task_instances
(more than 95% time). This slowness sometimes causes the webserver to timeout because theweb_server_worker_timeout
is hit.This line is very slow because it's deleting
TaskReschedule
rows in a for loop one by one.This PR simply changes this code to delete
TaskReschedule
in a single sql query with a bunch ofOR
conditions. It's effectively doing the same, but now it's much faster. Simple profiling shows that it's at least seven times faster when deleting thousands ofTaskReschedule
.