Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIRFLOW-2516] Fix mysql deadlocks #6988

Merged

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Jan 1, 2020

Deadlocks were occuring in mysql when task_instance was modified
by two queries at the same time. One query used state as selection
criteria and updated it in the same query where second query just
updated the state for the same table. The first query locked state
index first and primary index afterwards, the second query locked
primary index first and state afterwards - leading to deadlocks.

This change splits the first query into two independent ones.
First query makes select FOR UPDATE and selects all the task
instances to act on (this will lock primary index only)
and second updates all affected task instances.

Note that performance impact for that is neglectable because this
query is only run once every scheduler loop and the second part
of it (looping through task instances) will only happen in case
there are some manually modified DagRun states - so it is only
run to correct some wrong states of DagRun. This should happen
very infrequently.


Issue link: AIRFLOW-2516

  • Description above provides context of the change
  • Commit message starts with [AIRFLOW-NNNN], where AIRFLOW-NNNN = JIRA ID*
  • Unit tests coverage for changes (not needed for documentation changes)
  • Commits follow "How to write a good git commit message"
  • Relevant documentation is updated including usage instructions.
  • I will engage committers as explained in Contribution Workflow Example.

(*) For document-only changes, no JIRA issue is needed, commit message is [AIRFLOW-XXXX].


In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

@codecov-io
Copy link

codecov-io commented Jan 1, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@f4d3e5e). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #6988   +/-   ##
=========================================
  Coverage          ?   85.03%           
=========================================
  Files             ?      707           
  Lines             ?    39361           
  Branches          ?        0           
=========================================
  Hits              ?    33472           
  Misses            ?     5889           
  Partials          ?        0
Impacted Files Coverage Δ
airflow/jobs/scheduler_job.py 89.34% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f4d3e5e...706b8cc. Read the comment docs.

Copy link
Member

@kaxil kaxil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we confirm with any MySQL user facing this issue that it solved the issue for them?

@potiuk
Copy link
Member Author

potiuk commented Jan 1, 2020

Not yet @kaxil -> that's why it's still Draft . But I provided the users with patched versions of jobs.py/scheduled_job.py for 1.9, 1.10.6, 1.10.3 and asked them to test it. See https://issues.apache.org/jira/browse/AIRFLOW-2516?focusedCommentId=17006364&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17006364 and https://issues.apache.org/jira/browse/AIRFLOW-4498?focusedCommentId=17006370&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17006370

They tested it before for 1.9 and 1.10.6 so I expect they will come back after NewYear's holidays.

@potiuk
Copy link
Member Author

potiuk commented Jan 1, 2020

And I keep my fingers crossed that it's going to help 🤞

@kaxil
Copy link
Member

kaxil commented Jan 1, 2020

🤞

Deadlocks were occuring in mysql when task_instance was modified
by two queries at the same time. One query used state as selection
criteria and updated it in the same query where second query just
updated the state for the same table. The first query locked state
index first and primary index afterwards, the second query locked
primary index first and state afterwards - leading to deadlocks.

This change splits the first query into two independent ones.
First query makes select FOR UPDATE and selects all the task
instances to act on (this will lock primary index only)
and second updates all affected task instances.

Note that performance impact for that is neglectable because this
query is only run once every scheduler loop and the second part
of it (looping through task instances) will only happen in case
there are some manually modified DagRun states - so it is only
run to correct some wrong states of DagRun. This should happen
very infrequently.
@potiuk potiuk force-pushed the AIRFLOW-2516-fix-mysql-deadlocks branch from 8084255 to 706b8cc Compare January 13, 2020 16:12
@boring-cyborg boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jan 13, 2020
@potiuk potiuk marked this pull request as ready for review January 13, 2020 16:12
@potiuk potiuk requested review from ashb and mik-laj January 13, 2020 16:12
@potiuk
Copy link
Member Author

potiuk commented Jan 13, 2020

@kaxil @ashb @mik-laj @nuclearpinguin -> It's 10 days without the deadlock for our customer https://issues.apache.org/jira/browse/AIRFLOW-2516 so it looks like the problem is solved. Please approve and i will merge it and cherry-pick to 1.10.8

@potiuk potiuk merged commit 1a52182 into apache:master Jan 13, 2020
potiuk added a commit that referenced this pull request Jan 21, 2020
Deadlocks were occuring in mysql when task_instance was modified
by two queries at the same time. One query used state as selection
criteria and updated it in the same query where second query just
updated the state for the same table. The first query locked state
index first and primary index afterwards, the second query locked
primary index first and state afterwards - leading to deadlocks.

This change splits the first query into two independent ones.
First query makes select FOR UPDATE and selects all the task
instances to act on (this will lock primary index only)
and second updates all affected task instances.

Note that performance impact for that is neglectable because this
query is only run once every scheduler loop and the second part
of it (looping through task instances) will only happen in case
there are some manually modified DagRun states - so it is only
run to correct some wrong states of DagRun. This should happen
very infrequently.

(cherry picked from commit 1a52182)
kaxil pushed a commit that referenced this pull request Jan 22, 2020
Deadlocks were occuring in mysql when task_instance was modified
by two queries at the same time. One query used state as selection
criteria and updated it in the same query where second query just
updated the state for the same table. The first query locked state
index first and primary index afterwards, the second query locked
primary index first and state afterwards - leading to deadlocks.

This change splits the first query into two independent ones.
First query makes select FOR UPDATE and selects all the task
instances to act on (this will lock primary index only)
and second updates all affected task instances.

Note that performance impact for that is neglectable because this
query is only run once every scheduler loop and the second part
of it (looping through task instances) will only happen in case
there are some manually modified DagRun states - so it is only
run to correct some wrong states of DagRun. This should happen
very infrequently.

(cherry picked from commit 1a52182)
kaxil pushed a commit that referenced this pull request Jan 23, 2020
Deadlocks were occuring in mysql when task_instance was modified
by two queries at the same time. One query used state as selection
criteria and updated it in the same query where second query just
updated the state for the same table. The first query locked state
index first and primary index afterwards, the second query locked
primary index first and state afterwards - leading to deadlocks.

This change splits the first query into two independent ones.
First query makes select FOR UPDATE and selects all the task
instances to act on (this will lock primary index only)
and second updates all affected task instances.

Note that performance impact for that is neglectable because this
query is only run once every scheduler loop and the second part
of it (looping through task instances) will only happen in case
there are some manually modified DagRun states - so it is only
run to correct some wrong states of DagRun. This should happen
very infrequently.

(cherry picked from commit 1a52182)
potiuk added a commit that referenced this pull request Jan 26, 2020
Deadlocks were occuring in mysql when task_instance was modified
by two queries at the same time. One query used state as selection
criteria and updated it in the same query where second query just
updated the state for the same table. The first query locked state
index first and primary index afterwards, the second query locked
primary index first and state afterwards - leading to deadlocks.

This change splits the first query into two independent ones.
First query makes select FOR UPDATE and selects all the task
instances to act on (this will lock primary index only)
and second updates all affected task instances.

Note that performance impact for that is neglectable because this
query is only run once every scheduler loop and the second part
of it (looping through task instances) will only happen in case
there are some manually modified DagRun states - so it is only
run to correct some wrong states of DagRun. This should happen
very infrequently.

(cherry picked from commit 1a52182)
kaxil pushed a commit that referenced this pull request Jan 26, 2020
Deadlocks were occuring in mysql when task_instance was modified
by two queries at the same time. One query used state as selection
criteria and updated it in the same query where second query just
updated the state for the same table. The first query locked state
index first and primary index afterwards, the second query locked
primary index first and state afterwards - leading to deadlocks.

This change splits the first query into two independent ones.
First query makes select FOR UPDATE and selects all the task
instances to act on (this will lock primary index only)
and second updates all affected task instances.

Note that performance impact for that is neglectable because this
query is only run once every scheduler loop and the second part
of it (looping through task instances) will only happen in case
there are some manually modified DagRun states - so it is only
run to correct some wrong states of DagRun. This should happen
very infrequently.

(cherry picked from commit 1a52182)
galuszkak pushed a commit to FlyrInc/apache-airflow that referenced this pull request Mar 5, 2020
Deadlocks were occuring in mysql when task_instance was modified
by two queries at the same time. One query used state as selection
criteria and updated it in the same query where second query just
updated the state for the same table. The first query locked state
index first and primary index afterwards, the second query locked
primary index first and state afterwards - leading to deadlocks.

This change splits the first query into two independent ones.
First query makes select FOR UPDATE and selects all the task
instances to act on (this will lock primary index only)
and second updates all affected task instances.

Note that performance impact for that is neglectable because this
query is only run once every scheduler loop and the second part
of it (looping through task instances) will only happen in case
there are some manually modified DagRun states - so it is only
run to correct some wrong states of DagRun. This should happen
very infrequently.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler including HA (high availability) scheduler
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants