[AIRFLOW-2516] Fix mysql deadlocks #6988

potiuk · 2020-01-01T09:50:21Z

Deadlocks were occuring in mysql when task_instance was modified
by two queries at the same time. One query used state as selection
criteria and updated it in the same query where second query just
updated the state for the same table. The first query locked state
index first and primary index afterwards, the second query locked
primary index first and state afterwards - leading to deadlocks.

This change splits the first query into two independent ones.
First query makes select FOR UPDATE and selects all the task
instances to act on (this will lock primary index only)
and second updates all affected task instances.

Note that performance impact for that is neglectable because this
query is only run once every scheduler loop and the second part
of it (looping through task instances) will only happen in case
there are some manually modified DagRun states - so it is only
run to correct some wrong states of DagRun. This should happen
very infrequently.

Issue link: AIRFLOW-2516

Description above provides context of the change
Commit message starts with [AIRFLOW-NNNN], where AIRFLOW-NNNN = JIRA ID*
Unit tests coverage for changes (not needed for documentation changes)
Commits follow "How to write a good git commit message"
Relevant documentation is updated including usage instructions.
I will engage committers as explained in Contribution Workflow Example.

(*) For document-only changes, no JIRA issue is needed, commit message is [AIRFLOW-XXXX].

In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.
Read the Pull Request Guidelines for more information.

codecov-io · 2020-01-01T11:22:31Z

Codecov Report

❗ No coverage uploaded for pull request base (master@f4d3e5e). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master    #6988   +/-   ##
=========================================
  Coverage          ?   85.03%           
=========================================
  Files             ?      707           
  Lines             ?    39361           
  Branches          ?        0           
=========================================
  Hits              ?    33472           
  Misses            ?     5889           
  Partials          ?        0

Impacted Files	Coverage Δ
airflow/jobs/scheduler_job.py	`89.34% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f4d3e5e...706b8cc. Read the comment docs.

kaxil

Did we confirm with any MySQL user facing this issue that it solved the issue for them?

potiuk · 2020-01-01T17:26:41Z

Not yet @kaxil -> that's why it's still Draft . But I provided the users with patched versions of jobs.py/scheduled_job.py for 1.9, 1.10.6, 1.10.3 and asked them to test it. See https://issues.apache.org/jira/browse/AIRFLOW-2516?focusedCommentId=17006364&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17006364 and https://issues.apache.org/jira/browse/AIRFLOW-4498?focusedCommentId=17006370&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17006370

They tested it before for 1.9 and 1.10.6 so I expect they will come back after NewYear's holidays.

potiuk · 2020-01-01T17:27:29Z

And I keep my fingers crossed that it's going to help 🤞

kaxil · 2020-01-01T17:27:37Z

🤞

Deadlocks were occuring in mysql when task_instance was modified by two queries at the same time. One query used state as selection criteria and updated it in the same query where second query just updated the state for the same table. The first query locked state index first and primary index afterwards, the second query locked primary index first and state afterwards - leading to deadlocks. This change splits the first query into two independent ones. First query makes select FOR UPDATE and selects all the task instances to act on (this will lock primary index only) and second updates all affected task instances. Note that performance impact for that is neglectable because this query is only run once every scheduler loop and the second part of it (looping through task instances) will only happen in case there are some manually modified DagRun states - so it is only run to correct some wrong states of DagRun. This should happen very infrequently.

potiuk · 2020-01-13T16:17:33Z

@kaxil @ashb @mik-laj @nuclearpinguin -> It's 10 days without the deadlock for our customer https://issues.apache.org/jira/browse/AIRFLOW-2516 so it looks like the problem is solved. Please approve and i will merge it and cherry-pick to 1.10.8

Deadlocks were occuring in mysql when task_instance was modified by two queries at the same time. One query used state as selection criteria and updated it in the same query where second query just updated the state for the same table. The first query locked state index first and primary index afterwards, the second query locked primary index first and state afterwards - leading to deadlocks. This change splits the first query into two independent ones. First query makes select FOR UPDATE and selects all the task instances to act on (this will lock primary index only) and second updates all affected task instances. Note that performance impact for that is neglectable because this query is only run once every scheduler loop and the second part of it (looping through task instances) will only happen in case there are some manually modified DagRun states - so it is only run to correct some wrong states of DagRun. This should happen very infrequently. (cherry picked from commit 1a52182)

Deadlocks were occuring in mysql when task_instance was modified by two queries at the same time. One query used state as selection criteria and updated it in the same query where second query just updated the state for the same table. The first query locked state index first and primary index afterwards, the second query locked primary index first and state afterwards - leading to deadlocks. This change splits the first query into two independent ones. First query makes select FOR UPDATE and selects all the task instances to act on (this will lock primary index only) and second updates all affected task instances. Note that performance impact for that is neglectable because this query is only run once every scheduler loop and the second part of it (looping through task instances) will only happen in case there are some manually modified DagRun states - so it is only run to correct some wrong states of DagRun. This should happen very infrequently.

kaxil reviewed Jan 1, 2020

View reviewed changes

potiuk force-pushed the AIRFLOW-2516-fix-mysql-deadlocks branch from 8084255 to 706b8cc Compare January 13, 2020 16:12

boring-cyborg bot added the area:Scheduler including HA (high availability) scheduler label Jan 13, 2020

potiuk marked this pull request as ready for review January 13, 2020 16:12

potiuk requested review from ashb and mik-laj January 13, 2020 16:12

kaxil approved these changes Jan 13, 2020

View reviewed changes

mik-laj approved these changes Jan 13, 2020

View reviewed changes

potiuk merged commit 1a52182 into apache:master Jan 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIRFLOW-2516] Fix mysql deadlocks #6988

[AIRFLOW-2516] Fix mysql deadlocks #6988

potiuk commented Jan 1, 2020 •

edited by boring-cyborg bot

Loading

codecov-io commented Jan 1, 2020 •

edited

Loading

kaxil left a comment

potiuk commented Jan 1, 2020 •

edited

Loading

potiuk commented Jan 1, 2020

kaxil commented Jan 1, 2020

potiuk commented Jan 13, 2020

[AIRFLOW-2516] Fix mysql deadlocks #6988

[AIRFLOW-2516] Fix mysql deadlocks #6988

Conversation

potiuk commented Jan 1, 2020 • edited by boring-cyborg bot Loading

codecov-io commented Jan 1, 2020 • edited Loading

Codecov Report

kaxil left a comment

Choose a reason for hiding this comment

potiuk commented Jan 1, 2020 • edited Loading

potiuk commented Jan 1, 2020

kaxil commented Jan 1, 2020

potiuk commented Jan 13, 2020

potiuk commented Jan 1, 2020 •

edited by boring-cyborg bot

Loading

codecov-io commented Jan 1, 2020 •

edited

Loading

potiuk commented Jan 1, 2020 •

edited

Loading