Attempt to fix the flaky TestTriggererJob test #34075

potiuk · 2023-09-04T11:29:09Z

As documented in #33323, we have frequent failures of the flaky triggerer job tests.

The flaky failures are about some errors when we close all the sessions in teardown of the test. It turns out that the tests had side-effect - they have not waited for the TriggererJob thread to complete, they merely marked them to be stopped, but they have not waited for those to complete - this is quite plausible explanation of the flaky test failures - since those threads have 1 second sleep, it's more than likely that the session has been created and used by the thread while the teardown has been attempting to close all the sessions.

This side effect could also have an effect for other tests that were run after - because in a busy test run machine, the side effects could propagate further than just to the teardown, so it could also explain why sometimes (very rarely) other job tests failed with similar errors.

The fix is to join the runner after marking it to be stopped.

Fixes: #33323 (Hopefully)

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

As documented in apache#33323, we have frequent failures of the flaky triggerer job tests. The flaky failures are about some errors when we close all the sessions in teardown of the test. It turns out that the tests had side-effect - they have not waited for the TriggererJob thread to complete, they merely marked them to be stopped, but they have not waited for those to complete - this is quite plausible explanation of the flaky test failures - since those threads have 1 second sleep, it's more than likely that the session has been created and used by the thread while the teardown has been attempting to close all the sessions. This side effect could also have an effect for other tests that were run after - because in a busy test run machine, the side effects could propagate further than just to the teardown, so it could also explain why sometimes (very rarely) other job tests failed with similar errors. The fix is to join the runner after marking it to be stopped. Fixes: apache#33323 (Hopefully)

Taragolis · 2023-09-04T11:52:56Z

I'm not sure this would help, because timeout it is only for how long block Main Thread, but thread might still exists.
In the other case if parameter timeout not set or None then it would block Main Thread until required thread completed.

Some example

from threading import Thread
from time import sleep


def infinity():
    while True:
        sleep(1)


if __name__ == "__main__":
    thr = Thread(target=infinity)
    thr.start()
    print("Alive? ", thr.is_alive())
    thr.join(5)
    print("5 second? I'm stile Alive! ", thr.is_alive())
    thr.join(30)
    print("30 second? Nope! I'm stile Alive! ", thr.is_alive())
    print("Let's block main thread, but this would not help either.")
    thr.join()

I assume that we should do something on Triggerer side, however killing thread it's not something straight forward, last time when I tried to do that stuff, with events. queues and etc I lost my mind, and I'm not sure that I recovered after that.

Taragolis · 2023-09-04T12:51:12Z

Some additional "simulation" within pytest and long living threads

If run entire module in Pytest, all tests will failed:

TestMultiThreading::test_thread_is_alive - Thread still alive
TestMultiThreading::test_check_threads - Thread with name "foo-bar" still exists
TestOtherTest::test_check_threads - Thread with name "foo-bar" still exists

import threading
from time import sleep


class TestMultiThreading:
    def test_thread_is_alive(self):

        def infinity():
            while True:
                sleep(1)

        thr = threading.Thread(target=infinity, name="foo-bar")
        try:
            thr.start()
        finally:
            thr.join(2)
            assert not thr.is_alive()

    def test_check_threads(self):
        threads_names = [thread.name for thread in threading.enumerate()]
        assert "foo-bar" not in threads_names, threads_names


class TestOtherTest:
    def test_check_threads(self):
        threads_names = [thread.name for thread in threading.enumerate()]
        assert "foo-bar" not in threads_names, threads_names

That mean if our side effect that we block main thread and as result we ignore all signals, then solution with timeout would work.

But if our problem that thread existed after test completed and this affect other test, then solution with timeout would not work. And better spawn separate process which would run TriggererJob and kill that process in the end of the test.

potiuk · 2023-09-04T13:19:51Z

I think it will help. The way it was before had almost guarantee that those races will happen because we have not waited for the thread at all.

Here we give 30 seconds (way more than the 1 sec. Idle) for the thread to complete. And this is all about to make it far far less probable and 'fast enough' even if the thread hangs for some reason or the machine is very, very busy.

In most cases even if the thread will hang for 30 s and we won't complete waiting, nothing bad will happen. We would have to be really unlucky to get the thread do something with session while we are closing it. Previously it was just the question of how close to the end of the 1s sleep the thread was so If the 'race window' was say 10ms - we woud have 1% chance to hit it. With 30 seconds time for the thread to complete IMHO the probability is wery low.

If we have infinite join and the thread does not complete, then we will fail at 60s anyway with test timeout, and the test would fail in this case - even if it would have side effects would be harmless, so I prefer to just 'let it go' after half the time and continue hoping that the side effects won't kick in.

potiuk · 2023-09-04T13:24:12Z

I think it will. The way it was before had almost guarantee that those races will happen because we have not waited for the thread at all.

Here we give 30 seconds (way more than the 1 sec. Idle) for the thread to complete. And this is all about to make it far far less probable and 'fast enough' even if the thread hangs for some reason or the machine is very, very busy.

In most cases even if the thread will hang for 30 s and we won't complete waiting, nothing bad will happen. We would have to be really unlucky to get the thread do something with session while we are closing it. Previously it was just the question of how close to the end of the 1s sleep the thread was so If the 'race window' was say 10ms - we woud have 1% chance to hit it. With 30 seconds time for the thread to complete IMHO the probability is wery low.

If we have infinite join and the thread does not complete, then we will fail at 60s anyway with test timeout, and the test would fail in this case - even if it would have side effects would be harmless, so I prefer to just 'let it go' after half the time and continue hoping that the side effects won't kick in.

And yeah likely having separate process is 'better' but I wanted to have a quick fix now - to Alo make it easy to cherry-pick to 2.7

Taragolis

Oh, I don't see initially that we do not wait at all. In this case it should dramatically reduce that side effect.

potiuk · 2023-09-04T14:41:02Z

cc: @ephraimbuddy -> should be good to cherry-pick

As documented in #33323, we have frequent failures of the flaky triggerer job tests. The flaky failures are about some errors when we close all the sessions in teardown of the test. It turns out that the tests had side-effect - they have not waited for the TriggererJob thread to complete, they merely marked them to be stopped, but they have not waited for those to complete - this is quite plausible explanation of the flaky test failures - since those threads have 1 second sleep, it's more than likely that the session has been created and used by the thread while the teardown has been attempting to close all the sessions. This side effect could also have an effect for other tests that were run after - because in a busy test run machine, the side effects could propagate further than just to the teardown, so it could also explain why sometimes (very rarely) other job tests failed with similar errors. The fix is to join the runner after marking it to be stopped. Fixes: #33323 (Hopefully) (cherry picked from commit 47f79b9)

boring-cyborg bot added area:Scheduler including HA (high availability) scheduler area:Triggerer labels Sep 4, 2023

potiuk requested review from uranusjr, Taragolis and hussein-awala September 4, 2023 11:29

potiuk added this to the Airflow 2.7.1 milestone Sep 4, 2023

ephraimbuddy approved these changes Sep 4, 2023

View reviewed changes

ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Sep 4, 2023

Taragolis approved these changes Sep 4, 2023

View reviewed changes

potiuk merged commit 47f79b9 into apache:main Sep 4, 2023
42 checks passed

potiuk deleted the attempt-to-fix-flaky-triggerer-tests branch September 4, 2023 14:40

potiuk mentioned this pull request Sep 4, 2023

Bring back SHUTDOWN-related constants #34063

Merged

ephraimbuddy mentioned this pull request Sep 4, 2023

Status of testing of Apache Airflow 2.7.1rc2 #34065

Closed

97 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to fix the flaky TestTriggererJob test #34075

Attempt to fix the flaky TestTriggererJob test #34075

potiuk commented Sep 4, 2023

Taragolis commented Sep 4, 2023

Taragolis commented Sep 4, 2023

potiuk commented Sep 4, 2023 •

edited

Loading

potiuk commented Sep 4, 2023

Taragolis left a comment

potiuk commented Sep 4, 2023

Attempt to fix the flaky TestTriggererJob test #34075

Attempt to fix the flaky TestTriggererJob test #34075

Conversation

potiuk commented Sep 4, 2023

Taragolis commented Sep 4, 2023

Taragolis commented Sep 4, 2023

potiuk commented Sep 4, 2023 • edited Loading

potiuk commented Sep 4, 2023

Taragolis left a comment

Choose a reason for hiding this comment

potiuk commented Sep 4, 2023

potiuk commented Sep 4, 2023 •

edited

Loading