-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to fix the flaky TestTriggererJob test #34075
Attempt to fix the flaky TestTriggererJob test #34075
Conversation
As documented in apache#33323, we have frequent failures of the flaky triggerer job tests. The flaky failures are about some errors when we close all the sessions in teardown of the test. It turns out that the tests had side-effect - they have not waited for the TriggererJob thread to complete, they merely marked them to be stopped, but they have not waited for those to complete - this is quite plausible explanation of the flaky test failures - since those threads have 1 second sleep, it's more than likely that the session has been created and used by the thread while the teardown has been attempting to close all the sessions. This side effect could also have an effect for other tests that were run after - because in a busy test run machine, the side effects could propagate further than just to the teardown, so it could also explain why sometimes (very rarely) other job tests failed with similar errors. The fix is to join the runner after marking it to be stopped. Fixes: apache#33323 (Hopefully)
I'm not sure this would help, because timeout it is only for how long block Main Thread, but thread might still exists. Some example from threading import Thread
from time import sleep
def infinity():
while True:
sleep(1)
if __name__ == "__main__":
thr = Thread(target=infinity)
thr.start()
print("Alive? ", thr.is_alive())
thr.join(5)
print("5 second? I'm stile Alive! ", thr.is_alive())
thr.join(30)
print("30 second? Nope! I'm stile Alive! ", thr.is_alive())
print("Let's block main thread, but this would not help either.")
thr.join() I assume that we should do something on Triggerer side, however killing thread it's not something straight forward, last time when I tried to do that stuff, with events. queues and etc I lost my mind, and I'm not sure that I recovered after that. |
Some additional "simulation" within pytest and long living threads If run entire module in Pytest, all tests will failed:
import threading
from time import sleep
class TestMultiThreading:
def test_thread_is_alive(self):
def infinity():
while True:
sleep(1)
thr = threading.Thread(target=infinity, name="foo-bar")
try:
thr.start()
finally:
thr.join(2)
assert not thr.is_alive()
def test_check_threads(self):
threads_names = [thread.name for thread in threading.enumerate()]
assert "foo-bar" not in threads_names, threads_names
class TestOtherTest:
def test_check_threads(self):
threads_names = [thread.name for thread in threading.enumerate()]
assert "foo-bar" not in threads_names, threads_names That mean if our side effect that we block main thread and as result we ignore all signals, then solution with timeout would work. But if our problem that thread existed after test completed and this affect other test, then solution with timeout would not work. And better spawn separate process which would run |
I think it will help. The way it was before had almost guarantee that those races will happen because we have not waited for the thread at all. Here we give 30 seconds (way more than the 1 sec. Idle) for the thread to complete. And this is all about to make it far far less probable and 'fast enough' even if the thread hangs for some reason or the machine is very, very busy. In most cases even if the thread will hang for 30 s and we won't complete waiting, nothing bad will happen. We would have to be really unlucky to get the thread do something with session while we are closing it. Previously it was just the question of how close to the end of the 1s sleep the thread was so If the 'race window' was say 10ms - we woud have 1% chance to hit it. With 30 seconds time for the thread to complete IMHO the probability is wery low. If we have infinite join and the thread does not complete, then we will fail at 60s anyway with test timeout, and the test would fail in this case - even if it would have side effects would be harmless, so I prefer to just 'let it go' after half the time and continue hoping that the side effects won't kick in. |
I think it will. The way it was before had almost guarantee that those races will happen because we have not waited for the thread at all. Here we give 30 seconds (way more than the 1 sec. Idle) for the thread to complete. And this is all about to make it far far less probable and 'fast enough' even if the thread hangs for some reason or the machine is very, very busy. In most cases even if the thread will hang for 30 s and we won't complete waiting, nothing bad will happen. We would have to be really unlucky to get the thread do something with session while we are closing it. Previously it was just the question of how close to the end of the 1s sleep the thread was so If the 'race window' was say 10ms - we woud have 1% chance to hit it. With 30 seconds time for the thread to complete IMHO the probability is wery low. If we have infinite join and the thread does not complete, then we will fail at 60s anyway with test timeout, and the test would fail in this case - even if it would have side effects would be harmless, so I prefer to just 'let it go' after half the time and continue hoping that the side effects won't kick in. And yeah likely having separate process is 'better' but I wanted to have a quick fix now - to Alo make it easy to cherry-pick to 2.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I don't see initially that we do not wait at all. In this case it should dramatically reduce that side effect.
cc: @ephraimbuddy -> should be good to cherry-pick |
As documented in #33323, we have frequent failures of the flaky triggerer job tests. The flaky failures are about some errors when we close all the sessions in teardown of the test. It turns out that the tests had side-effect - they have not waited for the TriggererJob thread to complete, they merely marked them to be stopped, but they have not waited for those to complete - this is quite plausible explanation of the flaky test failures - since those threads have 1 second sleep, it's more than likely that the session has been created and used by the thread while the teardown has been attempting to close all the sessions. This side effect could also have an effect for other tests that were run after - because in a busy test run machine, the side effects could propagate further than just to the teardown, so it could also explain why sometimes (very rarely) other job tests failed with similar errors. The fix is to join the runner after marking it to be stopped. Fixes: #33323 (Hopefully) (cherry picked from commit 47f79b9)
As documented in #33323, we have frequent failures of the flaky triggerer job tests.
The flaky failures are about some errors when we close all the sessions in teardown of the test. It turns out that the tests had side-effect - they have not waited for the TriggererJob thread to complete, they merely marked them to be stopped, but they have not waited for those to complete - this is quite plausible explanation of the flaky test failures - since those threads have 1 second sleep, it's more than likely that the session has been created and used by the thread while the teardown has been attempting to close all the sessions.
This side effect could also have an effect for other tests that were run after - because in a busy test run machine, the side effects could propagate further than just to the teardown, so it could also explain why sometimes (very rarely) other job tests failed with similar errors.
The fix is to join the runner after marking it to be stopped.
Fixes: #33323 (Hopefully)
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.