-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Back off between attempts to start the tests #15106
Back off between attempts to start the tests #15106
Conversation
As a note: Run 2 succeeded, with a couple of tests succeeding on try 2&3. It ultimately failed because it couldn't publish the results, because there was already a successful run, whatever.
|
I can't understand how "run 4" succeeded in "first try" 2 times? It's a single run, it only ran once? |
Shh Dustin ignore all that. I'm basically just ctrl+f'ing the log for |
Hmm. After re-doing the tests 6 times, Run 7 actually straight up failed too. I always assumed that just adding a longer delay wouldn't really work. We need to figure out why it's not actually attaching. that being said, it sure is weird that a retry does work sometimes... like, that's weird |
src/host/ft_host/InitTests.cpp
Outdated
@@ -238,10 +238,11 @@ MODULE_SETUP(ModuleSetup) | |||
VERIFY_WIN32_BOOL_SUCCEEDED_RETURN(FreeConsole()); | |||
|
|||
// Wait a moment for the driver to be ready after freeing to attach. | |||
Sleep(100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really needed?
src/host/ft_host/InitTests.cpp
Outdated
Sleep(1000); | ||
|
||
// Sleep with a backoff, to give us longer to try next time. | ||
Sleep(1000 * (1 + tries)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, and really really don't quote me on this, but as far as I can tell, Sleep
and all other "waitable" functions are all based on KiCheckWaitNext
which uses QueryInterruptTimePrecise
for calls like Sleep
and QueryUnbiasedInterruptTime
for APIs like WaitForSingleObject
.
The difference between the Query*Time
APIs that have unbiased in their name is:
The unbiased interrupt-time count does not include time the system spends in sleep or hibernation.
In other words, as far as I can tell, Sleep
sleeps in "real time" and not in "run time", so if our CIs, which probably run on some very cheap spot instances, get suspended quite often, Sleep
will effectively be skipped every time, without the OS and the apps on it really progressing at all.
I would thus try and fix it by calling this instead:
WaitForSingleObject(GetCurrentThread(), 1000);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using exponential backoff with 5-10 retries would probably be a good idea anyways though. A good initial delay would probably be something like 10ms?, so that it doesn't throttle the tests too much on our fast developer machines. (Could you test on your own PC how many ms are enough so that it never hits retry when running locally?) Something like this:
// This will wait for up to 32s in total (from 10ms to 163840ms)
for (DWORD delay = 10; delay < 30000; delay *= 2)
{
// ...
if (succeeded)
{
return true;
}
WaitForSingleObject(GetCurrentThread(), delay);
}
VERIFY_FAILED(); // what's the right call again?
return false;
BTW should we really reopen the CRT handles inside the loop? Seems like we should only do that once before the loop...
|
0a39269 had 9/10 runs pass. That's annoying, but maybe better than what we're getting now. |
@@ -280,7 +281,7 @@ MODULE_SETUP(ModuleSetup) | |||
} | |||
}; | |||
|
|||
VERIFY_IS_LESS_THAN(tries, 100, L"Make sure we set up the new console in time"); | |||
VERIFY_IS_LESS_THAN(delay, 30000u, L"Make sure we set up the new console in time"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option would be to just return true
above instead of break
ing. That way you can blindly VERIFY_FAIL()
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jeez. if it works.
This is a test PR for kicking the CI.
Looking through this test, I seriously don't understand how this doesn't work. I mean, I don't really get how it does work, but at this point in the tests, we've actually established that both
Nihilist.exe
and openconsole are running. From my read, there's no reason these should be failing at this point.We previously added a "retry 5 times" bit to this test, in #8534. That did work back then. So uh, just do that... again?