Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Back off between attempts to start the tests #15106

Merged
merged 5 commits into from
Apr 13, 2023

Conversation

zadjii-msft
Copy link
Member

@zadjii-msft zadjii-msft commented Apr 4, 2023

This is a test PR for kicking the CI.

Looking through this test, I seriously don't understand how this doesn't work. I mean, I don't really get how it does work, but at this point in the tests, we've actually established that both Nihilist.exe and openconsole are running. From my read, there's no reason these should be failing at this point.

We previously added a "retry 5 times" bit to this test, in #8534. That did work back then. So uh, just do that... again?

@zadjii-msft
Copy link
Member Author

zadjii-msft commented Apr 4, 2023

As a note: Run 2 succeeded, with a couple of tests succeeding on try 2&3. It ultimately failed because it couldn't publish the results, because there was already a successful run, whatever.

Run 1 Retry 2 Retries 3 Retries 4 Retries
1 0 0 0 0
2 1 1 0 0
3 1 0 0 0
4 2 1 0 0
5 0 0 0 0
6 2 0 0 0
7❌ 0 0 0 0
? 0 0 0 0
? 0 0 0 0
? 0 0 0 0
? 0 0 0 0
? 0 0 0 0
? 0 0 0 0
? 0 0 0 0
? 0 0 0 0
? 0 0 0 0

@DHowett
Copy link
Member

DHowett commented Apr 4, 2023

I can't understand how "run 4" succeeded in "first try" 2 times? It's a single run, it only ran once?

@zadjii-msft
Copy link
Member Author

zadjii-msft commented Apr 4, 2023

Shh Dustin ignore all that. I'm basically just ctrl+f'ing the log for Succeeded on try #<N> and then logging how many times a test took a retry to get initialized. So far, no test has taken till retry 4. We didn't have the logging before though, so I can't easily tell before this change how deep in the retries tests would usually take.

@zadjii-msft
Copy link
Member Author

Hmm. After re-doing the tests 6 times, Run 7 actually straight up failed too. I always assumed that just adding a longer delay wouldn't really work. We need to figure out why it's not actually attaching.

that being said, it sure is weird that a retry does work sometimes... like, that's weird

@zadjii-msft
Copy link
Member Author

image

well that's ironic

@@ -238,10 +238,11 @@ MODULE_SETUP(ModuleSetup)
VERIFY_WIN32_BOOL_SUCCEEDED_RETURN(FreeConsole());

// Wait a moment for the driver to be ready after freeing to attach.
Sleep(100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed?

Sleep(1000);

// Sleep with a backoff, to give us longer to try next time.
Sleep(1000 * (1 + tries));
Copy link
Member

@lhecker lhecker Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, and really really don't quote me on this, but as far as I can tell, Sleep and all other "waitable" functions are all based on KiCheckWaitNext which uses QueryInterruptTimePrecise for calls like Sleep and QueryUnbiasedInterruptTime for APIs like WaitForSingleObject.

The difference between the Query*Time APIs that have unbiased in their name is:

The unbiased interrupt-time count does not include time the system spends in sleep or hibernation.

In other words, as far as I can tell, Sleep sleeps in "real time" and not in "run time", so if our CIs, which probably run on some very cheap spot instances, get suspended quite often, Sleep will effectively be skipped every time, without the OS and the apps on it really progressing at all.

I would thus try and fix it by calling this instead:

WaitForSingleObject(GetCurrentThread(), 1000);

Copy link
Member

@lhecker lhecker Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using exponential backoff with 5-10 retries would probably be a good idea anyways though. A good initial delay would probably be something like 10ms?, so that it doesn't throttle the tests too much on our fast developer machines. (Could you test on your own PC how many ms are enough so that it never hits retry when running locally?) Something like this:

// This will wait for up to 32s in total (from 10ms to 163840ms)
for (DWORD delay = 10; delay < 30000; delay *= 2)
{
    // ...
    if (succeeded)
    {
        return true;
    }

    WaitForSingleObject(GetCurrentThread(), delay);
}

VERIFY_FAILED(); // what's the right call again?
return false;

BTW should we really reopen the CRT handles inside the loop? Seems like we should only do that once before the loop...

@zadjii-msft
Copy link
Member Author

zadjii-msft commented Apr 12, 2023

Run 1 Retry 2 Retries 3 Retries >=4 Retries
1✅ 2 0 0 0
2✅ 2 1 0 1
3✅ 2 1 2 0
4✅ 1 3 2 1
5❔ 0 0 0 0
6❔ 0 0 0 0
7❔ 0 0 0 0
8❔ 0 0 0 0
9❔ 0 0 0 0
10❔ 0 0 0 0

@zadjii-msft
Copy link
Member Author

0a39269 had 9/10 runs pass. That's annoying, but maybe better than what we're getting now.

@@ -280,7 +281,7 @@ MODULE_SETUP(ModuleSetup)
}
};

VERIFY_IS_LESS_THAN(tries, 100, L"Make sure we set up the new console in time");
VERIFY_IS_LESS_THAN(delay, 30000u, L"Make sure we set up the new console in time");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to just return true above instead of breaking. That way you can blindly VERIFY_FAIL() here.

Copy link
Member

@DHowett DHowett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jeez. if it works.

@DHowett DHowett merged commit 72d0566 into main Apr 13, 2023
@DHowett DHowett deleted the dev/migrie/b/these-mother-uckers-ucking-with-my-tests branch April 13, 2023 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants