-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tests] Add support for running tests on helix #1917
Conversation
With this PR, tests will be run in two steps on CI: 1. On the build machine - which is the same as they are run currently. This will be run in a single step, from the source repo with built `artifacts`. 2. On helix - for this the tests are archived to a `.zip` file, and then sent to helix with any dependencies. Additionally, tests can be marked to always be run on helix or the build machine. Other tests can be run on either. - a new property `$(DefaultToRunningTestsOnHelix)` when true would run all the tests that are not build-machine-only to run on helix. - when the property is !true, all the tests that can run build-machine will run there. This is the default. This adds new properties that affect CI builds: - `$(IsTestSupportProject)` - which is for a project that should be run as a test project itself. - `$(IsHelixOnlyTestProject)` - such a test can only be run on helix. - `$(IsNonHelixOnlyTestProject)` - such a project must always run on the build machine, IOW, with the source repo. - Also, a default `.runsettings` file is used to allow easier setting of loggers, and other configuration for the tests.
All the tests will still continue to use dotnet-coverage. But since tests can run on more than one machine now, there is an additional CI step that merges all the coverage xml files.
Ready for review |
Helix step can be slow to get agents sometimes, so increased the timeout to 45mins. |
@@ -0,0 +1,94 @@ | |||
# Please remember to update the documentation if you make changes to these parameters! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loved the docs :).
Did we get these from some other repo? Mainly asking to see if we need to try to keep in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The template is from eng/common/templates/steps/send-to-helix.yml
, slightly modified here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm I see, is there a way to instead use that template file and pass in some parameters to accommodate for our small modifications? If possible, I'd love to reduce the amount of yaml that we have to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This eng/common/templates/steps/send-to-helix.yml
is a sample template from arcade, and it's not used directly. But I can make changes in that file itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean when you say a "sample template". Tipically the templates under eng\common\templates are there so that repos can use them if they want/need as part of their build process. I don't want to block this PR on this, but I do want to make sure this is fixed. Can we at least log a tracking bug so that we go and add the necessary hooks in dotnet/arcade repo so that we can later ingest those changes here, and remove this file entirely so that we instead leverage the files from eng\common?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened dotnet/arcade#14470 .
tests/Directory.Build.props
Outdated
@@ -10,6 +10,16 @@ | |||
CS1712: Type parameter 'type_parameter' has no matching typeparam tag in the XML comment on 'type_or_member' (but other type parameters do) | |||
--> | |||
<NoWarn>$(NoWarn);1573;1591;1712</NoWarn> | |||
|
|||
<IsHelixOnlyTestProject Condition="'$(IsHelixOnlyTestProject)' == ''">false</IsHelixOnlyTestProject> | |||
<IsNonHelixOnlyTestProject Condition="'$(IsNonHelixOnlyTestProject)' == ''">false</IsNonHelixOnlyTestProject> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the above is a boolean, do we actually need this property? Can't we just infer based on the above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have 3 possible values - 1. helix only; 2. build only; 3. any . The two properties represent (1), and (2), and both being false means that (3) - no restrictions on where the tests can run.
We could use a string property with some fixed values like ['helix', 'non-helix', 'any']
, but then you have the value strings in multiple places where you want to check, and that could get misspelled and cause tests to get silently skipped.
I'm not attached to any specific way of doing it though.
The name isn't great either, for the second property - IsNonHelixOnlyTestProject
. An alternative would be IsBuildMachineOnlyTestProject
sounds less clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In dotnet/aspnetcore, we have C# attributes to control whether a test runs on Helix or not.
https://github.com/dotnet/aspnetcore/blob/515987ac92f801fb995d88a20282a61571523aab/src/Testing/src/xunit/SkipOnCIAttribute.cs
https://github.com/dotnet/aspnetcore/blob/515987ac92f801fb995d88a20282a61571523aab/src/Testing/src/xunit/SkipOnHelixAttribute.cs
https://github.com/dotnet/aspnetcore/blob/515987ac92f801fb995d88a20282a61571523aab/src/Testing/src/xunit/SkipNonHelix.cs
Would it be possible to do this that way? Then we wouldn't need to worry about misspelling, since it would be C# code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, yeah I think that as @eerhardt pointed out in a separate comment, I guess that the name is a bit confusing. I like his suggestion of dropping Only, or alternatively we can have something like OnlyRunOnHelix
and DoNotRunOnHelix
or something along those lines. Don't feel strongly about my suggestion haha, but I do think that the current version might be misleading.
tests/Aspire.Components.Common.Tests/Aspire.Components.Common.Tests.csproj
Show resolved
Hide resolved
I want to double check something about the test results, so please don't merge yet. |
All good. |
@@ -1,4 +1,7 @@ | |||
<Project> | |||
<PropertyGroup> | |||
<IsNonHelixOnlyTestProject>true</IsNonHelixOnlyTestProject> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name feels weird. It would seem that IsHelixOnlyTestProject=false
should be the same thing as this.
If we want to say "this test should never run on Helix", I would drop "Only" in the name. IsNonHelixTestProject=true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking for having the only bit was to have a way to allow running tests on the build machine, or helix and be able to easily move those around in case we don't want to run on helix at all, or want to completely switch to helix. But we can skip that whole thing.
Without that behavior:
- we can have a single property -
RunTestsOnHelix
- abool
, and it would control switching between helix, and the build machine. And this can default tofalse
. - And I'll drop
defaultToRunningTestsOnHelix
how does that sound?
Do we want a mixed mode too - one for which the test project is run in both places, and attributes on individual tests determine which ones to skip? If we don't control this at the project level, then we have to send all the tests to helix and let the attributes help with the skips.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(My opinion) - I would prefer to do it as close as possible to how dotnet/aspnetcore does it. Most of the engineers in this repo work in dotnet/aspnetcore as well. So it would be great if the mechanics are the same, so they don't have to relearn things jumping between repos.
tests/Directory.Build.props
Outdated
@@ -10,6 +10,16 @@ | |||
CS1712: Type parameter 'type_parameter' has no matching typeparam tag in the XML comment on 'type_or_member' (but other type parameters do) | |||
--> | |||
<NoWarn>$(NoWarn);1573;1591;1712</NoWarn> | |||
|
|||
<IsHelixOnlyTestProject Condition="'$(IsHelixOnlyTestProject)' == ''">false</IsHelixOnlyTestProject> | |||
<IsNonHelixOnlyTestProject Condition="'$(IsNonHelixOnlyTestProject)' == ''">false</IsNonHelixOnlyTestProject> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In dotnet/aspnetcore, we have C# attributes to control whether a test runs on Helix or not.
https://github.com/dotnet/aspnetcore/blob/515987ac92f801fb995d88a20282a61571523aab/src/Testing/src/xunit/SkipOnCIAttribute.cs
https://github.com/dotnet/aspnetcore/blob/515987ac92f801fb995d88a20282a61571523aab/src/Testing/src/xunit/SkipOnHelixAttribute.cs
https://github.com/dotnet/aspnetcore/blob/515987ac92f801fb995d88a20282a61571523aab/src/Testing/src/xunit/SkipNonHelix.cs
Would it be possible to do this that way? Then we wouldn't need to worry about misspelling, since it would be C# code.
.. and use a single property named `$(RunTestsOnHelix)` that defaults to `false`.
@@ -5,6 +5,8 @@ parameters: | |||
HelixBuild: $(Build.BuildNumber) # required -- the build number Helix will use to identify this -- automatically set to the AzDO build number | |||
HelixTargetQueues: '' # required -- semicolon-delimited list of Helix queues to test on; see https://helix.dot.net/ for a list of queues | |||
HelixAccessToken: '' # required -- access token to make Helix API requests; should be provided by the appropriate variable group | |||
HelixProjectPath: '' # required -- path to the project file to build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in eng/common
, which is maintained in dotnet/arcade. Are we going to send a PR to arcade for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this get overwritten by arcade updates? If so, then I'll move this to a separate file as before, and just reference this original template in a comment. Even if not that maybe not touching this file would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joperezr I'm reverting to the earlier way of using a copy of that file. Lemme know if you have an objections to that.
@@ -98,7 +98,7 @@ stages: | |||
jobs: | |||
|
|||
- job: windows | |||
timeoutInMinutes: 30 | |||
timeoutInMinutes: 45 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, 30 mins was already pretty low and super close to the time it actually takes to build the repo in AzDO, so we were going to have to do this sooner rather than later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ${{ if eq(parameters.isWindows, 'true') }}: | ||
- pwsh: | | ||
$_cov_filelist=$((Get-ChildItem -Path ${{ parameters.repoTestResultsPath }} -Recurse -Filter *.cobertura.xml -File | % { $_.FullName }) -join ' ') | ||
echo "${{ parameters.dotnetScript }} dotnet-coverage merge $_cov_filelist -f cobertura -o ${{ parameters.repoTestResultsPath }}/$(Agent.Os)_$(Agent.JobName).cobertura.xml" > merge-cov.cmd | ||
cmd /c merge-cov.cmd | ||
rm merge-cov.cmd | ||
displayName: Merge code coverage results | ||
|
||
- ${{ if ne(parameters.isWindows, 'true') }}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to split here? Why can't the same script run on both windows and non-windows?
Might need a comment explaining why we are doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran into my lack of pwsh skills :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the yml that generates the report, it handles merging already. So, this can be simplified.
<DotNetCliPackageType>sdk</DotNetCliPackageType> | ||
|
||
<GlobalJsonContent>$([System.IO.File]::ReadAllText('$(RepoRoot)global.json'))</GlobalJsonContent> | ||
<DotNetCliVersion>$([System.Text.RegularExpressions.Regex]::Match($(GlobalJsonContent), '(%3F<="dotnet": ").*(%3F=")'))</DotNetCliVersion> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who reads this value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is read by helix targets in Arcade.
<PropertyGroup> | ||
<Language>msbuild</Language> | ||
|
||
<_workItemTimeout>00:20:00</_workItemTimeout> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between this and the TestSessionTimeout
in the .runsettings
file? Why are they 2 different values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout defined in the yml is for the CI step, so the whole step that sends items to helix and waits on them. It would include the time taken while waiting for agents to be assigned, and the time take to run the tests on helix.
The timeout here is only for the time taken on helix, per helix item.
The timeout for the step is larger mainly to account for helix wait times, which can vary depending on how busy the queue is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the timeout in the .runsettings
file for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the timeout in the
.runsettings
file for?
IIUC, that is a timeout for individual tests.
Also, it should have been 5minutes but I had bumped it up when I was getting timeouts and forgot to revert that.
The second stage processes the *cobertura.xml files to generate a report. And for that it collects these xml files from all the jobs. Earlier commits in this PR did not have unique names across jobs, like for linux, and windows. Which meant that the report generation step only handled one set of the xml files. Fix that by using the original unique filenames for the merged xml generated from the helix, and non-helix tests.
/bl:${{ parameters.repoLogPath }}/build.binlog | ||
$(_OfficialBuildIdArgs) | ||
displayName: Build | ||
|
||
- ${{ if ne(parameters.skipTests, 'true') }}: | ||
- script: ${{ parameters.dotnetScript }} dotnet-coverage collect | ||
--settings $(Build.SourcesDirectory)/eng/CodeCoverage.config | ||
--output ${{ parameters.repoTestResultsPath }}/NonHelixTests.cobertura.xml | ||
"${{ parameters.buildScript }} -testnobuild -test -configuration ${{ parameters.buildConfig }} /bl:${{ parameters.repoLogPath }}/tests.binlog $(_OfficialBuildIdArgs)" | ||
displayName: Run non-helix tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to run the non-helix tests and the helix tests in parallel? There shouldn't be a reason to serialize them, and that could speed up our CI legs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a wrapper project that builds both in parallel from msbuild. Should that be in a follow up PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we need to build them in parallel? I'm talking about running them in parallel.
Why would we need a wrapper project? Does azdo allow for running 2 pipeline tasks at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "build" in parallel I meant run the Test, and send-to-helix targets in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does azdo allow for running 2 pipeline tasks at the same time?
AFAIK, this isn't supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the thing that is allowed is to run two jobs or stages at the same time, but not steps within a job AFAIK. While I do think that could speed up things a bit, I also think it might be overkill to gain 9 mins or so that it takes in average to run the non helix tests first. I propose keeping things as is, and then perhaps look in the future here to see if this is something we need to invest on.
@akoeplinger would you mind taking a look at this PR to check that we are doing this in an idiomatic way? |
Should we go ahead with this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM. Thanks @radical!
I have merged |
Why?
Using helix to run tests would mainly be useful for End2End integration tests which would need controlled versions of dependencies, and that would be easier to do on helix than the build machines. This could be used to setup custom docker containers, and networks to simulate various scenarios and run e2e tests.
With this PR, tests will be run in two steps on CI:
On the build machine - which is the same as they are run currently.
This will be run in a single step, from the source repo with built
artifacts
.On helix - for this the tests are archived to
.zip
files, and thensent to helix with any dependencies.
dotnet-coverage
. But sincetests can run on more than one machine now, there is an additional CI
step that merges all the coverage xml files.
Details
This adds new properties that affect CI builds:
$(IsTestSupportProject)
- which is for a project that should be runas a test project itself.
$(RunTestsOnHelix)
- which defaults tofalse
Also, a default
.runsettings
file is used to allow easier setting ofloggers, and other configuration for the tests.
The dashboard tests are explicitly marked for running on helix, just so we have at least one such test suite. This PR explicitly avoids supporting zero helix tests to ensure that the support does not rot, and helix step cannot silently fail because nothing got zipped.
Increase the timeout for CI from 30mins to 45mins.