Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests] Add support for running tests on helix #1917

Merged
merged 17 commits into from
Feb 9, 2024

Conversation

radical
Copy link
Member

@radical radical commented Jan 28, 2024

Why?

Using helix to run tests would mainly be useful for End2End integration tests which would need controlled versions of dependencies, and that would be easier to do on helix than the build machines. This could be used to setup custom docker containers, and networks to simulate various scenarios and run e2e tests.

With this PR, tests will be run in two steps on CI:

  1. On the build machine - which is the same as they are run currently.
    This will be run in a single step, from the source repo with built
    artifacts.

  2. On helix - for this the tests are archived to .zip files, and then
    sent to helix with any dependencies.

  • all the tests will still continue to use dotnet-coverage. But since
    tests can run on more than one machine now, there is an additional CI
    step that merges all the coverage xml files.

Details

This adds new properties that affect CI builds:

  • $(IsTestSupportProject) - which is for a project that should be run
    as a test project itself.

  • $(RunTestsOnHelix) - which defaults to false

  • Also, a default .runsettings file is used to allow easier setting of
    loggers, and other configuration for the tests.

  • The dashboard tests are explicitly marked for running on helix, just so we have at least one such test suite. This PR explicitly avoids supporting zero helix tests to ensure that the support does not rot, and helix step cannot silently fail because nothing got zipped.

  • Increase the timeout for CI from 30mins to 45mins.

With this PR, tests will be run in two steps on CI:

1. On the build machine - which is the same as they are run currently.
   This will be run in a single step, from the source repo with built
   `artifacts`.

2. On helix - for this the tests are archived to a `.zip` file, and then
   sent to helix with any dependencies.

Additionally, tests can be marked to always be run on helix or the build
machine. Other tests can be run on either.

- a new property `$(DefaultToRunningTestsOnHelix)` when true would run
  all the tests that are not build-machine-only to run on helix.
  - when the property is !true, all the tests that can run build-machine
    will run there. This is the default.

This adds new properties that affect CI builds:

- `$(IsTestSupportProject)` - which is for a project that should be run
  as a test project itself.
- `$(IsHelixOnlyTestProject)` - such a test can only be run on helix.
- `$(IsNonHelixOnlyTestProject)` - such a project must always run on the
  build machine, IOW, with the source repo.

- Also, a default `.runsettings` file is used to allow easier setting of
  loggers, and other configuration for the tests.
All the tests will still continue to use dotnet-coverage. But since
tests can run on more than one machine now, there is an additional CI
step that merges all the coverage xml files.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-integrations Issues pertaining to Aspire Integrations packages label Jan 28, 2024
@radical radical marked this pull request as ready for review January 29, 2024 22:13
@radical
Copy link
Member Author

radical commented Jan 29, 2024

Ready for review

@radical
Copy link
Member Author

radical commented Jan 30, 2024

Helix step can be slow to get agents sometimes, so increased the timeout to 45mins.

@@ -0,0 +1,94 @@
# Please remember to update the documentation if you make changes to these parameters!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loved the docs :).

Did we get these from some other repo? Mainly asking to see if we need to try to keep in sync.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The template is from eng/common/templates/steps/send-to-helix.yml, slightly modified here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm I see, is there a way to instead use that template file and pass in some parameters to accommodate for our small modifications? If possible, I'd love to reduce the amount of yaml that we have to maintain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This eng/common/templates/steps/send-to-helix.yml is a sample template from arcade, and it's not used directly. But I can make changes in that file itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean when you say a "sample template". Tipically the templates under eng\common\templates are there so that repos can use them if they want/need as part of their build process. I don't want to block this PR on this, but I do want to make sure this is fixed. Can we at least log a tracking bug so that we go and add the necessary hooks in dotnet/arcade repo so that we can later ingest those changes here, and remove this file entirely so that we instead leverage the files from eng\common?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -10,6 +10,16 @@
CS1712: Type parameter 'type_parameter' has no matching typeparam tag in the XML comment on 'type_or_member' (but other type parameters do)
-->
<NoWarn>$(NoWarn);1573;1591;1712</NoWarn>

<IsHelixOnlyTestProject Condition="'$(IsHelixOnlyTestProject)' == ''">false</IsHelixOnlyTestProject>
<IsNonHelixOnlyTestProject Condition="'$(IsNonHelixOnlyTestProject)' == ''">false</IsNonHelixOnlyTestProject>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the above is a boolean, do we actually need this property? Can't we just infer based on the above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 3 possible values - 1. helix only; 2. build only; 3. any . The two properties represent (1), and (2), and both being false means that (3) - no restrictions on where the tests can run.

We could use a string property with some fixed values like ['helix', 'non-helix', 'any'], but then you have the value strings in multiple places where you want to check, and that could get misspelled and cause tests to get silently skipped.

I'm not attached to any specific way of doing it though.

The name isn't great either, for the second property - IsNonHelixOnlyTestProject. An alternative would be IsBuildMachineOnlyTestProject sounds less clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, yeah I think that as @eerhardt pointed out in a separate comment, I guess that the name is a bit confusing. I like his suggestion of dropping Only, or alternatively we can have something like OnlyRunOnHelix and DoNotRunOnHelix or something along those lines. Don't feel strongly about my suggestion haha, but I do think that the current version might be misleading.

tests/send-to-helix.proj Outdated Show resolved Hide resolved
@radical
Copy link
Member Author

radical commented Feb 1, 2024

I want to double check something about the test results, so please don't merge yet.

@radical
Copy link
Member Author

radical commented Feb 1, 2024

I want to double check something about the test results, so please don't merge yet.

All good.

@@ -1,4 +1,7 @@
<Project>
<PropertyGroup>
<IsNonHelixOnlyTestProject>true</IsNonHelixOnlyTestProject>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name feels weird. It would seem that IsHelixOnlyTestProject=false should be the same thing as this.

If we want to say "this test should never run on Helix", I would drop "Only" in the name. IsNonHelixTestProject=true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking for having the only bit was to have a way to allow running tests on the build machine, or helix and be able to easily move those around in case we don't want to run on helix at all, or want to completely switch to helix. But we can skip that whole thing.

Without that behavior:

  • we can have a single property - RunTestsOnHelix - a bool, and it would control switching between helix, and the build machine. And this can default to false.
  • And I'll drop defaultToRunningTestsOnHelix

how does that sound?

Do we want a mixed mode too - one for which the test project is run in both places, and attributes on individual tests determine which ones to skip? If we don't control this at the project level, then we have to send all the tests to helix and let the attributes help with the skips.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(My opinion) - I would prefer to do it as close as possible to how dotnet/aspnetcore does it. Most of the engineers in this repo work in dotnet/aspnetcore as well. So it would be great if the mechanics are the same, so they don't have to relearn things jumping between repos.

@@ -10,6 +10,16 @@
CS1712: Type parameter 'type_parameter' has no matching typeparam tag in the XML comment on 'type_or_member' (but other type parameters do)
-->
<NoWarn>$(NoWarn);1573;1591;1712</NoWarn>

<IsHelixOnlyTestProject Condition="'$(IsHelixOnlyTestProject)' == ''">false</IsHelixOnlyTestProject>
<IsNonHelixOnlyTestProject Condition="'$(IsNonHelixOnlyTestProject)' == ''">false</IsNonHelixOnlyTestProject>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mitchdenny mitchdenny added area-engineering-systems and removed area-integrations Issues pertaining to Aspire Integrations packages labels Feb 5, 2024
@mitchdenny mitchdenny added this to the preview 4 (Mar) milestone Feb 5, 2024
.. and use a single property named `$(RunTestsOnHelix)` that defaults to
`false`.
@@ -5,6 +5,8 @@ parameters:
HelixBuild: $(Build.BuildNumber) # required -- the build number Helix will use to identify this -- automatically set to the AzDO build number
HelixTargetQueues: '' # required -- semicolon-delimited list of Helix queues to test on; see https://helix.dot.net/ for a list of queues
HelixAccessToken: '' # required -- access token to make Helix API requests; should be provided by the appropriate variable group
HelixProjectPath: '' # required -- path to the project file to build
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in eng/common, which is maintained in dotnet/arcade. Are we going to send a PR to arcade for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this get overwritten by arcade updates? If so, then I'll move this to a separate file as before, and just reference this original template in a comment. Even if not that maybe not touching this file would be good.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joperezr I'm reverting to the earlier way of using a copy of that file. Lemme know if you have an objections to that.

@@ -98,7 +98,7 @@ stages:
jobs:

- job: windows
timeoutInMinutes: 30
timeoutInMinutes: 45
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😞

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, 30 mins was already pretty low and super close to the time it actually takes to build the repo in AzDO, so we were going to have to do this sooner rather than later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

that's from a recent CI run.

Comment on lines 52 to 60
- ${{ if eq(parameters.isWindows, 'true') }}:
- pwsh: |
$_cov_filelist=$((Get-ChildItem -Path ${{ parameters.repoTestResultsPath }} -Recurse -Filter *.cobertura.xml -File | % { $_.FullName }) -join ' ')
echo "${{ parameters.dotnetScript }} dotnet-coverage merge $_cov_filelist -f cobertura -o ${{ parameters.repoTestResultsPath }}/$(Agent.Os)_$(Agent.JobName).cobertura.xml" > merge-cov.cmd
cmd /c merge-cov.cmd
rm merge-cov.cmd
displayName: Merge code coverage results

- ${{ if ne(parameters.isWindows, 'true') }}:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to split here? Why can't the same script run on both windows and non-windows?

Might need a comment explaining why we are doing this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into my lack of pwsh skills :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the yml that generates the report, it handles merging already. So, this can be simplified.

<DotNetCliPackageType>sdk</DotNetCliPackageType>

<GlobalJsonContent>$([System.IO.File]::ReadAllText('$(RepoRoot)global.json'))</GlobalJsonContent>
<DotNetCliVersion>$([System.Text.RegularExpressions.Regex]::Match($(GlobalJsonContent), '(%3F&lt;="dotnet": ").*(%3F=")'))</DotNetCliVersion>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who reads this value?

Copy link
Member Author

@radical radical Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is read by helix targets in Arcade.

<PropertyGroup>
<Language>msbuild</Language>

<_workItemTimeout>00:20:00</_workItemTimeout>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this and the TestSessionTimeout in the .runsettings file? Why are they 2 different values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout defined in the yml is for the CI step, so the whole step that sends items to helix and waits on them. It would include the time taken while waiting for agents to be assigned, and the time take to run the tests on helix.

The timeout here is only for the time taken on helix, per helix item.

The timeout for the step is larger mainly to account for helix wait times, which can vary depending on how busy the queue is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the timeout in the .runsettings file for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the timeout in the .runsettings file for?

IIUC, that is a timeout for individual tests.

Also, it should have been 5minutes but I had bumped it up when I was getting timeouts and forgot to revert that.

The second stage processes the *cobertura.xml files to generate a
report. And for that it collects these xml files from all the jobs.
Earlier commits in this PR did not have unique names across jobs, like
for linux, and windows. Which meant that the report generation step only
handled one set of the xml files.

Fix that by using the original unique filenames for the merged xml
generated from the helix, and non-helix tests.
@radical radical requested a review from eerhardt February 6, 2024 18:42
/bl:${{ parameters.repoLogPath }}/build.binlog
$(_OfficialBuildIdArgs)
displayName: Build

- ${{ if ne(parameters.skipTests, 'true') }}:
- script: ${{ parameters.dotnetScript }} dotnet-coverage collect
--settings $(Build.SourcesDirectory)/eng/CodeCoverage.config
--output ${{ parameters.repoTestResultsPath }}/NonHelixTests.cobertura.xml
"${{ parameters.buildScript }} -testnobuild -test -configuration ${{ parameters.buildConfig }} /bl:${{ parameters.repoLogPath }}/tests.binlog $(_OfficialBuildIdArgs)"
displayName: Run non-helix tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to run the non-helix tests and the helix tests in parallel? There shouldn't be a reason to serialize them, and that could speed up our CI legs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a wrapper project that builds both in parallel from msbuild. Should that be in a follow up PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we need to build them in parallel? I'm talking about running them in parallel.

Why would we need a wrapper project? Does azdo allow for running 2 pipeline tasks at the same time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "build" in parallel I meant run the Test, and send-to-helix targets in parallel.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does azdo allow for running 2 pipeline tasks at the same time?

AFAIK, this isn't supported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the thing that is allowed is to run two jobs or stages at the same time, but not steps within a job AFAIK. While I do think that could speed up things a bit, I also think it might be overkill to gain 9 mins or so that it takes in average to run the non helix tests first. I propose keeping things as is, and then perhaps look in the future here to see if this is something we need to invest on.

@radical
Copy link
Member Author

radical commented Feb 6, 2024

@akoeplinger would you mind taking a look at this PR to check that we are doing this in an idiomatic way?

@radical
Copy link
Member Author

radical commented Feb 8, 2024

Should we go ahead with this?

Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. Thanks @radical!

joperezr

This comment was marked as duplicate.

@radical
Copy link
Member Author

radical commented Feb 9, 2024

I have merged main into the branch, and will wait for the CI results to confirm that this didn't break since the last build 2 days ago.

@radical radical merged commit b3e6add into dotnet:main Feb 9, 2024
8 checks passed
@radical radical deleted the helix-tests branch February 9, 2024 04:35
@github-actions github-actions bot locked and limited conversation to collaborators Apr 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants