Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whither scala/scala CI? #751

Open
SethTisue opened this issue Dec 4, 2020 · 14 comments
Open

Whither scala/scala CI? #751

SethTisue opened this issue Dec 4, 2020 · 14 comments

Comments

@SethTisue
Copy link
Member

SethTisue commented Dec 4, 2020

Not sure where to put this. If you notice anything missing that could use improvement, feel free to edit this directly, or comment with suggestions or questions. Once we feel it's complete, perhaps we can find a place to put it.

This ticket replaces the similar older #507.

See also other tickets labeled CI/publishing/infra.

Basics for contributors

  1. Jenkins and Travis-CI mostly test the same things

The redundancy is partly intentional (each system serves to check/verify that the other one is functioning as expected) and partly a historical accident (we are still experimenting with both and the experimentation hasn't concluded).

  1. As a rule, every PR should pass both CIs before being merged

In particular, every commit in a PR must pass Jenkins. (Travis-CI only tests the last (edit-dnw) merge commit.)

For certain PRs, a maintainer might also choose to manually a trigger a Windows run (via GitHub Actions) and/or a community build run before merge.

  1. If they both fail, feel free to investigate using whichever CI system you prefer or is more familiar

The Jenkins build is monolithic, which means you only see "pass/fail", and you have to go digging in the logs to see where the failure was. On the other hand, if the problem is a test failure, the Jenkins UI splits out each tests for you, so it's more digging initially but then less digging later.

The Travis-CI build is split into jobs: build and bootstrap, run partests, run junit and other tests, compile on Dotty... and we could plausibly split it up even further. You can see it a glance in the GitHub UI which part failed.

When digging through logs, there are other minor ergonomic differences between the two UIs.

  1. If only one fails, you have some digging to do

See "Differences..." below.

  1. Jenkins publishes your changes to a "validation snapshots" resolver

This happens even if some tests fail. See the scala/scala README for information.

Differences between Jenkins and Travis-CI for PR validation

Jenkins, in combination with Scabot (which we built ourselves and operate ourselves), tests every commit in a PR. Travis-CI only tests the last commit. It is perhaps not strictly necessary that we require every commit in a PR to pass CI, but it is desirable.

Jenkins tests each commit in the PR's branch. Travis-CI tests a temporary merge commit of the PR's branch and the target branch (e.g. 2.13.x). When we hit "merge" the HEAD of the target branch may already have moved on, so that result may be stale.

Jenkins uses older, substantially more complicated scripts for bootstrapping (see the scripts directory). Travis-CI uses a newer, simpler method (see .travis.yml). The simpler method also more closely resembles how we advise contributors to bootstrap locally. In the long run, we should standardize on the simpler method, but the work of getting rid of the old stuff remains to be done.

For no special reason, only Travis-CI includes the compileWithDotty test, which verifies that the standard library compiles with the latest Scala 3 release.

Only Travis-CI builds the language spec.

How did we get here?

Originally, Jenkins (https://scala-ci.typesafe.com) was our only CI system. But we have to set up and maintain Jenkins ourselves (https://github.com/scala/scala-jenkins-infra) and pay to operate the EC2 instances, so Jenkins is costly for us in both labor and money.

So when the free Travis-CI service came into existence, we thought, let's try it! But we weren't ready to commit to it, so we kept Jenkins around.

Contributors only need to think about PR validation, but the core Scala team also needs a way to publish releases. Originally Scala releases were published from Jenkins, but circa 2018 we decided to move 2.12.x and 2.13.x publishing to Travis-CI, where it has remained ever since.

Why have we kept Jenkins?

Jenkins is a pain to maintain and a pain to expand our CI matrix on (e.g. to other JDK versions), and it's less familiar to most contributors these days than Travis-CI or GitHub Actions. Why do we still have it?

Reasons related to PR validation

  • Jenkins+Scabot tests every commit, not just the last commit.
  • Jenkins knows the publishing secrets necessary to publish PR validation snapshots. Travis-CI doesn't allow pull request jobs to have access to secrets. (Dotty has this problem too: Automatically publish artifacts for PRs scala3#6145 remains open, as of April 2021.) PR validation snapshots are especially valuable for running the community build against before a PR is merged, but they're also helpful for other kinds of PR review.

Other reasons

  • Jenkins also runs the community build, which is too big for a free service like Travis-CI. Thus, the cost of also having it do other things is amortized.
  • Free services may come and go, and their free service tiers may get more restrictive. Running our own Jenkins insulates us from such changes.
  • Sometimes it's helpful to troubleshoot Jenkins by ssh'ing to the workers ourselves. This kind of troubleshooting is less available on other services.
@SethTisue SethTisue changed the title Document our CI setup and the pros and cons Document CI setup pros and cons, decide what to do Dec 4, 2020
@SethTisue SethTisue changed the title Document CI setup pros and cons, decide what to do Whither scala/scala CI? Dec 4, 2020
@SethTisue

This comment has been minimized.

@SethTisue

This comment has been minimized.

@harpocrates
Copy link

It might be worth looking at what GHC has done: they're in a somewhat similar position of having limited resources but aspiring to check as many configurations as possible.

As far as the test suite goes, it would be super helpful to have some way of tagging particular tests so they only run on a specific configuration (JDK version, OS, etc.). Even if we don't have a CI configuration that tests those, the tests provide a good way of bisecting for regressions (or running costly configurations infrequently).

@lrytz
Copy link
Member

lrytz commented Mar 25, 2021

I tried out https://github.com/philips-labs/terraform-aws-github-runner a few weeks (months?) ago and it worked very well. In short it sets up everything in an AWS account to use on-demand EC2 (spot) instances as custom runners for GitHub Actions.

Some challenges

  • we currently build every commit in a PR, find a way to do that (if we still want that)
  • manage permissions to allow pr builds to publish to our artifactory

One thing the project doesn't currently support is starting different instance types based on the runner label in GitHub, the GitHub API doesn't provide the necessary data (philips-labs/terraform-aws-github-runner#518). This might be useful for the community build, and for windows testing (philips-labs/terraform-aws-github-runner#347).

@SethTisue

This comment has been minimized.

@SethTisue
Copy link
Member Author

SethTisue commented Mar 26, 2021

note that if anyone (perhaps someone whose initials are M.H.) is thinking of adding anything to our CI currently, favor adding it to Travis-CI, not Jenkins. the Jenkins build is monolithic and there are some pretty archaic scripts involved. whereas the way Travis-CI is set up is pretty close to how we would do it on GitHub Actions

@SethTisue
Copy link
Member Author

SethTisue commented Apr 22, 2021

moving Windows testing off Jenkins and onto GitHub Actions is happening at scala/scala#9496 and scala/scala#9485 (merged on 2.12.x, will be merged forward soon). it is run on merged PRs; it is not part of PR validation

and adding JDK 16 (and perhaps 11) to the 2.13 Travis-CI matrix has landed at scala/scala#9579

I have updated the issue description above to remove out-of-date information.

@SethTisue
Copy link
Member Author

we replaced JDK 16 with 17

@SethTisue
Copy link
Member Author

SethTisue commented Dec 6, 2021

Travis-CI status update:

note that Travis-CI will no longer offer "concurrency-based" plans except to existing customers who already have them: https://blog.travis-ci.com/2021-12-01-pricingenhancements

we're grandfathered in so this doesn't change anything for us at the moment, but it does indicate that the "1 job at a time" plan we're on might disappear entirely someday

even if it does, it might not be a bad change, given that the Travis-CI runs we actually need (namely, release runs) we only run rarely, so usage-based pricing might be okay

anyway, something to keep one eye on

https://app.travis-ci.com/github/scala shows that we're down to just scala, scala-dist, and scala-dist-smoketest

plus a few relatively unimportant stragglers we could easily move to GitHub Actions anytime (scala-rewrites, *.g8) moved

@SethTisue
Copy link
Member Author

SethTisue commented Feb 14, 2022

Summary:

  • We can't easily get rid of Jenkins without redoing finding some new solution for (or doing without) PR snapshot publishing.
  • We can't easily get rid of Travis-CI without redoing release publishing, which is complex and fragile.

It wouldn't be super hard to move PR validation entirely to GitHub Actions, and just leave Jenkins in place to publish the PR snapshots (without running tests), and use Travis-CI only for publishing. This would decrease the overall complexity by putting as many eggs as possible in the GitHub Actions basket, but:

  • It would require work
  • We'd still have all three systems

Under the circumstances, doing nothing (until some future occurrence forces our hand) may actually make the most sense, despite having three CI systems being manifestly absurd 🤷

I think if Jenkins were to implode we would probably decide to just do without the PR snapshots, but... it hasn't imploded.

@dwijnand
Copy link
Member

Any reason we can't publish snapshots to Sonatype, with GitHub Actions?

@SethTisue
Copy link
Member Author

SethTisue commented Feb 15, 2022

To publish anything anywhere, just about, you need a publishing secret, but PR runs on GitHub Actions understandably are not given secrets access. We got around this on Jenkins by storing the secrets on the worker nodes in a such way that made them inaccessible even to hostile code in an attacker's PR. It isn't obvious whether there's some trick or workaround we could use on GitHub Actions to somehow sufficiently safely give the PR jobs publishing permission.

@SethTisue
Copy link
Member Author

SethTisue commented Mar 2, 2022

Travis-CI has restored general availability of n-jobs-at-a-time plans like the one we're on: https://blog.travis-ci.com/2022-03-02-concurrentpricing

(We still had 1-job-at-a-time because we were grandfathered in, but this increases confidence that it's unlikely to be taken away.)

@lrytz
Copy link
Member

lrytz commented Mar 3, 2022

Under the circumstances, doing nothing (until some future occurrence forces our hand) may actually make the most sense, despite having three CI systems being manifestly absurd 🤷

I agree, whatever we invest should be towards the goal of getting rid of either Travis or (parts of) our AWS infra.

Besides PR builds / validation, our AWS

  • runs artifactory (for PR builds and mergely / integration builds)
  • is used for the community build; there's probably no chance of getting that to run on GitHub

GitHub Actions to somehow sufficiently safely give the PR jobs publishing permission

Maybe others have had similar situations and found solutions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants