Evaluate Travis or move CI away #652

jedbrown · 2020-11-02T06:51:08Z

The new pricing model has limited credit for open source so we may either need to start paying or move elsewhere (presumably GitHub Actions and/or GitLab-CI).

https://blog.travis-ci.com/2020-11-02-travis-ci-new-billing

jeremylt · 2020-11-02T14:09:34Z

I can try a port over to GitHub actions; it is supposed to be pretty painless. I think we'll loose the ability to target specific hardware though.

GitLab CI has a minutes cap, and we use a lot of minutes.

jedbrown · 2020-11-02T17:15:42Z

That is for cloud (and there's a process to request more); no limit on what we run on our own hardware.

We could run Docker+QEMU to simulate architectures for which native hardware is hard to come by.

We could also consider paying for Travis time, but we should probably migrate that which is easy to migrate away. We could use Azure Pipelines for macOS, Windows, and containerized Linux, but it's all x86-64.

valeriabarra · 2020-11-02T17:23:36Z

Not sure if it costs more/less or about the same what Travis is going to charge, but I know in CliMA they use https://buildkite.com/pricing
and it is configurable on your own runners

jedbrown · 2020-11-02T18:26:01Z

Maybe I'm missing something, but how is this functionally different from GitLab-CI? Both have an open source runner that we would install on our (on-premise or cloud) hardware.

valeriabarra · 2020-11-02T18:30:35Z

I'm not sure how much it differs, since I am not familiar with the functionalities that GitLab-CI offers either. I suppose that they used it because they could easily set up the runner for GPUs on the uni's cluster.

jeremylt · 2020-11-02T19:23:18Z

I played around with GitHub actions and it's pretty easy to set up.

Perhaps we do something like this:

Basic testing on Linux and OSX via GitHub Actions for C, Fortran
Basic testing on Linux via GitHub Actions for Python, Julia, and Rust with doc deployment
Hardware testing on aarch64 and ppc64le via GitHub Actions - https://github.com/marketplace/actions/run-on-architecture
Noether testing for Hip and MAGMA via GitLab CI runner (perhaps run only those backends there so CI runs faster?)
Azure for libCEED + [OCCA, LIBXSMM, MFEM, Nek5000, PETSc] (one or many containers? Perhaps using the CEED container)

valeriabarra · 2020-11-02T20:10:45Z

Maybe I'm missing something, but how is this functionally different from GitLab-CI? Both have an open source runner that we would install on our (on-premise or cloud) hardware.

Maybe @simonbyrne can answer how buildkite is different from GitLab-CI?

simonbyrne · 2020-11-02T21:02:50Z

We went with Buildkite as we were able to get it to play nice with our cluster. Basically we have a cron job on the cluster which polls the Buildkite API to check if there are new jobs (it is behind a firewall, so we can't use webhooks). When there are new jobs we create a corresponding Slurm job for each (with options to enable different jobs to request specific # of tasks / gpus): we launch buildkite-agent start inside the Slurm job with the the --acquire-job option to ensure there is a 1-1 correspondence between Buildkite jobs and Slurm jobs, and it shuts down and terminates the Slurm job as soon as it is finished. We store the Buildkite job id in the Slurm job comment so that we can see which jobs have already been queued.

We use Bors to handle our merging, and the Buildkite jobs are only triggered when you request a merge (this prevents random people from opening a PR and getting access to our cluster).

Overall it works pretty well, scales nicely (we regularly have 100 or so agents running without problems) and is free for open source projects. Our scripts to make this work are here: https://github.com/CliMA/slurm-buildkite. They are somewhat specific to our use case, but I'm happy to answer questions if you wanted to adapt it.

We looked into self-hosted GitHub Actions, but couldn't figure out a way to make sure a specific runner would run a specific job (the relevant issues we were stuck on are actions/runner#510 and actions/runner#620). Additionally, scaling runners looks cumbersome (you have to keep registering and unregistering runners, which we don't have to do with buildkite: you can just start a new agent and it adds it to the pool).

I only quickly looked into Gitlab CI: from what I could tell it has the same problems as GitHub Actions (but may be wrong).

jedbrown · 2020-11-02T22:17:11Z

Thanks, Simon. We use GitLab-CI for PETSc and have about 60 configurations that run (across various machines) as part of each pipeline. GitLab has "merge trains", which is somehow similar to Bors (but a native UI feature). ECP has GitLab-CI running via Slurm at DOE facilities. I could track down the scripts, but it's done using the custom executor (after an attempt to MR a more HPC-specific executor prior to custom being developed/deployed). PETSc mostly uses the ssh executor, though we'd like to containerize more of the pipelines to make machines more fungible.

simonbyrne · 2020-11-02T23:40:09Z

@jakebolewski did look into the ECP GitLab CI + Slurm integration, but I think in the end we decided it would it would require significant effort on behalf of the cluster admins, whereas we could run the buildkite agent under existing user permissions. If it had already been set up on our cluster I imagine we would have used it.

jeremylt · 2020-11-06T00:31:49Z

@jedbrown, any objection to moving our libCEED only tests on Linux, OSX, the different hardware, Python, and Julia testing to GitHub actions for now?

We could easily move our OCCA and LIBXSMM integration testing to Noether.

Then we'd only have to make a choice about where to do the MFEM and Nek5000 example tests.

I can fiddle with this on the side as I work tomorrow.

jedbrown · 2020-11-06T01:45:31Z

That sounds good. We can put MFEM and Nek5000 on Noether. Best would be to keep the commits pinned as you've done with caching in Travis.

jeremylt mentioned this issue Nov 6, 2020

CI - Use GitHub Actions #658

Merged

3 tasks

jeremylt closed this as completed in #658 Nov 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate Travis or move CI away #652

Evaluate Travis or move CI away #652

jedbrown commented Nov 2, 2020

jeremylt commented Nov 2, 2020

jedbrown commented Nov 2, 2020 •

edited

Loading

valeriabarra commented Nov 2, 2020 •

edited

Loading

jedbrown commented Nov 2, 2020

valeriabarra commented Nov 2, 2020

jeremylt commented Nov 2, 2020

valeriabarra commented Nov 2, 2020

simonbyrne commented Nov 2, 2020 •

edited

Loading

jedbrown commented Nov 2, 2020

simonbyrne commented Nov 2, 2020

jeremylt commented Nov 6, 2020

jedbrown commented Nov 6, 2020

Evaluate Travis or move CI away #652

Evaluate Travis or move CI away #652

Comments

jedbrown commented Nov 2, 2020

jeremylt commented Nov 2, 2020

jedbrown commented Nov 2, 2020 • edited Loading

valeriabarra commented Nov 2, 2020 • edited Loading

jedbrown commented Nov 2, 2020

valeriabarra commented Nov 2, 2020

jeremylt commented Nov 2, 2020

valeriabarra commented Nov 2, 2020

simonbyrne commented Nov 2, 2020 • edited Loading

jedbrown commented Nov 2, 2020

simonbyrne commented Nov 2, 2020

jeremylt commented Nov 6, 2020

jedbrown commented Nov 6, 2020

jedbrown commented Nov 2, 2020 •

edited

Loading

valeriabarra commented Nov 2, 2020 •

edited

Loading

simonbyrne commented Nov 2, 2020 •

edited

Loading