Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate Travis or move CI away #652

Closed
jedbrown opened this issue Nov 2, 2020 · 12 comments · Fixed by #658
Closed

Evaluate Travis or move CI away #652

jedbrown opened this issue Nov 2, 2020 · 12 comments · Fixed by #658

Comments

@jedbrown
Copy link
Member

jedbrown commented Nov 2, 2020

The new pricing model has limited credit for open source so we may either need to start paying or move elsewhere (presumably GitHub Actions and/or GitLab-CI).

https://blog.travis-ci.com/2020-11-02-travis-ci-new-billing

@jeremylt
Copy link
Member

jeremylt commented Nov 2, 2020

I can try a port over to GitHub actions; it is supposed to be pretty painless. I think we'll loose the ability to target specific hardware though.

GitLab CI has a minutes cap, and we use a lot of minutes.

@jedbrown
Copy link
Member Author

jedbrown commented Nov 2, 2020

That is for cloud (and there's a process to request more); no limit on what we run on our own hardware.

We could run Docker+QEMU to simulate architectures for which native hardware is hard to come by.

We could also consider paying for Travis time, but we should probably migrate that which is easy to migrate away. We could use Azure Pipelines for macOS, Windows, and containerized Linux, but it's all x86-64.

@valeriabarra
Copy link
Contributor

valeriabarra commented Nov 2, 2020

Not sure if it costs more/less or about the same what Travis is going to charge, but I know in CliMA they use https://buildkite.com/pricing
and it is configurable on your own runners

@jedbrown
Copy link
Member Author

jedbrown commented Nov 2, 2020

Maybe I'm missing something, but how is this functionally different from GitLab-CI? Both have an open source runner that we would install on our (on-premise or cloud) hardware.

@valeriabarra
Copy link
Contributor

I'm not sure how much it differs, since I am not familiar with the functionalities that GitLab-CI offers either. I suppose that they used it because they could easily set up the runner for GPUs on the uni's cluster.

@jeremylt
Copy link
Member

jeremylt commented Nov 2, 2020

I played around with GitHub actions and it's pretty easy to set up.

Perhaps we do something like this:

  • Basic testing on Linux and OSX via GitHub Actions for C, Fortran

  • Basic testing on Linux via GitHub Actions for Python, Julia, and Rust with doc deployment

  • Hardware testing on aarch64 and ppc64le via GitHub Actions - https://github.com/marketplace/actions/run-on-architecture

  • Noether testing for Hip and MAGMA via GitLab CI runner (perhaps run only those backends there so CI runs faster?)

  • Azure for libCEED + [OCCA, LIBXSMM, MFEM, Nek5000, PETSc] (one or many containers? Perhaps using the CEED container)

@valeriabarra
Copy link
Contributor

Maybe I'm missing something, but how is this functionally different from GitLab-CI? Both have an open source runner that we would install on our (on-premise or cloud) hardware.

Maybe @simonbyrne can answer how buildkite is different from GitLab-CI?

@simonbyrne
Copy link

simonbyrne commented Nov 2, 2020

We went with Buildkite as we were able to get it to play nice with our cluster. Basically we have a cron job on the cluster which polls the Buildkite API to check if there are new jobs (it is behind a firewall, so we can't use webhooks). When there are new jobs we create a corresponding Slurm job for each (with options to enable different jobs to request specific # of tasks / gpus): we launch buildkite-agent start inside the Slurm job with the the --acquire-job option to ensure there is a 1-1 correspondence between Buildkite jobs and Slurm jobs, and it shuts down and terminates the Slurm job as soon as it is finished. We store the Buildkite job id in the Slurm job comment so that we can see which jobs have already been queued.

We use Bors to handle our merging, and the Buildkite jobs are only triggered when you request a merge (this prevents random people from opening a PR and getting access to our cluster).

Overall it works pretty well, scales nicely (we regularly have 100 or so agents running without problems) and is free for open source projects. Our scripts to make this work are here: https://github.com/CliMA/slurm-buildkite. They are somewhat specific to our use case, but I'm happy to answer questions if you wanted to adapt it.

We looked into self-hosted GitHub Actions, but couldn't figure out a way to make sure a specific runner would run a specific job (the relevant issues we were stuck on are actions/runner#510 and actions/runner#620). Additionally, scaling runners looks cumbersome (you have to keep registering and unregistering runners, which we don't have to do with buildkite: you can just start a new agent and it adds it to the pool).

I only quickly looked into Gitlab CI: from what I could tell it has the same problems as GitHub Actions (but may be wrong).

@jedbrown
Copy link
Member Author

jedbrown commented Nov 2, 2020

Thanks, Simon. We use GitLab-CI for PETSc and have about 60 configurations that run (across various machines) as part of each pipeline. GitLab has "merge trains", which is somehow similar to Bors (but a native UI feature). ECP has GitLab-CI running via Slurm at DOE facilities. I could track down the scripts, but it's done using the custom executor (after an attempt to MR a more HPC-specific executor prior to custom being developed/deployed). PETSc mostly uses the ssh executor, though we'd like to containerize more of the pipelines to make machines more fungible.

@simonbyrne
Copy link

@jakebolewski did look into the ECP GitLab CI + Slurm integration, but I think in the end we decided it would it would require significant effort on behalf of the cluster admins, whereas we could run the buildkite agent under existing user permissions. If it had already been set up on our cluster I imagine we would have used it.

@jeremylt
Copy link
Member

jeremylt commented Nov 6, 2020

@jedbrown, any objection to moving our libCEED only tests on Linux, OSX, the different hardware, Python, and Julia testing to GitHub actions for now?

We could easily move our OCCA and LIBXSMM integration testing to Noether.

Then we'd only have to make a choice about where to do the MFEM and Nek5000 example tests.

I can fiddle with this on the side as I work tomorrow.

@jedbrown
Copy link
Member Author

jedbrown commented Nov 6, 2020

That sounds good. We can put MFEM and Nek5000 on Noether. Best would be to keep the commits pinned as you've done with caching in Travis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants