Let's rework PAPR's architecture! #62

jlebon · 2017-09-22T21:58:34Z

Rough problem statement

Right now, our architecture is very much like a waterfall.
Events on GitHub cause a linear cascade of events that
eventually fires off PAPR to run for those specific events.

This has severe limitations:

monolithic architecture makes it harder to try out locally
and thus harder to contribute
reliance on multiple linear infrastructures (CI bus,
Jenkins, OpenStack) results in a higher fault rate
no easy way to scale horizontally for HA

Other issues plaguing the current architecture:

queue is not easily visible/not public so it's hard to
tell what's going on without manually checking the
internal Jenkins queue (also homu queue is sort of visible
but could be way better)
no job prioritization, purely FIFO. But ideally, we want
to be able to apply a set of rules as to how jobs should
prioritized. E.g. 'auto' and 'try' branch before PRs, PRs
with certain labels before others, etc...
the combination of GHPRB and Homu is confusing and creates
an inconsistent user experience

Rough solution

We split up the architecture into multiple small services:

Homu/PAPR Scheduler (PAPRQ)
Proposed to run in CentOS CI
PAPR Workers
Need either OpenStack or Docker/Kubernetes
Bin packing problem - can we e.g. reuse Kubernetes
https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/
PAPR Publishers

The scheduler receives events from GitHub and queues up the
jobs. It understands .papr.yml and splits them into
individual jobs each representing a testsuite. This allows
e.g. workers that only support containerized workloads to
still participate in the pool. It also allows container work
to be weighed differently from VM/cluster work.

Workers periodically (with forced polls on events if
implementable) query PAPRQ for available jobs. PAPRQ
prioritizes jobs by a given set of rules.

What it would entail

The largest piece of work will be to enhance (fork?) Homu
to also handle PR events and add them to its queue. This
naturally resolves the confusing UX experience, and makes
optimizations like Add status-based test exemptions servo/homu#54
trivial to implement.

E.g. @rh-atomic-bot retry will actually know whether to
retry testing the PR, or retry testing the merge.

It also allows for more sophisticated syntax, like:
@rh-atomic-bot retry f26-sanity

Teach PAPR to connect to PAPRQ for jobs. This is either a
long-running service that polls, or is periodically started
by an external service (e.g. Jenkins, OCP)
This can come later. Rather than the workers publishing
to e.g. S3 themselves, do similar to what Cockpit does and
stream logs and updates back to PAPRQ itself. This allows us
to (1) have publicly visible streaming logs, and (2) keep
all the secrets in PAPRQ and only require workers to have a
single token.

Let's finalize this work and split it up amongst team
members so that everyone understands how it works, and can
help manage it.

Risks

Contributions/blocking on Servo/Homu team - getting review time is hard

Sub-alternative:
Sidecar/wrapper for Homu - PAPR intercepts github
events and forwards them to Homu as well, but also builds
its own state.
(Investigate organization-wide github events)

Transition plan

Can take down per-PR testing while keeping up testing on auto branch.

Alternatives

Customize Jenkins (Integrate better with CentOS CI)
Relationship with GHPRB there?

Hop on http://prow.k8s.io/ with Origin

Rely on Travis more

Other discussion

Standard test roles vs PAPR
PAPR describes more things, handles tasks like provisioning more declaratively
Could have stdtest in upstream git?

Colin: PAPR runs stdtest? Jonathan: Problem: Test in separate git repo. Unless upstream repo also holds stdtest definition?

ashcrow · 2017-09-25T14:36:35Z

We split up the architecture into multiple small services:

👍

Homu/PAPR Scheduler (PAPRQ)
Proposed to run in CentOS CI

PAPR Workers
Need either OpenStack or Docker/Kubernetes
Bin packing problem - can we e.g. reuse Kubernetes
https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/

PAPR Publishers

This sound a lot like a a message+worker system (EG: celery). Homu has a queue built in, but if the requirement is within PAPR itself to be able to handle it's own testing while keeping Homu strictly merge work reusing an established message+worker system would be close to trivial.

The largest piece of work will be to enhance (fork?) Homu
to also handle PR events and add them to its queue. This
naturally resolves the confusing UX experience, and makes
optimizations like servo/homu#54
trivial to implement.

I'm not against it, other than the need to hold our own Homu fork as we'd need to rebase and modify any time we want to pull in new features from upstream. I worry that combining merge gating and distributed testing may be coupling things that should stay separate but I wouldn't complain if it ends up being easier to code.

cgwalters · 2017-09-27T14:47:28Z

Also as part of this consider being coming an app maybe? https://developer.github.com/apps/building-integrations/setting-up-a-new-integration/about-choosing-an-integration-type/

This way papr could be installed organization-wide more easily.

jlebon · 2018-02-02T16:00:01Z

So... I think a good strategy here would be to start with the minimal amount of work needed to get into CentOS CI. And then we can slowly iterate over changing the architecture? There are some changes that will require a lot of upfront investment clearly, but trying to keep things gradual ensures we can timebox improvements.

Basically, I think the path of least resistance towards migrating to our CCI OCP project is something like:

set up a Jenkins service with GHPRB; and make sure webhooks are enabled
change the codebase to start containerized workloads in pods rather than containers. So we would e.g. oc exec instead of docker exec for provisioning.
change the codebase to start VMs inside containers rather than in an OSP instance.

Once we got that down, we can talk about swapping out GHPRB, doing fancier queueing, web UIs, etc... Does that make sense?

I'll try my hand at (1) and (2) today.

cgwalters · 2018-04-03T13:44:54Z

I see what you're saying, but the thing I struggle with now is basically that the current monolithic architecture is hard at least for me to understand/hack. I guess a basic problem is I just don't have a consistent amount of time to learn it, so it's more when things fail that it becomes critical path.

I keep circling back to that queuing approach, it feels orthogonal to the other work. If we stood up a simple service that received the webhooks and also did the conversion to Kube jobs, and posted that for anything else to read - it wouldn't conflict with the ocp branch, and would allow iteration on a more decoupled architecture. Maybe? What still feels fuzzy to me is how result posting works wrt credentials and large log files.

jlebon · 2018-04-03T15:39:28Z

One of my main goal is to make it really easy to get started and contribute. E.g. the instructions in https://github.com/projectatomic/papr/blob/ocp/docs/RUNNING.md should hopefully get anyone get up and running within 5 minutes. In the long-term, we'll need more people to help maintain this thing if we really want to make it part of our upstream CI story.

I keep circling back to that queuing approach, it feels orthogonal to the other work. If we stood up a simple service that received the webhooks and also did the conversion to Kube jobs, and posted that for anything else to read

Yeah, definitely would like to get to something like that (though I pictured using custom resources much like what Prow does). Between that and getting a container: MVP up and running though, it feels like the latter takes priority? Will dedicate some time to this this sprint and try to carve out pieces for others to contribute.

One thing I'm still struggling with is whether to eventually support host: and cluster: as well. Those would clearly still use VM-in-containers, though it depends on if we want to hide that from users or not. It seems like in a lot of cases, users don't want the sort of flexibility that e.g. ostree needs. PAPR also does a few nice things like setting up ssh keys, and injecting fresh rpmmd data.

jlebon · 2018-07-03T19:34:36Z

One note I'd like to make here, for lack of a better place, is that recently I was playing with replacing GHPRB with Prow. And it worked surprisingly well with the PAPR-in-OCP model. Notably, this involves setting skip_report to true so that Prow wasn't involved at all in updating back to GitHub.

The config I got running with was this:

presubmits:
  jlebon/papr-sandbox:
    - name: papr-sandbox-pull
      agent: kubernetes
      branches:
        - master
      always_run: true
      skip_report: true
      # XXX: need to teach trigger to pass down full comment so PAPR can learn
      # to handle retesting failed testsuites
      #trigger: "(?m)^/retest(\\s+.*)*$"
      #rerun_command: "/retest [testsuite1] [testsuite2] ..."
      spec:
        restartPolicy: Never
        serviceAccountName: papr
        volumes:
          - name: config-mount
            configMap:
              name: papr-config
        containers:
          - name: papr
            image: 172.30.1.1:5000/myproject/papr:latest
            imagePullPolicy: Always
            args:
              - "--debug"
              - "runtest"
              - "--conf"
              - "/etc/papr/config"
              - "--repo"
              - "$(REPO_OWNER)/$(REPO_NAME)"
              - "--pull"
              - "$(PULL_NUMBER)"
              - "--expected-sha1"
              - "$(PULL_PULL_SHA)"
            securityContext:
              runAsUser: 0
            volumeMounts:
              - name: config-mount
                mountPath: /etc/papr
            env:
              - name: GITHUB_TOKEN
                valueFrom:
                  secretKeyRef:
                    name: github-token
                    key: token
                    optional: false
              - name: AWS_ACCESS_KEY_ID
                valueFrom:
                  secretKeyRef:
                    name: aws-access-key
                    key: id
                    optional: false
              - name: AWS_SECRET_ACCESS_KEY
                valueFrom:
                  secretKeyRef:
                    name: aws-access-key
                    key: secret
                    optional: false

Then, e.g. /retest, /hold, etc just worked (see jlebon/papr-sandbox#1). I'd love to make use of this, though I'm not sure if we'd have to run our own Prow instance, or if the one running at http://deck-ci.svc.ci.openshift.org/ could run pods in CentOS CI -- or just move away from CentOS CI and run them at ci.openshift.org? IIUC, we should be able to satisfy kvm: true there.

cgwalters · 2018-07-03T21:20:09Z

Hmmm. I definitely see the advantages of taking GHPRB out of the picture, though my primary focus tends to go to the merge queue aspect - I see it as a lot more important than the per-PR testing.

I have a worry that prow-but-not-tide etc. is going to leave us with gaps and things to glue; although that's definitely true today with GHPRB/papr/homu, so not really a change for better/worse.

In this proposal we'd still be using Homu, correct?

jlebon · 2018-07-20T21:54:07Z

though my primary focus tends to go to the merge queue aspect

Yeah, that's one missing piece. I was looking at this some more today. To summarize: there is no prioritization of prowjobs; plank just blindly creates pods for any prowjob that doesn't have one already. There's a concurrency field to limit how many pending pods are allowed.

This means that any prioritization should happen at the Kube level. Kube does support pod priorities, which means we could e.g. label merge pods as having higher priority than PR pods so that they are scheduled before them. However, by default, it also does preemption, which means that e.g. it could kick out an already running PR job to schedule a merge job, which I don't think we want. You can disable preemption, but not per-pod, only at the kubelet level.

One idea is to write a custom scheduler for pods coming from prowjobs which would know how to prioritize them (could even do neat things like prioritizing some repos over others), though not sure how much work it'd be (e.g. we might have to implement node selection ourselves if there's no way to reuse the default scheduler logic).

In this proposal we'd still be using Homu, correct?

I was looking at possibly going all in at this point. Tide seems to offer the same guarantee of never-broken-master as Homu by rerunning tests before merge. IIUC though, the way it does this is very different: instead of pushing a branch like auto, test pods are fed a base SHA and a list of PRs (for batching) to git merge manually before testing. There's a helper utility (clonerefs) in the prow repo that can do this.

But at merge time, tide uses the GitHub merge API, rather than pushing it itself. This does bring back discussions from servo/homu#57 -- it assumes that whatever strategy is used by GitHub will be exactly equivalent to what was tested (clonerefs uses vanilla git merges). So it's definitely not as explicit as Homu. Not sure how hard it'd be to teach tide to adopt the Homu strategy for this of just directly pushing what was tested.

There's also no autosquashing feature, though that should be trivial to implement if tide learns to do the push itself.

cgwalters · 2018-08-02T19:55:14Z

However, by default, it also does preemption, which means that e.g. it could kick out an already running PR job to schedule a merge job, which I don't think we want.

Hum, that's assuming the cluster is full. The plank part may need to be a bit more configurable. Or alternatively, we could tend to rely on projects only doing "baseline" tests by default in PRs, and having e.g. /test compose or /test all that we do on particular PRs as needed.

But if the cluster is full (which actually gets into a more interesting question in a cloud world about just how much money we want to spend, since there is no "full") - yes, I'd say that merge jobs should preempt PR jobs, right?

jlebon mentioned this issue Sep 27, 2017

spec: also set required context on PRs #30

Closed

jlebon added the jira label Jan 16, 2018

jlebon mentioned this issue Feb 8, 2018

papr: add pipeline for deploying customized Jenkins projectatomic/paci#4

Merged

cgwalters mentioned this issue Apr 3, 2018

queuing and multi-backend #60

Closed

cgwalters mentioned this issue May 30, 2018

WIP: RFC: Rework vmcheck to use STI qcow2 inventory coreos/rpm-ostree#1362

Closed

jlebon mentioned this issue Oct 3, 2018

Add Ubuntu-18.04 to CI testing containers/podman#1578

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let's rework PAPR's architecture! #62

Let's rework PAPR's architecture! #62

jlebon commented Sep 22, 2017

ashcrow commented Sep 25, 2017

cgwalters commented Sep 27, 2017

jlebon commented Feb 2, 2018

cgwalters commented Apr 3, 2018

jlebon commented Apr 3, 2018

jlebon commented Jul 3, 2018

cgwalters commented Jul 3, 2018

jlebon commented Jul 20, 2018

cgwalters commented Aug 2, 2018

Let's rework PAPR's architecture! #62

Let's rework PAPR's architecture! #62

Comments

jlebon commented Sep 22, 2017

Rough problem statement

Rough solution

What it would entail

Risks

Transition plan

Alternatives

Other discussion

ashcrow commented Sep 25, 2017

cgwalters commented Sep 27, 2017

jlebon commented Feb 2, 2018

cgwalters commented Apr 3, 2018

jlebon commented Apr 3, 2018

jlebon commented Jul 3, 2018

cgwalters commented Jul 3, 2018

jlebon commented Jul 20, 2018

cgwalters commented Aug 2, 2018