Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

187187231 - Update PaaS tenant facing documentation #574

Merged
merged 1 commit into from
Jun 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions source/incident_management/incident_process.html.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ If an incident is ongoing outside of office hours (i.e. an in-hours incident con
### Starting an incident

1. Acknowledge the incident on PagerDuty or Slack and decide if the alerts you have received and their impact constitute an incident or not. Incidents generally have a negative impact on the availability of tenant services in some way or constitute a [cyber security incident](#what-qualifies-as-a-cyber-security-incident). Problems such as our billing smoke tests failing may indicate a tenant-impacting problem but do not in themselves constitute an incident.
2. Document briefly which steps you are taking to resolve the incident in the #paas-incident Slack channel. If the situation impacts tenants, [escalate to the person on communication](https://support.pagerduty.com/docs/response-plays#run-a-response-play-on-an-incident) (comms) support using PagerDuty so they can communicate with tenants.
2. Document briefly which steps you are taking to resolve the incident in the #paas-incident Slack channel. If the situation impacts tenants, [escalate to the person on communication](https://support.pagerduty.com/docs/response-plays#run-a-response-play-on-an-incident) (comms) support using PagerDuty or Slack so they can communicate with tenants.
3. The #paas-incident channel has a bookmarked hangout link. Join this video call to communicate with the comms lead and talk through what you’re doing and what’s happening.
4. If you decide it’s not an incident after investigating further, you must resolve the incident in PagerDuty. If you are sure it is an incident, [agree on a priority](https://www.cloud.service.gov.uk/support-and-response-times/#response-times-for-services-in-production) for the incident with the comms lead. You can change this priority level later as more information emerges.

Expand Down Expand Up @@ -98,11 +98,11 @@ You should start the process to document an incident if:

1. The comms lead should check the incident report has a summary of the incident and is up-to-date.
2. The comms lead should share the incident report with the PaaS SREs and wider managed service pool for visibility.
3. An incident continuation meeting to talk through:
3. An incident continuation meeting should be scheduled for 9am the following working day to talk through:

- the current incident status
- any useful contextual information
- the status of any communications that need to be updated further
- the current incident status
- any useful contextual information
- the status of any communications that need to be updated further

## Escalation paths

Expand Down
32 changes: 2 additions & 30 deletions source/incident_management/roles_and_responsibilities.html.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ You can refer to [Incident Process](/incident_management/incident_process/) for

The support rota is on [Pagerduty](https://governmentdigitalservice.pagerduty.com/schedules).

We have four rotas; in hours engineering, in hours comms, out of hours engineering and out of hours comms.
We have two rotas; in hours engineering, in hours comms.

Our escalation rota is [GaaP SCS Escalation] (https://governmentdigitalservice.pagerduty.com/schedules#PE6NQ9Z).

Expand All @@ -31,23 +31,7 @@ The in hours support lead is responsible for:
* picking up ‘small tasks’ in Pivotal if there is time in between support tasks
* having a support handover meeting at the beginning and end of the support week

When you need to take a break for lunch or essential meetings etc, make sure you tell people ahead of time, and arrange for a colleague to cover for you. All other members of the team are responsible for providing assistance to the support person as needed. If you don’t feel comfortable asking your colleagues, talk to the delivery manager or team lead who can help.


## Out of hours engineering support role and responsibility

The out of hours support engineer is responsible for:

* Responding to system alerts which will be sent to you via pagerduty. These will only be things which seriously impact the availability of live services due to a problem with our platform.
* Responding to notification from tenant teams of P1 issues they are having which are caused by problems with the PaaS. This will also be via Pagerduty.
* Looking at the issue and telling the initiator that you are doing so (if initiated by a human).
* Doing what is needed to ensure the platform is available, not necessarily fixing or diagnosing the root cause.
* Alerting the communication escalation person if they feel they need support with putting out tenant comms.
* Involving other people if needed. NO HEROICS, do not deploy if unsure. Some things are not a one-person-decision.

For out of hours similar guidelines apply, if you need cover for an hour or two or an evening (e.g. for an appointment, or a family dinner), you need to agree this in advance with a colleague who can cover for you, and [update Pagerduty using an override](https://support.pagerduty.com/hc/en-us/articles/202830170-Creating-and-Deleting-Overrides).

On a Bank Holiday, we create a daytime override on the out-of-hours schedule so there is someone on the rota during Bank Holiday daytime. This person is usually the current out-of-hours person. We also create an override on the in-hours schedule to ensure that any alerts go to PaaS team email rather than the in-hours person.
When you need to take a break for lunch or essential meetings etc, make sure you tell people ahead of time, and arrange for a colleague to cover for you. All other members of the team are responsible for providing assistance to the support person as needed. If you don’t feel comfortable asking your colleagues, talk to the delivery manager.

## In hours comms lead role and responsibility

Expand All @@ -58,18 +42,6 @@ On a Bank Holiday, we create a daytime override on the out-of-hours schedule so
* Records a timeline of events through the incident.
* Drafts and sends regular updates to tenants via Statuspage.

## Out of hours comms lead role and responsibility

* Checks the on call engineer is ok and finds out if they need any additional support.
* If more support is required, attempt to contact others from team (or GDS). Get help from the GaaP SCS Escalation person with this if needed.
* In the case of a P1 incident, or if any additional support is required, contact the person on duty on the [GaaP SCS Escalation rota] (https://governmentdigitalservice.pagerduty.com/schedules#PE6NQ9Z).
* Responsible for any communications (internal or external) required throughout the duration of the incident.
* Protects the support engineer from unnecessary distractions or questions.
* Opens an incident template with view permissions to all GDS staff. Posts it in #paas-incident channel.
* Records a timeline of events through the incident.
* Drafts and sends regular updates to tenants via Statuspage.


## GaaP SCS Escalation role and responsibility
If the out of hours 1st line support has decided that they need help with tenant communications so they can focus on fixing the issue, you should contact the person on the [GaaP SCS Escalation rota] (https://governmentdigitalservice.pagerduty.com/schedules#PE6NQ9Z). They will then be responsible for:

Expand Down
11 changes: 5 additions & 6 deletions source/support/so_you_are_on_support.html.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ You need to join the following Slack channels on gds.com:
* [#paas-internal](https://gds.slack.com/archives/CAEHMHGJ2)
* [#paas-incident](https://gds.slack.com/archives/CAD4W35KK)
* [#paas](https://gds.slack.com/archives/CADHV9267)
* [#paas-escalation](https://gds.slack.com/archives/C06LSLE77LJ)

You also need to join the following Slack channels on ukgovernmentdigital.slack.com:

Expand All @@ -79,10 +80,9 @@ You also need to join the following Slack channels on ukgovernmentdigital.slack

When you join the team, the DM or tech lead should arrange the following:

* shadowing support shifts 8 weeks after joining the team
* Knowledge transfer sessions using [this document](https://drive.google.com/drive/folders/1_yCutf5ybmNmwz1jKtqRqeyuzz4pw3I7?role=writer)
* 2 weeks shadowing the person on support

They should also arrange reverse shadowing for your first 2 weeks on support, with a more experienced engineer helping and supporting you.
* 2 weeks reverse shadowing for your first 2 weeks on support

Speak to your line manager or the tech lead if you have any questions or concerns about your shadowing experience.

Expand All @@ -101,12 +101,11 @@ You should set the Slack channel topic to “[@your_slack_username] is on suppor
At the end of your support shift, you need to:

* make sure all ongoing support tasks are sufficiently documented in Zendesk and/or hand over work to the person coming onto the support rota
* claim for your on-call shift using the [GaaP and CEVPS on-call pay submission form](https://docs.google.com/forms/d/e/1FAIpQLSfpMK85F2CxBFo_uubO2HHintc3Gx6jbifeUhnAm0g6GfoDEA/viewform?vc=0&c=0&w=1&flr=0)

## Support times

* in-hours: Weekdays 9:00 - 17:00
* out-of hours: Weekdays 17:00 - 9:00, weekends 24/7
* out-of hours: Not provided

You should regularly check the [dynamic calendar showing your in-hours support shifts as defined in PagerDuty](https://calendar.google.com/calendar/ical/8nvffdghj1kfrfgmji0ottc8nnh52t37%40import.calendar.google.com/public/basic.ics).

Expand Down Expand Up @@ -153,4 +152,4 @@ We use [Concourse](https://concourse-ci.org/) for our continuous integration and
We receive a number of platform alerts as well as tickets submitted through the Zendesk CRM as emails as well. To keep up-to-date during your in-hours support shift, you should regularly check your inbox for messages from:

* [[email protected]](https://groups.google.com/a/digital.cabinet-office.gov.uk/g/govpaas-alerting-prod)
* [[email protected]](https://groups.google.com/a/digital.cabinet-office.gov.uk/g/gov-uk-paas-support)
* [[email protected]](https://groups.google.com/a/digital.cabinet-office.gov.uk/g/gov-uk-paas-support)
9 changes: 2 additions & 7 deletions source/team/comms_lead_role.html.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,7 @@ Learn about [Statuspage](/team/statuspage/)
1. The subscribers on statuspage will get the notifications

## If you need to escalate to SMT on call:
1. If you need to escalate to SMT (for example, if its affecting coronavirus services) - go to [rotas app](https://rotas.cloudapps.digital/teams/techops-management-escalations) and select the current on call individual to get their contact info

## Don’t forget:
1. Your aim is to do just enough support out of hours to get through to working hours :)
1. You can update the x-gov slack paas channel if relevant
1. If you need to escalate to SMT (for example, if its affecting coronavirus services) - go to [rotas app](https://rotas.cloudapps.digital/teams/techops-management-escalations) and select the current on call individual to get their contact info. Only in hours.

## Response times for P1 incidents

Expand All @@ -40,5 +36,4 @@ Tenant updated: 1hr

### Outside working hours

Start work and respond: 40 minutes
Tenant updated: 1hr
No out of hours support provision.
69 changes: 28 additions & 41 deletions source/team/orientation.html.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,49 +6,23 @@ title: Orientation

Some key information to help new starters to find their way.

## Team members
We are a multidisciplinary team responsible for the developement and maintenance of GOV.UK PaaS.

- EL - Engineering Lead Support responsiblities
- CL - Comms Lead Support responsibilities

<!-- List is alphabetised -->
### Delivery Management
- Emma Pearce - Delivery Manager (CL)
- Kam Nijjar - Associate Delivery Manager

<!-- List is alphabetised -->
### Engineering
- Andy Hunt - Tech Lead (EL)
- Ben Corlett - Site Reliability Engineer (EL, contractor)
- Jack Joy - Site Reliability Engineer (EL, contractor)
- Jamie Van Dyke - Site Reliability Engineer (contractor)
- Jani Kraner - Front End Developer (CL)
- Malcolm Saunders - Site Reliability Engineer (EL, contractor)
- Nimalan Kirubakaran - Developer (EL)
- Panos Xynos - Site Reliability Engineer (EL, contractor)
- Robert Scott - Site Reliability Engineer (EL)
- Tom Whitwell - Site Reliability Engineer (EL)

### Product Management
- Lisa Scott - Senior Product Manager (CL)

### Programme Management
- Chris Wells - Programme Manager

### Technical Architecture
- Paul Dougan - Technical Architect (CL)

## Product

The following blog posts and videos give an overview of why we're here and
what we've been doing so far:
The following blog posts and videos give an overview of why PaaS was built and what it was used for:

- [A PaaS for Government - Anna at Velocity Europe (video)](https://www.youtube.com/watch?v=OLOaq-Xf5zU)
- [Building a platform to host digital services - Anna & Carl on the GDS blog](https://gds.blog.gov.uk/2015/09/08/building-a-platform-to-host-digital-services/)
- [Looking at open source PaaS technologies - Anna on the GDS Technology blog](https://gdstechnology.blog.gov.uk/2015/10/27/looking-at-open-source-paas-technologies/)
- [Choosing Cloud Foundry - Anna on the GaaP blog](https://governmentasaplatform.blog.gov.uk/2015/12/17/choosing-cloudfoundry/)

## Decomission Decision

The Government Digital Service (GDS) provided GOV.UK PaaS since 2015, supporting and providing a public cloud platform for departments using a shared hosting and responsibility model.
Following an extensive analysis period, GDS has concluded that, while the platform has been successful in its aims, the underlying technology would now require investment before it could meet its goals in the long term.
Faced with this need for re-investment GDS has decided to decommission the platform, in order to focus its budget and energy on other GDS products for common use by Government.
GDS will not be providing a replacement hosting service.


## Repos

These are the key repos that we use. There will be many others, which these
Expand Down Expand Up @@ -105,6 +79,25 @@ Missing standup, or being late, means you will miss out on updates on:

The last one is especially important, as the standup is a valuable way to crowdsource ideas on problems that people may be having with a story, and if you are not there, you can't help.

### Absences

It’s important that we have cover within the team to mitigate issues during work hours within the agreed SLAs (20 minutes response time for P1s; which means any new alert must be acknowledged and ideally triaged within 20 minutes. Therefore, there must always be someone covering support.
Due to the size of the team, this needs active coordination from the team. To ensure this, the team have adopted the following team norms:

* Only one PaaS SRE can be on annual leave at a time
* Another PaaS SRE can be on learning time, but must still be available for support (so self-learning is fine, but a conference would not be)
* Lunch breaks should be staggered

The managed service pool provide additional capacity for support cover when the PaaS SREs are unavailable, for example due to an unplanned absence coinciding with some annual leave. The Managed Service Delivery Manager is responsible for ensuring there is appropriate cover in place at all times, and so all Leave requests must be approved by them.

* This is the level of notice we expect when you take leave:
* For 1 day – you must give at least 3 days notice
* For 2 days – you must give at least 1 week notice
* For between 3 days - 4 days – you must give at least 2 weeks notice
* For between 1 week - 2 weeks – you must give at least 4 weeks notice
* For greater than 2 weeks – you must give 3 months notice


## Learning our technologies

We use a number of technologies and you may find it easier to learn about each
Expand All @@ -122,12 +115,6 @@ familiar with each one.
# Cloud Foundry, for those managing it | | [Cloud Foundry presentation, written by the team](https://docs.google.com/presentation/d/1LkR4Y3jLBQ8uskKeLIyKtSKDoutnAvty-vSSGfVNXZU/view), an [older presentation from before the move to Diego archecture](https://docs.google.com/presentation/d/1sZH1Nn_GiYfpBtT6br_AnZn_dynLzvYizJ9aQ4Zc1Ww/view)
# Terraform | The terraform [intro](https://www.terraform.io/intro/index.html) | The intro also covers key concepts.

## Communicating with Hand Signals

We use hand signals at our meetings to help make them more productive and
accessible for every person on the team. You can find out more about how this
works in practice by reading [our blog post][].

[our blog post]: https://gds.blog.gov.uk/2016/10/07/platform-as-a-service-team-takes-even-handed-approach-to-meetings/

## Inclusive language
Expand Down
Loading