From 5005ec7bdc522e1f589b35259f259290e539da88 Mon Sep 17 00:00:00 2001 From: Harsh291104 Date: Sat, 8 Jun 2024 12:34:47 +0100 Subject: [PATCH] 187187231 - Update PaaS tenant facing documentation --- .../incident_process.html.md.erb | 10 +-- .../roles_and_responsibilities.html.md | 32 +-------- source/support/so_you_are_on_support.html.md | 11 ++- source/team/comms_lead_role.html.md | 9 +-- source/team/orientation.html.md | 69 ++++++++----------- source/team/working_practices.html.md | 41 +++-------- 6 files changed, 52 insertions(+), 120 deletions(-) diff --git a/source/incident_management/incident_process.html.md.erb b/source/incident_management/incident_process.html.md.erb index e57c5de8..d0aa98fd 100644 --- a/source/incident_management/incident_process.html.md.erb +++ b/source/incident_management/incident_process.html.md.erb @@ -38,7 +38,7 @@ If an incident is ongoing outside of office hours (i.e. an in-hours incident con ### Starting an incident 1. Acknowledge the incident on PagerDuty or Slack and decide if the alerts you have received and their impact constitute an incident or not. Incidents generally have a negative impact on the availability of tenant services in some way or constitute a [cyber security incident](#what-qualifies-as-a-cyber-security-incident). Problems such as our billing smoke tests failing may indicate a tenant-impacting problem but do not in themselves constitute an incident. -2. Document briefly which steps you are taking to resolve the incident in the #paas-incident Slack channel. If the situation impacts tenants, [escalate to the person on communication](https://support.pagerduty.com/docs/response-plays#run-a-response-play-on-an-incident) (comms) support using PagerDuty so they can communicate with tenants. +2. Document briefly which steps you are taking to resolve the incident in the #paas-incident Slack channel. If the situation impacts tenants, [escalate to the person on communication](https://support.pagerduty.com/docs/response-plays#run-a-response-play-on-an-incident) (comms) support using PagerDuty or Slack so they can communicate with tenants. 3. The #paas-incident channel has a bookmarked hangout link. Join this video call to communicate with the comms lead and talk through what you’re doing and what’s happening. 4. If you decide it’s not an incident after investigating further, you must resolve the incident in PagerDuty. If you are sure it is an incident, [agree on a priority](https://www.cloud.service.gov.uk/support-and-response-times/#response-times-for-services-in-production) for the incident with the comms lead. You can change this priority level later as more information emerges. @@ -98,11 +98,11 @@ You should start the process to document an incident if: 1. The comms lead should check the incident report has a summary of the incident and is up-to-date. 2. The comms lead should share the incident report with the PaaS SREs and wider managed service pool for visibility. -3. An incident continuation meeting to talk through: +3. An incident continuation meeting should be scheduled for 9am the following working day to talk through: -- the current incident status -- any useful contextual information -- the status of any communications that need to be updated further + - the current incident status + - any useful contextual information + - the status of any communications that need to be updated further ## Escalation paths diff --git a/source/incident_management/roles_and_responsibilities.html.md b/source/incident_management/roles_and_responsibilities.html.md index 06733293..6ab786c0 100644 --- a/source/incident_management/roles_and_responsibilities.html.md +++ b/source/incident_management/roles_and_responsibilities.html.md @@ -12,7 +12,7 @@ You can refer to [Incident Process](/incident_management/incident_process/) for The support rota is on [Pagerduty](https://governmentdigitalservice.pagerduty.com/schedules). -We have four rotas; in hours engineering, in hours comms, out of hours engineering and out of hours comms. +We have two rotas; in hours engineering, in hours comms. Our escalation rota is [GaaP SCS Escalation] (https://governmentdigitalservice.pagerduty.com/schedules#PE6NQ9Z). @@ -31,23 +31,7 @@ The in hours support lead is responsible for: * picking up ‘small tasks’ in Pivotal if there is time in between support tasks * having a support handover meeting at the beginning and end of the support week -When you need to take a break for lunch or essential meetings etc, make sure you tell people ahead of time, and arrange for a colleague to cover for you. All other members of the team are responsible for providing assistance to the support person as needed. If you don’t feel comfortable asking your colleagues, talk to the delivery manager or team lead who can help. - - -## Out of hours engineering support role and responsibility - -The out of hours support engineer is responsible for: - -* Responding to system alerts which will be sent to you via pagerduty. These will only be things which seriously impact the availability of live services due to a problem with our platform. -* Responding to notification from tenant teams of P1 issues they are having which are caused by problems with the PaaS. This will also be via Pagerduty. -* Looking at the issue and telling the initiator that you are doing so (if initiated by a human). -* Doing what is needed to ensure the platform is available, not necessarily fixing or diagnosing the root cause. -* Alerting the communication escalation person if they feel they need support with putting out tenant comms. -* Involving other people if needed. NO HEROICS, do not deploy if unsure. Some things are not a one-person-decision. - -For out of hours similar guidelines apply, if you need cover for an hour or two or an evening (e.g. for an appointment, or a family dinner), you need to agree this in advance with a colleague who can cover for you, and [update Pagerduty using an override](https://support.pagerduty.com/hc/en-us/articles/202830170-Creating-and-Deleting-Overrides). - -On a Bank Holiday, we create a daytime override on the out-of-hours schedule so there is someone on the rota during Bank Holiday daytime. This person is usually the current out-of-hours person. We also create an override on the in-hours schedule to ensure that any alerts go to PaaS team email rather than the in-hours person. +When you need to take a break for lunch or essential meetings etc, make sure you tell people ahead of time, and arrange for a colleague to cover for you. All other members of the team are responsible for providing assistance to the support person as needed. If you don’t feel comfortable asking your colleagues, talk to the delivery manager. ## In hours comms lead role and responsibility @@ -58,18 +42,6 @@ On a Bank Holiday, we create a daytime override on the out-of-hours schedule so * Records a timeline of events through the incident. * Drafts and sends regular updates to tenants via Statuspage. -## Out of hours comms lead role and responsibility - -* Checks the on call engineer is ok and finds out if they need any additional support. -* If more support is required, attempt to contact others from team (or GDS). Get help from the GaaP SCS Escalation person with this if needed. -* In the case of a P1 incident, or if any additional support is required, contact the person on duty on the [GaaP SCS Escalation rota] (https://governmentdigitalservice.pagerduty.com/schedules#PE6NQ9Z). -* Responsible for any communications (internal or external) required throughout the duration of the incident. -* Protects the support engineer from unnecessary distractions or questions. -* Opens an incident template with view permissions to all GDS staff. Posts it in #paas-incident channel. -* Records a timeline of events through the incident. -* Drafts and sends regular updates to tenants via Statuspage. - - ## GaaP SCS Escalation role and responsibility If the out of hours 1st line support has decided that they need help with tenant communications so they can focus on fixing the issue, you should contact the person on the [GaaP SCS Escalation rota] (https://governmentdigitalservice.pagerduty.com/schedules#PE6NQ9Z). They will then be responsible for: diff --git a/source/support/so_you_are_on_support.html.md b/source/support/so_you_are_on_support.html.md index de6a1ca4..d244ed0f 100644 --- a/source/support/so_you_are_on_support.html.md +++ b/source/support/so_you_are_on_support.html.md @@ -69,6 +69,7 @@ You need to join the following Slack channels on gds.com: * [#paas-internal](https://gds.slack.com/archives/CAEHMHGJ2) * [#paas-incident](https://gds.slack.com/archives/CAD4W35KK) * [#paas](https://gds.slack.com/archives/CADHV9267) +* [#paas-escalation](https://gds.slack.com/archives/C06LSLE77LJ) You also need to join the following Slack channels on ukgovernmentdigital.slack.com: @@ -79,10 +80,9 @@ You also need to join the following Slack channels on ukgovernmentdigital.slack When you join the team, the DM or tech lead should arrange the following: -* shadowing support shifts 8 weeks after joining the team +* Knowledge transfer sessions using [this document](https://drive.google.com/drive/folders/1_yCutf5ybmNmwz1jKtqRqeyuzz4pw3I7?role=writer) * 2 weeks shadowing the person on support - -They should also arrange reverse shadowing for your first 2 weeks on support, with a more experienced engineer helping and supporting you. +* 2 weeks reverse shadowing for your first 2 weeks on support Speak to your line manager or the tech lead if you have any questions or concerns about your shadowing experience. @@ -101,12 +101,11 @@ You should set the Slack channel topic to “[@your_slack_username] is on suppor At the end of your support shift, you need to: * make sure all ongoing support tasks are sufficiently documented in Zendesk and/or hand over work to the person coming onto the support rota -* claim for your on-call shift using the [GaaP and CEVPS on-call pay submission form](https://docs.google.com/forms/d/e/1FAIpQLSfpMK85F2CxBFo_uubO2HHintc3Gx6jbifeUhnAm0g6GfoDEA/viewform?vc=0&c=0&w=1&flr=0) ## Support times * in-hours: Weekdays 9:00 - 17:00 -* out-of hours: Weekdays 17:00 - 9:00, weekends 24/7 +* out-of hours: Not provided You should regularly check the [dynamic calendar showing your in-hours support shifts as defined in PagerDuty](https://calendar.google.com/calendar/ical/8nvffdghj1kfrfgmji0ottc8nnh52t37%40import.calendar.google.com/public/basic.ics). @@ -153,4 +152,4 @@ We use [Concourse](https://concourse-ci.org/) for our continuous integration and We receive a number of platform alerts as well as tickets submitted through the Zendesk CRM as emails as well. To keep up-to-date during your in-hours support shift, you should regularly check your inbox for messages from: * [govpaas-alerting-prod@digital.cabinet-office.gov.uk](https://groups.google.com/a/digital.cabinet-office.gov.uk/g/govpaas-alerting-prod) -* [gov-uk-paas-support@digital.cabinet-office.gov.uk](https://groups.google.com/a/digital.cabinet-office.gov.uk/g/gov-uk-paas-support) +* [gov-uk-paas-support@digital.cabinet-office.gov.uk](https://groups.google.com/a/digital.cabinet-office.gov.uk/g/gov-uk-paas-support) \ No newline at end of file diff --git a/source/team/comms_lead_role.html.md b/source/team/comms_lead_role.html.md index 6171b1ee..05b9a5d3 100644 --- a/source/team/comms_lead_role.html.md +++ b/source/team/comms_lead_role.html.md @@ -25,11 +25,7 @@ Learn about [Statuspage](/team/statuspage/) 1. The subscribers on statuspage will get the notifications ## If you need to escalate to SMT on call: -1. If you need to escalate to SMT (for example, if its affecting coronavirus services) - go to [rotas app](https://rotas.cloudapps.digital/teams/techops-management-escalations) and select the current on call individual to get their contact info - -## Don’t forget: -1. Your aim is to do just enough support out of hours to get through to working hours :) -1. You can update the x-gov slack paas channel if relevant +1. If you need to escalate to SMT (for example, if its affecting coronavirus services) - go to [rotas app](https://rotas.cloudapps.digital/teams/techops-management-escalations) and select the current on call individual to get their contact info. Only in hours. ## Response times for P1 incidents @@ -40,5 +36,4 @@ Tenant updated: 1hr ### Outside working hours -Start work and respond: 40 minutes -Tenant updated: 1hr +No out of hours support provision. diff --git a/source/team/orientation.html.md b/source/team/orientation.html.md index ea590e82..fe86ea0b 100644 --- a/source/team/orientation.html.md +++ b/source/team/orientation.html.md @@ -6,49 +6,23 @@ title: Orientation Some key information to help new starters to find their way. -## Team members -We are a multidisciplinary team responsible for the developement and maintenance of GOV.UK PaaS. - -- EL - Engineering Lead Support responsiblities -- CL - Comms Lead Support responsibilities - - -### Delivery Management -- Emma Pearce - Delivery Manager (CL) -- Kam Nijjar - Associate Delivery Manager - - -### Engineering -- Andy Hunt - Tech Lead (EL) -- Ben Corlett - Site Reliability Engineer (EL, contractor) -- Jack Joy - Site Reliability Engineer (EL, contractor) -- Jamie Van Dyke - Site Reliability Engineer (contractor) -- Jani Kraner - Front End Developer (CL) -- Malcolm Saunders - Site Reliability Engineer (EL, contractor) -- Nimalan Kirubakaran - Developer (EL) -- Panos Xynos - Site Reliability Engineer (EL, contractor) -- Robert Scott - Site Reliability Engineer (EL) -- Tom Whitwell - Site Reliability Engineer (EL) - -### Product Management -- Lisa Scott - Senior Product Manager (CL) - -### Programme Management -- Chris Wells - Programme Manager - -### Technical Architecture -- Paul Dougan - Technical Architect (CL) - ## Product -The following blog posts and videos give an overview of why we're here and -what we've been doing so far: +The following blog posts and videos give an overview of why PaaS was built and what it was used for: - [A PaaS for Government - Anna at Velocity Europe (video)](https://www.youtube.com/watch?v=OLOaq-Xf5zU) - [Building a platform to host digital services - Anna & Carl on the GDS blog](https://gds.blog.gov.uk/2015/09/08/building-a-platform-to-host-digital-services/) - [Looking at open source PaaS technologies - Anna on the GDS Technology blog](https://gdstechnology.blog.gov.uk/2015/10/27/looking-at-open-source-paas-technologies/) - [Choosing Cloud Foundry - Anna on the GaaP blog](https://governmentasaplatform.blog.gov.uk/2015/12/17/choosing-cloudfoundry/) +## Decomission Decision + +The Government Digital Service (GDS) provided GOV.UK PaaS since 2015, supporting and providing a public cloud platform for departments using a shared hosting and responsibility model. +Following an extensive analysis period, GDS has concluded that, while the platform has been successful in its aims, the underlying technology would now require investment before it could meet its goals in the long term. +Faced with this need for re-investment GDS has decided to decommission the platform, in order to focus its budget and energy on other GDS products for common use by Government. +GDS will not be providing a replacement hosting service. + + ## Repos These are the key repos that we use. There will be many others, which these @@ -105,6 +79,25 @@ Missing standup, or being late, means you will miss out on updates on: The last one is especially important, as the standup is a valuable way to crowdsource ideas on problems that people may be having with a story, and if you are not there, you can't help. +### Absences + +It’s important that we have cover within the team to mitigate issues during work hours within the agreed SLAs (20 minutes response time for P1s; which means any new alert must be acknowledged and ideally triaged within 20 minutes. Therefore, there must always be someone covering support. +Due to the size of the team, this needs active coordination from the team. To ensure this, the team have adopted the following team norms: + +* Only one PaaS SRE can be on annual leave at a time +* Another PaaS SRE can be on learning time, but must still be available for support (so self-learning is fine, but a conference would not be) +* Lunch breaks should be staggered + +The managed service pool provide additional capacity for support cover when the PaaS SREs are unavailable, for example due to an unplanned absence coinciding with some annual leave. The Managed Service Delivery Manager is responsible for ensuring there is appropriate cover in place at all times, and so all Leave requests must be approved by them. + +* This is the level of notice we expect when you take leave: +* For 1 day – you must give at least 3 days notice +* For 2 days – you must give at least 1 week notice +* For between 3 days - 4 days – you must give at least 2 weeks notice +* For between 1 week - 2 weeks – you must give at least 4 weeks notice +* For greater than 2 weeks – you must give 3 months notice + + ## Learning our technologies We use a number of technologies and you may find it easier to learn about each @@ -122,12 +115,6 @@ familiar with each one. # Cloud Foundry, for those managing it | | [Cloud Foundry presentation, written by the team](https://docs.google.com/presentation/d/1LkR4Y3jLBQ8uskKeLIyKtSKDoutnAvty-vSSGfVNXZU/view), an [older presentation from before the move to Diego archecture](https://docs.google.com/presentation/d/1sZH1Nn_GiYfpBtT6br_AnZn_dynLzvYizJ9aQ4Zc1Ww/view) # Terraform | The terraform [intro](https://www.terraform.io/intro/index.html) | The intro also covers key concepts. -## Communicating with Hand Signals - -We use hand signals at our meetings to help make them more productive and -accessible for every person on the team. You can find out more about how this -works in practice by reading [our blog post][]. - [our blog post]: https://gds.blog.gov.uk/2016/10/07/platform-as-a-service-team-takes-even-handed-approach-to-meetings/ ## Inclusive language diff --git a/source/team/working_practices.html.md b/source/team/working_practices.html.md index eca09c74..fe81bb38 100644 --- a/source/team/working_practices.html.md +++ b/source/team/working_practices.html.md @@ -23,7 +23,7 @@ instance it's okay to ask for help, it's okay to have quiet days, and many other The development process consists of the following steps: -1. The Product Manager makes decisions on which work (for example, features or bugs) to prioritise. +1. The Tech Lead makes decisions on which work (for example, features or bugs) to prioritise. 2. The team make the necessary changes in their own development environments on feature branches. @@ -48,7 +48,7 @@ The release process consists of the following steps: 2. When the development is complete, a developer raises a pull request against the main branch of the git repository for review. -3. Another team member reviews the pull request and if there are no problems, merges the changes into the main branch. Alternatively, they may provide feedback to the developers and request corrections and/or additional work. +3. Another team member reviews the pull request and if there are no problems, merges the changes into the main branch. Alternatively, they may provide feedback to the developers and request corrections and/or additional work. When one of the PaaS SREs is unavailable (e.g. due to planned or unplanned leave), the Managed Service SRE Pool will be responsible for reviewing pull requests. 4. A new commit to the main branch triggers a Concourse git_resource that performs an automated deployment to the staging environment. @@ -64,14 +64,9 @@ repositories. ## Pairing -We pair on all stories to ensure that people don't get stuck on the same +We aim to pair on all stories to ensure that people don't get stuck on the same types of work and that there is a good distribution of knowledge across the -team. We aim to rotate pairs regularly by: - -- changing pairs when you've been on a story for more than 2 days - -- joining someone on an existing story that doesn't have a pair instead of - picking up new work +team. However, due to the size of the team, this is not always possible. We don't insist on a particular method of pairing. We're keen to have two people making decisions and aware of the story, but there are lots of ways those two @@ -299,7 +294,7 @@ should be noted. Doing this allows the Product Manager prioritise the follow up ## Review Review is the step in our process where the pull requests relating to a story -are code-reviewed, and merged. This is typically done by somone who hasn't +are code-reviewed, and merged. This is typically done by someone who hasn't worked on the story, however if a story has been paired on throughout, the pair can merge their own PRs, and push the story straight through to approval. @@ -333,42 +328,26 @@ If it changes behaviour or makes new features available to users: Technical Documentation changes follow the same overall process as code changes, but with several documentation-specific amends. This section summarises the tech docs change process. -### Pre kick-off - -- Required: technical writer - -Before the formal story kick-off, the technical writer reviews the story and drafts changes if possible. - ### Kick-off -- Required: technical writer, technical lead -- Optional: product representative, developer - -At this step, decide on what changes to the tech docs are required to complete this story. You must also agree on who needs to review and approve this story. Make sure that you decide whether the story needs product as well as technical review. If no specific technical reviewers are named, any developer can serve as the technical reviewer. You should also analyse if any other further changes or stories will result from this story. +At this step, decide on what changes to the tech docs are required to complete this story. You must also agree on who needs to review and approve this story. Make sure that you decide whether the story needs Service Owner review as well as technical review. If no specific technical reviewers are named, any developer can serve as the technical reviewer. You should also analyse if any other further changes or stories will result from this story. ### Doing -- Required: technical writer, developer - Draft the content changes in markdown, ensuring that it is technically correct and in line with the GDS style guide. You must preview changes in tech doc format so that the new or amended content is smoothly integrated into the exsting documentation structure. Evaluate if story needs to change, and if so, whether this can be included in the scope of the original story or should be part of a new story. Once you have agreed that the content is ready for further review, raise a pull request to the paas-tech-docs repo. Note that if the change requires product review, you must push the tech doc changes to Cloud Foundry for review. Refer to the [Deploy a static site](https://docs.cloud.service.gov.uk/#deploy-a-static-site) for instructions on how to do this. ### Reviewing -- Required: technical writer, developer, technical writer 2i -- Optional: product representative, tech lead +Three reviews should happen: -Three reviews should happen at the same time: - -- The developer reviews the content in both GitHub and in the tech doc preview, checking if it addresses the issues in the story. -- The tech writer conducting the 2i review checks the style and content. -- The product rep or tech lead reviews content if required; the product rep will look at the temporary Cloud Foundry version of the tech docs. +- Review the content in both GitHub and in the tech doc preview, checking if it addresses the issues in the story. +- Conduct the 2i review checks the style and content. +- Review content if required; using the temporary Cloud Foundry version of the tech docs. Implement any changes as required, and then get sign-off from reviewers. The reviewers then merge the changes once they have been signed off. ### Approving -- Required: tech writer, approver - The approver checks if the change deployed correctly, and whether it addresses the story. If it does so, they approve the story.