Production Readiness Review (Level A) #1

tcnksm · 2020-09-06T04:06:44Z

This template is the Production Readiness Checklist (PRC) for Level A microservices. Please make sure you have read the PRC guidelines.

Production Readiness Review has the following 2 phases:

Please have Design phase review before beginning development of your microservice and have Pre-production phase review before rolling out production release.

Design checklist

This checklist contains items are things that must be considered during the design phase and verified before the start of implementation.

☀️ General

Stateless server - All persistent data is stored outside of the container.
Deploy order - Its deploy does not have strong order.
Exclusive data ownership - It is the only service that can access its data store.

🔒 Security

Authentication - It is protected by an authentication service.
Authorization - Access is restricted to the appropriate level. Consider who should have access to each exposed API and what they are allowed to do.
Transport Security - It uses TLS to communicate with other services over the Internet.

🍀 Sustainability

No short-term transfer - Its team members are not forced to move to another team in the short term.
OnCall considered team - Its team follows OnCall practices.
Dependency SLA - Its team knows SLA of the service dependencies.
SLOs - Its SLOs and SLOs owner are defined.

Pre-production checklist (Mercari and Merpay common)

This checklist contains points that must be satisfied during implementation and verified prior to release.

It is recommended to ensure that your service is deployed in production (but not receiving production traffic) before requesting the PRC, as some of the points in the list below can only be validated (e.g. capacity estimation, dashboards, screenboards, alerting, profiling, ...) if the service is deployed in production and can receive some non-production traffic. This should be done only if your service will not impact other production services or datasets. Please let us know in the issue if you think this would be a problem for your service.

🔧 Maintainability

📉 Observability

✈️ Reliability

🔒 Security

Security review - It has completed the security design review by security team.
Non-root user - Its docker container runs as non-root user
Secrets - Its sensitive configuration is stored in Kubernetes secrets.
Non-sensitive log - It does not write sensitive information to app logs (STDOUT/STDERR).

📋 Accessibility

Design Doc - Its design doc is up to date with the implementation.
Description - It has service description.
Contact - It has contact info about the owners.
Source repo - It has links to source repo.
Docs - It has links to docs for users.
SLOs - Its dashboard shows SLOs.

📁 Data Storage

Data Replication - Its data is replicated to BigQuery (if required).
Minimal Operator Privileges - Personnel has minimal access privileges and accesses are auditable.
Recovery - It can be recovered from backup; the procedure has been defined and tested.
Fast Recovery - It can be recovered from backup in less than 2 hours; the procedure is described in the OnCall playbook, and it is practiced every 6 months.
PIT Recovery - Point-in-time recovery from backup can be completed in less than 2 hours.
Timeboard - Its GCP databases have a Datadog Timeboard.

GCP Cloud SQL (MySQL)

GCP Cloud Spanner

Regional Configuration - If it is a service deployed in a single region, its databases are in regional configuration and are deployed in the same region.
Global Configuration - If it is a service deployed in multiple regions, its databases are in multi-regional configuration and they are deployed in the same regions.
SLA Exclusions - Its databases are in compliance with the SLA exclusions, so that they do not fall outside of the Cloud Spanner SLA.
Automatic Backups - Its databases have scheduled automatic backups.
CPU - CPU usage of each node is monitored and alerts are sent if it is >65% (or >45% for multi-regional instances).
Disk usage - Disk usage of each node is monitored and alerts are sent if it is >75%.
Sessions - Number of sessions on each database+node is monitored and alerts are sent if it is >7500.

The text was updated successfully, but these errors were encountered:

mercari locked as resolved and limited conversation to collaborators Sep 6, 2020

tcnksm pinned this issue Sep 6, 2020

tcnksm changed the title ~~Production Readiness Review(Level A)~~ Production Readiness Review (Level A) Sep 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production Readiness Review (Level A) #1

Production Readiness Review (Level A) #1

tcnksm commented Sep 6, 2020 •

edited

Loading

Production Readiness Review (Level A) #1

Production Readiness Review (Level A) #1

Comments

tcnksm commented Sep 6, 2020 • edited Loading

Design checklist

☀️ General

🔒 Security

🍀 Sustainability

Pre-production checklist (Mercari and Merpay common)

🔧 Maintainability

📉 Observability

✈️ Reliability

🔒 Security

📋 Accessibility

📁 Data Storage

GCP Cloud SQL (MySQL)

GCP Cloud Spanner

tcnksm commented Sep 6, 2020 •

edited

Loading