You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please have Design phase review before beginning development of your microservice and have Pre-production phase review before rolling out production release.
Design checklist
This checklist contains items are things that must be considered during the design phase and verified before the start of implementation.
☀️ General
Stateless server - All persistent data is stored outside of the container.
Deploy order - Its deploy does not have strong order.
Pre-production checklist (Mercari and Merpay common)
This checklist contains points that must be satisfied during implementation and verified prior to release.
It is recommended to ensure that your service is deployed in production (but not receiving production traffic) before requesting the PRC, as some of the points in the list below can only be validated (e.g. capacity estimation, dashboards, screenboards, alerting, profiling, ...) if the service is deployed in production and can receive some non-production traffic. This should be done only if your service will not impact other production services or datasets. Please let us know in the issue if you think this would be a problem for your service.
🔧 Maintainability
Unit test - It has unit tests. And the unit tests are running in a CI system.
Test coverage - Its test coverage is reported to Codecov in CI system.
High Test coverage - Its test coverage is over 80%.
Config in env-var - Its config can be overridden via environment variable.
dockerignore - It has dockerignore to reduce the Docker image size.
No latest tag - Its Docker image tag is not latest or master.
Dependabot - Its dependencies are automatically updated.
Automated build - Its build process is automated (binary build and Docker image build is in this scope).
Automatic build - Its automated build process is running in CI/CD system.
Automated deploy - Its deploy process is automated.
Automatic deploy - Its automated deploy process is running in CI/CD system.
Error tracking - Its errors are tracked by Sentry.
✈️ Reliability
Auto Scale - It automatically scales horizontally to handle fluctuating workloads, its HPA is set as described in the Resource Requests and Limits documentation, and can be scaled manually if needed.
Graceful degradation - It keeps working, at least partially, while dependencies (e.g. other service or database) are not working partially or completely.
Readiness Probe - It has a health check (endpoint) for readiness probe. And readiness probe is configured.
Timeout - It sets an appropriate timeout for requests over a network.
Smart retry - It performs smart retries when interacting with dependencies (e.g. other services or database).
🔒 Security
Security review - It has completed the security design review by security team.
Non-root user - Its docker container runs as non-root user
Secrets - Its sensitive configuration is stored in Kubernetes secrets.
Non-sensitive log - It does not write sensitive information to app logs (STDOUT/STDERR).
📋 Accessibility
Design Doc - Its design doc is up to date with the implementation.
Description - It has service description.
Contact - It has contact info about the owners.
Source repo - It has links to source repo.
Docs - It has links to docs for users.
SLOs - Its dashboard shows SLOs.
📁 Data Storage
Data Replication - Its data is replicated to BigQuery (if required).
Minimal Operator Privileges - Personnel has minimal access privileges and accesses are auditable.
Recovery - It can be recovered from backup; the procedure has been defined and tested.
Fast Recovery - It can be recovered from backup in less than 2 hours; the procedure is described in the OnCall playbook, and it is practiced every 6 months.
PIT Recovery - Point-in-time recovery from backup can be completed in less than 2 hours.
Timeboard - Its GCP databases have a Datadog Timeboard.
GCP Cloud SQL (MySQL)
Maintenance Window - Its databases have a defined maintenance window (during core hours).
Regional HA - Its databases have regional HA enabled..
Read Replicas - Its databases have one or more read replicas, and it uses them for reads that do not need strict consistency.
Missing Master - It keeps correctly serving idempotent requests with no side-effects when the master is unavailable (e.g. by sending all reads to the read replicas and returning internal error to all other requests)..
Operational Guidelines - Its databases are in compliance with the Cloud SQL Operational guidelines, so that they do not fall outside the Cloud SQL SLA.
Replication Lag - Alerts should be sent if replication lag (Seconds Behind Master in Stackdriver) is >300s.
CPU - CPU usage of each instance (including replicas) should be <50% during peak load, and alerts should be sent if it increases to >80%.
Minimal Data Privileges - It has one or more dedicated MySQL users (not root) that have only the bare minimum set of required privileges (e.g. only SELECT and INSERT, but no UPDATE, DELETE or any other DDL/admin privileges). If the service has both admin and non-admin endpoints, they should use different users with different permissions.
GCP Cloud Spanner
Regional Configuration - If it is a service deployed in a single region, its databases are in regional configuration and are deployed in the same region.
Global Configuration - If it is a service deployed in multiple regions, its databases are in multi-regional configuration and they are deployed in the same regions.
SLA Exclusions - Its databases are in compliance with the SLA exclusions, so that they do not fall outside of the Cloud Spanner SLA.
This template is the Production Readiness Checklist (PRC) for Level A microservices. Please make sure you have read the PRC guidelines.
Production Readiness Review has the following 2 phases:
Please have Design phase review before beginning development of your microservice and have Pre-production phase review before rolling out production release.
Design checklist
This checklist contains items are things that must be considered during the design phase and verified before the start of implementation.
☀️ General
🔒 Security
🍀 Sustainability
Pre-production checklist (Mercari and Merpay common)
This checklist contains points that must be satisfied during implementation and verified prior to release.
It is recommended to ensure that your service is deployed in production (but not receiving production traffic) before requesting the PRC, as some of the points in the list below can only be validated (e.g. capacity estimation, dashboards, screenboards, alerting, profiling, ...) if the service is deployed in production and can receive some non-production traffic. This should be done only if your service will not impact other production services or datasets. Please let us know in the issue if you think this would be a problem for your service.
🔧 Maintainability
latest
ormaster
.📉 Observability
preStop
. See more on Configure PreStop.PodDisruptionBudget
set as described in the Configure Pod Distription Budget🔒 Security
📋 Accessibility
📁 Data Storage
GCP Cloud SQL (MySQL)
Seconds Behind Master
in Stackdriver) is >300s.root
) that have only the bare minimum set of required privileges (e.g. onlySELECT
andINSERT
, but noUPDATE
,DELETE
or any other DDL/admin privileges). If the service has both admin and non-admin endpoints, they should use different users with different permissions.GCP Cloud Spanner
The text was updated successfully, but these errors were encountered: