Copyable resource
Customer-cloud day-2 operations checklist
A checklist for operating software after it is deployed into customer cloud environments.
Customer-cloud day-2 operations checklist
Use this before shipping a Helm chart, Terraform module, Docker image, appliance, or customer-cloud deployment. The question is not "can the customer install it?" The question is whether you can operate it after it leaves your cloud.
Product:
Customer / segment:
Deployment target:
- AWS
- GCP
- Azure
- Kubernetes
- Single VM
- Air-gapped / offline
Install method:
- CloudFormation
- Terraform
- Helm
- CLI
- Manual runbook
- Other:
1. Version Inventory
| Question | Answer |
|---|---|
| Which release is each customer running right now? | |
| Which environments exist per customer: production, staging, dev, POC, region-specific? | |
| Which image digest is running? | |
| Which config schema is installed? | |
| Which dependency / base-image versions are present? | |
| How quickly can support answer this without asking the customer? |
Pass condition:
- Support can list every customer deployment, environment, and version from one place.
- The inventory includes image digests, config version, and release channel.
- The customer does not have to run a command for you to know what is installed.
2. Security Patch Path
| Question | Answer |
|---|---|
| How do we know which deployments are affected by a CVE? | |
| How do we ship the patched build? | |
| Which customers require approval before the patch rolls? | |
| How do we prove who is patched and who is not? | |
| What is the emergency patch clock we promise internally? |
Pass condition:
- A patch can target only vulnerable deployments.
- The release path works without SSH, VPN, or a support screen-share.
- Approval-gated customers can wait without blocking everyone else.
- The team can produce a patched/unpatched report.
3. Rollback Path
| Question | Answer |
|---|---|
| What makes a release rollbackable? | |
| Who can trigger rollback? | |
| Can one customer roll back without affecting the fleet? | |
| What happens to migrations or state changes? | |
| How is rollback audited? |
Pass condition:
- Rollback is a supported operation, not an incident improvisation.
- Rollback can be scoped to one customer, one cohort, or the whole fleet.
- Stateful changes have a written recovery plan.
4. Telemetry Contract
| Signal | What leaves the customer environment? | Redaction / bounds | Retention |
|---|---|---|---|
| Health | |||
| Logs | |||
| Metrics | |||
| Traces | |||
| Deployment state | |||
| Command results |
Pass condition:
- The customer approved the telemetry contract before install.
- Operational telemetry excludes customer payload data by default.
- Every signal is tagged by deployment, release, and environment.
- Support can diagnose common failures without asking for a log dump.
5. Command Path
| Question | Answer |
|---|---|
| What commands can support run? | |
| Where are handlers defined in source? | |
| What permissions does each handler need? | |
| What does each handler return? | |
| Who approves new handlers? | |
| How are command invocations audited? |
Pass condition:
- There is no generic shell or
execendpoint. - Commands are named handlers the customer can review.
- Command responses are bounded and redacted where needed.
- Every invocation is logged with actor, time, deployment, input, and result.
6. Customer Approval Gates
| Question | Answer |
|---|---|
| Which customers require change approval? | |
| Which changes require approval? | |
| Who approves on the customer side? | |
| How do we ship to everyone else while waiting? | |
| How do we handle emergency security fixes? |
Pass condition:
- Approval gates are per-customer, not fleet-wide.
- Waiting customers remain visible as waiting, not silently stale.
- Emergency patch handling is agreed before the emergency.
7. Revocation and Offboarding
| Question | Answer |
|---|---|
| What credential, role, identity, or agent does the customer revoke? | |
| What stops immediately after revocation? | |
| What data remains in the customer's account? | |
| What telemetry or metadata remains in your management service? | |
| How do you prove access is gone? |
Pass condition:
- The customer has a documented revocation action.
- Revocation does not require deleting customer data.
- Your team can no longer update, command, or observe after revocation.
8. Incident Drill
Pick one realistic failure and walk it through.
Failure:
- Bad release breaks startup.
- CVE in base image.
- Customer rotated an IAM policy.
- Queue trigger stopped delivering.
- Customer has no internet egress.
- Other:
Timeline:
| Minute | What happens | Who acts | Evidence |
|---|---|---|---|
| 0 | Detection | ||
| 15 | Triage | ||
| 30 | Customer impact known | ||
| 60 | Fix / rollback started | ||
| 120 | Proof of recovery |
Final decision:
- We can operate this deployment model.
- We need managed customer-cloud deployment before launch.
- We should narrow the customer promise.
- We should not sell this install path yet.