# Customer-cloud day-2 operations checklist

Use this before shipping a Helm chart, Terraform module, Docker image, appliance, or customer-cloud deployment. The question is not "can the customer install it?" The question is whether you can operate it after it leaves your cloud.

Product:

Customer / segment:

Deployment target:

- [ ] AWS
- [ ] GCP
- [ ] Azure
- [ ] Kubernetes
- [ ] Single VM
- [ ] Air-gapped / offline

Install method:

- [ ] CloudFormation
- [ ] Terraform
- [ ] Helm
- [ ] CLI
- [ ] Manual runbook
- [ ] Other:

## 1. Version Inventory

| Question | Answer |
|---|---|
| Which release is each customer running right now? | |
| Which environments exist per customer: production, staging, dev, POC, region-specific? | |
| Which image digest is running? | |
| Which config schema is installed? | |
| Which dependency / base-image versions are present? | |
| How quickly can support answer this without asking the customer? | |

Pass condition:

- [ ] Support can list every customer deployment, environment, and version from one place.
- [ ] The inventory includes image digests, config version, and release channel.
- [ ] The customer does not have to run a command for you to know what is installed.

## 2. Security Patch Path

| Question | Answer |
|---|---|
| How do we know which deployments are affected by a CVE? | |
| How do we ship the patched build? | |
| Which customers require approval before the patch rolls? | |
| How do we prove who is patched and who is not? | |
| What is the emergency patch clock we promise internally? | |

Pass condition:

- [ ] A patch can target only vulnerable deployments.
- [ ] The release path works without SSH, VPN, or a support screen-share.
- [ ] Approval-gated customers can wait without blocking everyone else.
- [ ] The team can produce a patched/unpatched report.

## 3. Rollback Path

| Question | Answer |
|---|---|
| What makes a release rollbackable? | |
| Who can trigger rollback? | |
| Can one customer roll back without affecting the fleet? | |
| What happens to migrations or state changes? | |
| How is rollback audited? | |

Pass condition:

- [ ] Rollback is a supported operation, not an incident improvisation.
- [ ] Rollback can be scoped to one customer, one cohort, or the whole fleet.
- [ ] Stateful changes have a written recovery plan.

## 4. Telemetry Contract

| Signal | What leaves the customer environment? | Redaction / bounds | Retention |
|---|---|---|---|
| Health | | | |
| Logs | | | |
| Metrics | | | |
| Traces | | | |
| Deployment state | | | |
| Command results | | | |

Pass condition:

- [ ] The customer approved the telemetry contract before install.
- [ ] Operational telemetry excludes customer payload data by default.
- [ ] Every signal is tagged by deployment, release, and environment.
- [ ] Support can diagnose common failures without asking for a log dump.

## 5. Command Path

| Question | Answer |
|---|---|
| What commands can support run? | |
| Where are handlers defined in source? | |
| What permissions does each handler need? | |
| What does each handler return? | |
| Who approves new handlers? | |
| How are command invocations audited? | |

Pass condition:

- [ ] There is no generic shell or `exec` endpoint.
- [ ] Commands are named handlers the customer can review.
- [ ] Command responses are bounded and redacted where needed.
- [ ] Every invocation is logged with actor, time, deployment, input, and result.

## 6. Customer Approval Gates

| Question | Answer |
|---|---|
| Which customers require change approval? | |
| Which changes require approval? | |
| Who approves on the customer side? | |
| How do we ship to everyone else while waiting? | |
| How do we handle emergency security fixes? | |

Pass condition:

- [ ] Approval gates are per-customer, not fleet-wide.
- [ ] Waiting customers remain visible as waiting, not silently stale.
- [ ] Emergency patch handling is agreed before the emergency.

## 7. Revocation and Offboarding

| Question | Answer |
|---|---|
| What credential, role, identity, or agent does the customer revoke? | |
| What stops immediately after revocation? | |
| What data remains in the customer's account? | |
| What telemetry or metadata remains in your management service? | |
| How do you prove access is gone? | |

Pass condition:

- [ ] The customer has a documented revocation action.
- [ ] Revocation does not require deleting customer data.
- [ ] Your team can no longer update, command, or observe after revocation.

## 8. Incident Drill

Pick one realistic failure and walk it through.

Failure:

- [ ] Bad release breaks startup.
- [ ] CVE in base image.
- [ ] Customer rotated an IAM policy.
- [ ] Queue trigger stopped delivering.
- [ ] Customer has no internet egress.
- [ ] Other:

Timeline:

| Minute | What happens | Who acts | Evidence |
|---|---|---|---|
| 0 | Detection | | |
| 15 | Triage | | |
| 30 | Customer impact known | | |
| 60 | Fix / rollback started | | |
| 120 | Proof of recovery | | |

Final decision:

- [ ] We can operate this deployment model.
- [ ] We need managed customer-cloud deployment before launch.
- [ ] We should narrow the customer promise.
- [ ] We should not sell this install path yet.
