Copyable resource

Customer-cloud day-2 operations checklist

A checklist for operating software after it is deployed into customer cloud environments.

Customer-cloud day-2 operations checklist

Use this before shipping a Helm chart, Terraform module, Docker image, appliance, or customer-cloud deployment. The question is not "can the customer install it?" The question is whether you can operate it after it leaves your cloud.

Product:

Customer / segment:

Deployment target:

  • AWS
  • GCP
  • Azure
  • Kubernetes
  • Single VM
  • Air-gapped / offline

Install method:

  • CloudFormation
  • Terraform
  • Helm
  • CLI
  • Manual runbook
  • Other:

1. Version Inventory

QuestionAnswer
Which release is each customer running right now?
Which environments exist per customer: production, staging, dev, POC, region-specific?
Which image digest is running?
Which config schema is installed?
Which dependency / base-image versions are present?
How quickly can support answer this without asking the customer?

Pass condition:

  • Support can list every customer deployment, environment, and version from one place.
  • The inventory includes image digests, config version, and release channel.
  • The customer does not have to run a command for you to know what is installed.

2. Security Patch Path

QuestionAnswer
How do we know which deployments are affected by a CVE?
How do we ship the patched build?
Which customers require approval before the patch rolls?
How do we prove who is patched and who is not?
What is the emergency patch clock we promise internally?

Pass condition:

  • A patch can target only vulnerable deployments.
  • The release path works without SSH, VPN, or a support screen-share.
  • Approval-gated customers can wait without blocking everyone else.
  • The team can produce a patched/unpatched report.

3. Rollback Path

QuestionAnswer
What makes a release rollbackable?
Who can trigger rollback?
Can one customer roll back without affecting the fleet?
What happens to migrations or state changes?
How is rollback audited?

Pass condition:

  • Rollback is a supported operation, not an incident improvisation.
  • Rollback can be scoped to one customer, one cohort, or the whole fleet.
  • Stateful changes have a written recovery plan.

4. Telemetry Contract

SignalWhat leaves the customer environment?Redaction / boundsRetention
Health
Logs
Metrics
Traces
Deployment state
Command results

Pass condition:

  • The customer approved the telemetry contract before install.
  • Operational telemetry excludes customer payload data by default.
  • Every signal is tagged by deployment, release, and environment.
  • Support can diagnose common failures without asking for a log dump.

5. Command Path

QuestionAnswer
What commands can support run?
Where are handlers defined in source?
What permissions does each handler need?
What does each handler return?
Who approves new handlers?
How are command invocations audited?

Pass condition:

  • There is no generic shell or exec endpoint.
  • Commands are named handlers the customer can review.
  • Command responses are bounded and redacted where needed.
  • Every invocation is logged with actor, time, deployment, input, and result.

6. Customer Approval Gates

QuestionAnswer
Which customers require change approval?
Which changes require approval?
Who approves on the customer side?
How do we ship to everyone else while waiting?
How do we handle emergency security fixes?

Pass condition:

  • Approval gates are per-customer, not fleet-wide.
  • Waiting customers remain visible as waiting, not silently stale.
  • Emergency patch handling is agreed before the emergency.

7. Revocation and Offboarding

QuestionAnswer
What credential, role, identity, or agent does the customer revoke?
What stops immediately after revocation?
What data remains in the customer's account?
What telemetry or metadata remains in your management service?
How do you prove access is gone?

Pass condition:

  • The customer has a documented revocation action.
  • Revocation does not require deleting customer data.
  • Your team can no longer update, command, or observe after revocation.

8. Incident Drill

Pick one realistic failure and walk it through.

Failure:

  • Bad release breaks startup.
  • CVE in base image.
  • Customer rotated an IAM policy.
  • Queue trigger stopped delivering.
  • Customer has no internet egress.
  • Other:

Timeline:

MinuteWhat happensWho actsEvidence
0Detection
15Triage
30Customer impact known
60Fix / rollback started
120Proof of recovery

Final decision:

  • We can operate this deployment model.
  • We need managed customer-cloud deployment before launch.
  • We should narrow the customer promise.
  • We should not sell this install path yet.