Incident Response Cookbook: Multi‑Vendor Cloud Outages

A practical SRE playbook for handling concurrent AWS and Cloudflare outages — with runbooks, SLA checks, and automation recipes.

Hook: When two cloud giants sneeze, your stack can catch a cold — fast

Multi‑vendor outages are no longer a rare headline. In late 2025 and early 2026, simultaneous incidents across edge and cloud providers (Cloudflare, AWS, and major social platforms) reminded teams that a single dependency failure can cascade into a full product blackout. If your team doesn't have a tested outage playbook for cross‑vendor incidents, recovery time and cost both spike.

Executive summary: What this incident response cookbook gives you

This is a practical, actionable playbook for developers and SREs facing concurrent provider failures. You'll get:

Detect and classify checks that combine provider status APIs, active health probes, and telemetry
Decision matrices for failover vs. degrade vs. rollback
Runbooks for Cloudflare + AWS dual failures, with on‑the‑button automations
SLA and cost checks so you know when to invoke contractual remedies or scale back to reduce bill shock
Sample scripts and Terraform snippets you can adapt and test in staging

Why multi‑vendor outages are the new normal (2026 perspective)

In 2026, three trends converge:

Large providers introduce regional and sovereign clouds (for example, the AWS European Sovereign Cloud launched in early 2026), which increases isolation but also adds complexity for multi‑region architectures.
Edge providers like Cloudflare now host critical DNS, WAF, and edge compute — making them single points of failure for many web apps.
Shared control plane dependencies and Internet routing events mean separate vendors can fail concurrently, so multi‑vendor redundancy is required rather than optional.

That makes this playbook timely: you need repeatable actions that cover concurrent AWS + Cloudflare outages and cost‑aware mitigations.

First 10 minutes: Immediate triage checklist

Use this quick checklist the moment alerts spike. These steps prioritize detection, communications, and a stopgap mitigation.

Confirm — aggregate provider status APIs and observability signals.
Classify — is it DNS/edge, origin compute, storage, or network? Multiple classes may be affected.
Communicate — open an incident bridge and post an initial status page update with expected cadence.
Mitigate — apply preconfigured traffic shifts or partial degradations (read below for automation).
Cost guardrails — enable auto‑throttles to prevent runaway scale and bills during retries.

Scripted status aggregation (example)

Run this quick aggregator to get provider states; it queries Cloudflare and AWS health endpoints and gives a one‑line summary. Adapt for your incident tooling (PagerDuty, Opsgenie, Slack).

# status-check.sh
  #!/usr/bin/env bash
  set -e
  echo 'Cloudflare status:'
  curl -s 'https://www.cloudflarestatus.com/api/v2/summary.json' | jq -r '.status.description'
  echo 'AWS Health (public events):'
  curl -s 'https://health.aws.amazon.com/public/status' | sed -n '1,4p'
  echo 'Third‑party signals: Downdetector (headlines)'
  curl -s 'https://downdetector.example/api/summary' | jq -r '.headline'

Classify the incident: Decision matrix

Use this matrix to decide whether to failover, degrade, or stay put. Base decisions on two axes: user impact and control plane availability.

High user impact + control plane unavailable: trigger emergency failover or temporary redirect to static experience.
High user impact + control plane available: scale read replicas, enable degraded feature flags.
Low user impact + control plane unavailable: monitor and prepare; avoid risky mass changes.

Runbooks: Play actions for common multi‑vendor scenarios

Scenario A — Cloudflare CDN/DNS outage while AWS origin is healthy

Open incident bridge and tag vendors: cloudflare, aws.
Confirm Cloudflare status via API and Cloudflare dashboard.
If DNS is impacted, switch authoritative DNS to a preconfigured backup (Route 53 or secondary DNS provider). Use preapproved DNS TTLs and migration steps.
Temporarily reduce dynamic features that require heavy edge computation (e.g., image resizing at edge).
Deploy a public static status page from S3 with CloudFront signed URLs as a fallback if DNS propagation is slow.

Scenario B — AWS S3/EC2 region outage while Cloudflare is unaffected

Determine affected services via AWS Health API and service limits.
Shift read traffic to replicas in unaffected regions using Route 53 weighted records or global accelerator.
For writes, enable a degraded mode: queue writes in a durable queue (SQS or third‑party) and surface a warning to users.
Engage AWS Support with an incident key and log SLA‑relevant timestamps.

Scenario C — Simultaneous Cloudflare + AWS partial outages

This is the hardest case because both edge and origin may be compromised. Use a conservative approach:

Switch DNS to backup provider but keep TTLs low and only promote a backup when health checks pass.
Expose a static fallback hosted in an alternative cloud or sovereign region (for example, a small S3/CloudFront deployment in the AWS European Sovereign Cloud if your contract permits).
Enable feature flags to disable heavy backend features and reduce error amplification.

Automation recipes: Automate detection and safe mitigation

Automation reduces toil — but it must be safe and reversible. Store runbook versions in Git, run automations through CI, and require two‑step approvals for high‑impact actions.

Automated status -> action pipeline

Poll provider status APIs and active health checks every 30s.
Push normalized events to your incident bus (Kafka, Pub/Sub).
Trigger playbooks in your orchestration tool (RunDeck, StackStorm, or serverless lambdas) with preconditions.

Example: Automatic DNS failover using Route 53 and a health probe

# Pseudocode: trigger failover when Cloudflare status reports partial outage and origin responds healthy
  if cloudflare_status != 'operational' && origin_health == 'healthy' then
    aws route53 change-resource-record-sets --change-batch '{"Changes":[{"Action":"UPSERT","ResourceRecordSet": {"Name":"www.example.com.","Type":"A","SetIdentifier":"primary","Weight":0}}]}' --hosted-zone-id Z123
    aws route53 change-resource-record-sets --change-batch '{"Changes":[{"Action":"UPSERT","ResourceRecordSet": {"Name":"www.example.com.","Type":"A","SetIdentifier":"backup","Weight":100,"ResourceRecords":[{"Value":"203.0.113.10"}]}}]}' --hosted-zone-id Z123
  fi

Note: protect DNS automations with multi‑person approval and audit trails.

Circuit breaker & retry policies

Implement a circuit breaker in your API gateway or client SDKs. When a backend crosses an error threshold, stop retries to avoid increasing load and cost.

# example: pseudo config for circuit breaker
  circuit_breaker:
    failure_threshold: 0.05 # 5% errors in 60s
    cooldown_seconds: 120
    max_concurrent_requests: 200

SLA checks and remediation steps

During an outage you need accurate SLA tracking to decide whether to pursue credits and whether to escalate commercially.

Collect timestamps for incident start and end from both your telemetry and provider status pages.
Record the exact regions and API endpoints impacted.
Calculate downtime against SLA using your contract definitions (e.g., monthly uptime percentage).

Quick SLA formula

Compute downtime percentage for the month:

# downtime_percent = (total_downtime_minutes_in_month / total_minutes_in_month) * 100

Keep in mind many agreements exclude scheduled maintenance and force majeure. Log everything.

Cost optimization during incidents

Incidents can cause runaway costs: autoscaling floods, repeated retries, and cross‑region data transfer. Use these strategies to limit spend while preserving essential functionality.

Enable cost quotas: temporary caps on autoscaling groups and serverless concurrency.
Throttle retries: reduce exponential backoff windows and add jitter to prevent stampedes.
Limit cross‑region replication: suspend noncritical replication tasks during incident windows.
Static degrade mode: route users to a low‑cost static site to preserve brand and reduce dynamic compute.

Observability: what to monitor during a multi‑vendor outage

Combine synthetic and real user monitoring with provider signals.

Synthetics: global HTTP checks at 30s cadence, DNS resolution checks, TLS handshake tests.
RUM (Real User Monitoring): capture geographic distribution of failures to isolate region impact.
Provider APIs: Cloudflare status API, AWS Health API, Cloud provider incident timelines.
Cost telemetry: track egress, API request counts, and serverless invocations in real time.

Communication templates: Internal and external

Keep messages short and scheduled. Use this cadence:

Initial: within 10 minutes — what we know, what we are doing, next update in N minutes.
Progress: every 15–30 minutes — changes and mitigations.
Resolution: summary and root cause timeline plus next steps (postmortem schedule).

External update (example)

We are currently experiencing service disruption affecting parts of our web and API traffic due to concurrent incidents reported by our edge and cloud providers. Our engineers are executing the multi‑vendor playbook: switching DNS to our secondary provider and enabling a static degraded mode. Next update in 30 minutes.

Testing and rehearsal: Make your playbook battle‑ready

Runbooks are only useful if rehearsed. Schedule quarterly game days focused on multi‑vendor scenarios. Include:

Controlled DNS failovers with whitelisted traffic.
Simulated provider status reports to validate automations.
Cost impact drills: measure bill impact during a scaled failover.

Postmortem & continuous improvement

After resolution, produce a blameless postmortem with timestamped evidence, decision rationale, and follow‑ups. Mandatory outputs:

Root cause and propagation path.
What the runbook executed and what failed.
Action items with owners and deadlines (e.g., add a secondary DNS provider, add additional health probes).

Advanced patterns & future proofing (2026+)

As cloud vendors introduce sovereign and specialized clouds, your architecture must evolve.

Multi‑control‑plane architecture: run minimal control capabilities in two or more provider ecosystems to reduce single control plane failure risk.
Edge fallback islands: prebuild small, cold‑startable static experiences in multiple clouds that can be promoted rapidly.
Immutable automation policies: signed and auditable runbooks stored in Git, executed through automated playbooks requiring one or two approvals depending on blast radius.
Contract mapping: keep a living map of vendor SLAs, P1 contacts, and legal remedies for each region, including sovereign cloud differences.

Sample playbook checklist (copy & paste)

- [ ] Alert triage: confirm multi‑vendor incident
  - [ ] Open incident bridge and assign roles (comm, tech lead, vendor liaison)
  - [ ] Run status aggregation script
  - [ ] Classify incident (DNS/Edge, Origin, Network)
  - [ ] If DNS impacted: prepare DNS failover procedure
  - [ ] If origin impacted: enable degraded mode and queue writes
  - [ ] Start cost guardrails (throttle autoscaling, cap serverless)
  - [ ] Communicate externally at agreed cadence
  - [ ] Record all timeline events for SLA calculations
  - [ ] After resolution: run postmortem and follow up

Real‑world example: What happened in Jan 2026

In mid‑January 2026, multiple platforms reported increased outage signals tied to edge and social networks. Teams that recovered fastest shared these traits: prebuilt static fallbacks, DNS secondary providers, and automated health checks that triggered safe failovers. Use that incident as a model: assume edge and origin might fail simultaneously and script accordingly.

Final takeaways

Prepare for multi‑vendor incidents: don't assume a single provider SLA shields you from simultaneous failures.
Automate carefully: use preconditions, approvals, and rehearsals.
Measure cost impact: guardrails reduce bill shock during recovery.
Document and iterate: postmortems should drive runbook improvements and vendor negotiations.

Call to action

Start your first multi‑vendor playbook today: copy the checklist above into your incident runbook repository, schedule a tabletop exercise for this quarter, and run the status aggregation script against your production footprint. Need help designing a tailored playbook or running a game day? Contact our SRE consultants or sign up for our workshop to build and test your multi‑vendor outage runbooks.

Incident Response Cookbook: Responding to Multi‑Vendor Cloud Outages

Hook: When two cloud giants sneeze, your stack can catch a cold — fast

Executive summary: What this incident response cookbook gives you

Why multi‑vendor outages are the new normal (2026 perspective)

First 10 minutes: Immediate triage checklist

Scripted status aggregation (example)

Classify the incident: Decision matrix

Runbooks: Play actions for common multi‑vendor scenarios

Scenario A — Cloudflare CDN/DNS outage while AWS origin is healthy

Scenario B — AWS S3/EC2 region outage while Cloudflare is unaffected

Scenario C — Simultaneous Cloudflare + AWS partial outages

Automation recipes: Automate detection and safe mitigation

Automated status -> action pipeline

Example: Automatic DNS failover using Route 53 and a health probe

Circuit breaker & retry policies

SLA checks and remediation steps

Quick SLA formula

Cost optimization during incidents

Observability: what to monitor during a multi‑vendor outage

Communication templates: Internal and external

External update (example)

Testing and rehearsal: Make your playbook battle‑ready

Postmortem & continuous improvement

Advanced patterns & future proofing (2026+)

Sample playbook checklist (copy & paste)

Real‑world example: What happened in Jan 2026

Final takeaways

Call to action

Related Topics

devtools

Up Next

Best Monorepo Tools in 2026: Nx vs Turborepo vs Bazel vs Rush

Secrets Management Tools Compared: Vault, AWS Secrets Manager, Doppler, and More

Best Feature Flag Tools for Engineering Teams: Hosted and Open Source Options

Hook: When two cloud giants sneeze, your stack can catch a cold — fast

Executive summary: What this incident response cookbook gives you

Why multi‑vendor outages are the new normal (2026 perspective)

First 10 minutes: Immediate triage checklist

Scripted status aggregation (example)

Classify the incident: Decision matrix

Runbooks: Play actions for common multi‑vendor scenarios

Scenario A — Cloudflare CDN/DNS outage while AWS origin is healthy

Scenario B — AWS S3/EC2 region outage while Cloudflare is unaffected

Scenario C — Simultaneous Cloudflare + AWS partial outages

Automation recipes: Automate detection and safe mitigation

Automated status -> action pipeline

Example: Automatic DNS failover using Route 53 and a health probe

Circuit breaker & retry policies

SLA checks and remediation steps

Quick SLA formula

Cost optimization during incidents

Observability: what to monitor during a multi‑vendor outage

Communication templates: Internal and external

External update (example)

Testing and rehearsal: Make your playbook battle‑ready

Postmortem & continuous improvement

Advanced patterns & future proofing (2026+)

Sample playbook checklist (copy & paste)

Real‑world example: What happened in Jan 2026

Final takeaways

Call to action

Related Reading

Related Topics

devtools

Up Next

Best Monorepo Tools in 2026: Nx vs Turborepo vs Bazel vs Rush

Secrets Management Tools Compared: Vault, AWS Secrets Manager, Doppler, and More

Best Feature Flag Tools for Engineering Teams: Hosted and Open Source Options