AWS Spot Instances: Complete Guide to 90% Cost Cuts

AWS Spot Instances offer up to 90% discount vs on-demand — the largest cost lever in cloud computing. The catch: AWS can reclaim them with 2 minutes notice. Here's how to architect for Spot safely and which workloads are genuinely suitable.

How Spot Instances Work

Spot Instances use AWS's spare EC2 capacity. When you run Spot, you bid for unused capacity. If AWS needs that capacity back, they give you a 2-minute interruption notice, then terminate or stop your instance. In exchange: 60–90% discount vs on-demand prices.

Spot interruption rates are lower than most engineers assume. The average interruption frequency for most instance types in major regions is 5–10% per month. Graviton instance types (arm64) tend to have the lowest interruption rates because they have separate capacity pools.

Workloads That Work Well on Spot

// good for spot

CI/CD build pipelines
Batch data processing (Spark, EMR)
ML training jobs
Stateless web tier (behind load balancer)
Video transcoding
Game servers (with checkpoint support)
Selenium test grids

// not suitable

Primary databases
Single-instance critical services
Stateful workloads without checkpointing
Long-running jobs without restart logic
Anything requiring guaranteed uptime SLA

Handling Interruptions Gracefully

AWS sends a 2-minute interruption notice via instance metadata and EventBridge. Your application must handle this. The pattern:

# Poll instance metadata for interruption notice
TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token"   -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")

# Check for Spot interruption notice (returns 404 if not interrupted)
curl -H "X-aws-ec2-metadata-token: $TOKEN"   http://169.254.169.254/latest/meta-data/spot/interruption-action

# If 200 returned: drain connections, checkpoint state, graceful shutdown

For batch jobs: checkpoint progress every 5–10 minutes. For stateless web: rely on your load balancer's connection draining. For Kubernetes: use node termination handler (AWS Node Termination Handler Helm chart).

Architecture Patterns

Pattern 1 — Diversified instance fleet: Never use a single instance type. Request 6–10 instance families simultaneously (m5, m5a, m5n, m4, m6i, r5, etc.). If one pool is interrupted, capacity remains in others.

Pattern 2 — Spot + On-Demand baseline: Run 20–30% of your fleet as on-demand, 70–80% as Spot. The on-demand instances absorb traffic during Spot interruptions. Net savings: 55–70% vs all on-demand.

Mixed Instance Fleet with Auto Scaling

# Example: EC2 Auto Scaling mixed fleet configuration
MixedInstancesPolicy:
  InstancesDistribution:
    OnDemandBaseCapacity: 2          # Always keep 2 on-demand
    OnDemandPercentageAboveBaseCapacity: 20  # 20% on-demand above base
    SpotAllocationStrategy: "price-capacity-optimized"
  LaunchTemplate:
    Overrides:
      - InstanceType: m5.xlarge
      - InstanceType: m5a.xlarge
      - InstanceType: m6i.xlarge
      - InstanceType: m6a.xlarge
      - InstanceType: m4.xlarge

Use price-capacity-optimized allocation strategy — it picks pools with highest Spot availability, reducing interruption risk more than pure lowest-price selection.

// FAQ

Is Spot safe for production?

Yes — with proper architecture. Stateless services behind a load balancer with diversified instance types and a 20% on-demand baseline run on Spot reliably. Companies like Netflix, Lyft, and Airbnb run significant production workloads on Spot.

Should I use a tool like Spot.io instead of managing Spot directly?

For teams running $20K+/month of compute, yes. Spot.io automates fleet diversification, interruption handling, and fallback logic — typically delivering 60–80% savings with zero interruption risk. The ROI is usually 10–30x the tool cost.

AWS Spot Instances: Complete Guide to 90% Cost Cuts (2026)