This page is part of the FinOpsForge ontology — a structured library of named FinOps entities, each treated with consistent operations: define, implement, compare, calculate. Full methodology →
Crawl: Get Baseline Visibility
Step 1: Inventory Your AI Spend Sources
Before optimizing anything, know where you're spending. AI spend typically comes from three sources: direct provider APIs (OpenAI, Anthropic, Google AI — billed via credit card or direct invoice), cloud-native AI services (AWS Bedrock, Azure OpenAI Service, Google Vertex AI — appear in your cloud bill), and GPU infrastructure (EC2 P/G instances, GKE GPU nodes — traditional cloud compute). Each source requires a different visibility approach.
Step 2: Centralize API Keys
Separate API keys per team means no central visibility. Consolidate to organization-level API keys with project-level subkeys in each provider's dashboard. OpenAI Projects, Anthropic Workspaces, and Google Cloud projects all support this without requiring a single shared key. Set spend limits and email alerts at $500 and $1,000/month per project. This is the fastest path from zero visibility to baseline visibility.
Step 3: Set Spend Alerts
Every major AI provider supports spend alerts. Configure them before anything else — a runaway agent or a misconfigured batch job can generate thousands in unexpected spend within hours. Hard limits (which pause API access when reached) are appropriate for non-production environments. Soft alerts (email notification) are appropriate for production where uninterrupted service matters.
Walk: Attribution and Accountability
Step 4: Deploy an LLM Gateway
An LLM gateway (LiteLLM, Helicone, or Portkey) sits between your application and all AI providers, providing: unified API across providers, centralized cost tracking with metadata attribution, automatic cost calculation per call, and logging for debugging and quality monitoring. Every LLM call through the gateway captures model, token counts, cost, latency, and any custom metadata you pass.
Step 5: Define Attribution Metadata
Decide what dimensions matter for your showback model before instrumenting. Standard dimensions: team, feature or product area, customer tier (enterprise/SMB/free), environment (production/staging/development), experiment ID. Consistency matters more than completeness — define the standard, enforce it in code review, and don't change it retroactively.
Step 6: Build Team-Level Showback Reports
Monthly reports showing each team's AI spend by model, with token count breakdowns and trend versus prior month. Deliver where engineers already work — Slack digest or embedded in existing dashboards. The same cultural change management that works for cloud showback applies here: visibility first, accountability second.
Run: Optimization and Unit Economics
Step 7: Implement Prompt Caching
Anthropic (Claude), OpenAI (GPT-4o), and Google (Gemini) all offer prompt caching — repeated identical prompt prefixes are cached server-side at 10–50% of standard input token pricing. For applications with large system prompts, RAG context chunks, or repeated document content, caching can reduce input token costs by 60–80% on cache hits. Audit your highest-volume prompts for caching eligibility first.
Step 8: Model Routing
Not every LLM task requires a frontier model. Classification, intent detection, extraction, and simple QA often perform equivalently on smaller, cheaper models (GPT-4o mini at $0.15/1M tokens vs GPT-4o at $2.50/1M — a 16x cost difference). Implement a routing layer that assigns each request to the cheapest capable model. A/B test quality impact before routing production traffic.
| Optimization Lever | Typical Savings | Complexity | Prerequisites |
|---|---|---|---|
| Centralize API keys + alerts | Catch waste within hours | Low | Admin access |
| LLM gateway (LiteLLM) | Enables all downstream | Low–Medium | Engineering time (1–2 days) |
| Prompt caching | 60–80% on input tokens (cache hits) | Low | Long system prompts or repeated context |
| Model routing | 50–80% on mixed workloads | Medium | Quality benchmarking per task type |
| Output length control | 10–30% on output tokens | Low | max_tokens parameter tuning |
| Self-hosted inference | 60–80% vs API at volume | High | 1B+ tokens/month, GPU infrastructure |
Step 9: Track Unit Economics
Cost per API call is an operational metric. Cost per user interaction, cost per document processed, cost per successful task completion — these are unit economics that belong in product reviews alongside revenue metrics. Calculate them monthly and put them next to performance and reliability metrics. When a new model or prompt change increases quality but also increases cost, unit economics makes the trade-off explicit and discussable.
Step 10: Anomaly Response Automation
Set automated rate limits per team or feature that pause non-critical AI workloads when spend exceeds threshold, rather than alerting humans to intervene manually. Production-critical workloads get soft alerts; batch and background workloads get hard stops. This prevents the scenario where a weekend pipeline runs unchecked and generates $10k in unexpected API costs before anyone notices Monday morning.
// FAQ
Estimate your cloud savings
Free FinOps Savings Calculator — AWS, Azure & GCP · no signup