First: Know Your Baseline
You can't optimize what you don't measure. Before applying any strategy, calculate your current monthly cost using the formula:
Log token usage per request for one week. You'll likely find 20% of requests consume 60% of your tokens — and those are the ones to optimize first.
🔤 Calculate your LLM baseline cost
Enter your current model, token counts, and request volume to get your monthly baseline before optimizing.
Open LLM Cost Calculator →7 Strategies to Cut Your Bill
Enable Prompt Caching
Anthropic (Claude) and Google (Gemini) offer prompt caching — repeated context blocks (system prompts, RAG documents, few-shot examples) are cached server-side after the first call. Cache hits cost 10–25% of normal input price.
Example: 3,000-token system prompt × 200,000 requests/month = 600M tokens. At Claude 3.5 Sonnet ($3/M), that's $1,800/month uncached. With caching enabled: first call $3, repeat calls at $0.30/M = $180/month. Saves $1,620/month.
Implement Model Routing
Not every request needs a flagship model. A rule-based or ML-based router sends simple tasks to cheap models and only escalates to premium models when needed.
Routing logic:
- GPT-4o mini / Claude Haiku: Classification, keyword extraction, short summaries, simple Q&A, intent detection
- GPT-4o / Claude Sonnet: Complex analysis, multi-step reasoning, code review, nuanced writing
- o1 / Claude Opus: Hard math, architectural decisions, long-chain reasoning — use sparingly
Example: A customer support app routes 80% of tickets to GPT-4o mini ($0.15/$0.60 per million) and 20% to GPT-4o ($2.50/$10). Blended cost drops from $10/M to $2.12/M output. Saves 79% on output tokens.
Use Batch API for Async Workloads
OpenAI and Anthropic both offer a Batch API that processes requests asynchronously (up to 24h turnaround) at exactly 50% of standard pricing. Zero complexity, pure savings.
Eligible workloads: document processing, content moderation, dataset labeling, overnight analytics, SEO content generation, email personalization, report generation.
Example: Legal-tech firm processing 10,000 contracts/month at $175 standard. Batch API: $87.50/month.
Audit and Shrink Your System Prompt
Your system prompt is charged on every single request. Most system prompts contain 30–50% removable content: outdated instructions, redundant examples, filler phrasing, and guidelines the model already follows by default.
- Remove examples if the model already performs correctly without them
- Cut "Be helpful, honest, and harmless" — the model knows this
- Use bullet points instead of prose (shorter, equally effective)
- Move rarely-needed instructions to conditional injection
Example: Trimming from 2,000 to 800 tokens × 500,000 requests/month = 600M fewer tokens. At GPT-4o ($2.50/M): saves $1,500/month.
Implement Context Truncation
In multi-turn conversations, input tokens grow with every exchange — turn 1 sends 500 tokens, turn 20 sends 10,000 tokens for the same conversation. Without truncation, costs scale quadratically.
Three approaches:
- Sliding window: Keep only the last N turns. Simple, loses older context.
- Summarization: Periodically compress older turns into a summary. Preserves context, adds one cheap summary call.
- Selective retrieval: Store turns as embeddings, retrieve only the semantically relevant ones. Best quality, most complex.
Teams that launch multi-turn features without truncation often see a 10× cost increase within 30 days as conversations grow. Implement a strategy before launch, not after.
Cache Responses at the Application Layer
Many LLM calls in production are semantically identical. FAQ answers, static content generation, template-based outputs — these don't need a fresh API call every time.
- Exact caching: Hash the full prompt, cache the response (Redis/Memcached). Zero cost on cache hit.
- Semantic caching: Embed the user query, find a cached response with cosine similarity above 0.95. Tools: GPTCache, Langchain caching, custom embedding store.
Example: A documentation chatbot with 40% cache hit rate on a $2,000/month API bill saves $800/month with Redis caching. Infrastructure cost: $20/month. Net saving: $780/month.
Set max_tokens Explicitly — Always
Without a max_tokens limit, models will generate to their maximum context window. A response that occasionally runs to 4,000 tokens when you only need 500 quadruples your output cost on that call.
Measure your actual P95 output token length in production for one week, add 20% buffer, and cap there. This alone can cut output costs 30–50% for apps where the model tends to over-generate.
Combined Savings Example
A mid-size AI app spending $5,000/month on LLM APIs applies all 7 strategies:
| Strategy | Monthly Saving |
|---|---|
| Prompt caching (large system prompt) | −$1,200 |
| Model routing (80% to mini) | −$1,500 |
| Batch API (30% of workload async) | −$450 |
| System prompt trim (2,000→900 tokens) | −$320 |
| Context truncation (sliding window) | −$280 |
| Response caching (35% hit rate) | −$190 |
| max_tokens cap | −$110 |
| Total savings | −$4,050 (81%) |
| New monthly bill | $950 |
FAQ
How much can I save with prompt caching?
Up to 90% on input tokens for cached content. Anthropic charges 10% of normal input price for cache hits. A 2,000-token system prompt on 100,000 requests/month saves 180M tokens — roughly $540/month on Claude 3.5 Sonnet.
What is model routing?
Sending different request types to different models based on complexity. Simple tasks go to cheap mini models (10–15× cheaper), complex tasks go to flagships. A well-tuned router cuts costs 50–80% with near-identical quality for most production workloads.
Does the Batch API affect output quality?
No — the Batch API uses identical models and parameters. The only difference is that responses are delivered asynchronously within 24 hours instead of in real-time. Quality is identical.
🔤 See your potential savings
Enter your current model and usage — then compare the cost after switching models or adjusting token counts.
Open LLM Cost Calculator →Last updated: June 2, 2026. Suggest an optimization we missed →