What is LLM model routing?

Model routing is sending different types of requests to different models based on complexity. Simple tasks (classification, extraction, short Q&A) go to cheap mini models. Complex reasoning, long-form generation, or code goes to flagship models. A well-tuned router can cut costs 50-80% with near-identical output quality.

Does the OpenAI Batch API reduce costs?

Yes — OpenAI's Batch API (and Anthropic's equivalent) processes requests asynchronously with a 24-hour turnaround, at exactly 50% of the standard API price. For any workload that doesn't need real-time response — document processing, content generation, analytics — batch is free money.

How to Reduce LLM API Costs by 60%: 7 Proven Strategies (2026)

Most teams overpay for LLM APIs by 40–70% without knowing it. The waste comes from a handful of fixable patterns: oversized system prompts, wrong model selection, synchronous calls where async would work, and uncapped output tokens. This guide walks through 7 strategies — each with real dollar savings — that you can implement this week.

First: Know Your Baseline

You can't optimize what you don't measure. Before applying any strategy, calculate your current monthly cost using the formula:

monthly_cost = (avg_input_tokens × input_price/1M + avg_output_tokens × output_price/1M) × monthly_requests

Log token usage per request for one week. You'll likely find 20% of requests consume 60% of your tokens — and those are the ones to optimize first.

🔤 Calculate your LLM baseline cost

Enter your current model, token counts, and request volume to get your monthly baseline before optimizing.

Open LLM Cost Calculator →

7 Strategies to Cut Your Bill

Strategy 01

SAVINGS: 10–90% on input tokens

Enable Prompt Caching

Anthropic (Claude) and Google (Gemini) offer prompt caching — repeated context blocks (system prompts, RAG documents, few-shot examples) are cached server-side after the first call. Cache hits cost 10–25% of normal input price.

Example: 3,000-token system prompt × 200,000 requests/month = 600M tokens. At Claude 3.5 Sonnet ($3/M), that's $1,800/month uncached. With caching enabled: first call $3, repeat calls at $0.30/M = $180/month. Saves $1,620/month.

// Anthropic — mark cacheable blocks with cache_control messages: [{ role: "user", content: [{ type: "text", text: systemPrompt, cache_control: { type: "ephemeral" } // cached for 5 min TTL }, { type: "text", text: userMessage }] }]

Strategy 02

SAVINGS: 50–80% for mixed workloads

Implement Model Routing

Not every request needs a flagship model. A rule-based or ML-based router sends simple tasks to cheap models and only escalates to premium models when needed.

Routing logic:

GPT-4o mini / Claude Haiku: Classification, keyword extraction, short summaries, simple Q&A, intent detection
GPT-4o / Claude Sonnet: Complex analysis, multi-step reasoning, code review, nuanced writing
o1 / Claude Opus: Hard math, architectural decisions, long-chain reasoning — use sparingly

Example: A customer support app routes 80% of tickets to GPT-4o mini ($0.15/$0.60 per million) and 20% to GPT-4o ($2.50/$10). Blended cost drops from $10/M to $2.12/M output. Saves 79% on output tokens.

Strategy 03

SAVINGS: 50% flat on eligible workloads

Use Batch API for Async Workloads

OpenAI and Anthropic both offer a Batch API that processes requests asynchronously (up to 24h turnaround) at exactly 50% of standard pricing. Zero complexity, pure savings.

Eligible workloads: document processing, content moderation, dataset labeling, overnight analytics, SEO content generation, email personalization, report generation.

// OpenAI Batch API — save 50% on these requests const batch = await openai.batches.create({ input_file_id: fileId, endpoint: "/v1/chat/completions", completion_window: "24h" // process overnight, pay half });

Example: Legal-tech firm processing 10,000 contracts/month at $175 standard. Batch API: $87.50/month.

Strategy 04

SAVINGS: $0.01–$5.00 per 1,000 requests

Audit and Shrink Your System Prompt

Your system prompt is charged on every single request. Most system prompts contain 30–50% removable content: outdated instructions, redundant examples, filler phrasing, and guidelines the model already follows by default.

Remove examples if the model already performs correctly without them
Cut "Be helpful, honest, and harmless" — the model knows this
Use bullet points instead of prose (shorter, equally effective)
Move rarely-needed instructions to conditional injection

Example: Trimming from 2,000 to 800 tokens × 500,000 requests/month = 600M fewer tokens. At GPT-4o ($2.50/M): saves $1,500/month.

Strategy 05

SAVINGS: 30–70% in multi-turn apps

Implement Context Truncation

In multi-turn conversations, input tokens grow with every exchange — turn 1 sends 500 tokens, turn 20 sends 10,000 tokens for the same conversation. Without truncation, costs scale quadratically.

Three approaches:

Sliding window: Keep only the last N turns. Simple, loses older context.
Summarization: Periodically compress older turns into a summary. Preserves context, adds one cheap summary call.
Selective retrieval: Store turns as embeddings, retrieve only the semantically relevant ones. Best quality, most complex.

⚠ Most common mistake

Teams that launch multi-turn features without truncation often see a 10× cost increase within 30 days as conversations grow. Implement a strategy before launch, not after.

Strategy 06

SAVINGS: 20–60% depending on repeat rate

Cache Responses at the Application Layer

Many LLM calls in production are semantically identical. FAQ answers, static content generation, template-based outputs — these don't need a fresh API call every time.

Exact caching: Hash the full prompt, cache the response (Redis/Memcached). Zero cost on cache hit.
Semantic caching: Embed the user query, find a cached response with cosine similarity above 0.95. Tools: GPTCache, Langchain caching, custom embedding store.

Example: A documentation chatbot with 40% cache hit rate on a $2,000/month API bill saves $800/month with Redis caching. Infrastructure cost: $20/month. Net saving: $780/month.

Strategy 07

SAVINGS: 10–200% of current output cost

Set max_tokens Explicitly — Always

Without a max_tokens limit, models will generate to their maximum context window. A response that occasionally runs to 4,000 tokens when you only need 500 quadruples your output cost on that call.

// Always set this — measure your P95 output length first const response = await openai.chat.completions.create({ model: "gpt-4o", messages: messages, max_tokens: 800, // your measured P95 + 20% buffer temperature: 0.7 });

Measure your actual P95 output token length in production for one week, add 20% buffer, and cap there. This alone can cut output costs 30–50% for apps where the model tends to over-generate.

Combined Savings Example

A mid-size AI app spending $5,000/month on LLM APIs applies all 7 strategies:

Strategy	Monthly Saving
Prompt caching (large system prompt)	−$1,200
Model routing (80% to mini)	−$1,500
Batch API (30% of workload async)	−$450
System prompt trim (2,000→900 tokens)	−$320
Context truncation (sliding window)	−$280
Response caching (35% hit rate)	−$190
max_tokens cap	−$110
Total savings	−$4,050 (81%)
New monthly bill	$950

FAQ

How much can I save with prompt caching?

Up to 90% on input tokens for cached content. Anthropic charges 10% of normal input price for cache hits. A 2,000-token system prompt on 100,000 requests/month saves 180M tokens — roughly $540/month on Claude 3.5 Sonnet.

What is model routing?

Sending different request types to different models based on complexity. Simple tasks go to cheap mini models (10–15× cheaper), complex tasks go to flagships. A well-tuned router cuts costs 50–80% with near-identical quality for most production workloads.

Does the Batch API affect output quality?

No — the Batch API uses identical models and parameters. The only difference is that responses are delivered asynchronously within 24 hours instead of in real-time. Quality is identical.

🔤 See your potential savings

Enter your current model and usage — then compare the cost after switching models or adjusting token counts.

Open LLM Cost Calculator →

🧮

APICalculators Team

We build free, privacy-first cost calculators for developers and AI engineers. Pricing data is sourced directly from official provider documentation and verified monthly.

Twitter →GitHub →

Last updated: June 2, 2026. Suggest an optimization we missed →

How to Reduce Your LLM API Costs by 60%: 7 Proven Strategies for 2026

First: Know Your Baseline

🔤 Calculate your LLM baseline cost

7 Strategies to Cut Your Bill

Enable Prompt Caching

Implement Model Routing

Use Batch API for Async Workloads

Audit and Shrink Your System Prompt

Implement Context Truncation

Cache Responses at the Application Layer

Set max_tokens Explicitly — Always

Combined Savings Example

FAQ

How much can I save with prompt caching?

What is model routing?

Does the Batch API affect output quality?

🔤 See your potential savings