Speech-to-Text · Updated 2026-06-11

Whisper vs Deepgram: STT API Comparison 2026

OpenAI Whisper API charges $0.006/min. Deepgram Nova-2 charges $0.0043/min. At 1,000 hours/month that gap is $102. Here's every dimension that matters beyond raw price.

Feature & Pricing Breakdown

Side-by-side comparison across the dimensions that decide your vendor choice.

Dimension OpenAI Whisper API Deepgram Nova-2
Price per minute $0.006 $0.0043 28% cheaper
Free tier None (pay per use) First $200 credit free
Base / economy model Single model tier $0.0025/min (Base)
Real-time streaming No — file upload only Yes — WebSocket API
Latency (batch) ~1–3s for short files ~0.5–1.5s
Latency (streaming) Not available <300ms first word
Languages supported 99 languages Best ~30 languages
WER on clean English ~4–5% ~3–4% Best
WER on noisy/accented ~8–12% Best ~12–18%
Speaker diarization No Yes (included)
Word-level timestamps Yes Yes (phoneme-level)
Confidence scores per word No Yes
Custom vocabulary No Yes (keyword boosting)
Self-host option Yes (open weights) No (cloud only)

Cost by Monthly Volume

Three realistic usage tiers — from a small podcast app to an enterprise call-center.

Small

100 hrs/mo = 6,000 minutes
Whisper API$36.00/mo
Deepgram Nova-2$25.80/mo
Deepgram saves28% ($10.20)

Medium

1,000 hrs/mo = 60,000 minutes
Whisper API$360.00/mo
Deepgram Nova-2$258.00/mo
Deepgram saves$102/mo

Large

10,000 hrs/mo = 600,000 minutes
Whisper API$3,600/mo
Deepgram Nova-2$2,580/mo
Deepgram saves$1,020/mo
Self-Host Tip

At 800+ hours/month, self-hosting Whisper large-v3 on a GPU server (~$53/mo on Hetzner CCX33) costs roughly $0.001/min — six times cheaper than any API option. Worth the engineering investment at scale.

Advertisement

Technical Deep Dive

What the pricing table doesn't tell you.

Streaming vs Batch: The Architecture Decision

The most fundamental difference between Whisper API and Deepgram is not price — it is architecture. The OpenAI Whisper API accepts completed audio files and returns a full transcript. There is no WebSocket connection, no streaming endpoint, and no way to receive partial results mid-audio. This means you must buffer the entire audio recording before submitting it, which introduces irreducible latency equal to the duration of the audio itself plus processing time.

Deepgram's WebSocket API flips this model entirely. You open a persistent connection before the user starts speaking, stream raw audio bytes in chunks as small as 100ms, and receive word-level transcription results back in under 300 milliseconds from when each word was spoken. This is the architecture that powers live meeting transcription tools, voice bots, and real-time captioning systems. If your product requires the transcript to appear while the user is still talking, Deepgram is the only choice between these two.

Accuracy Where It Actually Matters

Word Error Rate benchmarks tell a nuanced story. On clean, native English audio — the typical call-center recording at 16kHz — Deepgram Nova-2 achieves WER in the 3–4% range, slightly edging out Whisper API's 4–5%. But clean audio is the easy case. When you introduce background noise, non-native accents, technical jargon, or overlapping speakers, the rankings shift. Whisper large-v3 was trained on 680,000 hours of multilingual, noisy, real-world audio harvested from the internet, giving it remarkable robustness to adverse conditions. WER on accented English can run 8–12% for Whisper vs 12–18% for Deepgram Nova-2, a gap wide enough to materially affect downstream applications that parse transcript text programmatically.

Deepgram's Operator Features

Beyond raw transcription, Deepgram ships several features that matter in production deployments. Speaker diarization automatically labels each utterance with a speaker identifier, critical for call recordings and meeting minutes. Confidence scores per word let your application flag low-confidence segments for human review rather than blindly trusting the transcript. Phoneme-level timestamp alignment allows subtitle generation tools to sync text to video with frame-accurate precision. Keyword boosting lets you supply a vocabulary of domain-specific terms — product names, medical terminology, ticker symbols — that the model will prefer over acoustically similar alternatives. None of these are available in the Whisper API.

Whisper's 99-Language Advantage

If your application operates across languages, Whisper's breadth is decisive. Deepgram Nova-2 covers approximately 30 languages with highest quality concentrated in English, Spanish, French, German, and a handful of others. Whisper supports 99 languages including lower-resource ones like Swahili, Welsh, Icelandic, Azerbaijani, and Malay. For a global product that transcribes user-submitted audio in whatever language the user happens to speak, Whisper API is the only viable managed-API solution. Deepgram's language roadmap is expanding, but the gap remains large in 2026.

The Self-Hosting Equation

Whisper's open-weights license is a genuine cost lever at scale. The model weights for Whisper large-v3 are freely downloadable and can be run on any NVIDIA GPU. A Hetzner CCX33 instance with a GPU attachment costs approximately $53/month and can process around 800 hours of audio per month using faster-whisper, a CTranslate2-optimized inference backend. That works out to roughly $0.001 per minute — six times cheaper than Deepgram Nova-2 and eight times cheaper than the Whisper API. At 10,000 hours per month you would need roughly 13 such machines for about $690/month total, compared to $2,580 for Deepgram Nova-2 or $3,600 for the Whisper API. The trade-off is the engineering effort: you need to manage GPU instances, build a job queue, handle failures, and monitor throughput. Deepgram's API eliminates all of that operational overhead in exchange for higher per-minute cost.

Deepgram Base Model: The Budget Tier

Deepgram also offers a Base model tier at $0.0025/minute — less than half the Nova-2 price and less than half the Whisper API price. The Base model has lower accuracy than Nova-2, particularly on noisy audio or non-US accents, but for high-volume workloads on clean, controlled audio it represents the cheapest managed STT option available without operating your own GPU infrastructure. At 10,000 hours per month, Deepgram Base costs $1,500/month vs $2,580 for Nova-2 and $3,600 for Whisper API.

When to Use Each Service

Choose Deepgram Nova-2 when your application needs real-time streaming transcription, operates on clean English audio, requires speaker diarization or word-level confidence scores, or when you need to stay under $0.005/minute without managing infrastructure. Choose Whisper API when you need broad language coverage beyond the top 30, when your audio contains heavy accents or significant background noise, or when batch throughput matters more than latency. Consider self-hosting Whisper large-v3 when your monthly volume exceeds 500 hours and you have the engineering capacity to operate GPU workers — the economics become compelling quickly at that scale.

Verdict

Choose Deepgram Nova-2 if you need…

  • Real-time streaming (<300ms latency)
  • Speaker diarization out of the box
  • Word-level confidence scores
  • Lower cost on English audio
  • Phoneme-level subtitle alignment
  • Custom keyword vocabulary boosting

Choose Whisper API if you need…

  • 99-language multilingual support
  • Better accuracy on accented speech
  • Better accuracy on noisy audio
  • Self-host option (open weights)
  • Real-time streaming
  • Per-word confidence scores

Calculate Your Exact STT Cost

Enter your monthly audio hours and see a precise cost breakdown for Whisper, Deepgram Nova-2, Deepgram Base, and self-hosted options side by side.

Open STT/TTS Cost Calculator →

Frequently Asked Questions

Can I self-host Whisper for free?+

Yes. Whisper is open-weights and the model weights are publicly available on GitHub and Hugging Face. You can run Whisper large-v3 on any NVIDIA GPU without paying OpenAI anything. A Hetzner CCX33 dedicated server costs around $53/month and can process roughly 800 hours of audio per month, bringing your per-minute cost to about $0.001 — six times cheaper than the OpenAI Whisper API. The trade-off is infrastructure management, cold-start latency, and the engineering time required to build a reliable queue and worker system.

Does Deepgram support real-time streaming transcription?+

Yes, real-time streaming is one of Deepgram's primary strengths. Deepgram exposes a WebSocket API that accepts live audio chunks and returns transcription results within approximately 200–300 milliseconds of speech. This makes it well-suited for applications like live meeting transcription, call-center agent assist, voice bots, and real-time captioning. The OpenAI Whisper API, by contrast, only processes completed audio files — it has no WebSocket or streaming endpoint, so it cannot be used for real-time use cases without additional engineering.

Which is more accurate for accented speech?+

OpenAI Whisper large-v3 generally outperforms Deepgram Nova-2 on accented speech and noisy audio conditions. Whisper was trained on 680,000 hours of multilingual audio with broad accent coverage, giving it strong generalization on difficult inputs. Word Error Rate benchmarks on accented or multilingual content consistently show Whisper's advantage. For English-only business calls in decent audio conditions, the gap narrows significantly and Deepgram's lower latency and confidence scores often make it the better operational choice.

What languages does Deepgram support?+

Deepgram Nova-2 supports a focused set of languages with highest accuracy in English, Spanish, French, German, Italian, Portuguese, Dutch, Hindi, Japanese, and Korean — approximately 30 languages total. Deepgram's language support is deliberately narrower than Whisper's 99-language coverage because Deepgram prioritizes accuracy and latency over breadth. If your application requires transcription in lower-resource languages such as Swahili, Welsh, Tamil, or Icelandic, Whisper is the significantly better choice.

Advertisement

Related Comparisons & Tools