Deepgram Nova 2/3 for Voice AI: The Developer's Guide to Production-Grade STT
Deep dive into Deepgram Nova 2 and Nova 3 for voice AI applications. Compare accuracy, latency, pricing, and learn how to optimize STT quality with Burki's integration.
Table of Contents▼
When your voice AI misunderstands a customer's order or mishears a critical support request, the consequences cascade through your entire application. A 5% word error rate might sound acceptable in benchmarks, but in production, that translates to frustrated users, failed transactions, and eroded trust. Speech-to-text accuracy isn't just a metric to optimize—it's the foundation that determines whether your voice AI succeeds or fails.
After evaluating dozens of STT providers for voice AI applications, Deepgram's Nova models consistently deliver the accuracy and latency profile that production systems demand. In this guide, we'll examine exactly what makes Nova 2 and Nova 3 different, when to use each, and how Burki's integration helps you extract maximum performance from both models.
Why Deepgram Leads Voice AI Transcription
Deepgram built their speech recognition models specifically for conversational AI workloads. Unlike general-purpose transcription services that prioritize pre-recorded content, Deepgram optimized for the real-time, low-latency requirements that voice assistants demand.
Nova 3: Setting New Accuracy Benchmarks
Released in February 2025, Nova 3 represents Deepgram's most accurate model to date. The numbers tell the story:
- 54.3% reduction in word error rate for streaming audio compared to previous models
- Median WER of 6.84% on real-world datasets—a 54.2% improvement over the next-best alternative at 14.92%
- Sub-300ms transcription latency enabling natural conversational flow
These aren't cherry-picked benchmarks. Deepgram tested Nova 3 against diverse, real-world audio including accented speech, background noise, and overlapping conversations. The model maintains high accuracy even in challenging acoustic conditions that cause other STT systems to struggle.
Nova 2: The Proven Workhorse
Nova 2 established Deepgram's reputation for production-grade speech recognition. Key achievements include:
- 18% more accurate than the original Nova model
- 36% relative WER improvement over OpenAI Whisper (large)
- 30% average WER reduction compared to leading alternatives for both pre-recorded and real-time transcription
- Support for 30+ languages with consistent accuracy
For many voice AI applications, Nova 2 delivers sufficient accuracy at a lower cost point. The model has been battle-tested across thousands of production deployments, making it a reliable choice for teams prioritizing stability.
Nova 2 vs Nova 3: Making the Right Choice
The decision between Nova 2 and Nova 3 depends on your specific accuracy requirements, latency tolerance, and budget constraints.
Accuracy Comparison
| Metric | Nova 2 | Nova 3 |
|---|---|---|
| Streaming WER | ~12-15% | ~6.84% |
| WER vs. Competitors | 30% better | 54% better |
| Noisy Audio Handling | Good | Excellent |
| Accented Speech | Good | Excellent |
Nova 3's accuracy advantage is most pronounced in challenging audio conditions. If your voice AI handles diverse accents, background noise, or overlapping speech, Nova 3's improvements translate directly to better user experience.
Latency Profile
Both models deliver sub-300ms transcription latency for streaming audio, which is essential for conversational voice AI. However, Nova 3's architecture was specifically optimized for streaming workloads, delivering more consistent performance under load.
For batch processing of pre-recorded audio, Nova 2 processes approximately 29.8 seconds of audio per hour with diarization enabled—fast enough for near-real-time applications but not instant.
Streaming vs Batch Processing
Understanding Deepgram's processing modes is critical for cost optimization:
Streaming (Real-time):
- WebSocket connection for continuous audio
- Sub-300ms latency
- Essential for live voice AI conversations
- Higher per-minute cost
Batch Processing:
- Upload complete audio files
- Processing completes at 1-2x real-time
- Ideal for call recordings, voicemail transcription
- Significantly lower cost
For voice AI applications, streaming is typically required during live calls. Batch processing makes sense for post-call analytics and transcript generation.
Deepgram Pricing Breakdown (2026)
Understanding Deepgram's pricing structure helps you architect cost-effective voice AI systems.
Nova 3 Pricing
| Plan | Per-Minute Rate | Hourly Equivalent |
|---|---|---|
| Pay-As-You-Go | $0.0077/min | $0.46/hour |
| Growth Plan | $0.0065/min | $0.39/hour |
The Growth plan offers a 16% discount for committed usage. New accounts receive $200 in free credits to evaluate the platform.
Nova 2 Pricing
| Processing Mode | Per-Minute Rate | Hourly Equivalent |
|---|---|---|
| Batch (Pre-recorded) | $0.0043/min | $0.26/hour |
| Streaming | $0.0077/min | $0.46/hour |
Nova 2 batch processing is approximately 44% cheaper than streaming, making it the optimal choice for non-real-time workloads.
Cost Comparison vs. Competitors
| Provider | Per 1,000 Minutes |
|---|---|
| Deepgram Nova 3 | $4.30 |
| OpenAI Whisper API | $6.00 |
| Google Speech-to-Text | $16.00 |
Deepgram delivers 2x+ cost savings compared to cloud providers while offering superior accuracy for conversational AI workloads.
Additional Costs
Speaker diarization adds approximately $0.001-0.002/min to the base transcription cost. For voice AI applications where you need to distinguish between caller and assistant speech, factor this into your budget.
Burki + Deepgram Integration
Burki's voice AI platform integrates natively with Deepgram Nova 2 and Nova 3, plus the streaming-optimized Deepgram Flux variant. This integration abstracts away the complexity of WebSocket connections, audio format handling, and error recovery.
Supported Deepgram Models in Burki
| Provider | Models | Languages | Features |
|---|---|---|---|
| Deepgram | Nova 2, Nova 3 | 30+ | Real-time, Confidence scoring, Smart formatting |
| Deepgram Flux | Nova 2 (streaming optimized) | 30+ | Ultra-low latency streaming |
Why Flux Matters
Deepgram Flux is a streaming-optimized variant specifically designed for voice AI applications. It delivers even lower latency than standard Nova 2 by optimizing the entire audio pipeline for real-time conversational workloads.
Configuring Deepgram STT in Burki
Burki provides granular control over Deepgram's STT behavior. Here's how to configure optimal settings for voice AI.
Model Selection
Choose your Deepgram model based on the accuracy-cost tradeoff:
STT Provider: Deepgram
Model: Nova 3 (highest accuracy) | Nova 2 (cost-optimized) | Flux (lowest latency)Language Configuration
Set your primary language for optimal recognition:
Language: en-US (or appropriate locale)
Auto-detection: Enabled/Disabled based on use caseFor single-language deployments, disabling auto-detection can improve accuracy by eliminating language-switching overhead.
Endpointing Configuration
Endpointing determines when Deepgram considers an utterance complete. Proper configuration is essential for natural conversation flow:
| Parameter | Purpose | Recommended Value |
|---|---|---|
| Silence Threshold | Minimum silence before utterance end | 500-800ms |
| Minimum Silence Duration | Prevents premature cutoffs | 300ms |
| Utterance End Timeout | Maximum wait for speech resumption | 1000-1500ms |
| VAD Turnoff | Voice activity detection sensitivity | 500ms |
Shorter thresholds create faster responses but risk cutting off speakers mid-thought. Longer thresholds feel more natural but increase latency.
Smart Formatting Options
Enable Deepgram's formatting features for cleaner transcripts:
- Automatic Punctuation: Adds periods, commas, question marks
- Smart Formatting: Converts numbers, dates, currency to proper formats
- Interim Results: Provides partial transcripts for real-time display
Bring Your Own API Key
Burki supports per-assistant Deepgram API keys, allowing you to:
- Use your own Deepgram account for direct billing
- Apply custom vocabulary and trained models
- Access enterprise features and higher rate limits
Configure your Deepgram API key at the assistant level or fall back to organization-level credentials.
Optimizing Accuracy for Voice AI
Raw model accuracy is just the starting point. Production voice AI systems require additional optimization.
Audio Denoising with RNNoise
Burki integrates RNNoise for real-time audio denoising before STT processing. This neural network-based noise reduction can significantly improve accuracy in noisy environments:
- Per-Assistant Toggle: Enable selectively based on expected audio quality
- Hardware Acceleration: GPU support for low-latency processing
- Transparent Integration: Denoising happens automatically in the audio pipeline
For call center environments, office settings, or any scenario with predictable background noise, enabling RNNoise provides measurable accuracy improvements.
Keyword Boosting
Deepgram supports keyword boosting to improve recognition of domain-specific terms. This is critical for:
- Product names and model numbers
- Industry jargon and acronyms
- Proper nouns and brand names
Boosted keywords can achieve up to 90% higher keyword recall rate compared to baseline recognition.
Handling Filler Words
Configure whether to transcribe filler words like "uh" and "um":
- Include Fillers: More natural transcripts, useful for sentiment analysis
- Exclude Fillers: Cleaner output, reduced token usage for LLM processing
For voice AI that feeds transcripts to an LLM for response generation, excluding fillers often produces better results.
Confidence Score Monitoring
Deepgram provides per-word confidence scores with each transcript. Burki surfaces these for quality monitoring:
- Track average confidence across calls
- Flag low-confidence segments for review
- Identify patterns in recognition failures
Low confidence scores often indicate audio quality issues, network problems, or out-of-vocabulary terms that need boosting.
Voice AI Pipeline Integration
STT is one component of Burki's real-time conversation pipeline:
Incoming Audio -> STT (Deepgram) -> LLM -> TTS -> Audio Output
|
Real-time WebSocket StreamingLatency Optimization
Burki optimizes the entire pipeline, not just STT:
- Pipeline Overlapping: STT, LLM, and TTS processing overlap for minimal latency
- Interim Results: Begin LLM processing on partial transcripts
- Streaming TTS: Start audio playback before full response generation
With Deepgram's sub-300ms STT latency combined with Burki's pipeline optimization, total response times of 0.8-1.2 seconds are achievable.
Interruption Handling
Natural conversation requires handling interruptions gracefully. Burki's STT integration supports:
- Interruption Threshold: Words required before allowing interruption
- Minimum Speaking Time: Grace period before interruption detection
- Transcript Buffering: Accumulates speech during cooldown periods
- LLM Cancellation: Cancels pending responses on interruption
These features work together to create conversational flow that feels natural while avoiding constant interruptions.
Cost Analysis: Real-World Scenarios
Let's examine Deepgram costs for typical voice AI deployments.
Scenario 1: Customer Support Bot
- Volume: 10,000 calls/month
- Average Duration: 5 minutes
- Model: Nova 3 (streaming)
- Monthly Cost: 50,000 minutes x $0.0077 = $385/month
With Nova 2 streaming at the same rate, cost would be identical. The savings from Nova 2 come from batch processing, which isn't applicable for live calls.
Scenario 2: Appointment Scheduling
- Volume: 5,000 calls/month
- Average Duration: 3 minutes
- Model: Nova 2 (streaming)
- Monthly Cost: 15,000 minutes x $0.0077 = $115.50/month
For simpler use cases where Nova 2's accuracy is sufficient, there's no direct cost benefit for streaming. Consider Nova 2 when accuracy requirements are less stringent.
Scenario 3: Post-Call Analytics
- Volume: 20,000 call recordings/month
- Average Duration: 8 minutes
- Model: Nova 2 (batch)
- Monthly Cost: 160,000 minutes x $0.0043 = $688/month
Batch processing for recordings delivers significant savings—44% less than streaming rates.
Scenario 4: High-Accuracy Sales Calls
- Volume: 2,000 calls/month
- Average Duration: 15 minutes
- Model: Nova 3 with Growth Plan
- Monthly Cost: 30,000 minutes x $0.0065 = $195/month
The Growth plan's 16% discount adds up for committed usage.
Frequently Asked Questions
What's the accuracy difference between Nova 2 and Nova 3?
Nova 3 achieves approximately 54% lower word error rate than Nova 2 for streaming transcription. On real-world datasets, Nova 3 delivers median WER of 6.84% compared to Nova 2's approximately 12-15%. The difference is most noticeable with challenging audio: accents, background noise, and overlapping speech.
Should I use Nova 3 for all voice AI applications?
Not necessarily. Nova 2 provides excellent accuracy for many use cases at the same streaming price point. If your audio quality is good and your vocabulary is standard, Nova 2's accuracy may be sufficient. Nova 3 shines when you need the absolute best recognition for challenging conditions.
How does Deepgram compare to OpenAI Whisper?
Deepgram Nova 3 outperforms Whisper on real-time streaming transcription, with 36%+ lower WER and significantly lower latency. Whisper's batch processing is competitive for pre-recorded audio but lacks the streaming capabilities voice AI requires.
What languages does Deepgram support?
Both Nova 2 and Nova 3 support 30+ languages and dialects for real-time and recorded audio. English receives the most optimization, but major world languages are well-supported.
Can I use custom vocabulary with Deepgram?
Yes. Deepgram supports keyword boosting for improved recognition of custom terms. For enterprise needs, custom model training is available. Through Burki's BYO API key feature, you can leverage any Deepgram account features including custom models.
What's the minimum audio quality required?
Deepgram handles a wide range of audio quality, but optimal results require:
- Sample rate: 8kHz minimum, 16kHz+ recommended
- Clear speech without heavy compression
- Background noise within reasonable limits
Burki's RNNoise integration can improve results for noisy audio.
How does pricing work for speaker diarization?
Speaker diarization adds approximately $0.001-0.002 per minute on top of base transcription costs. For voice AI calls, diarization helps distinguish between caller and assistant speech for analytics and training purposes.
What happens if Deepgram is unavailable?
Burki supports fallback STT providers. Configure Azure Speech as a backup to maintain availability if Deepgram experiences issues.
Getting Started with Deepgram on Burki
Setting up Deepgram STT on Burki takes minutes:
- Create a Burki account and access the assistant builder
- Navigate to STT configuration in your assistant settings
- Select Deepgram as your provider
- Choose your model (Nova 3 for accuracy, Nova 2 for cost optimization, Flux for lowest latency)
- Configure endpointing based on your conversation style
- Enable audio denoising if handling noisy environments
- Test with the web call interface before going live
For production deployments, consider bringing your own Deepgram API key for direct billing and access to enterprise features.
Conclusion
Deepgram's Nova models deliver the accuracy, latency, and cost profile that voice AI applications demand. Nova 3's 54% WER improvement makes it the clear choice for high-stakes conversations, while Nova 2 remains a cost-effective option for simpler use cases.
Burki's native integration eliminates the complexity of managing Deepgram connections, audio formats, and error handling. Combined with RNNoise denoising, intelligent endpointing, and pipeline optimization, you get production-grade STT performance without the infrastructure overhead.
Ready to build voice AI with industry-leading speech recognition? Start your free Burki trial with 200 minutes of calls included—no credit card required.
Last updated: January 2026
Sources:
Ready to try Burki?
Start your 200-minute free trial today. No credit card required.
Start Free Trial200 free minutes included. No credit card required.