Deepgram Nova 2/3 for Voice AI: The Developer's

When your voice AI misunderstands a customer's order or mishears a critical support request, the consequences cascade through your entire application. A 5% word error rate might sound acceptable in benchmarks, but in production, that translates to frustrated users, failed transactions, and eroded trust. Speech-to-text accuracy isn't just a metric to optimize—it's the foundation that determines whether your voice AI succeeds or fails.

After evaluating dozens of STT providers for voice AI applications, Deepgram's Nova models consistently deliver the accuracy and latency profile that production systems demand. In this guide, we'll examine exactly what makes Nova 2 and Nova 3 different, when to use each, and how Burki's integration helps you extract maximum performance from both models.

Why Deepgram Leads Voice AI Transcription

Deepgram built their speech recognition models specifically for conversational AI workloads. Unlike general-purpose transcription services that prioritize pre-recorded content, Deepgram optimized for the real-time, low-latency requirements that voice assistants demand.

Nova 3: Setting New Accuracy Benchmarks

Released in February 2025, Nova 3 represents Deepgram's most accurate model to date. The numbers tell the story:

54.3% reduction in word error rate for streaming audio compared to previous models
Median WER of 6.84% on real-world datasets—a 54.2% improvement over the next-best alternative at 14.92%
Sub-300ms transcription latency enabling natural conversational flow

These aren't cherry-picked benchmarks. Deepgram tested Nova 3 against diverse, real-world audio including accented speech, background noise, and overlapping conversations. The model maintains high accuracy even in challenging acoustic conditions that cause other STT systems to struggle.

Nova 2: The Proven Workhorse

Nova 2 established Deepgram's reputation for production-grade speech recognition. Key achievements include:

18% more accurate than the original Nova model
36% relative WER improvement over OpenAI Whisper (large)
30% average WER reduction compared to leading alternatives for both pre-recorded and real-time transcription
Support for 30+ languages with consistent accuracy

For many voice AI applications, Nova 2 delivers sufficient accuracy at a lower cost point. The model has been battle-tested across thousands of production deployments, making it a reliable choice for teams prioritizing stability.

Nova 2 vs Nova 3: Making the Right Choice

The decision between Nova 2 and Nova 3 depends on your specific accuracy requirements, latency tolerance, and budget constraints.

Accuracy Comparison

Metric	Nova 2	Nova 3
Streaming WER	~12-15%	~6.84%
WER vs. Competitors	30% better	54% better
Noisy Audio Handling	Good	Excellent
Accented Speech	Good	Excellent

Nova 3's accuracy advantage is most pronounced in challenging audio conditions. If your voice AI handles diverse accents, background noise, or overlapping speech, Nova 3's improvements translate directly to better user experience.

Latency Profile

Both models deliver sub-300ms transcription latency for streaming audio, which is essential for conversational voice AI. However, Nova 3's architecture was specifically optimized for streaming workloads, delivering more consistent performance under load.

For batch processing of pre-recorded audio, Nova 2 processes approximately 29.8 seconds of audio per hour with diarization enabled—fast enough for near-real-time applications but not instant.

Streaming vs Batch Processing

Understanding Deepgram's processing modes is critical for cost optimization:

Streaming (Real-time):

WebSocket connection for continuous audio
Sub-300ms latency
Essential for live voice AI conversations
Higher per-minute cost

Batch Processing:

Upload complete audio files
Processing completes at 1-2x real-time
Ideal for call recordings, voicemail transcription
Significantly lower cost

For voice AI applications, streaming is typically required during live calls. Batch processing makes sense for post-call analytics and transcript generation.

Deepgram Pricing Breakdown (2026)

Understanding Deepgram's pricing structure helps you architect cost-effective voice AI systems.

Nova 3 Pricing

Plan	Per-Minute Rate	Hourly Equivalent
Pay-As-You-Go	$0.0077/min	$0.46/hour
Growth Plan	$0.0065/min	$0.39/hour

The Growth plan offers a 16% discount for committed usage. New accounts receive $200 in free credits to evaluate the platform.

Nova 2 Pricing

Processing Mode	Per-Minute Rate	Hourly Equivalent
Batch (Pre-recorded)	$0.0043/min	$0.26/hour
Streaming	$0.0077/min	$0.46/hour

Nova 2 batch processing is approximately 44% cheaper than streaming, making it the optimal choice for non-real-time workloads.

Cost Comparison vs. Competitors

Provider	Per 1,000 Minutes
Deepgram Nova 3	$4.30
OpenAI Whisper API	$6.00
Google Speech-to-Text	$16.00

Deepgram delivers 2x+ cost savings compared to cloud providers while offering superior accuracy for conversational AI workloads.

Additional Costs

Speaker diarization adds approximately $0.001-0.002/min to the base transcription cost. For voice AI applications where you need to distinguish between caller and assistant speech, factor this into your budget.

Burki + Deepgram Integration

Burki's voice AI platform integrates natively with Deepgram Nova 2 and Nova 3, plus the streaming-optimized Deepgram Flux variant. This integration abstracts away the complexity of WebSocket connections, audio format handling, and error recovery.

Supported Deepgram Models in Burki

Provider	Models	Languages	Features
Deepgram	Nova 2, Nova 3	30+	Real-time, Confidence scoring, Smart formatting
Deepgram Flux	Nova 2 (streaming optimized)	30+	Ultra-low latency streaming

Why Flux Matters

Deepgram Flux is a streaming-optimized variant specifically designed for voice AI applications. It delivers even lower latency than standard Nova 2 by optimizing the entire audio pipeline for real-time conversational workloads.

Configuring Deepgram STT in Burki

Burki provides granular control over Deepgram's STT behavior. Here's how to configure optimal settings for voice AI.

Model Selection

Choose your Deepgram model based on the accuracy-cost tradeoff:

STT Provider: Deepgram
Model: Nova 3 (highest accuracy) | Nova 2 (cost-optimized) | Flux (lowest latency)

Language Configuration

Set your primary language for optimal recognition:

Language: en-US (or appropriate locale)
Auto-detection: Enabled/Disabled based on use case

For single-language deployments, disabling auto-detection can improve accuracy by eliminating language-switching overhead.

Endpointing Configuration

Endpointing determines when Deepgram considers an utterance complete. Proper configuration is essential for natural conversation flow:

Parameter	Purpose	Recommended Value
Silence Threshold	Minimum silence before utterance end	500-800ms
Minimum Silence Duration	Prevents premature cutoffs	300ms
Utterance End Timeout	Maximum wait for speech resumption	1000-1500ms
VAD Turnoff	Voice activity detection sensitivity	500ms

Shorter thresholds create faster responses but risk cutting off speakers mid-thought. Longer thresholds feel more natural but increase latency.

Smart Formatting Options

Enable Deepgram's formatting features for cleaner transcripts:

Automatic Punctuation: Adds periods, commas, question marks
Smart Formatting: Converts numbers, dates, currency to proper formats
Interim Results: Provides partial transcripts for real-time display

Bring Your Own API Key

Burki supports per-assistant Deepgram API keys, allowing you to:

Use your own Deepgram account for direct billing
Apply custom vocabulary and trained models
Access enterprise features and higher rate limits

Configure your Deepgram API key at the assistant level or fall back to organization-level credentials.

Optimizing Accuracy for Voice AI

Raw model accuracy is just the starting point. Production voice AI systems require additional optimization.

Audio Denoising with RNNoise

Burki integrates RNNoise for real-time audio denoising before STT processing. This neural network-based noise reduction can significantly improve accuracy in noisy environments:

Per-Assistant Toggle: Enable selectively based on expected audio quality
Hardware Acceleration: GPU support for low-latency processing
Transparent Integration: Denoising happens automatically in the audio pipeline

For call center environments, office settings, or any scenario with predictable background noise, enabling RNNoise provides measurable accuracy improvements.

Keyword Boosting

Deepgram supports keyword boosting to improve recognition of domain-specific terms. This is critical for:

Product names and model numbers
Industry jargon and acronyms
Proper nouns and brand names

Boosted keywords can achieve up to 90% higher keyword recall rate compared to baseline recognition.

Handling Filler Words

Configure whether to transcribe filler words like "uh" and "um":

Include Fillers: More natural transcripts, useful for sentiment analysis
Exclude Fillers: Cleaner output, reduced token usage for LLM processing

For voice AI that feeds transcripts to an LLM for response generation, excluding fillers often produces better results.

Confidence Score Monitoring

Deepgram provides per-word confidence scores with each transcript. Burki surfaces these for quality monitoring:

Track average confidence across calls
Flag low-confidence segments for review
Identify patterns in recognition failures

Low confidence scores often indicate audio quality issues, network problems, or out-of-vocabulary terms that need boosting.

Voice AI Pipeline Integration

STT is one component of Burki's real-time conversation pipeline:

Incoming Audio -> STT (Deepgram) -> LLM -> TTS -> Audio Output
                     |
              Real-time WebSocket Streaming

Latency Optimization

Burki optimizes the entire pipeline, not just STT:

Pipeline Overlapping: STT, LLM, and TTS processing overlap for minimal latency
Interim Results: Begin LLM processing on partial transcripts
Streaming TTS: Start audio playback before full response generation

With Deepgram's sub-300ms STT latency combined with Burki's pipeline optimization, total response times of 0.8-1.2 seconds are achievable.

Interruption Handling

Natural conversation requires handling interruptions gracefully. Burki's STT integration supports:

Interruption Threshold: Words required before allowing interruption
Minimum Speaking Time: Grace period before interruption detection
Transcript Buffering: Accumulates speech during cooldown periods
LLM Cancellation: Cancels pending responses on interruption

These features work together to create conversational flow that feels natural while avoiding constant interruptions.

Cost Analysis: Real-World Scenarios

Let's examine Deepgram costs for typical voice AI deployments.

Scenario 1: Customer Support Bot

Volume: 10,000 calls/month
Average Duration: 5 minutes
Model: Nova 3 (streaming)
Monthly Cost: 50,000 minutes x $0.0077 = $385/month

With Nova 2 streaming at the same rate, cost would be identical. The savings from Nova 2 come from batch processing, which isn't applicable for live calls.

Scenario 2: Appointment Scheduling

Volume: 5,000 calls/month
Average Duration: 3 minutes
Model: Nova 2 (streaming)
Monthly Cost: 15,000 minutes x $0.0077 = $115.50/month

For simpler use cases where Nova 2's accuracy is sufficient, there's no direct cost benefit for streaming. Consider Nova 2 when accuracy requirements are less stringent.

Scenario 3: Post-Call Analytics

Volume: 20,000 call recordings/month
Average Duration: 8 minutes
Model: Nova 2 (batch)
Monthly Cost: 160,000 minutes x $0.0043 = $688/month

Batch processing for recordings delivers significant savings—44% less than streaming rates.

Scenario 4: High-Accuracy Sales Calls

Volume: 2,000 calls/month
Average Duration: 15 minutes
Model: Nova 3 with Growth Plan
Monthly Cost: 30,000 minutes x $0.0065 = $195/month

The Growth plan's 16% discount adds up for committed usage.

Frequently Asked Questions

What's the accuracy difference between Nova 2 and Nova 3?

Nova 3 achieves approximately 54% lower word error rate than Nova 2 for streaming transcription. On real-world datasets, Nova 3 delivers median WER of 6.84% compared to Nova 2's approximately 12-15%. The difference is most noticeable with challenging audio: accents, background noise, and overlapping speech.

Should I use Nova 3 for all voice AI applications?

Not necessarily. Nova 2 provides excellent accuracy for many use cases at the same streaming price point. If your audio quality is good and your vocabulary is standard, Nova 2's accuracy may be sufficient. Nova 3 shines when you need the absolute best recognition for challenging conditions.

How does Deepgram compare to OpenAI Whisper?

Deepgram Nova 3 outperforms Whisper on real-time streaming transcription, with 36%+ lower WER and significantly lower latency. Whisper's batch processing is competitive for pre-recorded audio but lacks the streaming capabilities voice AI requires.

What languages does Deepgram support?

Both Nova 2 and Nova 3 support 30+ languages and dialects for real-time and recorded audio. English receives the most optimization, but major world languages are well-supported.

Can I use custom vocabulary with Deepgram?

Yes. Deepgram supports keyword boosting for improved recognition of custom terms. For enterprise needs, custom model training is available. Through Burki's BYO API key feature, you can leverage any Deepgram account features including custom models.

What's the minimum audio quality required?

Deepgram handles a wide range of audio quality, but optimal results require:

Sample rate: 8kHz minimum, 16kHz+ recommended
Clear speech without heavy compression
Background noise within reasonable limits

Burki's RNNoise integration can improve results for noisy audio.

How does pricing work for speaker diarization?

Speaker diarization adds approximately $0.001-0.002 per minute on top of base transcription costs. For voice AI calls, diarization helps distinguish between caller and assistant speech for analytics and training purposes.

What happens if Deepgram is unavailable?

Burki supports fallback STT providers. Configure Azure Speech as a backup to maintain availability if Deepgram experiences issues.

Getting Started with Deepgram on Burki

Setting up Deepgram STT on Burki takes minutes:

Create a Burki account and access the assistant builder
Navigate to STT configuration in your assistant settings
Select Deepgram as your provider
Choose your model (Nova 3 for accuracy, Nova 2 for cost optimization, Flux for lowest latency)
Configure endpointing based on your conversation style
Enable audio denoising if handling noisy environments
Test with the web call interface before going live

For production deployments, consider bringing your own Deepgram API key for direct billing and access to enterprise features.

Conclusion

Deepgram's Nova models deliver the accuracy, latency, and cost profile that voice AI applications demand. Nova 3's 54% WER improvement makes it the clear choice for high-stakes conversations, while Nova 2 remains a cost-effective option for simpler use cases.

Burki's native integration eliminates the complexity of managing Deepgram connections, audio formats, and error handling. Combined with RNNoise denoising, intelligent endpointing, and pipeline optimization, you get production-grade STT performance without the infrastructure overhead.

Ready to build voice AI with industry-leading speech recognition? Start your free Burki trial with 200 minutes of calls included—no credit card required.

Last updated: January 2026

Sources: