Back to Blog
Provider Integrations

Deepgram Nova 2/3 for Voice AI: The Developer's Guide to Production-Grade STT

Deep dive into Deepgram Nova 2 and Nova 3 for voice AI applications. Compare accuracy, latency, pricing, and learn how to optimize STT quality with Burki's integration.

Meeran Malik
13 min read

When your voice AI misunderstands a customer's order or mishears a critical support request, the consequences cascade through your entire application. A 5% word error rate might sound acceptable in benchmarks, but in production, that translates to frustrated users, failed transactions, and eroded trust. Speech-to-text accuracy isn't just a metric to optimize—it's the foundation that determines whether your voice AI succeeds or fails.

After evaluating dozens of STT providers for voice AI applications, Deepgram's Nova models consistently deliver the accuracy and latency profile that production systems demand. In this guide, we'll examine exactly what makes Nova 2 and Nova 3 different, when to use each, and how Burki's integration helps you extract maximum performance from both models.

Why Deepgram Leads Voice AI Transcription

Deepgram built their speech recognition models specifically for conversational AI workloads. Unlike general-purpose transcription services that prioritize pre-recorded content, Deepgram optimized for the real-time, low-latency requirements that voice assistants demand.

Nova 3: Setting New Accuracy Benchmarks

Released in February 2025, Nova 3 represents Deepgram's most accurate model to date. The numbers tell the story:

  • 54.3% reduction in word error rate for streaming audio compared to previous models
  • Median WER of 6.84% on real-world datasets—a 54.2% improvement over the next-best alternative at 14.92%
  • Sub-300ms transcription latency enabling natural conversational flow

These aren't cherry-picked benchmarks. Deepgram tested Nova 3 against diverse, real-world audio including accented speech, background noise, and overlapping conversations. The model maintains high accuracy even in challenging acoustic conditions that cause other STT systems to struggle.

Nova 2: The Proven Workhorse

Nova 2 established Deepgram's reputation for production-grade speech recognition. Key achievements include:

  • 18% more accurate than the original Nova model
  • 36% relative WER improvement over OpenAI Whisper (large)
  • 30% average WER reduction compared to leading alternatives for both pre-recorded and real-time transcription
  • Support for 30+ languages with consistent accuracy

For many voice AI applications, Nova 2 delivers sufficient accuracy at a lower cost point. The model has been battle-tested across thousands of production deployments, making it a reliable choice for teams prioritizing stability.

Nova 2 vs Nova 3: Making the Right Choice

The decision between Nova 2 and Nova 3 depends on your specific accuracy requirements, latency tolerance, and budget constraints.

Accuracy Comparison

MetricNova 2Nova 3
Streaming WER~12-15%~6.84%
WER vs. Competitors30% better54% better
Noisy Audio HandlingGoodExcellent
Accented SpeechGoodExcellent

Nova 3's accuracy advantage is most pronounced in challenging audio conditions. If your voice AI handles diverse accents, background noise, or overlapping speech, Nova 3's improvements translate directly to better user experience.

Latency Profile

Both models deliver sub-300ms transcription latency for streaming audio, which is essential for conversational voice AI. However, Nova 3's architecture was specifically optimized for streaming workloads, delivering more consistent performance under load.

For batch processing of pre-recorded audio, Nova 2 processes approximately 29.8 seconds of audio per hour with diarization enabled—fast enough for near-real-time applications but not instant.

Streaming vs Batch Processing

Understanding Deepgram's processing modes is critical for cost optimization:

Streaming (Real-time):

  • WebSocket connection for continuous audio
  • Sub-300ms latency
  • Essential for live voice AI conversations
  • Higher per-minute cost

Batch Processing:

  • Upload complete audio files
  • Processing completes at 1-2x real-time
  • Ideal for call recordings, voicemail transcription
  • Significantly lower cost

For voice AI applications, streaming is typically required during live calls. Batch processing makes sense for post-call analytics and transcript generation.

Deepgram Pricing Breakdown (2026)

Understanding Deepgram's pricing structure helps you architect cost-effective voice AI systems.

Nova 3 Pricing

PlanPer-Minute RateHourly Equivalent
Pay-As-You-Go$0.0077/min$0.46/hour
Growth Plan$0.0065/min$0.39/hour

The Growth plan offers a 16% discount for committed usage. New accounts receive $200 in free credits to evaluate the platform.

Nova 2 Pricing

Processing ModePer-Minute RateHourly Equivalent
Batch (Pre-recorded)$0.0043/min$0.26/hour
Streaming$0.0077/min$0.46/hour

Nova 2 batch processing is approximately 44% cheaper than streaming, making it the optimal choice for non-real-time workloads.

Cost Comparison vs. Competitors

ProviderPer 1,000 Minutes
Deepgram Nova 3$4.30
OpenAI Whisper API$6.00
Google Speech-to-Text$16.00

Deepgram delivers 2x+ cost savings compared to cloud providers while offering superior accuracy for conversational AI workloads.

Additional Costs

Speaker diarization adds approximately $0.001-0.002/min to the base transcription cost. For voice AI applications where you need to distinguish between caller and assistant speech, factor this into your budget.

Burki + Deepgram Integration

Burki's voice AI platform integrates natively with Deepgram Nova 2 and Nova 3, plus the streaming-optimized Deepgram Flux variant. This integration abstracts away the complexity of WebSocket connections, audio format handling, and error recovery.

Supported Deepgram Models in Burki

ProviderModelsLanguagesFeatures
DeepgramNova 2, Nova 330+Real-time, Confidence scoring, Smart formatting
Deepgram FluxNova 2 (streaming optimized)30+Ultra-low latency streaming

Why Flux Matters

Deepgram Flux is a streaming-optimized variant specifically designed for voice AI applications. It delivers even lower latency than standard Nova 2 by optimizing the entire audio pipeline for real-time conversational workloads.

Configuring Deepgram STT in Burki

Burki provides granular control over Deepgram's STT behavior. Here's how to configure optimal settings for voice AI.

Model Selection

Choose your Deepgram model based on the accuracy-cost tradeoff:

STT Provider: Deepgram
Model: Nova 3 (highest accuracy) | Nova 2 (cost-optimized) | Flux (lowest latency)

Language Configuration

Set your primary language for optimal recognition:

Language: en-US (or appropriate locale)
Auto-detection: Enabled/Disabled based on use case

For single-language deployments, disabling auto-detection can improve accuracy by eliminating language-switching overhead.

Endpointing Configuration

Endpointing determines when Deepgram considers an utterance complete. Proper configuration is essential for natural conversation flow:

ParameterPurposeRecommended Value
Silence ThresholdMinimum silence before utterance end500-800ms
Minimum Silence DurationPrevents premature cutoffs300ms
Utterance End TimeoutMaximum wait for speech resumption1000-1500ms
VAD TurnoffVoice activity detection sensitivity500ms

Shorter thresholds create faster responses but risk cutting off speakers mid-thought. Longer thresholds feel more natural but increase latency.

Smart Formatting Options

Enable Deepgram's formatting features for cleaner transcripts:

  • Automatic Punctuation: Adds periods, commas, question marks
  • Smart Formatting: Converts numbers, dates, currency to proper formats
  • Interim Results: Provides partial transcripts for real-time display

Bring Your Own API Key

Burki supports per-assistant Deepgram API keys, allowing you to:

  • Use your own Deepgram account for direct billing
  • Apply custom vocabulary and trained models
  • Access enterprise features and higher rate limits

Configure your Deepgram API key at the assistant level or fall back to organization-level credentials.

Optimizing Accuracy for Voice AI

Raw model accuracy is just the starting point. Production voice AI systems require additional optimization.

Audio Denoising with RNNoise

Burki integrates RNNoise for real-time audio denoising before STT processing. This neural network-based noise reduction can significantly improve accuracy in noisy environments:

  • Per-Assistant Toggle: Enable selectively based on expected audio quality
  • Hardware Acceleration: GPU support for low-latency processing
  • Transparent Integration: Denoising happens automatically in the audio pipeline

For call center environments, office settings, or any scenario with predictable background noise, enabling RNNoise provides measurable accuracy improvements.

Keyword Boosting

Deepgram supports keyword boosting to improve recognition of domain-specific terms. This is critical for:

  • Product names and model numbers
  • Industry jargon and acronyms
  • Proper nouns and brand names

Boosted keywords can achieve up to 90% higher keyword recall rate compared to baseline recognition.

Handling Filler Words

Configure whether to transcribe filler words like "uh" and "um":

  • Include Fillers: More natural transcripts, useful for sentiment analysis
  • Exclude Fillers: Cleaner output, reduced token usage for LLM processing

For voice AI that feeds transcripts to an LLM for response generation, excluding fillers often produces better results.

Confidence Score Monitoring

Deepgram provides per-word confidence scores with each transcript. Burki surfaces these for quality monitoring:

  • Track average confidence across calls
  • Flag low-confidence segments for review
  • Identify patterns in recognition failures

Low confidence scores often indicate audio quality issues, network problems, or out-of-vocabulary terms that need boosting.

Voice AI Pipeline Integration

STT is one component of Burki's real-time conversation pipeline:

Incoming Audio -> STT (Deepgram) -> LLM -> TTS -> Audio Output
                     |
              Real-time WebSocket Streaming

Latency Optimization

Burki optimizes the entire pipeline, not just STT:

  • Pipeline Overlapping: STT, LLM, and TTS processing overlap for minimal latency
  • Interim Results: Begin LLM processing on partial transcripts
  • Streaming TTS: Start audio playback before full response generation

With Deepgram's sub-300ms STT latency combined with Burki's pipeline optimization, total response times of 0.8-1.2 seconds are achievable.

Interruption Handling

Natural conversation requires handling interruptions gracefully. Burki's STT integration supports:

  • Interruption Threshold: Words required before allowing interruption
  • Minimum Speaking Time: Grace period before interruption detection
  • Transcript Buffering: Accumulates speech during cooldown periods
  • LLM Cancellation: Cancels pending responses on interruption

These features work together to create conversational flow that feels natural while avoiding constant interruptions.

Cost Analysis: Real-World Scenarios

Let's examine Deepgram costs for typical voice AI deployments.

Scenario 1: Customer Support Bot

  • Volume: 10,000 calls/month
  • Average Duration: 5 minutes
  • Model: Nova 3 (streaming)
  • Monthly Cost: 50,000 minutes x $0.0077 = $385/month

With Nova 2 streaming at the same rate, cost would be identical. The savings from Nova 2 come from batch processing, which isn't applicable for live calls.

Scenario 2: Appointment Scheduling

  • Volume: 5,000 calls/month
  • Average Duration: 3 minutes
  • Model: Nova 2 (streaming)
  • Monthly Cost: 15,000 minutes x $0.0077 = $115.50/month

For simpler use cases where Nova 2's accuracy is sufficient, there's no direct cost benefit for streaming. Consider Nova 2 when accuracy requirements are less stringent.

Scenario 3: Post-Call Analytics

  • Volume: 20,000 call recordings/month
  • Average Duration: 8 minutes
  • Model: Nova 2 (batch)
  • Monthly Cost: 160,000 minutes x $0.0043 = $688/month

Batch processing for recordings delivers significant savings—44% less than streaming rates.

Scenario 4: High-Accuracy Sales Calls

  • Volume: 2,000 calls/month
  • Average Duration: 15 minutes
  • Model: Nova 3 with Growth Plan
  • Monthly Cost: 30,000 minutes x $0.0065 = $195/month

The Growth plan's 16% discount adds up for committed usage.

Frequently Asked Questions

What's the accuracy difference between Nova 2 and Nova 3?

Nova 3 achieves approximately 54% lower word error rate than Nova 2 for streaming transcription. On real-world datasets, Nova 3 delivers median WER of 6.84% compared to Nova 2's approximately 12-15%. The difference is most noticeable with challenging audio: accents, background noise, and overlapping speech.

Should I use Nova 3 for all voice AI applications?

Not necessarily. Nova 2 provides excellent accuracy for many use cases at the same streaming price point. If your audio quality is good and your vocabulary is standard, Nova 2's accuracy may be sufficient. Nova 3 shines when you need the absolute best recognition for challenging conditions.

How does Deepgram compare to OpenAI Whisper?

Deepgram Nova 3 outperforms Whisper on real-time streaming transcription, with 36%+ lower WER and significantly lower latency. Whisper's batch processing is competitive for pre-recorded audio but lacks the streaming capabilities voice AI requires.

What languages does Deepgram support?

Both Nova 2 and Nova 3 support 30+ languages and dialects for real-time and recorded audio. English receives the most optimization, but major world languages are well-supported.

Can I use custom vocabulary with Deepgram?

Yes. Deepgram supports keyword boosting for improved recognition of custom terms. For enterprise needs, custom model training is available. Through Burki's BYO API key feature, you can leverage any Deepgram account features including custom models.

What's the minimum audio quality required?

Deepgram handles a wide range of audio quality, but optimal results require:

  • Sample rate: 8kHz minimum, 16kHz+ recommended
  • Clear speech without heavy compression
  • Background noise within reasonable limits

Burki's RNNoise integration can improve results for noisy audio.

How does pricing work for speaker diarization?

Speaker diarization adds approximately $0.001-0.002 per minute on top of base transcription costs. For voice AI calls, diarization helps distinguish between caller and assistant speech for analytics and training purposes.

What happens if Deepgram is unavailable?

Burki supports fallback STT providers. Configure Azure Speech as a backup to maintain availability if Deepgram experiences issues.

Getting Started with Deepgram on Burki

Setting up Deepgram STT on Burki takes minutes:

  1. Create a Burki account and access the assistant builder
  2. Navigate to STT configuration in your assistant settings
  3. Select Deepgram as your provider
  4. Choose your model (Nova 3 for accuracy, Nova 2 for cost optimization, Flux for lowest latency)
  5. Configure endpointing based on your conversation style
  6. Enable audio denoising if handling noisy environments
  7. Test with the web call interface before going live

For production deployments, consider bringing your own Deepgram API key for direct billing and access to enterprise features.

Conclusion

Deepgram's Nova models deliver the accuracy, latency, and cost profile that voice AI applications demand. Nova 3's 54% WER improvement makes it the clear choice for high-stakes conversations, while Nova 2 remains a cost-effective option for simpler use cases.

Burki's native integration eliminates the complexity of managing Deepgram connections, audio formats, and error handling. Combined with RNNoise denoising, intelligent endpointing, and pipeline optimization, you get production-grade STT performance without the infrastructure overhead.

Ready to build voice AI with industry-leading speech recognition? Start your free Burki trial with 200 minutes of calls included—no credit card required.


Last updated: January 2026

Sources:

Ready to try Burki?

Start your 200-minute free trial today. No credit card required.

Start Free Trial

200 free minutes included. No credit card required.

Related Articles