Cartesia Sonic
Sub-100ms TTS
Cartesia's text-to-speech model — Sonic-3 hits 40–90 millisecond time-to-first-audio, the fastest in production. Built on state-space-model architecture (the team came out of Mamba's research group at Stanford), it's optimized for live voice agents and call-center deployments where latency kills conversation quality.
Why it matters
Made low-latency TTS competitive with quality-tier alternatives. Cartesia is the reason voice agents stopped feeling laggy in 2024–2025.
Core Capabilities
Generative
Produces images, video, audio, or other media.
Audio
Speech, music, or other audio understanding/synthesis.
Multimodal
Combines text, vision, and audio in one model.
Context Window
Context window not disclosed.
Availability
API
Available
Product / App
Not available
Open Source
Not released
Enterprise
Contact sales
Pricing Model
Pay per token
Input and output billed separately.
Pay-per-token What it feels like
- Audio model from Cartesia — see the linked sources below for benchmark and review coverage
- Audio synthesis or transcription per the published model card
Best use cases
- Audio synthesis / transcription tasks per the model card
- See the model spec and sources block for benchmarked use cases
Not ideal for
- Tasks far outside the modalities listed in this model's spec
- Workflows where a more recent successor in the same family scores higher