AUDIO MODEL May 2025 Cartesia Last updated: Apr 29, 2026

Cartesia Sonic

Sub-100ms TTS

Cartesia's text-to-speech model — Sonic-3 hits 40–90 millisecond time-to-first-audio, the fastest in production. Built on state-space-model architecture (the team came out of Mamba's research group at Stanford), it's optimized for live voice agents and call-center deployments where latency kills conversation quality.

Try demo

Official ↗

Why it matters

Made low-latency TTS competitive with quality-tier alternatives. Cartesia is the reason voice agents stopped feeling laggy in 2024–2025.

Core Capabilities

Generative

Produces images, video, audio, or other media.

Audio

Speech, music, or other audio understanding/synthesis.

Multimodal

Combines text, vision, and audio in one model.

Context Window

Context window not disclosed.

Availability

API

Available

Product / App

Not available

Open Source

Not released

Enterprise

Contact sales

Pricing Model

Pay per token

Input and output billed separately.

Pay-per-token

What it feels like

Audio model from Cartesia — see the linked sources below for benchmark and review coverage
Audio synthesis or transcription per the published model card

Best use cases

Audio synthesis / transcription tasks per the model card
See the model spec and sources block for benchmarked use cases

Not ideal for

Tasks far outside the modalities listed in this model's spec
Workflows where a more recent successor in the same family scores higher