AUDIO MODEL Cartesia Last updated:

Cartesia Sonic

Sub-100ms TTS

Cartesia's text-to-speech model — Sonic-3 hits 40–90 millisecond time-to-first-audio, the fastest in production. Built on state-space-model architecture (the team came out of Mamba's research group at Stanford), it's optimized for live voice agents and call-center deployments where latency kills conversation quality.

Try demo

Why it matters

Made low-latency TTS competitive with quality-tier alternatives. Cartesia is the reason voice agents stopped feeling laggy in 2024–2025.

Core Capabilities

Generative
Produces images, video, audio, or other media.
Audio
Speech, music, or other audio understanding/synthesis.
Multimodal
Combines text, vision, and audio in one model.

Context Window

Context window not disclosed.

Availability

API
Available
Product / App
Not available
Open Source
Not released
Enterprise
Contact sales

Pricing Model

Pay per token
Input and output billed separately.
Pay-per-token

What it feels like

  • Audio model from Cartesia — see the linked sources below for benchmark and review coverage
  • Audio synthesis or transcription per the published model card

Best use cases

  • Audio synthesis / transcription tasks per the model card
  • See the model spec and sources block for benchmarked use cases

Not ideal for

  • Tasks far outside the modalities listed in this model's spec
  • Workflows where a more recent successor in the same family scores higher