[ ABORT TO HUD ]
SEQ. 1
SEQ. 2

Text-to-Speech & Realtime Voice

🎙️ Speech & Audio APIs 15 min 250 BASE XP

Generating Speech

The gpt-4o-mini-tts model generates natural-sounding speech with unprecedented control over tone, emotion, and delivery style.

const audio = await openai.audio.speech.create({
  model: "gpt-4o-mini-tts",
  voice: "coral",
  input: "Welcome to the Infinity Tech Stack Academy!",
  instructions: "Speak with enthusiasm and energy, like a tech conference host.",
  response_format: "mp3"
});

Steerable TTS

Unlike traditional TTS that just reads text flatly, gpt-4o-mini-tts accepts instructions that control HOW it speaks — tone, pacing, emotion, accent emphasis.

Available Voices

VoiceCharacter
alloyNeutral, balanced
echoWarm, conversational
fableExpressive, storytelling
onyxDeep, authoritative
novaFriendly, upbeat
shimmerSoft, calm
coralClear, professional

The Realtime API

For ultra-low latency voice applications, the Realtime API maintains a persistent WebSocket connection for bidirectional audio streaming with gpt-realtime-1.5.

  • Voice Activity Detection (VAD): Automatically detects when users stop speaking
  • Tool Calling in Voice: Trigger backend tools while speaking
  • Audio Reasoning: Understands tone, inflection, and urgency
🎯 Use Case Decision: Use TTS for pre-generated audio (podcasts, notifications). Use the Realtime API for interactive voice conversations (phone agents, assistants).
SYNAPSE VERIFICATION
QUERY 1 // 3
What makes gpt-4o-mini-tts different from traditional TTS?
It's faster
It accepts natural language instructions to control tone, emotion, and delivery style
It only works in English
It generates video
Watch: 139x Rust Speedup
Text-to-Speech & Realtime Voice | Speech & Audio APIs — OpenAI Academy