Grok Voice is the voice layer around Grok experiences, spanning real-time voice agents, speech-to-text, and text-to-speech workflows.

What can Grok Voice be used for?

Teams can use Grok Voice for live assistants, customer support flows, call routing, transcription pipelines, and expressive speech playback in web or telephony products.

Does Grok Voice support both STT and TTS?

Yes. Grok Voice coverage includes speech-to-text for transcription and text-to-speech for generated playback, alongside real-time voice-agent interactions.

Is this page about APIs or the app experience?

This page is a compact overview of the Grok Voice stack, so it highlights developer-facing APIs and the product experience in one place.

Voice Layer For Grok

Grok Voice for live AI conversations.

A dark, product-first overview of the Grok Voice stack: low-friction voice agents, speech-to-text, text-to-speech, and the interaction patterns teams need when voice has to feel fast, calm, and production-ready.

Explore the Voice Stack Read the FAQ

Real-time agents Speech-to-text Text-to-speech Tool-aware flows

Live session Voice console

Input

Route the caller, search current docs, then answer in a natural voice.

Response

Intent captured. Tools armed. Voice response ready in one continuous loop.

Modes: Agent / STT / TTS
Interface: Web, mobile, telephony
Flow: Listen, think, answer
Tools: Search + MCP hooks

At A Glance

One voice surface, three core building blocks.

Voice agents

Real-time conversations that can listen, reason, search, and respond without collapsing into brittle IVR behavior.

Speech-to-text

Transcription flows for recorded files and live streams, tuned for accents, domain language, and long-running sessions.

Text-to-speech

Expressive playback for apps, assistants, and call experiences where the synthetic voice still needs range and control.

Operational reach

Built to stretch across support, sales, concierge, intake, and internal workflows where voice is both interface and output.

Voice Stack

Switch between the three parts of Grok Voice.

The page stays static, but the product story does not. Use the mode switcher to scan what changes across agent, STT, and TTS workflows.

Real-time orchestration

Voice Agents

Grok Voice can anchor live agent sessions that combine speech, reasoning, tool use, and web-aware lookups in a single conversational thread.

Designed for live sessions rather than batched back-office calls.
Useful for support desks, reception, intake, and guided product flows.
Fits products where the response must feel immediate and context-aware.

ShapeSpeak -> Think -> Act

Best fitInteractive voice apps

InterfacesWebSocket + product UI

Voice Flow

How a Grok Voice interaction typically moves.

Capture intent

Listen to the user in natural speech instead of forcing short keypad-style prompts.

Resolve context

Bring in search, product data, or tools so the model has something grounded to act on.

Return a voiced answer

Respond in text, speech, or both depending on whether the product is silent, spoken, or mixed.

Keep the thread alive

Continue the conversation without resetting context every time the user asks for clarification.

Why Grok Voice

Built for products where voice is not a novelty layer.

Fast enough for live interaction

The experience is framed around responsive spoken loops, not asynchronous transcription dumped into a queue.

Broader than a single endpoint

Voice agents, transcription, and playback sit together as one surface, so teams can design workflows rather than isolated demos.

Closer to operator reality

Support, concierge, and intake flows are messy. The value is in handling interruptions, follow-ups, and grounded responses under pressure.

FAQ

Common Grok Voice questions.

What is Grok Voice?

Grok Voice is the voice-facing layer around Grok experiences, spanning live agents, speech recognition, and spoken playback.

Is Grok Voice only for developers?

No. The underlying stack is API-friendly, but the product story also maps to user-facing assistants on web, mobile, and other voice surfaces.

Can Grok Voice handle both incoming and outgoing speech?

Yes. A typical implementation can listen to speech, convert it into text or actions, then answer with generated spoken output.

Why separate voice agents, STT, and TTS?

Because teams often need different compositions: full live assistants, transcription-only flows, or spoken output without an always-on agent loop.