AsyncAPI description of Inworld AI's publicly documented runtime WebSocket surface. Inworld exposes three independent WebSocket endpoints: * **TTS streaming** — bidirectional text-to-speech synthesis with per-context configuration, flush, and close semantics (`/tts/v1/voice:streamBidirectional`). * **STT streaming** — bidirectional speech-to-text transcription that accepts a config message, base64-encoded audio chunks, and end-of-turn signals (`/stt/v1/transcribe:streamBidirectional`). * **Realtime API** — end-to-end speech-to-speech sessions using an OpenAI-Realtime-API-compatible event protocol (`/api/v1/realtime/session`). All three endpoints authenticate with HTTP Basic using a Base64-encoded Inworld API key (Authorization header), per the published documentation. This document is a non-fabricated reconstruction of the message shapes published at https://docs.inworld.ai (TTS WebSocket, STT WebSocket, and Realtime WebSocket reference pages). Field sets reflect the documented JSON payloads; many Realtime server-sent events are documented narratively by Inworld and are modeled here as open objects.
View SpecView on GitHubAIArtificial IntelligenceVoiceText To SpeechSpeech To TextRealtimeLLM RoutingVoice CloningConversational AIGame AIAsyncAPIWebhooksEvents
Channels
/tts/v1/voice:streamBidirectional
publishttsClientMessage
Messages sent from client to Inworld TTS.
Bidirectional WebSocket endpoint for streaming text-to-speech synthesis. A single connection may host up to five contexts; the account is limited to 20 concurrent connections. The connection auto-closes after 10 minutes of inactivity across all contexts.
/stt/v1/transcribe:streamBidirectional
publishsttClientMessage
Messages sent from client to Inworld STT.
Bidirectional WebSocket endpoint for streaming speech-to-text transcription. The first client message must be a TranscribeConfig followed by AudioChunk messages and (optionally) EndTurn / CloseStream control messages.
/api/v1/realtime/session
publishrealtimeClientEvent
Events sent from client to the Realtime session.
Bidirectional WebSocket endpoint for Inworld Realtime sessions. The protocol is intentionally compatible with the OpenAI Realtime API event protocol; existing OpenAI Realtime clients can connect by changing the base URL and authentication header. Inworld extends the session object with a `providerData` field for STT, TTS, memory, backchannel, and responsiveness extensions (see https://docs.inworld.ai/realtime/provider-data).
Messages
✉
TtsCreateContext
Create TTS context
Establishes an independent synthesis stream within the connection (max 5 per connection).
✉
TtsSendText
Send text for synthesis
Sends up to 1000 characters of text to a context.
✉
TtsFlushContext
Flush buffered text
Manually triggers synthesis of buffered text.
✉
TtsCloseContext
Close TTS context
Terminates a context and releases resources, flushing pending text.
✉
TtsContextCreated
Context created (server)
✉
TtsAudioChunk
Audio chunk (server)
✉
TtsFlushCompleted
Flush completed (server)
✉
TtsContextClosed
Context closed (server)
✉
SttTranscribeConfig
Transcribe config (client, first message)
✉
SttAudioChunk
Audio chunk (client)
✉
SttEndTurn
End turn (client)
✉
SttCloseStream
Close stream (client)
✉
SttTranscription
Transcription (server)
✉
SttSpeechStarted
Speech started (server)
✉
SttSpeechStopped
Speech stopped (server)
✉
SttUsage
Usage (server, coming soon)
✉
RealtimeSessionUpdate
session.update (client)
✉
RealtimeConversationItemCreate
conversation.item.create (client)
✉
RealtimeConversationItemTruncate
conversation.item.truncate (client)
✉
RealtimeConversationItemDelete
conversation.item.delete (client)
✉
RealtimeConversationItemRetrieve
conversation.item.retrieve (client)
✉
RealtimeResponseCreate
response.create (client)
✉
RealtimeResponseCancel
response.cancel (client)
✉
RealtimeInputAudioBufferAppend
input_audio_buffer.append (client)
✉
RealtimeInputAudioBufferCommit
input_audio_buffer.commit (client)
✉
RealtimeInputAudioBufferClear
input_audio_buffer.clear (client)
✉
RealtimeOutputAudioBufferClear
output_audio_buffer.clear (client)
✉
RealtimeSessionCreated
session.created (server)
✉
RealtimeSessionUpdated
session.updated (server)
✉
RealtimeGenericServerEvent
Other Realtime server event
Inworld documents additional server events that follow the OpenAI Realtime API protocol — including (non-exhaustive) `response.start`, `response.content_part.added`, `response.content_part.delta`, `response.done`, `conversation.item.created`, `input_audio_buffer.speech_started`, and `input_audio_buffer.speech_stopped`. The exact field set for these events is described narratively in the Inworld docs; they are represented here as a generic typed event so consumers do not assume unverified field shapes.
Servers
wss
productionapi.inworld.ai
Inworld production WebSocket host. Used for all three runtime WebSocket endpoints (TTS, STT, Realtime).