xAI · AsyncAPI Specification

xAI Realtime WebSocket APIs

Version 1.0.0

AsyncAPI 2.6 description of xAI's documented WebSocket APIs: - Real-time Speech-to-Text (STT) streaming at wss://api.x.ai/v1/stt - Voice Agent (bidirectional speech-to-speech) at wss://api.x.ai/v1/realtime Every channel, message, and field below is sourced from the public xAI developer documentation at https://docs.x.ai/. No events have been invented; events that are unsupported (per the docs) are excluded.

View Spec View on GitHub AILLMFoundation ModelsGrokGenerative AIAsyncAPIWebhooksEvents

Channels

stt/audio
publish sttSendAudioChunk
Stream a chunk of raw audio to the STT engine.
Client streams audio to the STT server. Audio is sent as binary WebSocket frames containing raw PCM (16-bit signed little-endian), mu-law, or A-law samples matching the `encoding` query parameter. Recommended chunk size for 16 kHz PCM is ~100 ms (3,200 bytes).
stt/control
publish sttSendAudioDone
Signal that no more audio will be sent.
Client signals end-of-audio. After the server emits transcript.done, the connection is closed.
stt/events
subscribe sttReceiveEvents
Receive STT lifecycle, transcript, and error events.
Server-to-client transcription events for the STT session.
realtime/client
publish realtimeSendClientEvent
Send a client event to the voice agent.
Client-to-server events for the voice agent: session configuration, conversation item creation, response requests, and input audio buffer management.
realtime/server
subscribe realtimeReceiveServerEvent
Receive a server event from the voice agent.
Server-to-client events for the voice agent: session lifecycle, conversation item events, response streaming (text + audio), transcription of user input, function-call argument completion, and error events.

Messages

SttAudioChunk
STT raw audio chunk (binary frame)
A binary WebSocket frame carrying raw audio samples.
SttAudioDone
audio.done
Client signal that no more audio will be sent.
SttTranscriptCreated
transcript.created
Emitted once when the STT session is ready.
SttTranscriptPartial
transcript.partial
Intermediate transcript event (interim or chunk-final).
SttTranscriptDone
transcript.done
Final transcript emitted after the client sends audio.done.
SttError
error
STT error event.
RtSessionUpdate
session.update
Configure the voice session.
RtConversationItemCreate
conversation.item.create
Add a message or function output to the conversation.
RtResponseCreate
response.create
Request the model to generate a response.
RtInputAudioBufferAppend
input_audio_buffer.append
Append base64-encoded audio to the input buffer.
RtInputAudioBufferCommit
input_audio_buffer.commit
Mark the end of audio input for manual turn detection.
RtInputAudioBufferClear
input_audio_buffer.clear
Discard uncommitted audio from the input buffer.
RtSessionCreated
session.created
Sent when the WebSocket connection is established.
RtSessionUpdated
session.updated
Sent after a successful session.update.
RtConversationItemCreated
conversation.item.created
A conversation item (message or function output) was added.
RtInputAudioTranscriptionCompleted
conversation.item.input_audio_transcription.completed
Transcription of user speech for a conversation item.
RtResponseCreated
response.created
Model has begun generating a response.
RtResponseDone
response.done
Response generation has completed.
RtResponseOutputTextDelta
response.output_text.delta
Streamed text chunk from the model.
RtResponseOutputAudioDelta
response.output_audio.delta
Streamed audio chunk (base64) from the model.
RtResponseFunctionCallArgumentsDone
response.function_call_arguments.done
A function call has been triggered with complete arguments.
RtError
error
Realtime API error event.

Servers

wss
stt api.x.ai/v1/stt
Real-time Speech-to-Text streaming endpoint. Configuration is supplied via URL query parameters (no setup message is required). Authenticate with a Bearer token in the Authorization header (or an ephemeral token). Source: https://docs.x.ai/developers/model-capabilities/audio/speech-to-text
wss
realtime api.x.ai/v1/realtime
Voice Agent realtime endpoint (bidirectional speech-to-speech, OpenAI Realtime API compatible using beta event naming). The target model is selected with the `model` query parameter (e.g. grok-voice-latest). Source: https://docs.x.ai/developers/model-capabilities/audio/voice-agent

AsyncAPI Specification

Raw ↑
asyncapi: 2.6.0
info:
  title: xAI Realtime WebSocket APIs
  version: '1.0.0'
  description: |
    AsyncAPI 2.6 description of xAI's documented WebSocket APIs:
      - Real-time Speech-to-Text (STT) streaming at wss://api.x.ai/v1/stt
      - Voice Agent (bidirectional speech-to-speech) at wss://api.x.ai/v1/realtime
    Every channel, message, and field below is sourced from the public xAI
    developer documentation at https://docs.x.ai/. No events have been
    invented; events that are unsupported (per the docs) are excluded.
  contact:
    name: xAI
    url: https://docs.x.ai/
  license:
    name: Proprietary
    url: https://x.ai/legal/terms-of-service
  externalDocs:
    description: xAI Audio / Voice documentation
    url: https://docs.x.ai/developers/model-capabilities/audio/voice

defaultContentType: application/json

servers:
  stt:
    url: api.x.ai/v1/stt
    protocol: wss
    description: |
      Real-time Speech-to-Text streaming endpoint. Configuration is supplied
      via URL query parameters (no setup message is required). Authenticate
      with a Bearer token in the Authorization header (or an ephemeral token).
      Source: https://docs.x.ai/developers/model-capabilities/audio/speech-to-text
    variables:
      sample_rate:
        description: Audio sample rate in Hz. Supported values include 8000, 16000, 22050, 24000, 44100, 48000.
        default: '16000'
        enum:
          - '8000'
          - '16000'
          - '22050'
          - '24000'
          - '44100'
          - '48000'
      encoding:
        description: Raw audio encoding (raw formats only).
        default: pcm
        enum:
          - pcm
          - mulaw
          - alaw
      interim_results:
        description: Emit interim (partial) transcripts approximately every 500ms.
        default: 'false'
        enum:
          - 'true'
          - 'false'
      endpointing:
        description: Silence duration in milliseconds (0-5000) that triggers an utterance-final event.
        default: '0'
      multichannel:
        description: Enable independent per-channel transcription.
        default: 'false'
        enum:
          - 'true'
          - 'false'
      channels:
        description: Number of channels when multichannel=true (2-8).
        default: '2'
    security:
      - bearerAuth: []
  realtime:
    url: api.x.ai/v1/realtime
    protocol: wss
    description: |
      Voice Agent realtime endpoint (bidirectional speech-to-speech, OpenAI
      Realtime API compatible using beta event naming). The target model is
      selected with the `model` query parameter (e.g. grok-voice-latest).
      Source: https://docs.x.ai/developers/model-capabilities/audio/voice-agent
    variables:
      model:
        description: Voice model identifier.
        default: grok-voice-latest
    security:
      - bearerAuth: []

channels:
  # ---------------------------------------------------------------------
  # Real-time STT channels (wss://api.x.ai/v1/stt)
  # ---------------------------------------------------------------------
  stt/audio:
    description: |
      Client streams audio to the STT server. Audio is sent as binary
      WebSocket frames containing raw PCM (16-bit signed little-endian),
      mu-law, or A-law samples matching the `encoding` query parameter.
      Recommended chunk size for 16 kHz PCM is ~100 ms (3,200 bytes).
    servers:
      - stt
    publish:
      operationId: sttSendAudioChunk
      summary: Stream a chunk of raw audio to the STT engine.
      message:
        $ref: '#/components/messages/SttAudioChunk'

  stt/control:
    description: |
      Client signals end-of-audio. After the server emits transcript.done,
      the connection is closed.
    servers:
      - stt
    publish:
      operationId: sttSendAudioDone
      summary: Signal that no more audio will be sent.
      message:
        $ref: '#/components/messages/SttAudioDone'

  stt/events:
    description: |
      Server-to-client transcription events for the STT session.
    servers:
      - stt
    subscribe:
      operationId: sttReceiveEvents
      summary: Receive STT lifecycle, transcript, and error events.
      message:
        oneOf:
          - $ref: '#/components/messages/SttTranscriptCreated'
          - $ref: '#/components/messages/SttTranscriptPartial'
          - $ref: '#/components/messages/SttTranscriptDone'
          - $ref: '#/components/messages/SttError'

  # ---------------------------------------------------------------------
  # Voice Agent channels (wss://api.x.ai/v1/realtime)
  # ---------------------------------------------------------------------
  realtime/client:
    description: |
      Client-to-server events for the voice agent: session configuration,
      conversation item creation, response requests, and input audio
      buffer management.
    servers:
      - realtime
    publish:
      operationId: realtimeSendClientEvent
      summary: Send a client event to the voice agent.
      message:
        oneOf:
          - $ref: '#/components/messages/RtSessionUpdate'
          - $ref: '#/components/messages/RtConversationItemCreate'
          - $ref: '#/components/messages/RtResponseCreate'
          - $ref: '#/components/messages/RtInputAudioBufferAppend'
          - $ref: '#/components/messages/RtInputAudioBufferCommit'
          - $ref: '#/components/messages/RtInputAudioBufferClear'

  realtime/server:
    description: |
      Server-to-client events for the voice agent: session lifecycle,
      conversation item events, response streaming (text + audio),
      transcription of user input, function-call argument completion,
      and error events.
    servers:
      - realtime
    subscribe:
      operationId: realtimeReceiveServerEvent
      summary: Receive a server event from the voice agent.
      message:
        oneOf:
          - $ref: '#/components/messages/RtSessionCreated'
          - $ref: '#/components/messages/RtSessionUpdated'
          - $ref: '#/components/messages/RtConversationItemCreated'
          - $ref: '#/components/messages/RtInputAudioTranscriptionCompleted'
          - $ref: '#/components/messages/RtResponseCreated'
          - $ref: '#/components/messages/RtResponseDone'
          - $ref: '#/components/messages/RtResponseOutputTextDelta'
          - $ref: '#/components/messages/RtResponseOutputAudioDelta'
          - $ref: '#/components/messages/RtResponseFunctionCallArgumentsDone'
          - $ref: '#/components/messages/RtError'

components:
  securitySchemes:
    bearerAuth:
      type: http
      scheme: bearer
      bearerFormat: API key (or ephemeral token)
      description: |
        Pass `Authorization: Bearer <XAI_API_KEY>` on the WebSocket upgrade
        request. xAI recommends proxying WebSocket connections through your
        backend and never exposing the key client-side.

  messages:
    # -----------------------------------------------------------------
    # STT messages
    # -----------------------------------------------------------------
    SttAudioChunk:
      name: SttAudioChunk
      title: STT raw audio chunk (binary frame)
      summary: A binary WebSocket frame carrying raw audio samples.
      contentType: application/octet-stream
      payload:
        type: string
        format: binary
        description: |
          Raw audio bytes. Encoding is determined by the `encoding` query
          parameter (pcm = signed 16-bit little-endian, mulaw = G.711 mu-law,
          alaw = G.711 A-law). Sample rate is set by `sample_rate`.

    SttAudioDone:
      name: SttAudioDone
      title: audio.done
      summary: Client signal that no more audio will be sent.
      payload:
        type: object
        required: [type]
        properties:
          type:
            type: string
            const: audio.done

    SttTranscriptCreated:
      name: SttTranscriptCreated
      title: transcript.created
      summary: Emitted once when the STT session is ready.
      payload:
        type: object
        required: [type]
        properties:
          type:
            type: string
            const: transcript.created

    SttTranscriptPartial:
      name: SttTranscriptPartial
      title: transcript.partial
      summary: Intermediate transcript event (interim or chunk-final).
      payload:
        type: object
        required: [type, text, is_final, speech_final]
        properties:
          type:
            type: string
            const: transcript.partial
          text:
            type: string
            description: Partial transcript text.
          words:
            type: array
            description: Word-level segments.
            items:
              $ref: '#/components/schemas/Word'
          is_final:
            type: boolean
            description: |
              When true the text for this chunk is locked (no further changes).
          speech_final:
            type: boolean
            description: |
              When true the utterance has ended (e.g. silence threshold met).
              States:
                - is_final=false, speech_final=false: interim text (may change)
                - is_final=true,  speech_final=false: chunk-final (~3s locked)
                - is_final=true,  speech_final=true:  complete utterance
          start:
            type: number
            description: Start time in seconds for the partial segment.
          duration:
            type: number
            description: Duration in seconds for the partial segment.
          channel_index:
            type: integer
            description: Channel index (only present when multichannel=true).

    SttTranscriptDone:
      name: SttTranscriptDone
      title: transcript.done
      summary: Final transcript emitted after the client sends audio.done.
      payload:
        type: object
        required: [type, duration]
        properties:
          type:
            type: string
            const: transcript.done
          text:
            type: string
            description: Final transcript text.
          duration:
            type: number
            description: Total audio duration in seconds.
          channel_index:
            type: integer
            description: Channel index (only present when multichannel=true).

    SttError:
      name: SttError
      title: error
      summary: STT error event.
      payload:
        type: object
        required: [type, message]
        properties:
          type:
            type: string
            const: error
          message:
            type: string
            description: Human-readable error description.

    # -----------------------------------------------------------------
    # Voice Agent (realtime) messages
    # -----------------------------------------------------------------
    RtSessionUpdate:
      name: RtSessionUpdate
      title: session.update
      summary: Configure the voice session.
      payload:
        type: object
        required: [type, session]
        properties:
          type:
            type: string
            const: session.update
          session:
            $ref: '#/components/schemas/RtSession'

    RtConversationItemCreate:
      name: RtConversationItemCreate
      title: conversation.item.create
      summary: Add a message or function output to the conversation.
      payload:
        type: object
        required: [type, item]
        properties:
          type:
            type: string
            const: conversation.item.create
          item:
            $ref: '#/components/schemas/RtConversationItem'

    RtResponseCreate:
      name: RtResponseCreate
      title: response.create
      summary: Request the model to generate a response.
      payload:
        type: object
        required: [type]
        properties:
          type:
            type: string
            const: response.create

    RtInputAudioBufferAppend:
      name: RtInputAudioBufferAppend
      title: input_audio_buffer.append
      summary: Append base64-encoded audio to the input buffer.
      payload:
        type: object
        required: [type, audio]
        properties:
          type:
            type: string
            const: input_audio_buffer.append
          audio:
            type: string
            description: Base64-encoded audio bytes (PCM/PCMU/PCMA per session config).

    RtInputAudioBufferCommit:
      name: RtInputAudioBufferCommit
      title: input_audio_buffer.commit
      summary: Mark the end of audio input for manual turn detection.
      payload:
        type: object
        required: [type]
        properties:
          type:
            type: string
            const: input_audio_buffer.commit

    RtInputAudioBufferClear:
      name: RtInputAudioBufferClear
      title: input_audio_buffer.clear
      summary: Discard uncommitted audio from the input buffer.
      payload:
        type: object
        required: [type]
        properties:
          type:
            type: string
            const: input_audio_buffer.clear

    RtSessionCreated:
      name: RtSessionCreated
      title: session.created
      summary: Sent when the WebSocket connection is established.
      payload:
        type: object
        required: [type, session]
        properties:
          type:
            type: string
            const: session.created
          session:
            $ref: '#/components/schemas/RtSession'

    RtSessionUpdated:
      name: RtSessionUpdated
      title: session.updated
      summary: Sent after a successful session.update.
      payload:
        type: object
        required: [type, session]
        properties:
          type:
            type: string
            const: session.updated
          session:
            $ref: '#/components/schemas/RtSession'

    RtConversationItemCreated:
      name: RtConversationItemCreated
      title: conversation.item.created
      summary: A conversation item (message or function output) was added.
      payload:
        type: object
        required: [type, item]
        properties:
          type:
            type: string
            const: conversation.item.created
          item:
            $ref: '#/components/schemas/RtConversationItem'

    RtInputAudioTranscriptionCompleted:
      name: RtInputAudioTranscriptionCompleted
      title: conversation.item.input_audio_transcription.completed
      summary: Transcription of user speech for a conversation item.
      payload:
        type: object
        required: [type, item_id, content_index, transcript]
        properties:
          type:
            type: string
            const: conversation.item.input_audio_transcription.completed
          item_id:
            type: string
          content_index:
            type: integer
          transcript:
            type: string

    RtResponseCreated:
      name: RtResponseCreated
      title: response.created
      summary: Model has begun generating a response.
      payload:
        type: object
        required: [type, response]
        properties:
          type:
            type: string
            const: response.created
          response:
            $ref: '#/components/schemas/RtResponse'

    RtResponseDone:
      name: RtResponseDone
      title: response.done
      summary: Response generation has completed.
      payload:
        type: object
        required: [type, response]
        properties:
          type:
            type: string
            const: response.done
          response:
            $ref: '#/components/schemas/RtResponse'

    RtResponseOutputTextDelta:
      name: RtResponseOutputTextDelta
      title: response.output_text.delta
      summary: Streamed text chunk from the model.
      payload:
        type: object
        required: [type, text]
        properties:
          type:
            type: string
            const: response.output_text.delta
          text:
            type: string
            description: Text chunk.
          response_id:
            type: string

    RtResponseOutputAudioDelta:
      name: RtResponseOutputAudioDelta
      title: response.output_audio.delta
      summary: Streamed audio chunk (base64) from the model.
      payload:
        type: object
        required: [type]
        properties:
          type:
            type: string
            const: response.output_audio.delta
          delta:
            type: string
            description: |
              Base64-encoded audio chunk (per the official quick-start sample).
          audio:
            type: string
            description: |
              Base64-encoded audio chunk (alternate field name used in the
              event reference).
          response_id:
            type: string

    RtResponseFunctionCallArgumentsDone:
      name: RtResponseFunctionCallArgumentsDone
      title: response.function_call_arguments.done
      summary: A function call has been triggered with complete arguments.
      payload:
        type: object
        required: [type, name, call_id, arguments]
        properties:
          type:
            type: string
            const: response.function_call_arguments.done
          name:
            type: string
            description: Function name.
          call_id:
            type: string
            description: Unique call identifier; echoed back with function_call_output.
          arguments:
            type: string
            description: JSON-encoded string of function arguments.
          response_id:
            type: string

    RtError:
      name: RtError
      title: error
      summary: Realtime API error event.
      payload:
        type: object
        required: [type, error]
        properties:
          type:
            type: string
            const: error
          error:
            type: object
            required: [message]
            properties:
              type:
                type: string
                description: Error category.
              message:
                type: string
                description: Human-readable error description.
              param:
                type: string
                description: Parameter that caused the error, when applicable.

  schemas:
    Word:
      type: object
      description: Word-level transcript segment.
      properties:
        text:
          type: string
        start:
          type: number
          description: Word start time in seconds.
        end:
          type: number
          description: Word end time in seconds.

    RtAudioFormat:
      type: object
      description: Audio format descriptor used by the realtime session config.
      properties:
        type:
          type: string
          enum:
            - audio/pcm
            - audio/pcmu
            - audio/pcma
        rate:
          type: number
          description: Sample rate (PCM only).

    RtTurnDetection:
      type: object
      description: Turn detection configuration. Use null type for manual turns.
      properties:
        type:
          type: [string, 'null']
          enum:
            - server_vad
            - null
        threshold:
          type: number
          minimum: 0.1
          maximum: 0.9
          description: VAD activation threshold.
        silence_duration_ms:
          type: number
          minimum: 0
          maximum: 10000
          description: Silence duration before a turn ends.
        prefix_padding_ms:
          type: number
          minimum: 0
          maximum: 10000
          description: Audio padding before detected speech start.

    RtSession:
      type: object
      description: Voice session configuration.
      properties:
        voice:
          type: string
          description: Built-in voice (eve, ara, rex, sal, leo) or a custom voice ID.
        instructions:
          type: string
          description: System prompt.
        tools:
          type: array
          description: Tools available to the agent (file_search, web_search, x_search, mcp, function).
          items:
            type: object
        turn_detection:
          $ref: '#/components/schemas/RtTurnDetection'
        audio:
          type: object
          properties:
            input:
              type: object
              properties:
                format:
                  $ref: '#/components/schemas/RtAudioFormat'
            output:
              type: object
              properties:
                format:
                  $ref: '#/components/schemas/RtAudioFormat'

    RtMessageContent:
      type: object
      properties:
        type:
          type: string
          description: Content block type (e.g. input_text, input_audio).
        text:
          type: string
          description: Text content (for input_text blocks).
        audio:
          type: string
          description: Base64-encoded audio (for input_audio blocks).

    RtConversationItem:
      type: object
      description: A conversation item (message or function call output).
      properties:
        id:
          type: string
        type:
          type: string
          enum:
            - message
            - function_call_output
        role:
          type: string
          enum:
            - user
            - assistant
          description: Required for message items.
        content:
          type: array
          description: Content blocks for message items.
          items:
            $ref: '#/components/schemas/RtMessageContent'
        call_id:
          type: string
          description: For function_call_output items, the matching function call ID.
        output:
          type: string
          description: For function_call_output items, the JSON-encoded function result.

    RtResponse:
      type: object
      description: Response object metadata.
      properties:
        id:
          type: string
        status:
          type: string