elevenlabs · AsyncAPI Specification

ElevenLabs Text to Speech Streaming Events

Version 1.0

The ElevenLabs Text to Speech WebSocket API enables bidirectional streaming for text-to-speech conversion. Clients send text chunks incrementally and receive audio chunks as they are generated, enabling ultra-low latency speech synthesis for real-time applications.

View Spec View on GitHub AsyncAPIWebhooksEvents

Channels

/stream-input
publish receiveAudioChunk
Receive generated audio chunks
Bidirectional WebSocket channel for streaming text-to-speech. Clients send text chunks and receive audio chunks in real time as the model generates speech.

Messages

AudioChunkEvent
Audio Chunk
Generated audio data chunk
AlignmentEvent
Alignment Data
Word-level timing alignment data
FinalEvent
Final Event
Signals the end of audio generation
InitMessage
Initialization Message
Initial configuration for the streaming session
TextChunkMessage
Text Chunk
Text input for speech synthesis
FlushMessage
Flush
Forces generation of remaining audio
CloseMessage
Close
Signals the end of text input

Servers

wss
production wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input
ElevenLabs Text to Speech WebSocket server for bidirectional streaming synthesis.

AsyncAPI Specification

Raw ↑
asyncapi: 2.6.0
info:
  title: ElevenLabs Text to Speech Streaming Events
  description: >-
    The ElevenLabs Text to Speech WebSocket API enables bidirectional
    streaming for text-to-speech conversion. Clients send text chunks
    incrementally and receive audio chunks as they are generated, enabling
    ultra-low latency speech synthesis for real-time applications.
  version: '1.0'
  contact:
    name: ElevenLabs Support
    url: https://help.elevenlabs.io
servers:
  production:
    url: wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input
    protocol: wss
    description: >-
      ElevenLabs Text to Speech WebSocket server for bidirectional
      streaming synthesis.
    security:
      - apiKeyHeader: []
channels:
  /stream-input:
    description: >-
      Bidirectional WebSocket channel for streaming text-to-speech. Clients
      send text chunks and receive audio chunks in real time as the model
      generates speech.
    publish:
      operationId: receiveAudioChunk
      summary: Receive generated audio chunks
      description: >-
        Audio chunks sent from the server as the text-to-speech model
        generates speech from the provided text input.
      message:
        oneOf:
          - $ref: '#/components/messages/AudioChunkEvent'
          - $ref: '#/components/messages/AlignmentEvent'
          - $ref: '#/components/messages/FinalEvent'
    subscribe:
      operationId: sendTextChunk
      summary: Send text chunks for synthesis
      description: >-
        Text chunks sent from the client for incremental speech synthesis.
        Includes the initial configuration message and subsequent text
        input messages.
      message:
        oneOf:
          - $ref: '#/components/messages/InitMessage'
          - $ref: '#/components/messages/TextChunkMessage'
          - $ref: '#/components/messages/FlushMessage'
          - $ref: '#/components/messages/CloseMessage'
components:
  securitySchemes:
    apiKeyHeader:
      type: httpApiKey
      in: header
      name: xi-api-key
      description: >-
        ElevenLabs API key for WebSocket authentication.
  messages:
    AudioChunkEvent:
      name: audio_chunk
      title: Audio Chunk
      summary: Generated audio data chunk
      description: >-
        Contains a base64-encoded chunk of generated audio. Chunks are
        sent as they are produced by the model for low-latency playback.
      payload:
        $ref: '#/components/schemas/AudioChunkPayload'
    AlignmentEvent:
      name: alignment
      title: Alignment Data
      summary: Word-level timing alignment data
      description: >-
        Contains timing information mapping generated audio to the input
        text, enabling synchronized text highlighting.
      payload:
        $ref: '#/components/schemas/AlignmentPayload'
    FinalEvent:
      name: final
      title: Final Event
      summary: Signals the end of audio generation
      description: >-
        Sent when the server has finished generating all audio for the
        provided text input.
      payload:
        $ref: '#/components/schemas/FinalPayload'
    InitMessage:
      name: init
      title: Initialization Message
      summary: Initial configuration for the streaming session
      description: >-
        The first message sent by the client to configure the streaming
        session, including model selection, voice settings, and output
        format preferences.
      payload:
        $ref: '#/components/schemas/InitPayload'
    TextChunkMessage:
      name: text_chunk
      title: Text Chunk
      summary: Text input for speech synthesis
      description: >-
        Contains a chunk of text to be converted to speech. Text can be
        sent incrementally as it becomes available.
      payload:
        $ref: '#/components/schemas/TextChunkPayload'
    FlushMessage:
      name: flush
      title: Flush
      summary: Forces generation of remaining audio
      description: >-
        Triggers the model to generate audio for any buffered text that
        has not yet been processed. Useful for ensuring all pending text
        is synthesized.
      payload:
        $ref: '#/components/schemas/FlushPayload'
    CloseMessage:
      name: close
      title: Close
      summary: Signals the end of text input
      description: >-
        Sent by the client to indicate that no more text will be sent,
        triggering final audio generation and connection cleanup.
      payload:
        $ref: '#/components/schemas/ClosePayload'
  schemas:
    AudioChunkPayload:
      type: object
      properties:
        audio:
          type: string
          description: >-
            Base64-encoded audio data chunk.
        isFinal:
          type: boolean
          description: >-
            Whether this is the final audio chunk.
    AlignmentPayload:
      type: object
      properties:
        chars:
          type: array
          description: >-
            Character-level alignment data.
          items:
            type: string
        charStartTimesMs:
          type: array
          description: >-
            Start times in milliseconds for each character.
          items:
            type: number
        charDurationsMs:
          type: array
          description: >-
            Durations in milliseconds for each character.
          items:
            type: number
    FinalPayload:
      type: object
      properties:
        isFinal:
          type: boolean
          const: true
          description: >-
            Indicates this is the final message for the session.
    InitPayload:
      type: object
      required:
        - text
      properties:
        text:
          type: string
          description: >-
            Initial text to begin generation. Can be a space to start
            an empty session.
        voice_settings:
          type: object
          description: >-
            Voice settings for the session.
          properties:
            stability:
              type: number
              description: >-
                Voice stability setting.
              minimum: 0
              maximum: 1
            similarity_boost:
              type: number
              description: >-
                Voice similarity boost setting.
              minimum: 0
              maximum: 1
        generation_config:
          type: object
          description: >-
            Generation configuration.
          properties:
            chunk_length_schedule:
              type: array
              description: >-
                Schedule of chunk lengths for audio generation.
              items:
                type: integer
        xi_api_key:
          type: string
          description: >-
            API key for authentication if not provided in headers.
        model_id:
          type: string
          description: >-
            The TTS model to use for generation.
        output_format:
          type: string
          description: >-
            The desired audio output format.
    TextChunkPayload:
      type: object
      required:
        - text
      properties:
        text:
          type: string
          description: >-
            A chunk of text to convert to speech.
        try_trigger_generation:
          type: boolean
          description: >-
            Whether to attempt immediate generation of available text.
    FlushPayload:
      type: object
      properties:
        text:
          type: string
          const: ""
          description: >-
            Empty text string signals a flush.
        flush:
          type: boolean
          const: true
          description: >-
            Flag to trigger flushing of buffered text.
    ClosePayload:
      type: object
      properties:
        text:
          type: string
          const: ""
          description: >-
            Empty text string.