elevenlabs · AsyncAPI Specification

ElevenLabs Text to Speech Streaming Events

Version 1.0

The ElevenLabs Text to Speech WebSocket API enables bidirectional streaming for text-to-speech conversion. Clients send text chunks incrementally and receive audio chunks as they are generated, enabling ultra-low latency speech synthesis for real-time applications.

View Spec View on GitHub AsyncAPIWebhooksEvents

Channels

/stream-input

publish receiveAudioChunk

Receive generated audio chunks

Bidirectional WebSocket channel for streaming text-to-speech. Clients send text chunks and receive audio chunks in real time as the model generates speech.

Messages

✉

AudioChunkEvent

Audio Chunk

Generated audio data chunk

✉

AlignmentEvent

Alignment Data

Word-level timing alignment data

✉

FinalEvent

Final Event

Signals the end of audio generation

✉

InitMessage

Initialization Message

Initial configuration for the streaming session

✉

TextChunkMessage

Text Chunk

Text input for speech synthesis

✉

FlushMessage

Flush

Forces generation of remaining audio

✉

CloseMessage

Signals the end of text input

Servers

wss

production wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input

ElevenLabs Text to Speech WebSocket server for bidirectional streaming synthesis.

AsyncAPI Specification

asyncapi: 2.6.0
info:
  title: ElevenLabs Text to Speech Streaming Events
  description: >-
    The ElevenLabs Text to Speech WebSocket API enables bidirectional
    streaming for text-to-speech conversion. Clients send text chunks
    incrementally and receive audio chunks as they are generated, enabling
    ultra-low latency speech synthesis for real-time applications.
  version: '1.0'
  contact:
    name: ElevenLabs Support
    url: https://help.elevenlabs.io
servers:
  production:
    url: wss://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream-input
    protocol: wss
    description: >-
      ElevenLabs Text to Speech WebSocket server for bidirectional
      streaming synthesis.
    security:
      - apiKeyHeader: []
channels:
  /stream-input:
    description: >-
      Bidirectional WebSocket channel for streaming text-to-speech. Clients
      send text chunks and receive audio chunks in real time as the model
      generates speech.
    publish:
      operationId: receiveAudioChunk
      summary: Receive generated audio chunks
      description: >-
        Audio chunks sent from the server as the text-to-speech model
        generates speech from the provided text input.
      message:
        oneOf:
          - $ref: '#/components/messages/AudioChunkEvent'
          - $ref: '#/components/messages/AlignmentEvent'
          - $ref: '#/components/messages/FinalEvent'
    subscribe:
      operationId: sendTextChunk
      summary: Send text chunks for synthesis
      description: >-
        Text chunks sent from the client for incremental speech synthesis.
        Includes the initial configuration message and subsequent text
        input messages.
      message:
        oneOf:
          - $ref: '#/components/messages/InitMessage'
          - $ref: '#/components/messages/TextChunkMessage'
          - $ref: '#/components/messages/FlushMessage'
          - $ref: '#/components/messages/CloseMessage'
components:
  securitySchemes:
    apiKeyHeader:
      type: httpApiKey
      in: header
      name: xi-api-key
      description: >-
        ElevenLabs API key for WebSocket authentication.
  messages:
    AudioChunkEvent:
      name: audio_chunk
      title: Audio Chunk
      summary: Generated audio data chunk
      description: >-
        Contains a base64-encoded chunk of generated audio. Chunks are
        sent as they are produced by the model for low-latency playback.
      payload:
        $ref: '#/components/schemas/AudioChunkPayload'
    AlignmentEvent:
      name: alignment
      title: Alignment Data
      summary: Word-level timing alignment data
      description: >-
        Contains timing information mapping generated audio to the input
        text, enabling synchronized text highlighting.
      payload:
        $ref: '#/components/schemas/AlignmentPayload'
    FinalEvent:
      name: final
      title: Final Event
      summary: Signals the end of audio generation
      description: >-
        Sent when the server has finished generating all audio for the
        provided text input.
      payload:
        $ref: '#/components/schemas/FinalPayload'
    InitMessage:
      name: init
      title: Initialization Message
      summary: Initial configuration for the streaming session
      description: >-
        The first message sent by the client to configure the streaming
        session, including model selection, voice settings, and output
        format preferences.
      payload:
        $ref: '#/components/schemas/InitPayload'
    TextChunkMessage:
      name: text_chunk
      title: Text Chunk
      summary: Text input for speech synthesis
      description: >-
        Contains a chunk of text to be converted to speech. Text can be
        sent incrementally as it becomes available.
      payload:
        $ref: '#/components/schemas/TextChunkPayload'
    FlushMessage:
      name: flush
      title: Flush
      summary: Forces generation of remaining audio
      description: >-
        Triggers the model to generate audio for any buffered text that
        has not yet been processed. Useful for ensuring all pending text
        is synthesized.
      payload:
        $ref: '#/components/schemas/FlushPayload'
    CloseMessage:
      name: close
      title: Close
      summary: Signals the end of text input
      description: >-
        Sent by the client to indicate that no more text will be sent,
        triggering final audio generation and connection cleanup.
      payload:
        $ref: '#/components/schemas/ClosePayload'
  schemas:
    AudioChunkPayload:
      type: object
      properties:
        audio:
          type: string
          description: >-
            Base64-encoded audio data chunk.
        isFinal:
          type: boolean
          description: >-
            Whether this is the final audio chunk.
    AlignmentPayload:
      type: object
      properties:
        chars:
          type: array
          description: >-
            Character-level alignment data.
          items:
            type: string
        charStartTimesMs:
          type: array
          description: >-
            Start times in milliseconds for each character.
          items:
            type: number
        charDurationsMs:
          type: array
          description: >-
            Durations in milliseconds for each character.
          items:
            type: number
    FinalPayload:
      type: object
      properties:
        isFinal:
          type: boolean
          const: true
          description: >-
            Indicates this is the final message for the session.
    InitPayload:
      type: object
      required:
        - text
      properties:
        text:
          type: string
          description: >-
            Initial text to begin generation. Can be a space to start
            an empty session.
        voice_settings:
          type: object
          description: >-
            Voice settings for the session.
          properties:
            stability:
              type: number
              description: >-
                Voice stability setting.
              minimum: 0
              maximum: 1
            similarity_boost:
              type: number
              description: >-
                Voice similarity boost setting.
              minimum: 0
              maximum: 1
        generation_config:
          type: object
          description: >-
            Generation configuration.
          properties:
            chunk_length_schedule:
              type: array
              description: >-
                Schedule of chunk lengths for audio generation.
              items:
                type: integer
        xi_api_key:
          type: string
          description: >-
            API key for authentication if not provided in headers.
        model_id:
          type: string
          description: >-
            The TTS model to use for generation.
        output_format:
          type: string
          description: >-
            The desired audio output format.
    TextChunkPayload:
      type: object
      required:
        - text
      properties:
        text:
          type: string
          description: >-
            A chunk of text to convert to speech.
        try_trigger_generation:
          type: boolean
          description: >-
            Whether to attempt immediate generation of available text.
    FlushPayload:
      type: object
      properties:
        text:
          type: string
          const: ""
          description: >-
            Empty text string signals a flush.
        flush:
          type: boolean
          const: true
          description: >-
            Flag to trigger flushing of buffered text.
    ClosePayload:
      type: object
      properties:
        text:
          type: string
          const: ""
          description: >-
            Empty text string.