PlayHT · AsyncAPI Specification

PlayAI Realtime WebSocket APIs

Version 1.0.0

AsyncAPI 2.6 description of the PlayAI (formerly PlayHT) realtime WebSocket APIs. Covers the Text-to-Speech (TTS) streaming WebSocket used to synthesize audio from text in real time, and the Voice Agents WebSocket used to operate audio-in / audio-out conversational agents. The TTS WebSocket URL is obtained dynamically from the HTTPS endpoint POST https://api.play.ai/api/v1/tts/websocket-auth using the Authorization (Bearer) and X-User-Id headers. The response contains a `webSocketUrls` map keyed by model (Play3.0-mini, PlayDialog, PlayDialogArabic, PlayDialogHindi, PlayDialogLora, PlayDialogMultilingual) along with an `expiresAt` timestamp. The returned URLs currently point to fal-hosted WebSocket gateways (e.g. wss://ws.fal.run/playht-fal/...). Voice Agents are reached directly at wss://api.play.ai/v1/talk/{agentId} and are authenticated by a `setup` message containing the API key. Sources: - https://docs.play.ai/api-reference/text-to-speech/websocket.md - https://docs.play.ai/api-reference/agents/websocket.md

View Spec View on GitHub VoiceTTSText to SpeechVoice CloningVoice AgentsStreamingPlayDialogPlay 3.0PlayNoteMultilingualReal-TimeAsyncAPIWebhooksEvents

Channels

/playht-tts/stream
publish sendTtsCommand
Send a TTS synthesis command.
TTS streaming channel for the Play3.0-mini model. Clients send JSON TTS command frames and receive JSON `start` / `end` control frames interleaved with binary audio chunks. The `fal_jwt_token` query parameter is obtained from the websocket-auth endpoint. The `/playht-tts-ldm/stream` path is used for the PlayDialog model, and similar per-model paths are returned for the other PlayDialog variants.
/v1/talk/{agentId}
publish sendAgentClientMessage
Send a client message to the voice agent.
Voice Agents audio-in / audio-out WebSocket. The connection is established with the target agent identifier; the first client message must be a `setup` frame carrying the API key and the desired audio configuration. Server sends `audioStream` chunks, voice activity events, `newAudioStream` markers, and `error` messages.

Messages

TtsCommand
TTS Command
Synthesize text on the streaming TTS connection.
TtsStart
TTS Start
Marks the start of a TTS response stream for a given request_id.
TtsEnd
TTS End
Marks the end of a TTS response stream for a given request_id.
TtsAudioChunk
TTS Audio Chunk
Binary audio frame delivered between the `start` and `end` JSON frames. The payload bytes match the configured `output_format` (for example MP3 for `audio/mpeg`).
AgentSetup
Agent Setup
First client message on a Voice Agents connection. Carries the API key and the desired audio in/out configuration.
AgentAudioIn
Agent Audio Input
Streams base64-encoded user audio into the agent.
AgentAudioStream
Agent Audio Stream
Base64-encoded chunk of the agent's spoken response.
AgentNewAudioStream
Agent New Audio Stream
Indicates the start of a new agent response stream. Clients should clear their playback buffer and start playing the new stream.
AgentVoiceActivityStart
Voice Activity Start
Server detected the user started speaking.
AgentVoiceActivityEnd
Voice Activity End
Server detected the user stopped speaking.
AgentError
Agent Error
Error message emitted by the Voice Agents server.

Servers

wss
tts ws.fal.run/playht-fal
Dynamically issued PlayAI TTS WebSocket gateway. The exact URL (including the `fal_jwt_token` query parameter and the model-specific path such as `/playht-tts/stream` for Play3.0-mini or `/playht-tts-ldm/stream` for PlayDialog) is returned by POST https://api.play.ai/api/v1/tts/websocket-auth. Connections last for up to 1 hour before re-authentication is required.
wss
agents api.play.ai
PlayAI Voice Agents WebSocket gateway. Connect to wss://api.play.ai/v1/talk/{agentId} and authenticate by sending a `setup` message that includes your API key.

AsyncAPI Specification

Raw ↑
asyncapi: '2.6.0'
info:
  title: PlayAI Realtime WebSocket APIs
  version: '1.0.0'
  description: >-
    AsyncAPI 2.6 description of the PlayAI (formerly PlayHT) realtime WebSocket
    APIs. Covers the Text-to-Speech (TTS) streaming WebSocket used to synthesize
    audio from text in real time, and the Voice Agents WebSocket used to
    operate audio-in / audio-out conversational agents.

    The TTS WebSocket URL is obtained dynamically from the HTTPS endpoint
    POST https://api.play.ai/api/v1/tts/websocket-auth using the
    Authorization (Bearer) and X-User-Id headers. The response contains a
    `webSocketUrls` map keyed by model (Play3.0-mini, PlayDialog,
    PlayDialogArabic, PlayDialogHindi, PlayDialogLora, PlayDialogMultilingual)
    along with an `expiresAt` timestamp. The returned URLs currently point to
    fal-hosted WebSocket gateways (e.g. wss://ws.fal.run/playht-fal/...).

    Voice Agents are reached directly at wss://api.play.ai/v1/talk/{agentId}
    and are authenticated by a `setup` message containing the API key.

    Sources:
      - https://docs.play.ai/api-reference/text-to-speech/websocket.md
      - https://docs.play.ai/api-reference/agents/websocket.md
  contact:
    name: PlayAI Developer Support
    url: https://docs.play.ai
  license:
    name: PlayAI Terms of Service
    url: https://play.ht/terms

defaultContentType: application/json

servers:
  tts:
    url: ws.fal.run/playht-fal
    protocol: wss
    description: >-
      Dynamically issued PlayAI TTS WebSocket gateway. The exact URL (including
      the `fal_jwt_token` query parameter and the model-specific path such as
      `/playht-tts/stream` for Play3.0-mini or `/playht-tts-ldm/stream` for
      PlayDialog) is returned by POST
      https://api.play.ai/api/v1/tts/websocket-auth. Connections last for up to
      1 hour before re-authentication is required.
  agents:
    url: api.play.ai
    protocol: wss
    description: >-
      PlayAI Voice Agents WebSocket gateway. Connect to
      wss://api.play.ai/v1/talk/{agentId} and authenticate by sending a
      `setup` message that includes your API key.

channels:
  /playht-tts/stream:
    description: >-
      TTS streaming channel for the Play3.0-mini model. Clients send JSON TTS
      command frames and receive JSON `start` / `end` control frames
      interleaved with binary audio chunks. The `fal_jwt_token` query
      parameter is obtained from the websocket-auth endpoint. The
      `/playht-tts-ldm/stream` path is used for the PlayDialog model, and
      similar per-model paths are returned for the other PlayDialog variants.
    servers:
      - tts
    bindings:
      ws:
        bindingVersion: '0.1.0'
        query:
          type: object
          required:
            - fal_jwt_token
          properties:
            fal_jwt_token:
              type: string
              description: Short-lived session token from /api/v1/tts/websocket-auth.
    publish:
      operationId: sendTtsCommand
      summary: Send a TTS synthesis command.
      description: >-
        Send a JSON TTS command. If a sequence of commands is sent on the same
        connection, audio output is returned in the same order as the
        requests.
      message:
        oneOf:
          - $ref: '#/components/messages/TtsCommand'
    subscribe:
      operationId: receiveTtsStream
      summary: Receive TTS synthesis events and audio.
      description: >-
        Receive a `start` JSON frame, one or more binary audio chunks, and
        then an `end` JSON frame for each TTS command. Binary frames carry
        audio data in the configured `output_format`.
      message:
        oneOf:
          - $ref: '#/components/messages/TtsStart'
          - $ref: '#/components/messages/TtsAudioChunk'
          - $ref: '#/components/messages/TtsEnd'

  /v1/talk/{agentId}:
    description: >-
      Voice Agents audio-in / audio-out WebSocket. The connection is
      established with the target agent identifier; the first client message
      must be a `setup` frame carrying the API key and the desired audio
      configuration. Server sends `audioStream` chunks, voice activity
      events, `newAudioStream` markers, and `error` messages.
    servers:
      - agents
    parameters:
      agentId:
        description: PlayAI agent identifier.
        schema:
          type: string
    publish:
      operationId: sendAgentClientMessage
      summary: Send a client message to the voice agent.
      message:
        oneOf:
          - $ref: '#/components/messages/AgentSetup'
          - $ref: '#/components/messages/AgentAudioIn'
    subscribe:
      operationId: receiveAgentServerMessage
      summary: Receive messages from the voice agent.
      message:
        oneOf:
          - $ref: '#/components/messages/AgentAudioStream'
          - $ref: '#/components/messages/AgentNewAudioStream'
          - $ref: '#/components/messages/AgentVoiceActivityStart'
          - $ref: '#/components/messages/AgentVoiceActivityEnd'
          - $ref: '#/components/messages/AgentError'

components:
  messages:
    TtsCommand:
      name: TtsCommand
      title: TTS Command
      summary: Synthesize text on the streaming TTS connection.
      contentType: application/json
      payload:
        $ref: '#/components/schemas/TtsCommandPayload'

    TtsStart:
      name: TtsStart
      title: TTS Start
      summary: Marks the start of a TTS response stream for a given request_id.
      contentType: application/json
      payload:
        $ref: '#/components/schemas/TtsStartPayload'

    TtsEnd:
      name: TtsEnd
      title: TTS End
      summary: Marks the end of a TTS response stream for a given request_id.
      contentType: application/json
      payload:
        $ref: '#/components/schemas/TtsEndPayload'

    TtsAudioChunk:
      name: TtsAudioChunk
      title: TTS Audio Chunk
      summary: >-
        Binary audio frame delivered between the `start` and `end` JSON
        frames. The payload bytes match the configured `output_format`
        (for example MP3 for `audio/mpeg`).
      contentType: application/octet-stream
      payload:
        type: string
        format: binary
        description: Raw binary audio data for one chunk of the TTS response.

    AgentSetup:
      name: AgentSetup
      title: Agent Setup
      summary: >-
        First client message on a Voice Agents connection. Carries the API
        key and the desired audio in/out configuration.
      contentType: application/json
      payload:
        $ref: '#/components/schemas/AgentSetupPayload'

    AgentAudioIn:
      name: AgentAudioIn
      title: Agent Audio Input
      summary: Streams base64-encoded user audio into the agent.
      contentType: application/json
      payload:
        $ref: '#/components/schemas/AgentAudioInPayload'

    AgentAudioStream:
      name: AgentAudioStream
      title: Agent Audio Stream
      summary: Base64-encoded chunk of the agent's spoken response.
      contentType: application/json
      payload:
        $ref: '#/components/schemas/AgentAudioStreamPayload'

    AgentNewAudioStream:
      name: AgentNewAudioStream
      title: Agent New Audio Stream
      summary: >-
        Indicates the start of a new agent response stream. Clients should
        clear their playback buffer and start playing the new stream.
      contentType: application/json
      payload:
        $ref: '#/components/schemas/AgentNewAudioStreamPayload'

    AgentVoiceActivityStart:
      name: AgentVoiceActivityStart
      title: Voice Activity Start
      summary: Server detected the user started speaking.
      contentType: application/json
      payload:
        $ref: '#/components/schemas/AgentVoiceActivityStartPayload'

    AgentVoiceActivityEnd:
      name: AgentVoiceActivityEnd
      title: Voice Activity End
      summary: Server detected the user stopped speaking.
      contentType: application/json
      payload:
        $ref: '#/components/schemas/AgentVoiceActivityEndPayload'

    AgentError:
      name: AgentError
      title: Agent Error
      summary: Error message emitted by the Voice Agents server.
      contentType: application/json
      payload:
        $ref: '#/components/schemas/AgentErrorPayload'

  schemas:
    TtsCommandPayload:
      type: object
      required:
        - text
        - voice
      properties:
        text:
          type: string
          description: Text to synthesize.
        voice:
          type: string
          description: >-
            Voice identifier (PlayAI voice URL or ID) to use for synthesis.
        request_id:
          type: string
          description: >-
            Optional client-supplied request identifier. Echoed back on the
            corresponding `start` and `end` frames.
        output_format:
          type: string
          description: >-
            Desired audio output format for the streamed binary chunks
            (matches the TTS streaming API formats, for example `mp3`).
        temperature:
          type: number
          minimum: 0.0
          maximum: 1.0
          description: Sampling temperature.
        speed:
          type: number
          minimum: 0.5
          maximum: 2.0
          description: Playback speed multiplier.

    TtsStartPayload:
      type: object
      required:
        - type
      properties:
        type:
          type: string
          const: start
          description: Discriminator value.
        request_id:
          type: string
          description: >-
            Identifier of the TTS command this stream corresponds to.

    TtsEndPayload:
      type: object
      required:
        - type
      properties:
        type:
          type: string
          const: end
          description: Discriminator value.
        request_id:
          type: string
          description: >-
            Identifier of the TTS command whose stream has ended.

    AgentSetupPayload:
      type: object
      required:
        - type
        - apiKey
      properties:
        type:
          type: string
          const: setup
        apiKey:
          type: string
          description: PlayAI API key.
        inputEncoding:
          type: string
          description: Format of the audio the client will send.
          enum:
            - media-container
            - mulaw
            - linear16
            - flac
            - amr-nb
            - amr-wb
            - opus
            - speex
            - g729
          default: media-container
        inputSampleRate:
          type: integer
          description: >-
            Sample rate of incoming audio. Required for headerless formats.
        outputFormat:
          type: string
          description: Format the server should use for `audioStream` chunks.
          enum:
            - mp3
            - raw
            - wav
            - ogg
            - flac
            - mulaw
          default: mp3
        outputSampleRate:
          type: integer
          description: Sample rate for outgoing audio.
          default: 44100
        customGreeting:
          type: string
          description: Overrides the agent's default greeting.
        prompt:
          type: string
          description: Additional behavioral instructions for the agent.
        continueConversation:
          type: string
          description: >-
            Conversation ID of a prior session to resume.

    AgentAudioInPayload:
      type: object
      required:
        - type
        - data
      properties:
        type:
          type: string
          const: audioIn
        data:
          type: string
          format: byte
          description: >-
            Base64-encoded audio chunk matching the configured
            `inputEncoding` and `inputSampleRate`.

    AgentAudioStreamPayload:
      type: object
      required:
        - type
        - data
      properties:
        type:
          type: string
          const: audioStream
        data:
          type: string
          format: byte
          description: >-
            Base64-encoded audio chunk matching the configured
            `outputFormat` and `outputSampleRate`.

    AgentNewAudioStreamPayload:
      type: object
      required:
        - type
      properties:
        type:
          type: string
          const: newAudioStream

    AgentVoiceActivityStartPayload:
      type: object
      required:
        - type
      properties:
        type:
          type: string
          const: voiceActivityStart

    AgentVoiceActivityEndPayload:
      type: object
      required:
        - type
      properties:
        type:
          type: string
          const: voiceActivityEnd

    AgentErrorPayload:
      type: object
      required:
        - type
        - code
        - message
      properties:
        type:
          type: string
          const: error
        code:
          type: integer
          description: >-
            Numeric error code. Documented codes include 1001 (invalid
            authorization token), 1002 (invalid agent ID), 1003 (invalid
            authorization credentials), 1005 (insufficient credits), 4400
            (invalid parameters / message format), 4401 (unauthorized
            access), 4429 (maximum concurrent connections exceeded), and
            4500 (internal server error).
        message:
          type: string
          description: Human-readable error description.