Deepgram · AsyncAPI Specification

Deepgram Speech-to-Text Streaming Events

Version 1.0

The Deepgram Speech-to-Text streaming API provides real-time transcription of audio using a WebSocket connection. Audio data is sent as binary WebSocket messages and transcription results are returned as JSON messages in real-time, supporting interim results, final results, speaker diarization, and speech detection events. The API supports the same model family and feature parameters as the pre-recorded API.

View Spec View on GitHub Artificial IntelligenceSpeech-To-TextText-To-SpeechTranscriptionVoice AIAsyncAPIWebhooksEvents

Channels

/v1/listen
publish sendAudioData
Send audio data for real-time transcription
WebSocket channel for real-time speech-to-text streaming. The client sends binary audio frames and receives JSON transcription events. Connection parameters include model, language, punctuate, diarize, smart_format, interim_results, utterance_end_ms, vad_events, and encoding options.

Messages

AudioFrame
Audio Frame
Binary audio data frame
CloseStream
Close Stream
Signal to close the audio stream
KeepAlive
Keep Alive
Keep the connection alive
TranscriptResult
Transcript Result
Real-time transcription result
SpeechStarted
Speech Started
Speech activity detected
UtteranceEnd
Utterance End
End of utterance detected
StreamMetadata
Stream Metadata
Stream metadata information
StreamError
Stream Error
Stream error event

Servers

wss
production wss://api.deepgram.com/v1/listen
Deepgram production WebSocket server for real-time speech-to-text streaming. Connect with query parameters to configure the transcription session.
wss
eu wss://api.eu.deepgram.com/v1/listen
Deepgram EU WebSocket server for real-time speech-to-text streaming.

AsyncAPI Specification

Raw ↑
asyncapi: 2.6.0
info:
  title: Deepgram Speech-to-Text Streaming Events
  description: >-
    The Deepgram Speech-to-Text streaming API provides real-time transcription
    of audio using a WebSocket connection. Audio data is sent as binary
    WebSocket messages and transcription results are returned as JSON messages
    in real-time, supporting interim results, final results, speaker
    diarization, and speech detection events. The API supports the same model
    family and feature parameters as the pre-recorded API.
  version: '1.0'
  contact:
    name: Deepgram Support
    url: https://developers.deepgram.com
servers:
  production:
    url: 'wss://api.deepgram.com/v1/listen'
    protocol: wss
    description: >-
      Deepgram production WebSocket server for real-time speech-to-text
      streaming. Connect with query parameters to configure the transcription
      session.
    security:
      - bearerAuth: []
  eu:
    url: 'wss://api.eu.deepgram.com/v1/listen'
    protocol: wss
    description: >-
      Deepgram EU WebSocket server for real-time speech-to-text streaming.
    security:
      - bearerAuth: []
channels:
  /v1/listen:
    description: >-
      WebSocket channel for real-time speech-to-text streaming. The client
      sends binary audio frames and receives JSON transcription events.
      Connection parameters include model, language, punctuate, diarize,
      smart_format, interim_results, utterance_end_ms, vad_events, and
      encoding options.
    publish:
      operationId: sendAudioData
      summary: Send audio data for real-time transcription
      description: >-
        Client sends binary audio data frames to the WebSocket connection.
        Audio should be sent as binary WebSocket messages. Send a JSON close
        message to signal end of audio stream.
      message:
        oneOf:
          - $ref: '#/components/messages/AudioFrame'
          - $ref: '#/components/messages/CloseStream'
          - $ref: '#/components/messages/KeepAlive'
    subscribe:
      operationId: receiveTranscriptionEvents
      summary: Receive transcription events
      description: >-
        Server sends JSON messages containing transcription results, metadata,
        and stream lifecycle events.
      message:
        oneOf:
          - $ref: '#/components/messages/TranscriptResult'
          - $ref: '#/components/messages/SpeechStarted'
          - $ref: '#/components/messages/UtteranceEnd'
          - $ref: '#/components/messages/StreamMetadata'
          - $ref: '#/components/messages/StreamError'
components:
  securitySchemes:
    bearerAuth:
      type: http
      scheme: bearer
      description: >-
        Deepgram API key passed as a token query parameter or Authorization
        header when establishing the WebSocket connection.
  messages:
    AudioFrame:
      name: AudioFrame
      title: Audio Frame
      summary: Binary audio data frame
      description: >-
        Raw binary audio data sent as a WebSocket binary message. The audio
        encoding format should be specified via connection query parameters.
      contentType: application/octet-stream
      payload:
        type: string
        format: binary
        description: >-
          Raw binary audio data in the configured encoding format.
    CloseStream:
      name: CloseStream
      title: Close Stream
      summary: Signal to close the audio stream
      description: >-
        JSON message sent by the client to signal the end of the audio
        stream, triggering final processing of any remaining audio.
      contentType: application/json
      payload:
        $ref: '#/components/schemas/CloseStreamPayload'
    KeepAlive:
      name: KeepAlive
      title: Keep Alive
      summary: Keep the connection alive
      description: >-
        JSON message sent by the client to keep the WebSocket connection
        alive during periods of silence without closing the stream.
      contentType: application/json
      payload:
        $ref: '#/components/schemas/KeepAlivePayload'
    TranscriptResult:
      name: TranscriptResult
      title: Transcript Result
      summary: Real-time transcription result
      description: >-
        JSON message containing transcription results. Can be an interim
        result (is_final=false) or a final result (is_final=true) depending
        on the interim_results connection parameter.
      contentType: application/json
      payload:
        $ref: '#/components/schemas/TranscriptResultPayload'
    SpeechStarted:
      name: SpeechStarted
      title: Speech Started
      summary: Speech activity detected
      description: >-
        Event indicating that speech activity has been detected in the
        audio stream. Sent when vad_events is enabled.
      contentType: application/json
      payload:
        $ref: '#/components/schemas/SpeechStartedPayload'
    UtteranceEnd:
      name: UtteranceEnd
      title: Utterance End
      summary: End of utterance detected
      description: >-
        Event indicating that the end of an utterance has been detected
        based on the configured utterance_end_ms threshold.
      contentType: application/json
      payload:
        $ref: '#/components/schemas/UtteranceEndPayload'
    StreamMetadata:
      name: StreamMetadata
      title: Stream Metadata
      summary: Stream metadata information
      description: >-
        Metadata about the streaming session including request ID, model
        information, and session configuration.
      contentType: application/json
      payload:
        $ref: '#/components/schemas/StreamMetadataPayload'
    StreamError:
      name: StreamError
      title: Stream Error
      summary: Stream error event
      description: >-
        Error event indicating an issue with the streaming session.
      contentType: application/json
      payload:
        $ref: '#/components/schemas/StreamErrorPayload'
  schemas:
    CloseStreamPayload:
      type: object
      required:
        - type
      properties:
        type:
          type: string
          const: CloseStream
          description: >-
            Message type identifier.
    KeepAlivePayload:
      type: object
      required:
        - type
      properties:
        type:
          type: string
          const: KeepAlive
          description: >-
            Message type identifier.
    TranscriptResultPayload:
      type: object
      properties:
        type:
          type: string
          const: Results
          description: >-
            Message type identifier.
        channel_index:
          type: array
          items:
            type: integer
          description: >-
            Channel index information.
        duration:
          type: number
          format: float
          description: >-
            Duration of audio processed in seconds.
        start:
          type: number
          format: float
          description: >-
            Start time of this result in seconds.
        is_final:
          type: boolean
          description: >-
            Whether this is a final or interim result.
        speech_final:
          type: boolean
          description: >-
            Whether the speech endpoint has been detected.
        channel:
          type: object
          properties:
            alternatives:
              type: array
              items:
                $ref: '#/components/schemas/StreamAlternative'
              description: >-
                Alternative transcriptions ordered by confidence.
          description: >-
            Channel transcription data.
    StreamAlternative:
      type: object
      properties:
        transcript:
          type: string
          description: >-
            Transcript text for this alternative.
        confidence:
          type: number
          format: float
          description: >-
            Confidence score for this alternative.
          minimum: 0
          maximum: 1
        words:
          type: array
          items:
            $ref: '#/components/schemas/StreamWord'
          description: >-
            Individual words with timing information.
    StreamWord:
      type: object
      properties:
        word:
          type: string
          description: >-
            The transcribed word.
        start:
          type: number
          format: float
          description: >-
            Start time of the word in seconds.
        end:
          type: number
          format: float
          description: >-
            End time of the word in seconds.
        confidence:
          type: number
          format: float
          description: >-
            Confidence score for this word.
        speaker:
          type: integer
          description: >-
            Speaker identifier when diarization is enabled.
        punctuated_word:
          type: string
          description: >-
            The word with punctuation applied.
    SpeechStartedPayload:
      type: object
      properties:
        type:
          type: string
          const: SpeechStarted
          description: >-
            Message type identifier.
        channel:
          type: array
          items:
            type: integer
          description: >-
            Channel indices where speech was detected.
        timestamp:
          type: number
          format: float
          description: >-
            Timestamp in seconds when speech was detected.
    UtteranceEndPayload:
      type: object
      properties:
        type:
          type: string
          const: UtteranceEnd
          description: >-
            Message type identifier.
        channel:
          type: array
          items:
            type: integer
          description: >-
            Channel indices for the utterance.
        last_word_end:
          type: number
          format: float
          description: >-
            Timestamp in seconds of the last word in the utterance.
    StreamMetadataPayload:
      type: object
      properties:
        type:
          type: string
          const: Metadata
          description: >-
            Message type identifier.
        transaction_key:
          type: string
          description: >-
            Transaction key for this session.
        request_id:
          type: string
          description: >-
            Unique request identifier for this session.
        sha256:
          type: string
          description: >-
            SHA-256 hash identifier.
        created:
          type: string
          format: date-time
          description: >-
            Timestamp when the session was created.
        duration:
          type: number
          format: float
          description: >-
            Total duration of audio processed.
        channels:
          type: integer
          description: >-
            Number of audio channels.
        models:
          type: array
          items:
            type: string
          description: >-
            Model identifiers used for transcription.
        model_info:
          type: object
          additionalProperties: true
          description: >-
            Detailed model information.
    StreamErrorPayload:
      type: object
      properties:
        type:
          type: string
          const: Error
          description: >-
            Message type identifier.
        description:
          type: string
          description: >-
            Human-readable error description.
        message:
          type: string
          description: >-
            Error message.
        variant:
          type: string
          description: >-
            Error variant classifier.