Livestream Speaker Diarization not distinguishing different speakers consistently #283

ali-rafiei · 2023-07-19T22:17:21Z

ali-rafiei
Jul 19, 2023

Which Deepgram product are you using?

Deepgram API

Details

I've been testing Deepgram's Livestream Speaker Diarization, and I've noticed that it's struggling to distinguish different speakers, especially when the tones of the different speakers are similar. I can't seem to find a fix, and I'm not sure if it's something on my end that I'm doing wrong or on Deepgram's end. I'm using test_suite.py (https://github.com/deepgram/streaming-test-suite) with some added code to make each speaker more visible when printed.

the following is a transcription attempt between two people in the beginning of this sample YouTube video (Daily English Conversation Practice)

If you are making a request to the Deepgram API, what is the full Deepgram URL you are making a request to?

wss://api.deepgram.com/v1/listen?tier=enhanced&model=meeting&punctuate=true&diarize=true

If you are making a request to the Deepgram API and have a request ID, please paste it below:

No response

If possible, please attach your code or paste it into the text box.

import pyaudio
import argparse
import asyncio
import aiohttp
import json
import os
import sys
import wave
import websockets

from datetime import datetime

startTime = datetime.now()

all_mic_data = []
all_transcripts = []

speaker = 0     # NEW CODE INSERTED HERE
speaker_transcript = ""
# ---------------------------------------

FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
CHUNK = 8000

audio_queue = asyncio.Queue()

# Mimic sending a real-time stream by sending this many seconds of audio at a time.
# Used for file "streaming" only.
REALTIME_RESOLUTION = 0.250

subtitle_line_counter = 0


def subtitle_time_formatter(seconds, separator):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    millis = int((seconds - int(seconds)) * 1000)
    return f"{hours:02}:{minutes:02}:{secs:02}{separator}{millis:03}"


def subtitle_formatter(response, format):
    global subtitle_line_counter
    subtitle_line_counter += 1

    start = response["start"]
    end = start + response["duration"]
    transcript = response.get("channel", {}).get("alternatives", [{}])[0].get("transcript", "")

    separator = "," if format == "srt" else '.'
    prefix = "- " if format == "vtt" else ""
    subtitle_string = (
        f"{subtitle_line_counter}\n"
        f"{subtitle_time_formatter(start, separator)} --> "
        f"{subtitle_time_formatter(end, separator)}\n"
        f"{prefix}{transcript}\n\n"
    )

    return subtitle_string


# Used for microphone streaming only.
def mic_callback(input_data, frame_count, time_info, status_flag):
    audio_queue.put_nowait(input_data)
    return (input_data, pyaudio.paContinue)


async def run(key, method, format, **kwargs):
    deepgram_url = f'{kwargs["host"]}/v1/listen?tier=enhanced&model=meeting&punctuate=true&diarize=true'

    if method == "mic":
        deepgram_url += "&encoding=linear16&sample_rate=16000"

    elif method == "wav":
        data = kwargs["data"]
        deepgram_url += f'&channels={kwargs["channels"]}&sample_rate={kwargs["sample_rate"]}&encoding=linear16'

    # Connect to the real-time streaming endpoint, attaching our credentials.
    async with websockets.connect(
        deepgram_url, extra_headers={"Authorization": "Token {}".format(key)}
    ) as ws:
        print(f'ℹ️  Request ID: {ws.response_headers.get("dg-request-id")}')
        print("🟢 (1/5) Successfully opened Deepgram streaming connection")

        async def sender(ws):
            print(
                f'🟢 (2/5) Ready to stream {method if (method == "mic" or method == "url") else kwargs["filepath"]} audio to Deepgram{". Speak into your microphone to transcribe." if method == "mic" else ""}'
            )

            if method == "mic":
                try:
                    while True:
                        mic_data = await audio_queue.get()
                        all_mic_data.append(mic_data)
                        await ws.send(mic_data)
                except websockets.exceptions.ConnectionClosedOK:
                    await ws.send(json.dumps({"type": "CloseStream"}))
                    print(
                        "🟢 (5/5) Successfully closed Deepgram connection, waiting for final transcripts if necessary"
                    )

                except Exception as e:
                    print(f"Error while sending: {str(e)}")
                    raise

            elif method == "url":
                # Listen for the connection to open and send streaming audio from the URL to Deepgram
                async with aiohttp.ClientSession() as session:
                    async with session.get(kwargs["url"]) as audio:
                        while True:
                            remote_url_data = await audio.content.readany()
                            await ws.send(remote_url_data)

                            # If no data is being sent from the live stream, then break out of the loop.
                            if not remote_url_data:
                                break

            elif method == "wav":
                nonlocal data
                # How many bytes are contained in one second of audio?
                byte_rate = (
                    kwargs["sample_width"] * kwargs["sample_rate"] * kwargs["channels"]
                )
                # How many bytes are in `REALTIME_RESOLUTION` seconds of audio?
                chunk_size = int(byte_rate * REALTIME_RESOLUTION)

                try:
                    while len(data):
                        chunk, data = data[:chunk_size], data[chunk_size:]
                        # Mimic real-time by waiting `REALTIME_RESOLUTION` seconds
                        # before the next packet.
                        await asyncio.sleep(REALTIME_RESOLUTION)
                        # Send the data
                        await ws.send(chunk)

                    await ws.send(json.dumps({"type": "CloseStream"}))
                    print(
                        "🟢 (5/5) Successfully closed Deepgram connection, waiting for final transcripts if necessary"
                    )
                except Exception as e:
                    print(f"🔴 ERROR: Something happened while sending, {e}")
                    raise e

            return

        async def receiver(ws):
            """Print out the messages received from the server."""
            first_message = True
            first_transcript = True
            transcript = ""

            async for msg in ws:
                res = json.loads(msg)
                if first_message:
                    print(
                        "🟢 (3/5) Successfully receiving Deepgram messages, waiting for finalized transcription..."
                    )
                    first_message = False
                try:
                    # handle local server messages
                    if res.get("msg"):
                        print(res["msg"])
                    if res.get("is_final"):
                        transcript = (
                            res.get("channel", {})
                            .get("alternatives", [{}])[0]
                            .get("transcript", "")
                        )
                        words = (
                            res.get("channel", {})
                            .get("alternatives", [{}])[0]
                            .get("words", [{}])
                        )
                        if words:
                            if first_transcript:
                                print("🟢 (4/5) Began receiving transcription")
                                # if using webvtt, print out header
                                if format == "vtt":
                                    print("WEBVTT\n")
                                first_transcript = False
                            if format == "vtt" or format == "srt":
                                transcript = subtitle_formatter(res, format)
                            
                            # NEW CODE INSERTED HERE -----------------------------------------------------------
                            first_word = True
                            for word in words:
                                if first_word == True:
                                    first_word = False
                                    speaker = word['speaker']
                                    speaker_transcript = "Speaker "+str(speaker)+": "
                                if speaker == word['speaker']:
                                    speaker_transcript += word["punctuated_word"] + ' ' # could use "word" instead of "punctuated_word" aswell
                                else:
                                    print(speaker_transcript)
                                    all_transcripts.append(speaker_transcript+'\n')
                                    speaker = word['speaker']
                                    speaker_transcript = "Speaker "+str(speaker)+": "+word["punctuated_word"]
                            print(speaker_transcript)
                            all_transcripts.append(speaker_transcript+'\n')
                            # -------------------------------------------------------------------------------
                            

                            # print("Speaker "+str(speaker)+": "+transcript)
                            # all_transcripts.append(transcript)

                        # if using the microphone, close stream if user says "goodbye"
                        if method == "mic" and "goodbye" in transcript.lower():
                            await ws.send(json.dumps({"type": "CloseStream"}))
                            print(
                                "🟢 (5/5) Successfully closed Deepgram connection, waiting for final transcripts if necessary"
                            )

                    # handle end of stream
                    if res.get("created"):
                        # save subtitle data if specified
                        if format == "vtt" or format == "srt":
                            data_dir = os.path.abspath(
                                os.path.join(os.path.curdir, "data")
                            )
                            if not os.path.exists(data_dir):
                                os.makedirs(data_dir)

                            transcript_file_path = os.path.abspath(
                                os.path.join(
                                    data_dir,
                                    f"{startTime.strftime('%Y%m%d%H%M')}.{format}",
                                )
                            )
                            with open(transcript_file_path, "w") as f:
                                f.write("".join(all_transcripts))
                            print(f"🟢 Subtitles saved to {transcript_file_path}")

                            # also save mic data if we were live streaming audio
                            # otherwise the wav file will already be saved to disk
                            if method == "mic":
                                wave_file_path = os.path.abspath(
                                    os.path.join(
                                        data_dir,
                                        f"{startTime.strftime('%Y%m%d%H%M')}.wav",
                                    )
                                )
                                wave_file = wave.open(wave_file_path, "wb")
                                wave_file.setnchannels(CHANNELS)
                                wave_file.setsampwidth(SAMPLE_SIZE)
                                wave_file.setframerate(RATE)
                                wave_file.writeframes(b"".join(all_mic_data))
                                wave_file.close()
                                print(f"🟢 Mic audio saved to {wave_file_path}")

                        print(
                            f'🟢 Request finished with a duration of {res["duration"]} seconds. Exiting!'
                        )
                except KeyError:
                    print(f"🔴 ERROR: Received unexpected API response! {msg}")

        # Set up microphone if streaming from mic
        async def microphone():
            audio = pyaudio.PyAudio()
            stream = audio.open(
                format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK,
                stream_callback=mic_callback,
            )

            stream.start_stream()

            global SAMPLE_SIZE
            SAMPLE_SIZE = audio.get_sample_size(FORMAT)

            while stream.is_active():
                await asyncio.sleep(0.1)

            stream.stop_stream()
            stream.close()

        functions = [
            asyncio.ensure_future(sender(ws)),
            asyncio.ensure_future(receiver(ws)),
        ]

        if method == "mic":
            functions.append(asyncio.ensure_future(microphone()))

        await asyncio.gather(*functions)


def validate_input(input):
    if input.lower().startswith("mic"):
        return input

    elif input.lower().endswith("wav"):
        if os.path.exists(input):
            return input

    elif input.lower().startswith("http"):
        return input

    raise argparse.ArgumentTypeError(
        f'{input} is an invalid input. Please enter the path to a WAV file, a valid stream URL, or "mic" to stream from your microphone.'
    )


def validate_format(format):
    if (
        format.lower() == ("text")
        or format.lower() == ("vtt")
        or format.lower() == ("srt")
    ):
        return format

    raise argparse.ArgumentTypeError(
        f'{format} is invalid. Please enter "text", "vtt", or "srt".'
    )

def validate_dg_host(dg_host):
    if (
        # Check that the host is a websocket URL
        dg_host.startswith("wss://")
        or dg_host.startswith("ws://")
    ):
        # Trim trailing slash if necessary
        if dg_host[-1] == '/':
            return dg_host[:-1]
        return dg_host 

    raise argparse.ArgumentTypeError(
            f'{dg_host} is invalid. Please provide a WebSocket URL in the format "{{wss|ws}}://hostname[:port]".'
    )

def parse_args():
    """Parses the command-line arguments."""
    parser = argparse.ArgumentParser(
        description="Submits data to the real-time streaming endpoint."
    )
    parser.add_argument(
        "-k", "--key", required=True, help="YOUR_DEEPGRAM_API_KEY (authorization)"
    )
    parser.add_argument(
        "-i",
        "--input",
        help='Input to stream to Deepgram. Can be "mic" to stream from your microphone (requires pyaudio), the path to a WAV file, or the URL to a direct audio stream. Defaults to the included file preamble.wav',
        nargs="?",
        const=1,
        default="preamble.wav",
        type=validate_input,
    )
    parser.add_argument(
        "-f",
        "--format",
        help='Format for output. Can be "text" to return plain text, "VTT", or "SRT". If set to VTT or SRT, the audio file and subtitle file will be saved to the data/ directory. Defaults to "text".',
        nargs="?",
        const=1,
        default="text",
        type=validate_format,
    )
    #Parse the host
    parser.add_argument(
        "--host",
        help='Point the test suite at a specific Deepgram URL (useful for on-prem deployments). Takes "{{wss|ws}}://hostname[:port]" as its value. Defaults to "wss://api.deepgram.com".',
        nargs="?",
        const=1,
        default="wss://api.deepgram.com",
        type=validate_dg_host,
    )
    return parser.parse_args()


def main():
    """Entrypoint for the example."""
    # Parse the command-line arguments.
    args = parse_args()
    input = args.input
    format = args.format.lower()
    host = args.host

    try:
        if input.lower().startswith("mic"):
            asyncio.run(run(args.key, "mic", format, host=host))

        elif input.lower().endswith("wav"):
            if os.path.exists(input):
                # Open the audio file.
                with wave.open(input, "rb") as fh:
                    (
                        channels,
                        sample_width,
                        sample_rate,
                        num_samples,
                        _,
                        _,
                    ) = fh.getparams()
                    assert sample_width == 2, "WAV data must be 16-bit."
                    data = fh.readframes(num_samples)
                    asyncio.run(
                        run(
                            args.key,
                            "wav",
                            format,
                            data=data,
                            channels=channels,
                            sample_width=sample_width,
                            sample_rate=sample_rate,
                            filepath=args.input,
                            host=host,
                        )
                    )
            else:
                raise argparse.ArgumentTypeError(
                    f"🔴 {args.input} is not a valid WAV file."
                )

        elif input.lower().startswith("http"):
            asyncio.run(run(args.key, "url", format, url=input, host=host))

        else:
            raise argparse.ArgumentTypeError(
                f'🔴 {input} is an invalid input. Please enter the path to a WAV file, a valid stream URL, or "mic" to stream from your microphone.'
            )

    except websockets.exceptions.InvalidStatusCode as e:
        print(f'🔴 ERROR: Could not connect to Deepgram! {e.headers.get("dg-error")}')
        print(
            f'🔴 Please contact Deepgram Support with request ID {e.headers.get("dg-request-id")}'
        )
        return
    except websockets.exceptions.ConnectionClosedError as e:
        error_description = f"Unknown websocket error."
        print(
            f"🔴 ERROR: Deepgram connection unexpectedly closed with code {e.code} and payload {e.reason}"
        )

        if e.reason == "DATA-0000":
            error_description = "The payload cannot be decoded as audio. It is either not audio data or is a codec unsupported by Deepgram."
        elif e.reason == "NET-0000":
            error_description = "The service has not transmitted a Text frame to the client within the timeout window. This may indicate an issue internally in Deepgram's systems or could be due to Deepgram not receiving enough audio data to transcribe a frame."
        elif e.reason == "NET-0001":
            error_description = "The service has not received a Binary frame from the client within the timeout window. This may indicate an internal issue in Deepgram's systems, the client's systems, or the network connecting them."

        print(f"🔴 {error_description}")
        # TODO: update with link to streaming troubleshooting page once available
        # print(f'🔴 Refer to our troubleshooting suggestions: ')
        print(
            f"🔴 Please contact Deepgram Support with the request ID listed above."
        )
        return

    except websockets.exceptions.ConnectionClosedOK:
        return

    except Exception as e:
        print(f"🔴 ERROR: Something went wrong! {e}")
        return


if __name__ == "__main__":
    sys.exit(main() or 0)

If possible, please attach an example audio file to reproduce the issue.

I used the following YouTube video (Daily English Conversation Practice) on my phone and held it up to my mic to simulate a meeting room with one microphone.
The command I use for the python code above is the following: python3 test_suite.py -k API_KEY_HERE -i mic

Answered by shirgoldbird

Jul 26, 2023

@ali-rafiei I have good news: we have a new live-streaming diarization model currently in development, and we’re looking for beta testers! If you’d be interested in joining the beta program, please email me with your project ID: shir(dot)goldberg(at)deepgram(dot)com.

Any other information you’re willing to share about your diarization usecase and how the feature currently performs for you would be greatly appreciated as well.

View full answer

jjmaldonis · 2023-07-20T14:18:46Z

jjmaldonis
Jul 20, 2023
Maintainer

Hey @ali-rafiei, if a mic is held up to your computer's speaker the audio will be extremely low quality. Most meeting recording software records the audio within the app, even for a multi-person meeting with one mic. When a phone is held up to another speaker, the audio goes into the original microphone that the speakers used, comes out of your computer's speaker, then into your phone's mic. Whereas a meeting room with multiple people only has the original input from the speakers.

Transcribing multiperson meetings is difficult because speakers talk at the same time, are different distances away from the mic, may not be looking at the mic when speaking, and have different speakers all speaking on the same channel. These are difficult problems to overcome - even for a human listening to the conversation - and the problems are different than the "mic -> speaker -> mic" situation that you are using. Every time audio goes into a mic or out of a speaker, the sound wave gets degraded. The better the mic/speaker, the less this happens, but our phone mics and computer speakers are rarely high quality. So your sound wave is being degraded three times, which is likely worse (when performing transcription) than multiple people talking in the same room. Even if the audio sounds okay to your ear, the sound wave itself will be significantly transformed.

This poor sound quality is likely one reason for the poor diarization. In addition, diarization improves the longer the audio. If you have a 30 second audio file, the diarization results will be much worse than for a 30 minute audio file.

The code in our streaming test suite works well so I'm guessing your code is good. It's likely the audio that is resulting in the poor results.

Also, how often is the transcription identifying the speakers incorrectly?

4 replies

ali-rafiei Jul 20, 2023
Author

Yeah, I can confirm that when I replaced my output audio as input audio, it was a lot more consistent, but still inconsistent to a slight degree. My goal objective is to implement this into counselling sessions where they may be using their phones as microphones (since people may not be able to afford good quality microphones). Here's an example of both audio and transcript generated of the type of setting the Livestream Speaker Diarization would be used in (Google Drive Link). Here you can see how often the transcription identifies a speaker incorrectly.

I'm wondering if there's any way to limit the speakers if you know how many speakers there will be in conversation before it starts, as well as, if speakers can record something before they start the live conversation, so the algorithm can learn their voices?

I really appreciate the help so far, thank you!

shirgoldbird Jul 26, 2023

@ali-rafiei I have good news: we have a new live-streaming diarization model currently in development, and we’re looking for beta testers! If you’d be interested in joining the beta program, please email me with your project ID: shir(dot)goldberg(at)deepgram(dot)com.

Any other information you’re willing to share about your diarization usecase and how the feature currently performs for you would be greatly appreciated as well.

Answer selected by ali-rafiei

fahnub Nov 20, 2023

Hey @shirgoldbird I am working on a similar problem and I face a lot of consistency issues with diarization.
Can I also get to try the beta to see if that improves the functionality within my application?
Would be very grateful.

jjmaldonis Nov 20, 2023
Maintainer

Hey @fahnub, we are working on getting these updates into production soon. I'll set a reminder to update this thread when they are live.

fahnub · 2023-11-20T16:19:26Z

fahnub
Nov 20, 2023

Thanks for the reply @jjmaldonis, if you can provide with an ETA that would be super.
And yes, please get back to me on this thread or at itsfahadnadeem(at)gmail(dot)com

5 replies

jjmaldonis Nov 20, 2023
Maintainer

The plan is ~ two weeks from now, but that may change depending on e.g. other deadlines, if there are unexpected changes, or if benchmarking takes longer than expected. I'll reply here so it's public.

fahnub Nov 20, 2023

Thanks a lot, I am following this thread. Looking forward to hear from you :)

sofwanmd13 Nov 2, 2024

Any updates? Was the model launched? Last time I tried it, I was still having issues with accuracy of the diarization so wanted to know if it’s better now.

jjmaldonis Nov 5, 2024
Maintainer

Yes we did release an improvement for diarization last year. We highly recommend sending multichannel audio to Deepgram rather than mono channel, and setting multichannel=true in the query parameters. This will effectively result in perfect diarization as long as the speakers are speaking on different channels. More and more commonly, products offer support for generating multichannel audio, so we recommend enabling that whenever possible.

sofwanmd13 Nov 6, 2024

I apologize if I am wrong, but don't you need 2 different microphones for multichannel audio? And given both people talking are in close proximity and use 2 different mics separately, will deepgram accurately pick it up? since both the microphones would very well receive audio inputs for most of the other person's input as well. Would love some insight into this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepgram

Livestream Speaker Diarization not distinguishing different speakers consistently #283

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Livestream Speaker Diarization not distinguishing different speakers consistently #283

Which Deepgram product are you using?

Details

If you are making a request to the Deepgram API, what is the full Deepgram URL you are making a request to?

If you are making a request to the Deepgram API and have a request ID, please paste it below:

If possible, please attach your code or paste it into the text box.

If possible, please attach an example audio file to reproduce the issue.

Replies: 2 comments · 9 replies

jjmaldonis Jul 20, 2023 Maintainer

ali-rafiei Jul 20, 2023 Author

jjmaldonis Nov 20, 2023 Maintainer

jjmaldonis Nov 20, 2023 Maintainer

jjmaldonis Nov 5, 2024 Maintainer

Replies: 2 comments 9 replies

jjmaldonis
Jul 20, 2023
Maintainer

ali-rafiei Jul 20, 2023
Author

jjmaldonis Nov 20, 2023
Maintainer

jjmaldonis Nov 20, 2023
Maintainer

jjmaldonis Nov 5, 2024
Maintainer