Combining Transcription with Diarization (speaker identification) #99

MustafaCQN · 2023-03-31T11:21:14Z

MustafaCQN
Mar 31, 2023

Hi everyone!

I was working on a project that will takes audio file and transcribes the meeting.

The problem with this project that I am facing is I need to diarize the speakers with their names. I was using Pyannote package for the identification of speakers but the problem is, Both transcription and diarization uses different models which outcomes different timecodes. Because of the different timecodes, I cannot able to match the transcription with the speaker names.

Anybody knows how I can tweak this problem or is there a product/model/method that I can use for both transcribing the matching with the speaker name using timecodes?

Left is speaker identifications with timecodes (pyannote), right is the transcription with timecodes (faster-whisper)

Thank you!

apzl · 2023-04-04T11:44:09Z

apzl
Apr 4, 2023

Have you checked out WhisperX?

4 replies

ciekawy Apr 13, 2023

its slower. Can faster-whisper be used within WhisperX?

landemou Apr 13, 2023

it would be great to have the diarization on faster-whisper but surely very hard to set up !

MustafaCQN Apr 13, 2023
Author

Have you checked out WhisperX?

I will check and reply here after I see the results. Right now I am using another Thread to check speakers speaking identification from google meet. But as you guess its just solving the problem from one platform. And It can't be used cases like phone recording of real meetings etc.

it would be great to have the diarization on faster-whisper but surely very hard to set up !

I would love if faster-whisper releases this!. I was using Pyannote then manual scraping from google meet but in the end both of them giving me different timecodes than faster-whisper. So I have to combine them together for understandable diarization. I have solved this after I widen both timecodes to the nexts beginning code.
Ex:
[5.0 -> 10.0] hi
[12.0 -> 14.0] how are you

to
[5.0 -> 12.00] hi
[12.0 -> 14.0] how are you

then I take the average of the start and end time of the transcription and directly took the speaker name from the given diarization list.
That way this will create a diarization with approx. %80 accuracy with using 2 different models or 1 model and 1 automation thread.

ciekawy Apr 13, 2023

Shouldn't be the same compose of pyannote on whisper as in whisperX / pyannote-whisper project?

NavodPeiris · 2024-01-23T12:36:05Z

NavodPeiris
Jan 23, 2024

checkout this repo: https://github.com/Navodplayer1/speechlib
this uses pyannote diarization and segment the audio according to start and end times. Then apply faster-whisper transcription to each segment. Finally output transcript with time from pyannote diarization and transcripted text from faster-whisper.

you will get accurate timing

You can also do speaker recognition if you provide voices_folder. Then transcription will contain actual speaker names!

1 reply

RustX2802 May 17, 2024

Hi @NavodPeiris, is it possible to apply diarization with speechlib for real-time transcription capabilities? Have you tried this option?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining Transcription with Diarization (speaker identification) #99

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Combining Transcription with Diarization (speaker identification) #99

MustafaCQN Mar 31, 2023

Replies: 2 comments · 5 replies

apzl Apr 4, 2023

ciekawy Apr 13, 2023

landemou Apr 13, 2023

MustafaCQN Apr 13, 2023 Author

ciekawy Apr 13, 2023

NavodPeiris Jan 23, 2024

RustX2802 May 17, 2024

MustafaCQN
Mar 31, 2023

Replies: 2 comments 5 replies

apzl
Apr 4, 2023

MustafaCQN Apr 13, 2023
Author

NavodPeiris
Jan 23, 2024