[Unexpected Performance Drop] Using 44.1K sample_rate vs. default 16K leads to better performance in `pyannote/speaker-diarization-3.1` #1755

ai-nikolai · 2024-09-05T13:32:19Z

Tested versions

Reproducible in 3.1, 3.3

System information

M2 Pro, 3.3

Issue description

Sample Rate Mis-match:

Using the pre-trained pipeline: pyannote/speaker-diarization-3.1
Which has a default sample rate of 16K
We get vastly different performance on simple audio files from e.g. youtube in terms of segmentation. (With 44.1K performing better).

Question:

What are the default sample rates based on?
How does the up / down-sampling work concretely? (I.e. what could be the reason for such different behviour) -> How does the pipeline work if one passes sample rate 44.1K?
Would this depend on the original encoding quality?

@hbredin

Minimal reproduction example (MRE)

N/A

The text was updated successfully, but these errors were encountered:

ai-nikolai · 2024-09-05T13:32:54Z

@hbredin

ai-nikolai · 2024-09-06T07:42:04Z

Thank you for responding @hbredin. I will try and add a minimal reproducible script in coming days. However, in the mean time I have a quick question.

Basically what could be the reason for a big difference between loading audiofiles as 44.1K vs. 16K and passing them as waveforms to pyannote/speaker-diarization-3.1?

qalabeabbas49 · 2024-09-06T07:47:56Z

as far as I know, pyannote will convert any audio into mono channel 16khz.
In my experience, generally audio files recorded at a higher sample rate (44khz) will always perform well just because they have more information even after downsampling to 16khz.
While a file recorded at 16khz has less information.

ai-nikolai · 2024-09-06T09:24:28Z

Thank you, qalabeabbas49. I guess what I find interesting is that the audio file is the same. I.e. originally 44.1K or originally 16K. And then:
The original file gets loaded in either 44.1K or 16K and then pyannote converts to 16K (as you said). Loading this file in 44.1K makes a difference - not whether the file was originally 44.1K.

(loading via ffmpeg -ar 16000; or ffmpeg -ar 44100)

lockmatrix · 2024-10-27T05:22:36Z

same to me, 44.1K performing better

hbredin added the cannot_reproduce label Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Unexpected Performance Drop] Using 44.1K sample_rate vs. default 16K leads to better performance in `pyannote/speaker-diarization-3.1` #1755

[Unexpected Performance Drop] Using 44.1K sample_rate vs. default 16K leads to better performance in `pyannote/speaker-diarization-3.1` #1755

ai-nikolai commented Sep 5, 2024 •

edited

Loading

ai-nikolai commented Sep 5, 2024

ai-nikolai commented Sep 6, 2024

qalabeabbas49 commented Sep 6, 2024

ai-nikolai commented Sep 6, 2024 •

edited

Loading

lockmatrix commented Oct 27, 2024

[Unexpected Performance Drop] Using 44.1K sample_rate vs. default 16K leads to better performance in pyannote/speaker-diarization-3.1 #1755

[Unexpected Performance Drop] Using 44.1K sample_rate vs. default 16K leads to better performance in pyannote/speaker-diarization-3.1 #1755

Comments

ai-nikolai commented Sep 5, 2024 • edited Loading

Tested versions

System information

Issue description

Minimal reproduction example (MRE)

ai-nikolai commented Sep 5, 2024

ai-nikolai commented Sep 6, 2024

qalabeabbas49 commented Sep 6, 2024

ai-nikolai commented Sep 6, 2024 • edited Loading

lockmatrix commented Oct 27, 2024

[Unexpected Performance Drop] Using 44.1K sample_rate vs. default 16K leads to better performance in `pyannote/speaker-diarization-3.1` #1755

[Unexpected Performance Drop] Using 44.1K sample_rate vs. default 16K leads to better performance in `pyannote/speaker-diarization-3.1` #1755

ai-nikolai commented Sep 5, 2024 •

edited

Loading

ai-nikolai commented Sep 6, 2024 •

edited

Loading