Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Unexpected Performance Drop] Using 44.1K sample_rate vs. default 16K leads to better performance in pyannote/speaker-diarization-3.1 #1755

Open
ai-nikolai opened this issue Sep 5, 2024 · 5 comments

Comments

@ai-nikolai
Copy link

ai-nikolai commented Sep 5, 2024

Tested versions

Reproducible in 3.1, 3.3

System information

M2 Pro, 3.3

Issue description

Sample Rate Mis-match:

  • Using the pre-trained pipeline: pyannote/speaker-diarization-3.1
  • Which has a default sample rate of 16K
  • We get vastly different performance on simple audio files from e.g. youtube in terms of segmentation. (With 44.1K performing better).

Question:

  • What are the default sample rates based on?
  • How does the up / down-sampling work concretely? (I.e. what could be the reason for such different behviour) -> How does the pipeline work if one passes sample rate 44.1K?
  • Would this depend on the original encoding quality?

@hbredin

Minimal reproduction example (MRE)

N/A

@ai-nikolai
Copy link
Author

@hbredin

@ai-nikolai
Copy link
Author

Thank you for responding @hbredin. I will try and add a minimal reproducible script in coming days. However, in the mean time I have a quick question.

  • Basically what could be the reason for a big difference between loading audiofiles as 44.1K vs. 16K and passing them as waveforms to pyannote/speaker-diarization-3.1?

@qalabeabbas49
Copy link

as far as I know, pyannote will convert any audio into mono channel 16khz.
In my experience, generally audio files recorded at a higher sample rate (44khz) will always perform well just because they have more information even after downsampling to 16khz.
While a file recorded at 16khz has less information.

@ai-nikolai
Copy link
Author

ai-nikolai commented Sep 6, 2024

Thank you, qalabeabbas49. I guess what I find interesting is that the audio file is the same. I.e. originally 44.1K or originally 16K. And then:
The original file gets loaded in either 44.1K or 16K and then pyannote converts to 16K (as you said). Loading this file in 44.1K makes a difference - not whether the file was originally 44.1K.

(loading via ffmpeg -ar 16000; or ffmpeg -ar 44100)

@lockmatrix
Copy link

same to me, 44.1K performing better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants