Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speakers with similar pitch are difficult to distinguish #1712

Open
ChristianNSchmitz opened this issue May 17, 2024 · 3 comments
Open

Speakers with similar pitch are difficult to distinguish #1712

ChristianNSchmitz opened this issue May 17, 2024 · 3 comments

Comments

@ChristianNSchmitz
Copy link

ChristianNSchmitz commented May 17, 2024

Tested versions

3.1

System information

Ubuntu 24.04, pyannote 3.1.1

Issue description

Dear Pyannote-team, we are using pyannote speaker segmentation 3.1.1 for distinguishing speakers for further analysis in a dialogue of 2-3 people. However, if people have similar pitch (e.g. two men with deep voices), pyannote oftentimes misclassifies the speakers. For the human ear, distinguishing the two speakers is easy, so there must be only slight differences in pitch. Thus, I would like to ask whether you have tip for preferences or preprocessing for optimising the classification. Thanks a lot!

Minimal reproduction example (MRE)

Can be provided if necessary

@ChristianNSchmitz ChristianNSchmitz changed the title Speakers with similar voices are difficult to distinguish Speakers with similar pitch are difficult to distinguish May 17, 2024
@metal3d
Copy link
Contributor

metal3d commented Jun 26, 2024

I've got exactly the same problem. 3 men are speaking, I force the "num_speaker" to 3, but the model only matches one voice for 90% of the time.

That means that I cannot, at this time, use pyannote for what I want to do 😢

@hbredin
Copy link
Member

hbredin commented Jun 26, 2024

As for any machine learning approaches, train/test domain mismatch is usually the culprit.

Fine-tuning internal models and pipelines to your use case data is usually the best solution.

Did you try alternative speaker diarization tools? Do they perform better? I’d love to have a look at those files where it does not work.

@lockmatrix
Copy link

I encountered the same issue.

I suggest trying to optimize the synthesis method for mixed data. For example:

Construct mixed data with speakers of similar pronunciation, such as selecting voices with similar speaker embeddings for mixing.
Introduce diversity in volume adjustments across different voices, such as mixing with equal volumes and varying volumes, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants