prosody of sample output. fmin, fmax. Glowtts #1091
Unanswered
michaellin99999
asked this question in
General Q&A
Replies: 1 comment 3 replies
-
What I mean by monotone, is that in the dataset, the speaker has intonation ups and downs. but when glowtts + Griffin Lim the output is completely lack of intonation ups and down |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Does fmin and fmax in the audio parameters have an affect on the prosody of the generated sample?
I have them set on 220-3200 and the sample output sounds monotone. I trained glowtts with 700k steps and 20 hours of female dataset. and connected it to Griffin-Lim to produce the sample.
, We are trying to utilize GlowTTS to synthesize speaker voice and our data set consists of 22 hours of female speaking.We made sure that our dataset has good quality audio ,(I have attached some dataset sentences) dataset (the files I attached with encoded names)
However, when we trained glowtts from scratch and added griffin-lim and inference at 700k steps, we found that the output sounds incredibly monotone and robotic. (sample 1 and sample 2) We do not know why this has happened.
When we trained glowtts, we made the following adjustments. "mel_fmin set to 220 and mel_fmax set to 3200" We found these values to produce the best audio quality utilizing the coqui audio parameter testing environment. I listed our config below.
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:220
| > mel_fmax:3200
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > stats_path:../data/scale_stats.npy
| > base:10
| > hop_length:256
| > win_length:1024
I am really stuck on what could have caused this problem. Could there be a reasoning for this? or is it because we changed the mel_fmin and mel_fmax to too low?
dataset audio clips.zip
GlowTTS Output samples with GriffinLim.zip
Beta Was this translation helpful? Give feedback.
All reactions