Our code is mainly developed based on TANGO (https://github.com/declare-lab/tango) and Diffsound (https://github.com/yangdongchao/Text-to-sound-Synthesis).
To download AudioCaps dataset, using (code is from https://github.com/MorenoLaQuatra/audiocaps-download)
python download_data.py
To train TANGO with FLAN-T5 on the AudioCaps dataset, using
cd tango-master
python train.py \
--text_encoder_name="google/flan-t5-large" \
--scheduler_name="stabilityai/stable-diffusion-2-1" \
--unet_model_config="configs/diffusion_model_config.json" \
--freeze_text_encoder --augment --snr_gamma 5 \
To train TANGO with CLIP on the AudioCaps dataset, using
cd tango-master
python train.py \
--text_encoder_name="stabilityai/stable-diffusion-2-1" \
--scheduler_name="stabilityai/stable-diffusion-2-1" \
--unet_model_config="configs/diffusion_model_config.json" \
--freeze_text_encoder --augment --snr_gamma 5 \
To run inference on the AudioCaps dataset with Diffsound, using
cd Diffsound
python evaluation/generate_samples_batch.py
To run inference on the AudioCaps dataset with TANGO, using
cd tango-master
python inference.py \
--original_args="saved/*/summary.jsonl" \
--model="saved/*/best/pytorch_model_2.bin" \