Requirements

Computation

We ran this program on two GPUs, 1050 Mobile and Tesla V100. We did not conduct any benchmarks but, V100 was roughly 400x faster. It also depends on how much data you download. Hence, any server grade GPU should be feasible.
Storage

This program does generate a lot of files (download and otherwise). Each audio file is 96kiB in size. For 7k unique audio clips and at a 70/30 train and validation split it occupied ~120GiB of storage space. Hence, 1TB minimum if you download more audio clips.
Memory

Minimum of 4GB VRAM is required. It can handle a batch size of 2. At 20 batch size, on two GPUs, it occupied 16GiB VRAM on each GPU.

Setup

If you are using Docker, just run inside the container:

./setup.sh && ./install.sh

Else

Setup the directory structure
```
./setup.sh
```
Install dependencies
```
pip install -r requirements.txt
```
Additional dependencies:

i. ffmpeg ii. libav-tools ii. youtube-dl iii. sox
Install
```
./install.sh
```

During inference

from src import generate_audio, load_model

Run

Run all these files as scripts.

cd src/loader

NOTE: Make Sure AVSPEECH dataset is in data/audio_visual/ folder. Downloading requires a Google account.

Download the video dataset - _{^{interruptible}}

python3 download.py

Extract sound from the video

Video length can be more than 3 seconds. Hence, extract multiple audio from a single video file.

python3 extract_audio.py

Mix the audio - _{^{interruptible}}

Synthetically mix clean audio. This can take a lot of space of the disk. 96Kb approx for each file. Total number of files can be: ^total_filesC_{input_audio_size} for each train and val.

python3 audio_mixer_generator.py

Remove empty audio

Generating lots of synthetically mixed audio (100+ per second) generates a lot of empty audio files. Hence, we need to remove the empty audio files.

python3 remove_empty_audio.py

Convert the path inside the generated dataframe

Path changes from src and src/loader. Both directory has files that need to manipulate the data/ directory. Hence, create a copy with the correct path in src/loader/

python3 transform_df.py

Run to cache all embeddings

Create video embedding from all the video files. This will also store video which are corrupted. Corrupted video include where face was not detected.

python3 generate_video_embedding.py

Remove corrupt frames

Hence, remove corrupted video frames as well.

python3 remove_corrupt.py

Run to cache all spectrograms (optional)

Cache, all the spectrograms This takes a lot of storage. Tens/Hundreds of GB

python3 convert_to_spec.py

Train the model - _{^{interruptible}}

python3 train.py --bs 20 --workers 4 --cuda True

Results

Unfortunately, we could not train on a bigger dataset.

Example Prediction after 37 epochs (Suffering from overfitting)

Loss Plot

SNR Plot

References

Looking to Listen at a Cocktail Party: https://arxiv.org/abs/1804.03619
Discriminative Loss: https://arxiv.org/abs/1502.04149
PyTorch: pytorch.org
Catalyst: https://github.com/catalyst-team/catalyst
mir_eval: https://github.com/craffel/mir_eval
pysndfx: https://github.com/carlthome/python-audio-effects/tree/master/pysndfx

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
data		data
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
apt_install.txt		apt_install.txt
archive.sh		archive.sh
install.sh		install.sh
requirements.txt		requirements.txt
setup.py		setup.py
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Requirements

Setup

If you are using Docker, just run inside the container:

Else

Run

Download the video dataset - _{^{interruptible}}

Extract sound from the video

Mix the audio - _{^{interruptible}}

Remove empty audio

Convert the path inside the generated dataframe

Run to cache all embeddings

Remove corrupt frames

Run to cache all spectrograms (optional)

Train the model - _{^{interruptible}}

Results

Example Prediction after 37 epochs (Suffering from overfitting)

Loss Plot

SNR Plot

References

About

Releases

Packages

Contributors 4

Languages

License

vitrioil/Speech-Separation

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Requirements

Setup

If you are using Docker, just run inside the container:

Else

Run

Download the video dataset - interruptible

Extract sound from the video

Mix the audio - interruptible

Remove empty audio

Convert the path inside the generated dataframe

Run to cache all embeddings

Remove corrupt frames

Run to cache all spectrograms (optional)

Train the model - interruptible

Results

Example Prediction after 37 epochs (Suffering from overfitting)

Loss Plot

SNR Plot

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Download the video dataset - _{^{interruptible}}

Mix the audio - _{^{interruptible}}

Train the model - _{^{interruptible}}

Packages