This script uses a Tor proxy to download YouTube videos using the youtube-dl command line tool.
This will allow you to bypass download restrictions of YouTube for downloading large datasets.
- Python 3.7+
- youtube-dl
- tor
You must first download and configure Tor to run on commandline.
(MacOS)
brew install tor
(Linux)
sudo apt-get install tor
Once Tor has installed you must copy the example torrc
configuration file.
cp /usr/local/etc/tor/torrc.sample /usr/local/etc/tor/torrc
Create a password to access the local Tor proxy (remember this password for later) using:
tor --hash-password **your_password_here**
Copy the hash output from the terminal (it should look like this: 16:E3EAD3E61428CHSO20EA72221528EE489BDD9D21E937331E1D810694B2
)
Edit the torrc
file:
sudo nano /usr/local/etc/tor/torrc
Locate the line #HashedControlPassword
, remove the comment mark (#) and paste in the output from the previous step. The line should now look like:
HashedControlPassword 16:E3EAD3E61428CHSO20EA72221528EE489BDD9D21E937331E1D810694B2
Remove the comment mark (#) from the ControlPort
line.
ControlPort 9051
Close and save your file (ctrl+x
) then Y
<- if you are using nano to edit the file.
Congrats - you're ready to start running your Tor proxy.
You must start the Tor proxy before you run the python script. To start the Tor proxy, run:
tor
Create a config.json
file by copying and renaming the config.example.json
.
{
"tor_password": "enter_your_password_here",
"verbose_logging": false
}
If you find you're having issues running the tool, enable youtube-dl verbose logging to see what the issue is.
You can do this by editing the config.json
and changing the verbose_logging
to [True/False]
The script uses three files (below) to manage the state of downloading your dataset.
File | Description |
---|---|
dataset.csv | List of YouTube Ids in your dataset |
completed_downloads.csv | After each file is downloaded the YouTube ID will be added to this file |
error_files.csv | If a file fails to download the YouTube ID will be added to this file |
Start by adding YouTube Ids to the dataset.csv
file. This file will be used by the script to download the YouTube videos into the /videos
folder.
Files will be downloaded as .mp4
format.
Each line should be a new YouTube ID found at the end of a youtube link i.e. https://www.youtube.com/watch?v=dQw4w9WgXcQ (dQw4w9WgXcQ - YouTube Id)
nQPXu-T9uWc
bX2KCrEAc5w
Start the script using
python downloader.py
SocketError: [Errno 61] Connection refused
- Check that you have Tor running and configured correctly. This is due to either tor not running, incorrect password, or ControlPort hasn't been commented out.
IncorrectPassword: Authentication failed: Password did not match HashedControlPassword value from configuration
- Check that your script is using the plain text password that you set when you configured the Tor password (see above)