Skip to content
Aumit Leon edited this page Nov 24, 2017 · 6 revisions

Welcome to the million-songs wiki!

A machine learning approach to the million song dataset by Aumit Leon and Mariana Echeverria

The data directory has subdirectories that act like volumes-- if you go deep enough you'll find the H5 files that correspond to each song.


The Dataset

The Million Songs Dataset has data on 1,000,000 songs for 44,745 unique artists, along with user supplied tags from the MusicBrainz website.

Converting the data to a usable format

The data is given to us in HD5 format (https://support.hdfgroup.org/HDF5/whatishdf5.html).

HD5 files are binary files, so they are not very useful to us as they are given. In order to extract data from the h5 files, use get_data.py.

The million song dataset provides python wrappers within hd5_getters.py that can be used to recursively loop through each subdirectory and h5 file to extract certain features of the data.

get_data.py will visit every subdirectory (starting from the path you give indir), and will create a CSV of the data extracted from each h5 file. You don't need to put this script any place special, just be sure to provide it a proper path for indir. The output.csv file will be created in the same directory as this python script, so be sure not to commit that CSV file to Git :)

Year Prediction Dataset

The year prediction dataset is a simplified subset of the Million Song Dataset. This dataset has 90 attributes (features): 12 = timbre average, 78 = timbre covariance. The dataset is available at: http://archive.ics.uci.edu/ml/datasets/YearPredictionMSD

When you download the dataset, you get a large, comma separated text file-- because the data is already comma separated, to get this data into a CSV you can open with excel, cd into the diectory where you have the dataset downloaded, and run the following: cat YearPredictionsMSD.txt > yp.csv

Things to work on

  • Finish extracting the data, pick out what features we want to use
  • Pick what aspect of the data we want to run experiment on
  • Prototype some crude ML models!

Data preprocessing and other information

The dataset uses the Echo Nest API to collect quantitative information about songs (danceability, tempo, loudness, segment analysis, etc). The Echo Nest was acquired by Spotify and integreated into Spotify API's, and is still available for use by developers. To learn more, visit: https://developer.spotify.com/spotify-echo-nest-api/