-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request for Assistance in Replicating Re-Identification Risk Experiment #267
Comments
Thank you for your interest in our paper "Measuring Re-identification Risk.". We are delighted to see interest in the research community in replicating our work and we are happy to assist you. First, we would like to clarify that the MSD dataset was used in the paper exclusively for the purpose of allowing the academic researchers to experiment with a public dataset using our open source code. For this reason, we used a public dataset that has been part of many academic papers in the past. We did not intend however the MSD dataset to be considered similar or related to the Topics API, since the dataset is not based on browsing histories or Topics API outputs. We refer to Section 8.5 of the paper (Measuring Re-identification Risk) where we discuss specifically how we generated samples from the MSD dataset to test the probability of matching correctly a sample based on the song ids. Notice that this data generation process is not similar to the Topics API sampling method, and the results on this dataset do not have implications for the re-id risk of the Topics API. Concerning the data, the dataset appear to be still available at this repository http://millionsongdataset.com/tasteprofile/ |
I think the issue is with the API key. We can't access the user data without a key but since the MSD moved ownership it doesn't seem publicly accessible anymore. Do you know how to circumvent this problem or can you confirm this is the case? |
Hi, Please let me know if you have any other question. |
Hi @aleepasto , |
Hi, we use the song ids associated to a given user in the dataset without any associated meaning to the ids. As reported in the Section 8.5 of the paper, we simulate a system that outputs a sample of r songs for each user, independently, to generate two different databases. Then, we measure the match rate across the two datasets for a fixed r. |
Hi, this discussion is really interesting. I just wanted to clarify something about the million song implementation. So ideally Topics API's re-identification is going to be based on the attack model's ability to understand user behavior. These attacks would strongly depend on the some what deterministic nature of frequency counts for topics every epoch/week. However, I was confused about why random r songs were chosen for these users instead of applying the same frequency counting? Won't the randomness never allow any patterns to be formed? |
Thanks for the comment. Given the fundamental differences between the MSD dataset and the real Topics API implementation we did not intend to use that MSD dataset to model in any way the Topics API. For this reason, we did not attempt to mimic any part of the API behavior (e.g., fixing top k = 5 songs per user). We only included the dataset in the paper to allow external researchers to validate the theoretical and empirical modeling of our paper in a different context. I hope this answers your question. |
Hello,
I'm attempting to replicate the re-identification risk experiment detailed in the paper "Measuring Re-identification Risk." However, I'm encountering difficulties in accessing the Million Song Dataset, which was used in the empirical analysis. Unfortunately, the Echonest website appears to be down, preventing me from obtaining the necessary API key to access the dataset.
I would greatly appreciate any guidance on how to obtain the Million Song Dataset and replicate the experiment's results. Additionally, I'm seeking information on attribute mappings for the MSD dataset, specifically to simulate a scenario similar to the Topics API, which requires data such as browser history and the frequency of visits within a week for topic calculations.
Thank you for your assistance.
Best regards,
Yash Maurya
The text was updated successfully, but these errors were encountered: