Lindsey Rojtas | [email protected] | University of Pittsburgh Class of 2023
This repository will act as my platform for my final project for Advanced Special Topics in Computing and Info (CMPINF 1999) at the University of Pittsburgh. The special topics we are covering are comparative digital privacies.
Informed consent is difficult to acquire through an online platform. Privacy policies are a commonplace digital version of informed consent forms, but a user cannot be informed if they don't read the policy. Whether or not this is the fault of the user or the writer of the policy is up to the language used in the policy. I will be tokenizing privacy policies from a variety of different websites and analyzing the language used in these policies to determine whether the policies are short and simple enough for the average Internet user to be able to read.
This project is an investigative analysis on the length in words, average word length, and type-token ratio (word uniqueness) of 113 different privacy policies from the OPP-115 corpus in the form of HTML files (see below for more information on the corpus). These policies were found in 2016 and include but are not limited to Google, Amazon, New York Times, NBC Universal, Barnes and Noble, and other news and business sites. You will find my code and my explanations for it in the code
folder of this repository, which contains code.ipynb
. It is viewable online, so there is no need to download it. In code.ipynb
, to reiterate, there is my code with comments and annotations to make it easier to follow my thought process. In samples
, you will find the HTML file for IMDb's website page as a sample that I have put in my repository under Fair Use. Information on my LICENSE
can be found in the licensing heading of this README
file.
It is important to note going in that this is not to be looked at as a definitive analysis. In this project, a good privacy policy does not mean that I agree with the policy in practice. This analysis is an attempt at trying to figure out whether a policy is human readable through computational methods
Another small bit of motivation is to look at this issue with a polymathic approach. These issues with privacy policies are both computer/information issues and linguistic issues. As such, I am using skills that I learned in both my CS classes and linguistic classes to complete this project.
If you are my CMPINF 1999 professor, Mr. Song Shi at the University of Pittsburgh, hello! I hope you have fun looking at and grading this repository. The entire repository is my submission, as my code file also contains annotations and documentation to illustrate my thought process.
If you're anyone else, welcome! I really enjoyed working on this project and I hope you enjoy looking at it too. I may expand on it in the future.
Special thanks to Professor Shi and my CMPINF 1999 peers. It's been a great semester!
From Usable Privacy Policy Project's website:
"The OPP-115 Corpus (Online Privacy Policies, set of 115) is a collection of website privacy policies (i.e., in natural language) with annotations that specify data practices in the text. Each privacy policy was read and annotated by three graduate students in law.
The dataset is made available for research, teaching, and scholarship purposes only, with further parameters in the spirit of a Creative Commons Attribution-NonCommercial License. Contact Prof. Norman Sadeh with any questions.
If you use this dataset as part of a publication, you must cite the following paper:
The creation and analysis of a website privacy policy corpus. Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Mysore Sathyendra, N. Cameron Russell, Thomas B. Norton, Eduard Hovy, Joel Reidenberg, and Norman Sadeh. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, August 2016.
The above paper is also an essential read for understanding the structure and contents of the corpus."
This project is licensed under CC-BY-NC 3.0. The license is in LICENSE.md
and a human readable version can be found here. A sample of the data in the form of an HTML file is published under fair use.