1. What Is in This Repo?
2. What Did We Do?
3. Links to Datasets and Final Models
4. What Libraries, Technologies, and Techniques Are We Using?
5. Known Issues
6. Thank You!
Privacy Pioneer is a browser extension for helping people understand the privacy implications of their visits to websites.
This repo contains everything machine learning-related to Privacy Pioneer:
- The datasets we used for training, validation, and testing of Privacy Pioneer's machine learning model.
- Links to all models we experimented with (in their final state).
- The model that is used in Privacy Pioneer. When a user installs Privacy Pioneer, the model is served from here.
The machine learning for Privacy Pioneer is developed and maintained by the Privacy Pioneer team.
We had to create a way to integrate Machine Learning into a browser extension. This brought up a few questions to answer:
- Where do we want this model to be stored?
- Which pre-trained model to use?
- What is the process for model training?
- How did it get into the extension?
The Privacy Pioneer extension is structured around distributing information about how websites are taking your data and sending it to third parties without your explicit consent or knowledge. Thus we want this machine learning functionality to run locally on a user's browser. This means that we need a small, fast, accurate model to be stored and running within the extension itself.
BERT is a pre-trained model for language understanding built by Google for search optimization. It uses neural networks and transformer architectures to improve results on Natural Language Processing (NLP) tasks. The original BERT paper created two models, bert-base-uncased (500mb model) and bert-large-uncased (1.2gb model). However, we decided to use smaller models because large models have both a large amount of data consumption and long analysis times. There are numerous smaller models that have been created that have fairly comparable analysis results. We chose to use TinyBERT as our pretrained model, weighing in at 59mb. As you can see below, it performs 1.2 billion floating point operations per section (actually a fairly small amount), which gives it a fast runtime of 9.4 times faster than bert-base-uncased. Yet, you can see that the average results on all of the datasets used for the TinyBERT study was a 77%, which is just slightly below the 79.5% of bert-base-uncased at almost 1/8th the size.
Here is a brief overview of the steps we took to get to our final models:
- We trained hundreds of models on our datasets using TinyBERT and bert-base-uncased as our pre-trained model to see if we could get results on our data using a small model that were similar to a larger model like bert-base-uncased. We found that our small models were performing fairly similarly to the larger models. Thus we knew we could use the small models in Privacy Pioneer. We utilized Hugging Face Transformers, PyTorch, Weights and Biases, and Google Colab for this learning process.
- In order to create our human-labeled datasets, we had to take a sample of all the data we collected. Now that we had models that were very good at correctly labeling our data, we labeled the previously unlabeled data by using the models we trained with bert-base-uncased. Then, we used that newly computer-labeled data to train new TinyBERT models. Thus, the models were trained on a much larger dataset, which we hoped would give us better results. These models performed slightly worse than the normally trained models, so we learned that we did not want to use these models. This process is called Machine Learning Distillation.
- Yet at this point, we had 5 different models that were trained only on one type of data. Remember that our 5 types we are looking at are IP address, Region, Latitude, Longitude, and Zip Code. We then explored using only one model trained on all of our human-labeled data, and checked the results against the 5 individual models. Interestingly, the single multi-label model performed almost exactly the same as our 5 individual models. Thus we could use only 59mb on a user's computer as opposed to the 295mb that would be used by the 5 individual models. Thus we knew that a single multitask model would perform well in our extension.
- Now we knew that our model for Privacy Pioneer would be a TinyBERT-trained, non-distilled, multitask model.
We now had a final model that met our requirements, we used Hugging Face and Tensorflow to convert our model (a PyTorch saved_model) into a tfjs GraphModel format. Thus, we had a model that was able to run in Javascript on the browser. We placed that model into the Privacy Pioneer extension, where it gets downloaded to the indexed-db and loaded into the browser for use when our extension is running.
-
Datasets:
- City:
- Region:
- Latitude:
- Longitude:
- Zip:
- All:
-
Models:
- City:
- Region:
- Latitude:
- Longitude:
- Zip:
- Multitask:
Results: (note: each value is the Average F1 Score)
- Training/validation/test set data
- The datasets we are using are located within the Privacy Pioneer Google Drive Folder, and are also in this repo under the ./annotatedData folder. We are using a 80% training, 10% validation, 10% test split of the data. The test dataset is set up to be the data that was labeled by multiple independent labelers.
- Hugging Face
- Hugging Face is a machine learning library, ML/AI community, and dedicated API that that is set up to assist with the creation, storage, and distribution of machine learning programs and datasets.
- Google Colab
- Google Colab is a experimental framework for Jupiter Notebooks to be run on the cloud on GPUs and TPUs that are able to run independently of your own computer's runtimes. This enables us to quickly build and test ML models using minimal resources.
- Weights and Biases
- Weights and Biases is a framework for tracking metrics and hyperparameters during training and evaluation of Machine Learning models. It assisted greatly with understanding where we could optimize our ML pipeline.
- Base Model Details
- We primarily make use of TinyBERT and bert-base-uncased as the base models which we retrain for our specific use case. We use TinyBERT as the main model because it is only 59mb, which we deem as an acceptable size for achieving accuracy while remaining small enough to be loaded into a browser extension. We also explored Knowledge Distillation from bert-base-uncased to TinyBERT to better achieve accuracy at a smaller model size.
- How do the different scripts, frameworks, data interact?
- Our datasets are located on the privacy-tech-lab Hugging Face Team. Our files, scripts, and models are within the Privacy Pioneer Google Drive Folder's Machine Learning section and also on Hugging Face (see below).
- Where is the model you use in Privacy Pioneer?
- The folder ./convertMultiModel/multitaskModelForJSWeb is our model that we load into the Privacy Pioneer browser extension.
- As we developed our models in Python and then converted them to JavaScript, we noticed that the conversion resulted in a classification performance decrease of 4.7% points on average for precision, recall, and F1 score. We used Google's official libraries for the conversion and confirmed with Google that our conversion methodology was correct. We have opened an issue with Google.
We would like to thank our supporters!
Major financial support provided by Google.
Additional financial support provided by Wesleyan University and the Anil Fernando Endowment.
Conclusions reached or positions taken are our own and not necessarily those of our financial supporters, its trustees, officers, or staff.