With the current state of the world, I'm sure everyone's second-guessed whether a news headline was real or satire. To save people the trouble of having to actually read the article, I built a classifier model that takes in your headline and gives the percentage chance that it came from The Onion, the largest satirical news network on the web. The heart of the process comes in the form of a pre-trained DistilBERT model, which eats a tokenized version of your headline and kindly spits out a lower-dimensional embedding vector that represents it. Then, I use a logistic regression model that was trained on top of those embeddings to give a probability that it's from The Onion. To train the logistic regression, I used 30k headlines gathered from two subreddits, r/TheOnion and r/NotTheOnion. With a balanced dataset, the downstream logistic regression model achieves 87% accuracy on the training set and 85% accuracy on the test set.
This project is just for fun, but if you'd like more details, check out the notebook that walks through my process 🤗.
Data pull originally sourced from https://github.com/lukefeilberg/onion