In this assignment we use Naive Bayes (NB) for its two greatest strengths:
- Exploration of a data set
- Classification of new data based on training data
The databases are not small, so I've put them on Canvas. When you download those files move them into the repository folder.
In this section, you will build a Naïve Bayes classifier on the convention speeches of 2024, using the words of the speech text to predict the party (either Republican or Democratic). Your starting notebook walks you through the steps of fitting and using a Naïve Bayes model from the NLTK package. This repo includes some code that would help you limit the number of words you consider in your model, which might improve run-time. We have asked you to fill in some observations from the fitted model.
We have a pretty gigantic database of tweets (and other data) from everyone running for Congress in 2018. As an exercise, we'll try to use this convention model to classify those tweets.
The notebook walks you through the steps in broad terms:
- Pull data from the congressional DB.
- Clean, tokenize, and build your feature dictionary for a tweet.
- Use the classifier from Part 1 to estimate the party of the tweeter.
- Compare this estimate to their actual party.
If you're looking to go further on the assignment, consider building a model based on 2024 data, 2020 data, and 2016 data. Which one has the best predictive accuracy? My instinct is that 2016 will, but I haven't tested it.