Name: Varun Venkatesh
Email: [email protected]
Date: 04/25/2023
This project aims to find markers of age within blog posts using The Blog Authorship Corpus and within comments from a gathered age-grouped Reddit Dataset.
The "found" dataset used for this project is the Blog Authorship Corpus, which contains over 600,000 posts written on a wide range of topics. The dataset is available on Kaggle and was originally compiled by Dr. Jichan Zeng at the University of Illinois at Urbana-Champaign.
The project also involves the collection, processing, and usage of a custom reddit dataset that is divided into age groups. More info can be found here, as well as the reddit_data/
folder.
final_report.md
: A report detailing the final results of the projectprogress_report.md
: A report detailing the progress made, with steps, process, timeline, etc.data_samples/
: Contains the raw data files as well as processed data samplesscripts/
: Contains Python scripts for various tasks (preprocessing, feature extraction, model training, and model evaluation)notebooks/
: Contains iPython Notebooks (Jupyter) for the main work like discovery, processing, EDA, analysis, etc.
Feedback and comments on the project are in the Guestbook
blog_final.ipynb
: A Jupyter notebook detailing the analysis of the blog authorship corpusreddit_final.ipynb
: A Jupyter notebook detailing the analysis of the reddit datasetprocess_blog_data.ipynb
: A Jupyter notebook detailing the extraction of the blog authorship corpusprocess_reddit_data.ipynb
: A Jupyter notebook detailing the extraction of the raw reddit data obtained using this script
Note: If the Jupyter notebooks fail to load on GitHub, use the nbviewer links provided in the top of the corresponding files.
To replicate any work, run the following:
# create virtual environment
python3 -m venv /path/to/new/virtual/environment
# activate virtual environment
source /path/to/new/virtual/environment/bin/activate
# install requirements
pip3 install -r requirements.txt