Linguistic Features for Age Detection in Text

Name: Varun Venkatesh

Date: 04/25/2023

Description

This project aims to find markers of age within blog posts using The Blog Authorship Corpus and within comments from a gathered age-grouped Reddit Dataset.

Dataset

The "found" dataset used for this project is the Blog Authorship Corpus, which contains over 600,000 posts written on a wide range of topics. The dataset is available on Kaggle and was originally compiled by Dr. Jichan Zeng at the University of Illinois at Urbana-Champaign.

The project also involves the collection, processing, and usage of a custom reddit dataset that is divided into age groups. More info can be found here, as well as the reddit_data/ folder.

Important Files and Folders

final_report.md : A report detailing the final results of the project
progress_report.md: A report detailing the progress made, with steps, process, timeline, etc.
data_samples/: Contains the raw data files as well as processed data samples
scripts/: Contains Python scripts for various tasks (preprocessing, feature extraction, model training, and model evaluation)
notebooks/: Contains iPython Notebooks (Jupyter) for the main work like discovery, processing, EDA, analysis, etc.

Feedback and comments on the project are in the Guestbook

Most Important Jupyter Notebooks

blog_final.ipynb: A Jupyter notebook detailing the analysis of the blog authorship corpus
reddit_final.ipynb: A Jupyter notebook detailing the analysis of the reddit dataset
process_blog_data.ipynb: A Jupyter notebook detailing the extraction of the blog authorship corpus
process_reddit_data.ipynb: A Jupyter notebook detailing the extraction of the raw reddit data obtained using this script

Note: If the Jupyter notebooks fail to load on GitHub, use the nbviewer links provided in the top of the corresponding files.

To replicate any work, run the following:

# create virtual environment
python3 -m venv /path/to/new/virtual/environment
# activate virtual environment
source /path/to/new/virtual/environment/bin/activate
# install requirements
pip3 install -r requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Linguistic Features for Age Detection in Text

Description

Dataset

Important Files and Folders

Most Important Jupyter Notebooks

Files

README.md

Latest commit

History

README.md

File metadata and controls

Linguistic Features for Age Detection in Text

Description

Dataset

Important Files and Folders

Most Important Jupyter Notebooks