Skip to content

Latest commit

 

History

History
50 lines (31 loc) · 2.8 KB

README.md

File metadata and controls

50 lines (31 loc) · 2.8 KB

Linguistic Features for Age Detection in Text

Name: Varun Venkatesh

Email: [email protected]

Date: 04/25/2023

Description

This project aims to find markers of age within blog posts using The Blog Authorship Corpus and within comments from a gathered age-grouped Reddit Dataset.

Dataset

The "found" dataset used for this project is the Blog Authorship Corpus, which contains over 600,000 posts written on a wide range of topics. The dataset is available on Kaggle and was originally compiled by Dr. Jichan Zeng at the University of Illinois at Urbana-Champaign.

The project also involves the collection, processing, and usage of a custom reddit dataset that is divided into age groups. More info can be found here, as well as the reddit_data/ folder.

Important Files and Folders

  • final_report.md : A report detailing the final results of the project
  • progress_report.md: A report detailing the progress made, with steps, process, timeline, etc.
  • data_samples/: Contains the raw data files as well as processed data samples
  • scripts/: Contains Python scripts for various tasks (preprocessing, feature extraction, model training, and model evaluation)
  • notebooks/: Contains iPython Notebooks (Jupyter) for the main work like discovery, processing, EDA, analysis, etc.

Feedback and comments on the project are in the Guestbook

Most Important Jupyter Notebooks

Note: If the Jupyter notebooks fail to load on GitHub, use the nbviewer links provided in the top of the corresponding files.

To replicate any work, run the following:

# create virtual environment
python3 -m venv /path/to/new/virtual/environment
# activate virtual environment
source /path/to/new/virtual/environment/bin/activate
# install requirements
pip3 install -r requirements.txt