Skip to content

willbrasic/UCI_Adult_PyTorch_Scikit-Learn

Repository files navigation

UCI Adult Prediction with Pytorch

Hi there! Thank you for checking out my repository! This README.md file gives details on the neural networks used while the REAME_2.md file dives into using simpler machine learning algorithms to juxtapose their effectiveness with that of deep learning.

My chosen neural network (82.227% test accuracy rate), which is elucidated below, performs considerably above the 0.75 quantile (79.037% test accuracy rate) of all reported neural networks used for this dataset (see https://archive.ics.uci.edu/dataset/2/adult). I hope you enjoy reading through my project!

Overview

This repository contains code for deploying a PyTorch neural network in the UCI_Adult_PyTorch.py file along with various machine learning algorithms in UCI_Adult_Scikit-Learn.py on the UCI Adult dataset which can be found at UCI_Adult_Data.csv. Here, we focus on the former. For details on the simpler machine learning models, please look at the README_2.md file.

Table of Contents

Getting Started

Below are some instructions on how to get the project up and running.

Prerequisites

Main dependencies:

imbalanced_learn==0.11.0
imblearn==0.0
matplotlib==3.7.2
numpy==1.25.2
pandas==2.1.1
scikit_learn==1.3.0
seaborn==0.13.0
torch==2.0.1

Installation

# Clone the repository
git clone https://github.com/willbrasic/UCI_Adult_PyTorch_Scikit-Learn.git

# Navigate to the project directory
cd UCI_Adult_PyTorch_Scikit-Learn

# Install dependencies
pip install -r requirements.txt

Dataset

This project uses the Adult UCI dataset which can be found in the repository at UCI_Adult_Data.csv. Details can be found at https://archive.ics.uci.edu/dataset/2/adult. The data cleaning procedure can be found at UCI_Adult_PyTorch.py. Here are some graphs that summarize the outcome of interest (income) along with the covariates used for prediction:

Picture 1

Picture 2

Also, the data does have a slight issue regarding class proportions with class 1 (individuals making more than $50,000) being under-sampled. I tested if SMOTE could improve this. While recall did increase, overall validation accuracy decreased leading me to not use this method as I prioritize accuracy in general over a decease in the number of false negatives (predicting income <= 50K when the true label is > 50K). A decrease in the number of false negatives implies sensitivity rises.

For the sake of completeness, here is the confusion matrix when model_1, which is discussed in the Training section, is used on the data when SMOTE is applied:

Picture 3

Training

I create three neural networks: model_0 which only contains linear activations, model_1 which contains complex network with multiple hidden layers, BatchNorm, dropout, and ELU, and model_2 which contains multiple hidden linear (no non-linear layers) along with BatchNorm and dropout. ELU (https://pytorch.org/docs/stable/generated/torch.nn.ELU.html) activations are a good alternative for ReLU that avoids non-differentiability at zero. All networks use learning rate α = 0.01 and Nesterov momentum with parameter γ = 0.9 to improve optimization performance. Moreover, all neural networks implement early stopping.

The training and validation loss and accuracy for model_0 over epochs along with its confusion matrix looks as follows:

Picture 4

Picture 5

The mean training and validation accuracy rate for model_0 over thirty epochs is 81.6424% and 81.8092%, respectively.

The training and validation loss and accuracy for model_1 over epochs along with its confusion matrix looks as follows:

Picture 6

Picture 7

The mean training and validation accuracy rate for model_1 over thirty epochs is 80.8788% and 82.5968%, respectively.

The training and validation loss and accuracy for model_2 over epochs along with its confusion matrix looks as follows:

Picture 8

Picture 9

The mean training and validation accuracy rate for model_1 over thirty epochs is 80.6205% and 81.8572%, respectively.

As evidenced by the accuracy over epochs for each model, the models perform very similar. However, model_1 has slightly better validation accuracy by roughly 0.7 percentage points to that of model_0 and model_2. Thus, model_1 is selected for testing.

Results

The chosen model_1 has a testing accuracy of 82.227%. Here is its confusion matrix:

Picture 10

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages