Objective

To identify malicious websites when provided a link. It is a phishing checker.

Resources for datasets

Malicious URLs Data: https://sysnet.ucsd.edu/projects/url/

Current Training Data: https://archive.ics.uci.edu/dataset/327/phishing+websites

Workflow steps

Preprocessing
Features to be used
Improving Dataset
Removing Class Imbalance (current step)
Logistic Regression
Other Classification algorithms

Installation

Clone the repository.

git clone https://github.com/RafaeSyed/Phishing-analysis.git

Install requirements. Note: While the requirements.txt contain a lot of libraries most are redundant. As long as you have a working environment, add pandas, numpy, seaborn, matplotlib, scipy, scikit-learn and you should be good. To install directly from requirements:
```
pip install -r requirements.txt
```

Preprocessing

Being done in phishing_analysis.ipynb.

Features

Features are going to be the same as the UCI dataset. Clear reasoning is mentioned and it looks good enough to predict if a website is malicious.

Information on features can be found here: https://archive.ics.uci.edu/dataset/327/phishing+websites

Improving dataset

Another datset is to be added to the current UCI dataset (https://sysnet.ucsd.edu/projects/url/). This dataset contains 2.4 million URLs. We will extract the URLs and write code to map individual feature that we will be using from each URL to make a datapoint that we can append to our existing data. (Not necessary but fun)

Preprocessing

Using uci training data to clean data to make it work with classification algorithms. Preprocessing is being done in phishing_analysis.ipynb

Removing Class Imbalance

There are few positive instances of malicious links so data is skewed. Will have to try different techniques to remove the imbalance. Most probably going to use random sampling.

Logistic Regression

Trying Logistic regression to see how well the model performs with the given features

Other Classification Algorithms

Other algorithms to check further because LogReg is not very powerful and find a more robust algorithm (XGBoost)

We will be segmenting the code to individual algorithms at the end and not using ipynb. The files will be made .py files. Install Jupyter Notebook extension on VS code until then to view code. Do not Commit any changes on the main branch, fork if you want to work on the repo.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
Legit_urls.csv		Legit_urls.csv
ML Models		ML Models
Phishing Websites Features.docx		Phishing Websites Features.docx
Phishing_URLs.csv		Phishing_URLs.csv
Readme1.1		Readme1.1
Training Dataset.arff		Training Dataset.arff
imports.py		imports.py
machine-learning-algorithm-cheat-sheet.png		machine-learning-algorithm-cheat-sheet.png
phishing_analysis.ipynb		phishing_analysis.ipynb
phising-uci-dataset.xlsx		phising-uci-dataset.xlsx
readme.md		readme.md
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Objective

Resources for datasets

Workflow steps

Installation

Preprocessing

Features

Improving dataset

Preprocessing

Removing Class Imbalance

Logistic Regression

Other Classification Algorithms

We will be segmenting the code to individual algorithms at the end and not using ipynb. The files will be made .py files. Install Jupyter Notebook extension on VS code until then to view code. Do not Commit any changes on the main branch, fork if you want to work on the repo.

About

Releases

Packages

Contributors 2

Languages

License

RafaeSyed/Phishing-analysis

Folders and files

Latest commit

History

Repository files navigation

Objective

Resources for datasets

Workflow steps

Installation

Preprocessing

Features

Improving dataset

Preprocessing

Removing Class Imbalance

Logistic Regression

Other Classification Algorithms

We will be segmenting the code to individual algorithms at the end and not using ipynb. The files will be made .py files. Install Jupyter Notebook extension on VS code until then to view code. Do not Commit any changes on the main branch, fork if you want to work on the repo.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages