To identify malicious websites when provided a link. It is a phishing checker.
Malicious URLs Data: https://sysnet.ucsd.edu/projects/url/
Current Training Data: https://archive.ics.uci.edu/dataset/327/phishing+websites
- Preprocessing
- Features to be used
- Improving Dataset
- Removing Class Imbalance (current step)
- Logistic Regression
- Other Classification algorithms
- Clone the repository.
git clone https://github.com/RafaeSyed/Phishing-analysis.git
- Install requirements.
Note: While the requirements.txt contain a lot of libraries most are redundant. As long as you have a working environment, add pandas, numpy, seaborn, matplotlib, scipy, scikit-learn and you should be good.
To install directly from requirements:
pip install -r requirements.txt
Being done in phishing_analysis.ipynb.
Features are going to be the same as the UCI dataset. Clear reasoning is mentioned and it looks good enough to predict if a website is malicious.
Information on features can be found here: https://archive.ics.uci.edu/dataset/327/phishing+websites
Another datset is to be added to the current UCI dataset (https://sysnet.ucsd.edu/projects/url/). This dataset contains 2.4 million URLs. We will extract the URLs and write code to map individual feature that we will be using from each URL to make a datapoint that we can append to our existing data. (Not necessary but fun)
Using uci training data to clean data to make it work with classification algorithms. Preprocessing is being done in phishing_analysis.ipynb
There are few positive instances of malicious links so data is skewed. Will have to try different techniques to remove the imbalance. Most probably going to use random sampling.
Trying Logistic regression to see how well the model performs with the given features
Other algorithms to check further because LogReg is not very powerful and find a more robust algorithm (XGBoost)