GitHub - mlorinc/malicious-url-classifier: Experiments with RNN and CNN in malicious URL detection tasks. Includes automation tooling to make training and evaluating more seamless.

Malicious URL detection using deep learning methods

A simple automation framework for Natural Language Processing neural networks training and evaluation. Demonstrated on malicious URL classification.

License · Usage · Reproduction steps · Report Bug

This project was created as part of my Secure Hardware Devices university course, in which I had to design LSTM, GRU or Bi-LSTM based neural networks. As I was conducting more experiments, the code base cohesion was getting worse and worse by each experiment iteration. Eventually, the code base was hard to maintain. Therefore, this project was created as side product of the course project.

Entire training datasets, trained URL classification models are not part of the repository, because of their size. Though, model definitions can be found in the models module. Sadly, the module cannot be configured to custom one, so it is better to clone repository and work with the tool this way.

Furthermore, strings are solely used for classification and before they are fed into models, they are converted into vector embeddings. The embedding dimension is fixed at 128, however other parameters can be set in JSON configurations.

The following parameters can be configured:

Parameter	Description
batch_size	training batch size, training performance x accuracy tradeoff
max_features	maximum number of characters/words to be used by vectorizer
split	either character or whitespace
standardize	name of the standardizer
max_length	maximum length of the string
model	model definition which resides in models module
stringify	make model accept strings instead vectors
output_path	the path where final model will be saved
epoch	number of training epochs
patience	number of epochs until the training is stopped due to worsening validation loss

At last, currently only binary classifiers are supported and training values are hardcoded to url and labels to type. Though, it is not complicated to modify for multi-class classification.

Built With

To run project, make sure you have the following libraries installed:

Keras
Tensorflow
Pandas
Seaborn

License

Distributed under the Apache-2.0 License. See LICENSE.txt for more information.

Usage

Program features four commands which can be invoked by launching src/main.py file. The most usual flow has the following order:

Merge datasets: src/main.py dataset:fuse data/fused_dataset.csv <datasets...>
Split dataset into ./dataset folder: src/main.py dataset:split ./dataset data/fused_dataset.csv 0.6 0.2 0.2
Train NN: src/main.py model:train ./dataset ./configs/some.json
Evaluate NN and save plot to mygallery: src/main.py model:evaluate ./mygallery dataset <models...>

Malicious URL classification result reproduction

In order to reproduce data located in gallery, follow the following actions:

Train CNN-RNN hybrids: src/main.py model:train ./dataset ./configs/model.configs.json
Train RNN: src/main.py model:train ./dataset ./configs/recurrent.model.configs.json
Train CNN-RNN-Dense hybrids: src/main.py model:train ./dataset ./configs/combined.model.configs.json
Perform evaluation: src/main.py model:evaluate ./figures dataset ./models/* ./memory_models/* ./combined_models/*

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
dataset		dataset
gallery		gallery
src		src
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Malicious URL detection using deep learning methods

Built With

License

Usage

Malicious URL classification result reproduction

About

Releases

Packages

Languages

License

mlorinc/malicious-url-classifier

Folders and files

Latest commit

History

Repository files navigation

Malicious URL detection using deep learning methods

Built With

License

Usage

Malicious URL classification result reproduction

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages