A simple automation framework for Natural Language Processing neural networks training and evaluation.
Demonstrated on malicious URL classification.
License
·
Usage
·
Reproduction steps
·
Report Bug
This project was created as part of my Secure Hardware Devices university course, in which I had to design LSTM, GRU or Bi-LSTM based neural networks. As I was conducting more experiments, the code base cohesion was getting worse and worse by each experiment iteration. Eventually, the code base was hard to maintain. Therefore, this project was created as side product of the course project.
Entire training datasets, trained URL classification models are not part of the repository,
because of their size. Though, model definitions can be found in the models
module.
Sadly, the module cannot be configured to custom one, so it is better to clone repository
and work with the tool this way.
Furthermore, strings are solely used for classification and before they are fed into models, they are converted into vector embeddings. The embedding dimension is fixed at 128, however other parameters can be set in JSON configurations.
The following parameters can be configured:
Parameter | Description |
---|---|
batch_size | training batch size, training performance x accuracy tradeoff |
max_features | maximum number of characters/words to be used by vectorizer |
split | either character or whitespace |
standardize | name of the standardizer |
max_length | maximum length of the string |
model | model definition which resides in models module |
stringify | make model accept strings instead vectors |
output_path | the path where final model will be saved |
epoch | number of training epochs |
patience | number of epochs until the training is stopped due to worsening validation loss |
At last, currently only binary classifiers are supported and training values are hardcoded to url
and labels to type
. Though, it is not complicated to modify for multi-class classification.
To run project, make sure you have the following libraries installed:
- Keras
- Tensorflow
- Pandas
- Seaborn
Distributed under the Apache-2.0 License. See LICENSE.txt
for more information.
Program features four commands which can be invoked by launching src/main.py
file.
The most usual flow has the following order:
- Merge datasets:
src/main.py dataset:fuse data/fused_dataset.csv <datasets...>
- Split dataset into ./dataset folder:
src/main.py dataset:split ./dataset data/fused_dataset.csv 0.6 0.2 0.2
- Train NN:
src/main.py model:train ./dataset ./configs/some.json
- Evaluate NN and save plot to mygallery:
src/main.py model:evaluate ./mygallery dataset <models...>
In order to reproduce data located in gallery, follow the following actions:
- Train CNN-RNN hybrids:
src/main.py model:train ./dataset ./configs/model.configs.json
- Train RNN:
src/main.py model:train ./dataset ./configs/recurrent.model.configs.json
- Train CNN-RNN-Dense hybrids:
src/main.py model:train ./dataset ./configs/combined.model.configs.json
- Perform evaluation:
src/main.py model:evaluate ./figures dataset ./models/* ./memory_models/* ./combined_models/*