This dataset was curated by simulating a wide variety of intrusions in a military network environment. The simulation aimed to replicate a typical US Air Force LAN, providing an authentic environment to acquire raw TCP/IP dump data. The LAN was designed to mirror real-world conditions and was subjected to numerous attack scenarios.
- Environment: The dataset was created by simulating a typical US Air Force LAN.
- Data Type: Raw TCP/IP dump data.
- Connection: A connection in this dataset refers to a sequence of TCP packets that start and end over a certain time duration. During this duration, data flows from a source IP address to a target IP address under a specific protocol.
- Labels: Each connection is labeled as either:
- Normal: Indicating typical, non-malicious traffic.
- Anomalous: Indicating traffic that is indicative of an attack. Each anomalous connection is further labeled with a specific attack type.
The dataset comprises 41 features for each TCP/IP connection. These features are a mix of quantitative and qualitative data:
- Quantitative Features: 38
- Qualitative Features: 3
The features provide insights into the nature of the traffic, helping in distinguishing between normal and anomalous connections.
The class variable in the dataset categorizes the connections into two main categories:
- Normal: Connections that represent regular, non-malicious traffic.
- Anomalous: Connections that represent potential threats or attacks.
We trained several machine learning models on this dataset and evaluated their performance using two key metrics: Train Score and Test Score. Here are the results:
Model | Train Score | Test Score |
---|---|---|
K-Nearest Neighbors (KNN) | 0.977 | 0.977 |
Logistic Regression | 0.916 | 0.908 |
Decision Tree | 1.000 | 0.995 |
Random Forest | 0.999 | 0.996 |
Gradient Boosting Machine (GBM) | 0.996 | 0.993 |
XGBoost | 1.000 | 0.996 |
Adaboost | 0.986 | 0.986 |
Light GBM | 1.000 | 0.995 |
CatBoost | 0.999 | 0.995 |
Naive Bayes Model | 0.896 | 0.897 |
Voting Ensemble | 0.999 | 0.996 |
Support Vector Machine (SVM) | 0.964 | 0.963 |
- The models have been evaluated using Train and Test Scores, with Test Scores ranging from 0.895 to 0.996.
- Several models, such as Decision Tree, Random Forest, XGBoost, Light GBM, and the Voting Ensemble, have achieved high Test Scores, indicating strong performance in distinguishing between normal and anomalous network connections.
- The dataset's class distribution and the diversity of features have likely contributed to the model's performance.
- Further analysis and domain-specific considerations may be necessary for selecting the most suitable model for real-world network intrusion detection.
You can use the code and data in this repository to:
- Train and evaluate machine learning models on similar network intrusion detection datasets.
- Modify and fine-tune existing models or develop new models for improved performance.
- Gain insights into the effectiveness of different machine learning algorithms for network security tasks.
While the initial model evaluation has yielded promising results, there are several additional approaches and strategies you can explore to enhance the performance of network intrusion detection models:
Fine-tuning model hyperparameters can significantly improve model performance. Consider using techniques like grid search, random search, or Bayesian optimization to find optimal hyperparameter settings for your chosen models. This can lead to better generalization and increased accuracy.
Feature engineering plays a crucial role in network intrusion detection. Experiment with different feature selection methods, dimensionality reduction techniques, or the creation of new features to capture more relevant information from network connections.
Ensemble methods, such as stacking or blending, can further boost model performance. Combining the predictions of multiple well-tuned models can often yield better results than individual models. Explore different ensemble strategies to find the most effective combination.
Consider incorporating specialized anomaly detection algorithms into your model pipeline. These algorithms are designed to detect unusual patterns and can complement traditional classification models. Techniques like Isolation Forests, One-Class SVM, or autoencoders may be worth exploring.
Ensure robust model evaluation by using cross-validation techniques. This helps assess the model's generalization performance and reduces the risk of overfitting. Stratified K-Fold cross-validation is a common choice for imbalanced datasets.
Since network intrusion datasets often suffer from class imbalance, employ techniques to handle this issue. Options include oversampling the minority class, undersampling the majority class, or using synthetic data generation methods like SMOTE (Synthetic Minority Over-sampling Technique).
Consider experimenting with deep learning architectures, such as convolutional neural networks (CNNs) or recurrent neural networks (RNNs), especially if the dataset is large and complex. Deep learning models can automatically learn hierarchical features from raw data.
Enhance model interpretability to gain insights into the features and decisions made by the models. Techniques like SHAP (SHapley Additive exPlanations) values or LIME (Local Interpretable Model-agnostic Explanations) can help explain model predictions.
Feel free to explore these approaches and adapt them to your specific use case. Remember to keep track of your experiments, document your findings, and iterate on model improvements to achieve even better results in network intrusion detection.
Contributions to this project are highly encouraged. If you have insights, enhancements, or new approaches to share, please open an issue or submit a pull request. Collaboration can lead to more robust and effective network intrusion detection models.
Feel free to contribute to this project by providing enhancements, additional models, or domain-specific expertise.