Imbalanced classes in datasets are common problems often found in security data. Therefore, several strategies like class resampling and cost-sensitive training have been proposed to address it. In this paper, we propose a data augmentation strategy to oversample the minority classes in the dataset. Using our Sort-Augment-Combine (SAC) technique, we split the dataset into subsets of the class labels and then generate synthetic data from each of the subsets. The synthetic data were then used to oversample the minority classes. Upon the completion of the oversampling, the independent classes were combined to form an augmented training data for model fitting. Using performance metrics such as accuracy, recall (sensitivity) and true positives (specificity), the models trained using the augmented datasets show an improvement in performance metrics over the original dataset. Similarly, in a binary class dataset, SAC performed optimally and the combination of SAC and ROSE model shows an improvement in overall accuracy, sensitivity and specificity when compared with the performance of the Random Forest model on the original dataset, ROSE and SMOTE augmented datasets.
OTOKWALA, U., PETROVSKI, A. and KALUTARAGE, H. 2021. Improving intrusion detection through training data augmentation. In Moradpoor, N., Elçi, A. and Petrovski, A. (eds.) Proceedings of 14th International conference on Security of information and networks 2021 (SIN 2021), 15-17 December 2021, [virtual conference]. Piscataway: IEEE [online], article 17. Available from: https://doi.org/10.1109/SIN54109.2021.9699293