Effective detection of cyber attack in a cyber-physical power grid system.

. Advancement in technology and the adoption of smart devices in the operation of power grid systems have made it imperative to ensure adequate protection for the cyber-physical power grid system against cyber-attacks. This is because, contemporary cyber-attack landscapes have made devices’ first line of defense (i.e. authentication and authorization) hardly enough to withstand the attacks. To detect these attacks, this paper proposes a detection methodology based on Machine Learning techniques. The dataset used in this experiment was obtained from the synchrophasor measurements of data logs from snort, simulated control panels and relays of a smart power grid transmission system. After the preprocessing of the dataset, it was then scaled and analyzed before the fitting of - Random Forest, Support Vector Machine, Linear Discriminant Analysis and K-Nearest Neighbor algorithms. The fitting of the different classifiers was done in order to find the algorithm with the best output. Upon the completion of the experiment, the results of classifiers were tabulated and the result of the Random Forest model was the most effective with an accuracy of 92% and a significantly low rate of misclassification. The Random Forest model also shows a high percentage of the true positive rate that is critical to the security issue.


Introduction
The Purdue model for Industrial Control System (ICS) has bridged the gap between Information Technology (IT) and Operation Technology (OT) through the deployment of Wireless Sensor Network (WSN) and robots. As a result, the cyberphysical power grid system which is also known as the smart grid system has witnessed a tremendous advancement as Intelligent Electronic Device (IED) and other internet enabled devices have been incorporated into its structure for effective monitoring and value addition in its operations [1]. In fact, Cedric et. al., [2] had proposed that "next generation of electric power grid system and other critical infrastructures will rely mainly on advanced technologies such as: industrial automation control systems, error diagnostics, preventive maintenance, automatic safety switching, advance metering infrastructure, and synchrophasor systems". These advancements however, have exposed the system to a new vista of cyber-attack landscape which are clearly intended to undermine the smart grid system, cause system misuse and obviate it from the critical role it plays in the society.
Cyber-attacks on the smart grid system occur when an unauthorized user leverages on the flaws and vulnerabilities of the devices to gain access to the internet enabled device. Some of the vulnerabilities include: weak passwords, unpatched firmware, weak encryption, insecure web links, etc. [3]. According to Alasdair Gilchrist [3], hackers have in recent times resorted to looking for older firmware to perform their attacks especially for versions with known vulnerabilities. For example, the power grid infrastructure system which were isolated and only run on proprietary softwares are now running on Commercial-of-the-Shelf (COTS) components and according to reports [4] [5], several cyber-attacks have been targeted against it because the COTS are not resilient enough and because the built-in safeguards against cyber-attacks are not properly hardened, maintained or updated [6]. It is also noteworthy that before now, most cyber-attacks were restricted to the IT infrastructure of critical organizations; however, with the convergence of OT and the IT infrastructure, there has been a significant shift in cyber-attacks to OT infrastructures [7] and these breaches often results in: reset of the phasor parameters, system shutdown, and disruption of the power grid system [6]. Usually, the Operating System (OS) provides the abstraction and support mechanism for the protection of hardware and application from misuse [8]; however, the cyber-attacks and threats especially, from non-state actors have assumed some level of sophistication in recent times. This therefore, makes the effective detection and prevention of cyber-attacks on the smart power grid system very important [9] [10].

1.1
Structure of a smart grid system.
A typical structure of a cyber-physical power grid system is shown in Fig.1 with the components. Fig. 1 Structure of a Cyber-physical Power Grid System [11] A typical structure of a power grid system has power generators on both ends to supply electricity to the grid. The devices labeled R1, R2, R3 and R4 are Intelligent Electronic Devices (IEDs) which are connected to each circuit breaker, BRK1, BRK2 through BRK4. The role of the IED is to monitor events on the grid and to switch on or off the circuit breakers. According to the authors [11], there are two events that can cause the circuit breakers to trip and the events are: (a) an alert within the line (L1 and L2) that could initiate the IEDs to cause the breakers to trip (b) the operators manually issuing a command to the IED to break the circuit. In both instances, the intelligent devices, use a distance protection algorithm which enables the circuit breakers to trip irrespective of the cause of the command, i.e. whether it is a valid or invalid cause. Below is a list of events scenarios from the 2 mentioned above that can result in line tripping-(a) Short-circuit fault -this is when there is a short circuit between two lines (two or more lines touching each other). This often results in very high voltage that could lead to massive damages. (b) Line maintenance -this is when the line is intentionally disabled to allow for line maintenance. (c) Remote tripping command -this is a possible attack in which an attacker breaches the device's defense and sends a command to a relay thereby causing a breaker to open. (d) Relay setting change -this is another form of attack in which the attacker upon penetrating the device's defense, reconfigures the relay's setting and disables the relay function such that the relay will not trip even for a valid fault or a valid command. (e) Data Injection -this is another form of attack in which the attacker upon entry, initiates a seeming valid fault by changing the phasor values of current, voltage, and other parameters just to ensure that the line trips.
It is apparent that from the scenarios highlighted above, successful attacks against the power system has the propensity to obliterate and render the power grid system incapable of providing efficient power. With these inadequacies and the insufficient scalability of the smart power grid system to mitigate the cyber challenges [12], there is a need to identify the cyber-attacks and secure the power grid system infrastructure.

Objective, Contribution and structure of paper
The objective of this paper is to find an effective cyber-attack detection model by fitting different machine learning classifiers on a simulated smart power grid system dataset. The results will then be compared and the most effective of the models will be tested for effective performance using different metrics. The effectiveness of the performance of our model will therefore be our contribution for effective intrusion detection of cyber-attacks in the smart power grid system. The rest of this paper is organized as follows. Section II related literatures. Section III discusses data analysis. Section IV model fitting and performance evaluation. Section V is conclusion.

Related literatures
While a lot of research papers have been writing on the subject of intrusion detection in the smart grid system, a number of them appears static in their approach to intrusion detection especially in looking out for particular anomalous deviant behaviors. Considering that contemporary attacks on smart grid system have become dynamic, it therefore, requires that approaches should be dynamic and holistic such that detection could be effective even in multiclass situations. For example, cyber-attacks such as Relay Setting Change are common in smart grid system and they are often subtle and obfuscated in order to anonymize the attack. This kind of attack may likely not display an anomalous deviant behaviour to enable some of the proposed IDS detect.
Here are some of the literatures -
This system was adapted through a concerted effort by several organizations to widely monitor power grids system in real-time within a "neighboring grids cluster". Basically, WAMS monitors the cyber-physical system parameters such as phasors of voltage, current, and the status of the IEDs, relays, circuit breakers etc. [2]. The realtime data so generated from the multiple remote points are then synchronized by the WAMS and then transmitted for measurements by the Phasor Measurement Unit (PMU). The PMU is a device used to estimate the magnitude and phase angle of the phasor parameters (voltage, current, etc.) in the electricity grid. The monitoring and synchronization is done in order to ensure accuracy whilst looking out for deviations and malicious values that could lead to down time resulting from attacks [13] [14].

Specification-based Intrusion Detection System.
Unlike the signature-based and anomaly-based Intrusion Detection Systems, the Specification-based IDS is a behavior-rule specification-based technique for intrusion detection that was introduced by Ko in 1996. It has its application mostly in medical cyber-physical systems, electrical cyber-physical grid system, software engineering and in network protocol of some critical infrastructures [2] [15]. In this IDS, the rules work by representing the system behavior of the state machine at every instance of time. According to Pan et al. [2], the state machine behaviors are represented by a sequence of states according to the policies specified. The devices are then monitored and tracked for intrusion, changes and anomalous behaviors that could drive the system state from safe to unsafe state. Any noticeable sequence of behaviors that are outside the predefined specifications are flagged as intrusion. In a nutshell, the authors averred that the Specification-based IDS can be likened to a complement of the anomaly-based IDS.
This is another form of behavior-based IDS which was proposed by Park et al [16]. Though this model was targeted at the Medical Cyber-Physical devices (MCPD) for assisted living environments, it could as well be adopted to detect anomalous behavior in the power grid system. Basically, this Semi-Supervised Anomaly IDS audits a series of events called, episodes. These episodes are sensor ID, start time and duration of events. In using the Hidden Markov Model (HMM) technique, a comparative analysis is then done to determine the current state of events and what happens thereafter. Based on the noticed behaviour, classification is then done by classifying the behaviour as low-level state or high-level state in order to be able to infer whether it is an abnormal or normal behavior.
Faisal et. al. [17] proposed the use of Data-Stream-Based Mining IDS for the monitoring of intrusion in smart grid Advanced Metering Infrastructure (AMI). The structure of this IDS is similar to the anomaly-based IDS, but it selects a stream of data as against the conventional static mining techniques often observed in the anomaly technique. This proposal is, however, very limited in application to the smart meter. Therefore, this model is not suitable for intrusion detection of cyber-attacks in the cyber-physical smart power grid system.

3
Data Analysis

Description of dataset
The dataset used in this paper is the power system dataset [2] [18]. It is made of 129 variables (128 predictors and 1 response variable of 3-classes). The dataset contains the measurement of electric transmission on a smart power grid system. These measurements were done using 4 synchrophasors which measures 29 features of the events in each Phasor Measurement Unit (PMU) totaling 116 features. The PMU uses a common time source to synchronize the various measurements and the features so measured were classified as attacks and benign data. The benign data is consisted of Normal traffic and NoEvents. These measurements were obtained using: snort, a simulated control panel, and relays. The parameters measured are: the voltage phase angle (PA1:VH -PA3:VH), the voltage phase magnitude (PM1:V -PM3:V), the current phase angle (PA4:IH -PA6:IH) and the current phase magnitude (PM4:I -PM6:I). Others are: the zero voltage phase angle (PA7:VH -PA9:VH), zero voltage phase magnitude (PM7:V -PM9:V), the zero current phase angle (PA10:VH -PA12:VH) as well as the zero current phase magnitude (PM10:V -PM12:V). In addition, there were also other parameters that were measured, and they are: frequency for relay (F), frequency delta (DF), appearance impedance for relays (PA:Z), appearance impedance angle for relays (PA:ZH) and status flag for relays (S). Other descriptions in the dataset are fault location, line maintenance and load condition. The entire setup was aimed at measuring both the normal traffic transmission in the grid as well as the attacks (cyber intrusion) that could impact the power grid system.
To enable us visualise the distribution of the instant classes in the response variable of our dataset, using RStudio Integrated Development Environment (IDE), we plotted a barplot of the values. See the plot in Fig. 2 and the R code in Appendix A. Though the class representation and barplot shows the Attack class as the majority class over the benign class, the dataset does not fit into the description of an imbalanced dataset in cybersecurity considering the ratio between the classes. If we consider the dataset as binary (attacks and benign) then the ratio is 1 : 2. For attack to Natural it is 1 : 3 and for attack to NoEvent it is 1 : 11. In typical intrusion dataset, a ratio of 1 : 10 and above for a majority to minority class is expected before a dataset could be classified as an imbalanced dataset. More importantly, since our target class is the attack class, and it is a majority class, we elected to proceed with the dataset but with a view to ensuring that a higher recall rate is achieved and that the Area Under the Curve (AUC) for the ROC curves is high.
Data cleaning and pre-processing is a way of preparing the dataset for eventual use and to also ensure that all the data points contribute to the model without bias. It involves outlier removal, feature selection and data normalisation. However, in our experiment, we only performed outlier removal and data normalisation using the scaling function.
Outlier removal. While summarising and visualizing the dataset, we observed that the dataset was fraught with outliers that needs to be removed. However, further introspection into the dataset shows that the anomaly was caused by fault of 10% -19% on the relay of Line 1 which results in "Inf" values. The same outliers were found in Line 2, and relays number 3 and 4 of the power line. In addition, our observation also gave credence to the fact that these outliers may have been as a result of either the disabling of a single relay for line maintenance, remote tripping command of a single relay or a fault. In all the cases, the percentage of the disabling function lies between 10 -29%. In view of these and the need to visualise the data points that clearly deviate from the others, we decided to use boxplot package of RStudio to visualise the outliers in the dataset [19]. From the plot (Fig. 3), data points that were discovered to significantly deviate from the rest of the points were identified and removed. As could be seen in the figure, the outliers are "Inf" and they were found in the following variables: "R1.PA.Z", "R2.PA.Z", "R3.PA.Z", and "R4.PA.Z".

Fig. 3. Boxplot representation of values and outliers
Data Normalization. Data normalization during multivariate analysis is to enable each variable to contribute equally to the analysis. Therefore, the normalization method we used was scaling, and we scaled from the first to the 128 th variable leaving out the response variable which is a factor variable. Upon completion of the scaling, we then appended the response variable before we commenced the application of the classifier for modeling. See Appendix C for the code snippet on data scaling.

Model Fitting and Performance Evaluation
At this stage of our experiment, using a Windows 10 computing machine with intel core i5 processors and RStudio IDE, we applied some machine learning algorithms on the dataset. The essence was to fit several models and then compare the results of the models in order to determine which of them has the best accuracy, sensitivity and specificity. Also, our reason for using both linear and non-linear classifiers to fit the models was because, we observed that a few of the classifiers are highly likely to be biased toward the majority class in their output. However, before we applied the classifiers, we ensured that the dataset was clean of all factors that might affect our output. At this point, the total number of observations and variables after data preprocessing was, 52,885 -observations, and 129 -variables. We then partitioned the dataset into training and testing data and assigned 37,000 of the observations which constitute about 70% to training of the classifier. The remainder of the dataset which constitutes 30% of the observations was then used for validation. After the splitting, we went on to fit the model using the different classifiers.

Linear Discriminant Analysis (LDA).
The LDA [20] was the first classifier we used. It is a linear classifier that is robust and good at performing dimension reduction in the course of its application on datasets. It mostly works by dividing the data space into N number of disjoint regions such that probability densities are calculated with the assumption that the data is Gaussian with each attribute having same variance close to the mean. This classifier produced an accuracy of 71% with a high percentage of misclassification rate. Table 1 contains the values of the sensitivity and specificity of this classifier. Also find the Rcode snippet for the model in Appendix D.

Support Vector Machine (SVM).
Support Vector Machine (SVM) [21] is a non-linear classifier that is used for both regression and classification problems. SVM produces significant accuracy with less computation power. To maximize the output and margin, SVM uses decision boundaries to classify data points that are closer to the hyperplane. These data points then influence the number of data points closed to the hyperplane, position and the orientation of the hyperplane. Our accuracy while using this classifier to fit our model was 72%. This model also showed a high percentage of misclassification rate hence our desire to tune the kernel parameters in order to ensure improved performance. See Appendix E for the R-code and Table 1 for the value of sensitivity and specificity.

SVM Tuning.
Since the accuracy of our SVM model was not very high especially considering the high rate of misclassification, we decided to tune our SVM kernel parameters in order to improve the accuracy as well as reduce the Cost Matrix [21]. Usually, the SVM kernels takes data points as inputs and outputs similarity score that affects the class boundaries. The measure of the closeness on both sides of the hyperplane is the similarity and the nearer the data points are to the hyperplane, the higher the similarity score. We knew that to achieve a better SVM classifier output, it would require a better measure of closeness which can only be achieved through the right values of the kernel parameters. At this point, we then proceeded to try several values for gamma and cost with a view to having an optimal value that will yield a better accuracy and recall rate. We also applied the different kinds of kernel: Radial kernel, Polynomial kernel, Sigmoid kernel and Linear kernel. In the end, we were able to obtain a gamma value of 0.1 and cost parameter value of 20 in the radial kernel. With these values, we were able to tune the kernel parameters and obtained a better accuracy and a little reduction in the misclassification rate. With this tuning, we were able to improve the accuracy from 72% to 77%. However, we observed that the misclassification rate was still high hence the need for us to further apply some other non-linear classifiers. The sensitivity and specificity values have been provided in Table 1 and the R-code snippets are in Appendix F.

K -Nearest Neighbour (KNN)
The K-Nearest Neighbor (KNN) [22] is another non-linear classifier that we also used to model our work. KNN uses Euclidean distance to measure the distance between one data point and its neighbor. Based on the size of our dataset, we calculated the value of K as 192 and 193 (nearest neighbour), we then fit in the model and computed the confusion Matrix. The accuracy of the KNN model when it was fitted was 71% with a very high misclassification rate as the sensitivity and specificity were very low. See Table 1 for the values and Appendix G for code.

Random Forest
Random Forest (RF) [23] uses decision trees that are randomly created from selected data samples to make its predictions on each tree and then selects the best solution by means of voting. Usually, the more trees the classifier can create, the more robust the forest is. Its method of data splitting is an ensemble approach based on divide and conquer method. Individual trees are usually generated by the classifier using an attribute selection indicator. The application of Random Forest classifier to fit the model improved the accuracy of the model to 92% at 95% CI. Also, the model detection rate of the true positives (sensitivity) and specificity also improved. The improved accuracy makes the model quite relevant for the detection of instances of attacks in a multiclass dataset as the one we are using. Furthermore, the balanced accuracy across the three instances were also very high which is an indication of suitability of the classifier for our experiment. It is also worthwhile to add that with a Kappa value of 82%, the model could be said to have performed very well in the identification and detection of the attack classes. See Table 1 for more on the detected values and Appendix H for a snippet of the code.

Experimental result comparison
We computed the Confusion Matrix of each of the classifiers and tabulated the values of the classes in Table 1. For the purpose of this experiment, we restricted the values to the computed Accuracy, Sensitivity and Specificity.

Confusion Matrix of best model
From the comparison of the values in Table 1, the output of the Random Forest model gave the best result of all the classifiers. In addition, the RF model also gave the lowest misclassification rate of all the models hence the confusion matrix in Table 2. The numbers along the diagonal represent the correct decisions made, and the numbers on the left and right of the diagonal represent the errors otherwise known as misclassification of the various classes. The confusion matrix code is in Appendix I.  To further explain the value of our recall and precision -given all the predicted labeled class called Attack, the number of instances that were correctly predicted has a precision = 0.92 (92%). Also, a recall = 0.98 shows that for all instances that should have label Attack, our model correctly captured 98%.
F -Measure. F-Measure also known as F-score or F1 is another metric for the measurement of the accuracy of a classifier especially a dataset whose distribution of classes in the dataset is slightly skewed towards the majority class. Our dataset fits into this category hence our desire to also compute the F-score of our model. It is described as the harmonic mean of the precision and recall as it is the most common metric that is used on an uneven or imbalanced classification problem. An F-score value of 1 indicates that the variance among the class mean is exactly what is expected given the within-classes variance and not by chance. Therefore, with our model's F-score tending to 1 (F1≈1), we can infer that the RF model was able to classify and detect the attacks. Also, considering our Confidence Interval of 95% with a significance of 0.05, the value of our computed P-value (see Appendix I) is less than the significance level (0.05) therefore, we can also infer that the value is statistically significant and supports the adoption of the RF model as suitable for detection of attacks.

Cutoff value, Receiver Operating Characteristics (ROC) and AUC
Cut-off value -The ROC curve is used to determine the optimum cut-off value especially as it shows the trade-off between the true positives and the false positive at different cut-off marks. Basically, it evaluates the hit rate and false alarm rate at varying thresholds ( Figure 4) [24].. From Figure 4, it can be observed that the accuracy of our model tends to increase with an increase in the cutoff values. However, at a maximum threshold value before the default cutoff (0.5), the model was able to achieve the maximum accuracy. The code snippet is in Appendix J.
ROC Curve -The ROC curve is a veritable tool for visualizing and evaluating classifiers performance accuracy and it is independent of the class distribution. ROC curve's ability to tend to the top-left corner of the graph indicates a better performance. Our RF model ROC from the graph ( Figure 5) tends to the top left corner of the graph which is a pointer to the ability of the model to predict the true positive rates more correctly. AUC -The area of ROC graph is 1 and its scale ranges from points 0.0 -1.0. To measure the predictive accuracy of a model, the AUC of the curve needs to be computed as it is the probability that a given randomly chosen value is a positive instance of higher rank. An AUC of 0.5 indicates that the ROC curve lies on the baseline (the diagonal) where FPR = TPR which indicates that the predictive value of the AUC in the ROC curve is less accurate or at best the detection could only happen by chance. However, with our model AUC= 0.978, it is indicative that our RF model has a higher chance of detecting high positives.

Conclusion
In dealing with the growing integration and complexity of cyber-physical smart grid system, there is a necessity to explore an effective approach to detection, monitoring, optimizing, and more importantly, securing the smart power grid system. This paper has proposed an effective anomaly detection method against cyber-attacks in a smart grid system. Because the dataset we used has multiclass response variable, our focus was more on how to correctly classify and detect the true positive rate (Attacks) with a commensurate value of accuracy. The methodology we adopted to achieve this objective involved the application of several machine learning classifiers that will be able to provide a high accuracy as well as a high detection rate of the true positive rate. The classifiers we applied after necessary data cleaning and preparation were: Linear Discriminant Analysis, Support Vector Machine, K-Nearest Neighbor and Random Forest. Of all the classifiers, the Random forest model gave us the highest accuracy, a better detection rate of the true positives and also the specificity. We then went further to evaluate the performance of our model using metrics like precision, recall rate, F-score, ROC and Area Under the Curve. It is interesting to point out that all the metrics supported our model with very high probability for the detection of anomaly in a smart grid system.

Future work
The smart power grid system is experiencing a number of domain specific forms of cyber-attacks. These attacks include: data injection, remote tripping command injection, relay reset and others. Future works should look at identifying and classifying these forms of cyber-attack against the smart grid system infrastructure.