Streaming Multi-layer Ensemble Selection using Dynamic Genetic Algorithm

In this study, we introduce a novel framework for non-stationary data stream classification problems by modifying the Genetic Algorithm to search for the optimal configuration of a streaming multi-layer ensemble. We aim to connect the two sub-fields of non-stationary stream classification and evolutionary dynamic optimization. First, we present Streaming Multi-layer Ensemble (SMiLE) a novel classification algorithm for non­ stationary data streams which comprises multiple layers of different classifiers. Second, we develop an ensemble selection method to obtain an optimal subset of classifiers for each layer of SMiLE. We formulate the selection process as a dynamic optimization problem and then solve it by adapting the Genetic Algorithm to the stream setting, generating a new classification framework called SMiLE_GA. Finally, we apply the proposed framework to address a real-world problem of insect stream classification, which relates to the automatic recognition of insects through optical sensors in real-time. The experiments showed that the proposed method achieves better prediction accuracy than several state-of-the-art benchmark algorithms for non-stationary data stream classification.


INTRODUCTION
In the era of big data, machine learning is becoming increas ingly popular for analyzing complex data to save the cost and time of performing manual tasks. However, when dealing with real-world big data, traditional machine learning algorithms suffer from three major drawbacks: storing the whole dataset is infeasible; models fail to handle very high-speed data; changes in data distribution make models collapsed (concept drift [1]). Data can even be generated as a real-time stream in many applications including sensor networks, video streaming, and traffic monitor systems, which demands machine learning models to be updated continuously and rapidly. Naturally, data streams are potentially non-stationary because the process generating them may become different over time, leading to the concept drift issue. In particular, prediction models can get stuck in the concept of old data and never adapts readily to the new distribution. In such scenarios, online learning with an associated concept drift handling mechanism is one of the best schemes to adapt to distribution changes in data streams while maintaining good prediction performance. [2]- [4] The field of optimization plays a crucial role in almost all machine learning algorithms. For example, Deep Neural Network (DNN), one of the most successful machine learning models, needs an opt1m1zation algorithm to search for its optimal weights. Gradient-based optimization methods like Stochastic Gradient Descent, Adam [5] are well-suited for optimizing DNN due to the feasibility to differentiate the loss function with respect to its weights. However, these optimization methods are not applicable for more complicated scenarios, for example when the loss function is not differen tiable.
The dynamic nature of many real-world problems can affect their objective functions and constraints, corrupting the behaviors of traditional optimization methods. In the literature, optimization problems with their components changing over time are called Dynamic Optimization Problems (DOPs). Solving DOPs is particularly difficult due to the requirement to track changing optimal solution(s) over time. For complex problems like DOPs, Evolutionary Computation-based meth ods are an effective choice since their behaviors are inspired by biological evolution and self-organized populations operating in continuously changing environments.
In this paper, we propose a novel streaming classification framework by introducing a DOP solver that works in the data stream setting, which connects the two sub-fields and opens a new research direction for the machine learning community. Our contributions in this work are summarized as follows: 1) Streaming Multi-layer Ensemble: We introduce a cas cade structure to combine different online learning al gorithms into a multi-layer ensemble, which is able to learn incrementally from non-stationary data streams. 2) Ensemble selection for SMiLE: We propose a mecha nism to make the Genetic Algorithm applicable to solve the SMiLE selection problem in a non-stationary stream setting. 3) Real-world application: We apply the proposed frame work to address the insect stream classification problem . The goal is to recognize insects related to public health problems. The data streams in this problem was gener ated by using an optical sensor over time [6]. 4) Experimental analysis: We compare the proposed meth ods with several state-of-the-art algorithms on the insect streaming data. The experiments show that the proposed method achieve higher prediction accuracy than the benchmark algorithms .
In the next section, we have some discussions on the background and related work (Section II), followed by the proposed methods (Section Ill), experimental setting (Section IV), and result and discussion (Section V). Finally, we draw some conclusions in Section VI.

A. Data stream learning
In the data stream setting ( or online setting), learning models are expected to start making predictions at any time before obtaining the whole dataset since the stream of data may never end. Furthermore, they need to be incremental and fast due to the high-speed characteristic of data streams. Here we discuss two types of algorithms for data stream classification: single classifiers and ensemble systems.

Single classifiers
Some batch learning methods are naturally incremental and fast, making them directly applicable to classify streaming data. The most noticeable method with a low computational cost is the well-known Nai"ve Bayes (NB) classifier. It performs instance-incremental prediction by making a naive assumption that all feature variables are mutually independent conditional on each class. However, this simple assumption is also the drawback of the NB method since it is generally invalid in many real-world scenarios. Other methods that can perform online learning by instinct are Perceptron and Stochastic Gradient Descent (SGD). Perceptron tries to linearly separate the data into different classes, while SGD is an incremental gradient-based optimization method for differentiable objec tive functions, especially convex loss functions such as log loss or hinge loss. Both SGD and Perceptron are very fast and cost-efficient, but they can only handle simple datasets for instance those with the linear separability property.
Another way to produce online classifiers is to 'streamify' batch learning algorithms. Decision Tree attracts the most attention in the literature due to its capability to retain high performance and theoretical support when porting to the stream environments. It is also a good base learner for many streaming ensembles with state-of-the-art prediction accuracy [2], [7], [8]. Very Fast Decision Tree (VFDT) [9] -also known as Hoeffding Tree -was the first successful adaption of Decision Tree to the data stream setting. To determine the best split attribute when building a tree, VFDT tries not to revisit old instances by waiting for new ones to arrive. An interesting characteristic of VFDT is that it asymptotically converges to a batch learning Decision Tree when having enough data. Hulton et al. introduced Concept-adapting Very Fast Decision Tree (CVFDT) (10] as an upgraded version of VFDT for non-stationary data streams. There are also many other variants of the Hoeffding Tree model in the literature, for example Extremely Fast Decision Tree (EFDT) (11], Random Hoeffding Tree (RHT) (12], Hoeffding Option Tree (HOT) (13], and Hoeffding Adaptive Tree (HAT) [14]. They all use the Hoeffding bound to check the condition for splits at each node.

Ensemble systems
Almost all the best-performing models for non-stationary data streams in terms of prediction accuracy are ensemble-based methods mainly because they can selectively exploit the advantages of various single classifiers at once. The most well-known ensemble-based system for data streams is the Online Bagging method, which was introduced by Oza (15]. It adapted the classical Bagging algorithm to the stream setting by employing the Poisson(l) distribution to simulate the bootstrap technique in an online manner. The author also proposed Online Boosting in his work, but it is less popular than Online Bagging due to the slower speed and the lower prediction accuracy. There are better variants of Boosting for non-stationary data streams, such as the Boosting-like Online Learning Ensemble (BOLE) [3] and the Online Smooth Boost (OSB) (16]. BOLE improved the performance of Online Boosting by weakening the condition for an expert to vote and making use of the Drift Detection Method (DDM) [17] to handle changes in data. In the OSB method, the definition of the online weak learner was redefined, and only smooth distributions were generated to avoid assigning too much weight to a single ensemble member (also known as ensemble expert). Recently, van Rijn et al. proposed the BLAST ensemble (18] which made use of the Online Performance Estimation framework to adaptively select a subset of best-performing base classifiers to form the voting panel. BLAST works well in practice when having a diverse set of different base classifiers. Another approach to handle streaming data is to use chunk-based ensembles. A well-known ensemble in this category is the Learn++.NSE (19], which generalized the Learn++ method (20] for non-stationary environments. Learn++.NSE exploits the ensemble error on a new data chunk to assign weights to the instances. Very recently, Montiel et al. introduced the Adaptive XGBoost (AXGB) ensemble system [4], a replica of the classical eXtreme Gradient Boosting (XGB) (21], for the stream setting. In this method, new ensemble members are generated from mini-batches of incoming instances. The learning process continues even when a fixed maximum size of the ensemble is reached thanks to the fact that the ensemble keeps updating to be adaptive to the latest concept of data.

B. Evolutionary computing algorithms for Dynamic Optimiza tion Problems
Most optimization solvers in the literature were designed for problems with static fitness functions and constraints. However, in reality, these static assumptions may be invalid due to the dynamic characteristics of the environments where the problem is set up. In these cases, the objective functions and constraints can vary over time, makjng static optimiza tion algorithms collapsed or even inapplicable. This dynamic context can be found everywhere in today's real-world ap plications including Social Networks, IoT Devices, Smart Homes, and Smart Traffic Monitoring Systems. Evolutionary Computing (EC) and Swarm Intelligence (SI) methods are

Active approach
In the active approach, algorithms possess a mechanism to explicitly detect changes in the problem formulation. When a change is detected, the algorithm alters its original search behaviors to adapt to that change. Note that the performance of this approach highly depends on the efficiency of the asso ciated change detector. The most notable work following the active approach is the introduction of hyper-mutation operator [22], which triggers an increase in the rate of mutation when a change occurs. Another successful work involves actively diversifying the pool of candidates by migrating individuals inside a subpopulation in a multi-population paradigm [23], [24]. It appears that the active approach faces the challenges of determining how much diversity needs to be magnified when a change is detected, making it difficult to solve the problem of data stream classification in which none or very little information is revealed at the beginning of the stream.

Passive approach
On the other hand, population diversity is continuously main tained over time during the search procedure in the passive approach. In detail, diversity in population/swarm is promoted by sacrificing search performance to prevent the optimization process to converge too quickly to the optimum and, therefore, avoid being stuck there when the solution becomes out of date. There are many proposed ways to accomplish this idea in the literature. For example, Simoes and Costa introduce the transformation operator, which was inspired by the somatic hypermutation of B-cells [25]. When an individual performs the transformation operator, one gene segment is first randomly chosen from a random gene pool. Then, the selected segment substituted the gene located after a random transformation locus. In [26], robust optimization over time (ROOT) was proposed as a new approach to solve DOPs. This framework uses an adapted radial-basis-function to locally approximate the fitness, and an auto-aggressive model is employed to predict it. This method then searches for robust solutions by exploiting the information of local fitness approximation and prediction. Recently, Yazdani et al. followed the idea of ROOT to propose a multi-swarm Particle Swarm Optimization (PSO) algorithm for DOPs [27]. This method allows different swarms to track peaks and collect data about their search behaviors, which is then analyzed to determine the next robust solution. The average number of environments is maximized while the quality of solutions is kept acceptable. The passive approach can be easily applied to real-world problems such as stream classification since its performance is comparable to the active approach while there is no need to employ a change detector.

A. Problem formulation
A data stream is defined as an infinite sequence of data points X = {x 1 ,x 2 , ... ,x 00 }, in which xk is ad-dimensional feature vector, with an associated sequence of class labels Y = {y1,Y2,--·,Yoo}, where Yk E {li,l2,---,ZM} is the true label of the sample Xk in X. A common assumption in the data stream literature is that the true label Yk of Xk is obtainable before the next data point xk+l comes up. Generally, there are two main approaches to process data streams: (l) use a single data instance {xk, yk} at a time to update the classifier; (2) divide the incoming stream into equally sized chunks C 1 , C 2 , ... , C 00 and then use all instances of a chunk to update the classifier.
A data stream is stationary if all its instances (excluding the outliers) are generated from the same distribution D. By con trast, a non-stationary stream includes concept drift [l] over time, or in other words, the underlying data distribution may change over time. Many types of concept drift are introduced in the literature, most notably Abrupt Drift, Gradual Drift, and Incremental Drift. If an abrupt drift occurs at a moment, the current data distribution is immediately substituted by a new distribution, which often severely damages the classification performance if the learning model fails to react in a timely manner. Meanwhile, gradual and incremental drifts happen over a longer period of time, making them more challenging to detect. The types of concept drift in evolving data streams are very similar to the types of changes in dynamic optimization problems [28]. Therefore, it is very natural to employ a DOPs solver to address the problem of non-stationary data stream classification.

B. Streaming Multi-layer Ensemble (SMiLE)
Inspired by the cascade structure of Multi-layer Perceptron (or Neural Networks), we proposed the Streaming Multi-layer Ensemble (SMiLE) for data stream classification. The main idea is to use the layer-by-layer processing of the features to perform representation learning. In particular, the output of a layer is considered as the input data for the next layer [29]. The proposed method is illustrated in Fig. l.
Each layer of SMiLE is a heterogeneous ensemble con taining various types of online classifiers. Table I shows the classifier list used in each layer. We choose this list of online classifiers based on their prediction accuracy and speed. Since  The classification phase of SMiLE is slightly simpler than the update process discussed above. Each layer consecutively makes predictions for the K x M-dimensional output vector obtained from the previous layer, except for layer l which predicts for the raw feature vector x. The last layer L gives us K different M-dimensional output prediction vectors. The final prediction of SMiLE is achieved by aggregating these probability vectors using an ensemble combining method such as the Sum Rule or the Majority Vote Rule [30].

C. Ensemble selection for SMiLE
We introduce a novel selection method to improve the proposed SMiLE. It is inspired by the mechanism of the Drop Out method [31], a must-mentioning technique when talking about the success of Deep Learning [32]. The main idea is to drop a subset of neurons at each layer to reduce the complexity of the whole Neural Networks (Multi-layer Ensembles), avoiding the overfitting issue in many cases. Here, we selectively choose which classifiers to be used in each layer by formulating this process as a DOP and then solve it using the Genetic Algorithm. Our method is different from the original Drop Out method where a number of neurons are blindly dropped at random.
To formulate the selection process, we first introduce the concept of a SMiLE configuration. We employ a binary vector to represent the selection decision at the layer i: Si = [si,1 si,2··· si,K] r ,si,j E {O, 1}, where si,j = 1 means the j-th classifier of layer i is selected, and si,j = 0 otherwise. Therefore, a selection configuration for SMiLE can be represented by a LK-length vector: layer I layer 2 layer L Each configuration now corresponds to a specific multi layer ensemble which is simpler than the original one. We call this simpler multi-layer ensemble as a refined ensemble. For 1 1 1 0 1 1 1 1 0 1], number of classes M = 4, number of features d = 5, clas sifier list = Na"ive Bayes (NB), Perceptron (Pere), Stochastic Gradient Descent (SGD), then the refined ensemble is shown in Fig. 2.
-----Classlfcatlon Phase -Update Phase The classification phase and update phase of a refined ensemble follow the same procedures of the original SMiLE which has been thoroughly discussed in sub-section III-B.
The fitness associated with each configuration is evaluated as follows. First, we use the chunk-based approach to store a chunk of N latest instances of the data stream. This chunk then will be used to calculate the fitness as follows: • The first N /2 instances are used solely for updating the refined ensemble • For the remaining N /2 instances, we use the interleaved test-then-train method to evaluate the ensemble. Particu larly, the ensemble makes prediction for an incoming data point to obtain a predicted label which is compared with the real label to compute the 0-l loss. After that, the data point along with its true label is used for updating the ensemble. This is a very common technique to evaluate classifiers in data stream [2], [4], [33]. Update the segment pool 12: Choose the refined ensemble corresponding to the best candidate to classify Ck +l 13: end for changes over time due to the non-stationary characteristic of evolving data streams. Hence, we need a DOP solver instead of a static one to search for the best configuration of SMiLE.
In particular, we used the Genetic Algorithm (GA) with the transformation operator introduced in [25], which is in charge of injecting diversity into the population. The proposed method SMiLE_GA is detailed in Algorithm l. When a new chunk of data is available, we update the pop ulation max_iter times. In each iteration, we first recalculate the fitness of all candidates in the current population (line 5). This is done by using equation (l) for every candidate after updating them using the first half instances of the chunk. Next, two different individuals are selected from the population using the Roulette Wheel Selection technique with probabilities: where Pi and fitnessi are the selection probability and fitness score of the i-th candidate, respectively, and nPop is the size of the population used in GA (line 6). The two candidates are transformed with a probability of P t (line 7). The transformation operator is as follows. First, a random segment is selected from the pool. Then, a position in the configuration is chosen at random. After that, the selected segment substitutes the genes located right after the chosen position (see Fig. 3) [25]. In the next stage, the two selected chromosomes are mutated with a probability of P m (line 8).
After evolving the population, we then update the segment pool as follows: 70% of the segments are taken from the individuals of the current population, while the remaining 30% are newly generated at random (line l l). The segment sizes are also generated randomly. At the final step, the ensemble corresponding to the highest-fitness chromosome is chosen to predict for instances in the next data chunk (line 12).

A. Datasets
We applied the proposed framework to solve the problem of insect stream classification. We used six datasets recently published by Souza et al. [6], which relates to the automatic recognition of disease-carrying insects through optical sensors in real-time. To collect data, a smart trap was used to capture selective species, especially those that are vectors of mosquito borne diseases and agricultural pests. This trap frees all other species, alleviating the negative influence of the device on the ecological balance. Before the announcement of this data, the data stream literature was completely lacking knowledge of when and how data distributions change in real-world data streams. Souza et al. collected data in an artificial non stationary environment for about three months to construct the insect stream datasets with concept drifts. Note that the true class labels were obtained by building different collector devices that many specimens of only one species was exhibited inside the collector. Many types of concept drift in data streams can be introduced by changing the temperature as follows: • Abrupt: The first period of the stream was collected at a temperature of 30°C, then it suddenly changed to 20°C. After a period of time, the temperature increased back to around 35°C. Three other abrupt drifts similarly appeared until the stream ended. • Incremental-gradual: The beginning instances were col lected at around 37°C, and the temperature gradually decreased to 35°C. For a period after that, the temperature intercalates in the values of 35°C and 23°C until com pletely change to 23°C. Finally, the temperature gradually increased to 27°C. • Incremental: the stream of instances was collected while incrementally increasing the temperature from 20°C to 40°C. • Incremental-abrupt-reoccurring: There were three consec utive cycles of incremental changes of temperature from 20°C to 24 °C . • Incremental-reoccurring: There were three consecutive cycles of incremental changes where the temperature first gradually increased from 20°C to 40°C, then slowly decreased from 40°C to 20°C, and finally rose incremen tally from 20°C back to 40°C. • Out-of-control: there was no pattern in the changes of the temperature. This dataset is drift-free since all instances were collected in uniformly random order and each example was sampled uniformly at a time during the stream.
The details of the datasets are summarized in Ta ble II.

B. Benchmark algorithms and parameters
The benchmark algorithms used in our experiments were the Online Bagging (OB) [15], Online Smooth Boost (OSB) [16], Adaptive XGBoost (AXGB) [4], and Learn++.NSE (LNSE) [19]. Note that OB and OSB are two instance-incremental ensemble systems, while AXGB and LNSE are two chunk incremental ensemble systems. Since AXGB was developed only for binary classification problems, we wrapped it with a one-vs-all classifier when dealing with multi-class data streams. We compared these algorithms with our proposed methods SMiLE and SMiLE_GA.
Regarding the hyper-parameters, we used default values reported in the original papers if not stated here. The ensemble size of the benchmark algorithms was set to 30. The numbers of layers of SMiLE and SMiLE_GA were set to 4 when com paring to other benchmark algorithms. Each layer contained 7 different learning algorithms. We used the Majority Vote Rule for combing the outputs of the last layer. The chunk size was set to 500. The parameter for the GA module is as follows: the transformation probability was set to Pt = 0.75; the mutation rate was set to P m = 0.05; the population size and the segment pool size were both set to 30.
We compared the proposed methods to the benchmark algorithms concerning prediction accuracy, which is one of the most common metrics used in data stream evaluation [2], [3], [8], [33]. To evaluate the prediction performance, we used the interleaved-test-then-train strategy (also known as the prequential evaluation method), in which an instance is first used for testing and then for training.

A. Proposed methods vs. baselines
The accuracy results of SMiLE, SMiLE_GA, and 7 base learners on 6 datasets are reported in Table Ill. Overall, the proposed method SMiLE_ GA achieved the best prediction accuracy results. It ranks first on 4 datasets and ranks second on the remaining 2 datasets, demonstrating the considerable improvement of the proposed method in comparison to its base classifiers. We can see that the SGD performed extremely badly, which obtained only about 16% of accuracy on all datasets. This deteriorates the performance of SMiLE, making its accuracy even worse than the single learner Perceptron. For tunately, by using the ensemble selection module, SMiLE_GA can refuse to use SGD whenever this classifier exhibits harmful impacts to the performance of the whole framework. In com parison to SMiLE, the upgraded version SMILE_GA performs better on all datasets, especially on the Incremental-gradual and Abrupt datasets, where there are huge accuracy gaps of around 10% and 6% respectively between SMiLE_GA and SMiLE. This observation demonstrates the benefit of using the ensemble selection module.

B. Proposed methods vs. benchmark algorithms
Ta ble IV shows the accuracy results of SMiLE, SMiLE_GA, and 4 benchmark methods. For all datasets, we can see that the proposed method SMiLE_GA achieved the best overall results with an average accuracy of 68.9692% and an average ranking of 1.1667, followed by its counterpart SMiLE method. The SMiLE_GA framework ranked first on 5/6 datasets and ranked second on the remaining dataset, demonstrating the effectiveness of our method in comparison to available stream ing learning algorithms in the literature. Especially on the dataset with abrupt concept drifts, the SMiLE method left a large gap of 6.5% accuracy compared to the second-best method. By contrast, the worst-performing method was the OSB with an average accuracy of 57.5764% and an average ranking of 4.6667. This poor performance can be attributed to the fact that the OSB method did not consider a strategy to deal with changes in the distribution of data. Meanwhile, AXGB, LNSE, and OB obtained average performance in our experiments. From Ta ble IV, we can see a general view of the prediction performance of the algorithms. However, when  dealing with data streams, we are often interested in observing these performances over incoming instances, which helps us to better understand how changes happen during the stream. Therefore, we present Fig. 4 to show an individual evaluation for each dataset. Fig. 4a shows the accuracy results over time for the Abrupt data. The accuracies of all algorithms tended to decrease when concept drift occurred, except for the second one. OB, AXGB, and OSB endured the most significant drop in performance when changes occurred. By contrast, the performance of our proposed method SMiLE_GA was very stable during the stream. It recovered the prediction accuracy very quickly every time a drift happens, leading to its best accuracy at the end of the stream. Fig. 4b illustrates the results over time for the Incremental gradual data stream. In this figure, there were significant decreases in the performances of all methods right after the gradual drift occurred. As in the description of this dataset, there were two different concepts presented in this period, making it difficult for all algorithms to detect and react to this change. The proposed method SMiLE continues to perform best almost all over the stream. Fig. 4c shows the accuracy results over time for the In cremental dataset. Due to the slow rate of the incremental changes, all algorithms tended to perform stably with a gradual increase in the prediction accuracy. In this dataset, AXGB exhibited the highest accuracy after 10000 instances arrived until the end of the stream. Fig. 4d illustrates the accuracy results for the Incremental abrupt-reoccurring dataset. There were small decreases in the accuracy of all methods whenever a change occurred, except for the case of SMiLE, whose performance kept increasing even when concept drifts happened. However, the SMiLE_GA method obtained higher accuracy than its counterpart SMiLE even with its decline in accuracy when changes appeared. Fig. 4e shows the result results for the Incremental reoccurring dataset. The overall trend was very similar to what had previously been shown in Fig. 4d for the Incremental abrupt-reoccurring dataset, except for that in the current figure, the falls in the prediction accuracies of all methods were more significant after the final change occurred, especially for OSB and OB methods, those did not have a module to deal with concept drifts. Fig. 4f shows the results for the Out-of-control data.
Even though this dataset presents undefined changes in type and number, all methods obtained stable performances over time with gradual increases in the accumulated accuracies.
SMiLE_GA and AXGB were the best-performing methods in this dataset. Although SMiLE_GA obtains lower accuracy at the first period of the stream, its accuracy increased more quickly and surpassed the accuracy of AXGB at the end of the stream. Learn++.NSE seemed to perform poorly on this data, showing a minor downward trend in its accuracy during the data stream.

VI. CONCLUSIONS
In this work, we have proposed the Streaming Multi-layer Ensemble (SMiLE) for data stream classification. It arranges its base classifiers in a cascade structure and performs layer by-layer processing of the feature vector before utilizing an ensemble combining method to form the final prediction.
Furthermore, we introduced a novel data stream classification framework called SMiLE_GA, which modifies the Genetic Algorithm to solve the dynamic optimization problem repre senting the process of ensemble selection for multiple layer ensemble. We applied the proposed methods to solve a real world problem of insect stream classification. Our experiments showed that the SMiLE_GA greatly improved the prediction accuracy of its base classifiers and surpassed the performance of many benchmark algorithms.