VEGAS: a variable length-based genetic algorithm for ensemble selection in deep ensemble learning.

,


Introduction
In recent years, deep learning has emerged as a hot research topic because of its breakthrough performance in diverse learning tasks. For instance, in computer vision, Convolutional Neural Network (CNN), a deep neural network (DNN), significantly outperforms traditional machine learning algorithms on the image classification task on some large scale datasets. Despite its many successes, there are some limitations of DNNs. Firstly, these deep models are usually very complex with many parameters and can only be trained on specially-designed hardware. Secondly, DNNs require a considerable amount of labeled data for the training process. When the cost of labeled data is too prohibitive, deep models might not bring about the expected gains in performance. It is well-recognized that there are many learning tasks where DNNs are poorer than traditional machine learning methods, especially state-of-the-art methods like Random Forest and XgBoost [12].
Meanwhile, ensemble learning is developed to obtain a better result than using single classifiers. By using an ensemble of classifiers (EoC), the poor predictions of several classifiers are likely to be compensated by those of others, which boosts the performance of the whole ensemble. Ensemble systems have been applied in many areas such as computer vision, software engineering, and bioinformatics. Traditionally, ensemble systems have been constructed with one layer of EoC. A combining algorithm combines the predictions of EoC for the final collaborated prediction [10].
The term "deep learning" makes the crowd only think of DNNs, which include multiple layers of parameterized differentiable nonlinear modules. In 2014, Zhou and Feng introduced a deep ensemble system called gcForest, including several layers of Random Forest-based classifiers [22]. The introduction of gcForest has shown that DNN is only a subset of deep models or deep learning models can be constructed with multiple layers of non-differentiable learning modules. Experiments on some popular datasets have shown that deep ensemble models outperform not only DNNs but also several state-of-the-art ensemble methods [22], [15].
The predictive performance and computational/storage efficiency of ensemble systems can be further improved by obtaining a subset of classifiers that performs competitively to the whole ensemble. This research topic is known as ensemble selection (aka ensemble pruning or selective ensemble), an ensemble design stage to enhance ensemble performance based on searching for an optimal EoC from the whole ensemble. In this study, we introduce an ensemble selection method to improve the performance and efficiency of deep ensemble models. A configuration of the deep model is encoded in the form of binary encoding, showing which classifiers are selected or not. It is noted that the length of the proposed encoding depends on the number of layers and the number of classifiers in each layer that we use to construct the deep model. The configuration of the deep model thus is given in variable-length encoding. To find the optimal set of classifiers, we consider an optimization problem by maximizing the classification accuracy of the deep ensemble system on the validation data. In this work, we develop VEGAS: a Variable-length Genetic Algorithm (VLGA) to solve this optimization problem of the ensemble selection. The main contributions of our work are as follows: -We propose a classifier selection approach for deep ensemble system -We propose to encode classifiers in all layers of the deep ensemble system in a variable-length encoding -We develop a VLGA to search for the optimal number of layers and the optimal classifiers.
VEGAS: A VLGA for ensemble selection 3 -We experimentally show that VEGAS is better than some well-known benchmark algorithms on several datasets.

Ensemble learning and Ensemble Selection
Ensemble learning refers to a popular research topic in machine learning in which multiple classifiers are combined to obtain a better result than using single classifiers. Two main stages in building an ensemble system are somehow to generate diverse classifiers and then combine them to make a collaborated decision. For the first stage, we train a learning algorithm on multiple training sets generated from the original training data [11] or train different learning algorithms on the original training data [10] to generate EoC. For the second stage, a combining method works on the predictions of the generated classifiers for the final decision. Based on experiments on diverse datasets, Random Forest [1] and XgBoost [3] were reported as the top-performance ensemble methods. Inspired by the success of DNNs in many areas, several ensemble systems have been constructed with a number of layers of EoCs. Each layer receives outputs of the subsequent layer as its input training data and then outputs the training data for the next layer. The first deep ensemble system called gcForest was proposed in 2014, including many layers of two Random Forests and two Completely Random Tree Forests working in each layer. After that, several deep ensemble systems were introduced such as deep ensemble models of incremental classifiers [7], an ensemble of SVM classifiers with AdaBoost in finding model parameters [17], and deep ensemble models for multi-label learning [21]. Nguyen et al. [15] proposed MULES, a deep ensemble system with classifier and feature selection in each layer. The optimization problem was considered under bi-objectives: maximizing classification accuracy and diversity of EoC in each layer.
Meanwhile, ensemble selection is an additional intermediate stage of the ensemble design process that aims to select a subset of classifiers from the ensemble to achieve higher predictive performance and computational/storage efficiency than using the whole ensemble. Ensemble selection can be formulated as an optimization problem that can be solved by either Evolutionary Computation methods or greedy search approach. Nguyen et al. [13] used Ant Colony Optimisation (ACO) to search for the optimal combining algorithm and the optimal set from the predictions for the selected classifier in the ensemble system. Hill climbing search-based ensemble pruning, on the other hand, greedily selects the next configuration of selected classifiers around the neighborhood of the current configuration. Two crucial factors of the hill-climbing search-based ensemble pruning strategy are the searching direction and the measure to evaluate the different branches of the search process [16]. For an EoC, the measures determine the best single classifier to be added to or removed from the ensemble to optimize either performance or diversity of the ensemble. Examples of evaluation measures are Accuracy, Mean Cross-Entropy, F-score, ROC area [16], and Concurrency [2]. The direction of the search process can be conducted in the forward selection which starts from the empty EoC and adds a base classifier in sequence, or backward selection which prunes a base classifier from the whole set of classifiers until reaching the optimal subset of classifiers. Dai et al. [4] introduced the Modified Backtracking Ensemble Pruning algorithm to enhance the search processing in the backtracking method. The redundant solutions in the search space are reduced by using a binary encoding for the classifiers.

Variable learning encoding in evolutionary computation
There are some variable length encoding-based algorithms introduced recently. In [19], a VLGA is proposed to search for the best CNN structure. Each chromosome consists of three types of units corresponding to convolutional layers, pooling layers, and full connection layers. Later in [20], a non-binary VLGA was also proposed to search for the best CNN structure. This variable-length encoding strategy used different representations for different layer types. A skipping layer consists of two convolutional layers and one skipper connection; its encoding is the number of feature maps of the two convolutional layers within this skip layer. The encoding of the pooling layers is the pooling operation type, i.e. mean pooling or maximum pooling.
For applications, in [18], the GA with a variable-length chromosome was used to solve the path optimization problems. The path optimization problem is modeled as an abstract graph. Each chromosome is a set of nodes consisting of a feasible solution and therefore has a length equal to node amount. In [6], a GA with variable length chromosomes was also used to solve path planning problems for the autonomous mobile robot. Each chromosome is a position set that represents a valid path solution. The length of the chromosome is the number of the intermediate nodes. In [9], the proposed VLGA was also implemented to solve the multi-objective multi-robot path planning problems.

Ensemble Selection for Deep Ensemble Systems
Let D be the training data of N observations {(x n ,ŷ n }), where x n is the Dfeature vector of the training instance and y n be its corresponding label. True labelŷ n belongs to label set Y, |Y| = M . We aim to learn a hypothesis h (i.e., classifier) to approximate unknown relationship between the feature vector and its corresponding label g : x n →ŷ n and then use this hypothesis to assign a label for each unlabeled instance. We also denote K = K k as the set of K learning algorithms. In deep ensemble learning consisting of s layers, we train an EoC {h  learning algorithms on the original training data D. The first layer also generates input data for the second layer by using the Stacking algorithm with the set of learning algorithms K. Specifically, D is divided into T 1 disjoint parts in which the cardinality of each part is nearly similar. For each part, we train classifiers on its complementary and use these classifiers to predict for observations of this part. Thus, each observation in D will be tested one time. For observation x n , we denote p (1) k,m (x n ) is the prediction of the k th classifier in the first layer that observation belongs to the class label y m . The predictions in terms of M class labels are given in the form of probability: prediction vector of the EoC in the first layer for x n . The prediction vectors for all observations in D is given in the form of a N × (M K) matrix.
We denote L 1 denotes the new data generated by the 1 st layer as the input for the 2 nd layer. Normally, L 1 is created by concatenating the original training data and the predictions classifiers as below: in which denotes the concatenation operator between two matrices D of size N × D + 1 and P 1 of size N × (M K). Thus L 1 is obtained in the form of a N × (D + M K + 1) matrix including D features of original data, M K features of predictions, and ground truth of observations. A similar process conducts on the next layers until reaching the last layer in which at the i th layer, we train the EoC of K classifiers h (i) k , k = 1, . . . , K on the input data L i−1 generated by (i − 1) th layer and generate input data L i for the (i + 1) th layer The predictions of EoC of the last layer i.e. s th layer are combined for the collaborated decision. In this study, we use the Sum rule for combining [14]. For an instance x, the Sum rule summarizes the predictions of EoC of the last layer concerning each class label. The label associated with the maximum value is assigned to this instance as follows: In the classification process, each unseen instance is fed forward through the layers until reaching the last layer. The predictions of K classifier at the last layer i.e. P (s) (.) = p It is recognized that there is existing a subset of EoC that performs better than using the whole ensemble. Moreover, storing a subset of the ensemble will save the computational cost and storage cost. In this study, we propose an ensemble selection approach for deep ensemble systems. We propose to encode classifiers in the deep ensemble system using binary encoding in (5), showing which classifiers are presented or absent. For a deep ensemble system of s layers, since there are K classifiers in each layer, the encoding associated with the model of s layers has s × K binary elements. It is noted that the length of the proposed encoding is not fixed and depends on the number of layers that we use to construct the deep model. If the number of layers is chosen by 1 ≤ s ≤ S, we have S groups of encoding with the lengths of {K, 2 × K, . . . , S × K}. By using these groups of encoding, we aim to search for the optimal number of layers and the optimal set of classifiers in each layer for the deep ensemble system.

Optimization Problems and Algorithm
We consider optimization for the model selection problem. The objective is maximizing the accuracy of the classification task on a validation set V: whereh E is the combining model using the Sum Rule in (4) associated with encoding E, |.| denotes the cardinality of a set, and . is equal 1 if the condition is true, otherwise equal 0. In this study, we develop a VLGA to solve this optimization problem. Genetic Algorithm (GA) is a search heuristic inspired by Charles Darwin's theory of natural evolution. It is widely recognized that GA commonly generates high-quality solutions for search problems [8]. Three operators of GA are considered in this study: Selection: We apply the roulette wheel selection approach to select a pair of individuals for breeding. The probability of choosing an individual from a population is proportional to its fitness as an individual has a higher chance of being chosen if its fitness is higher than those of others. Probability of choosing individual i th is equal to: where f i is the accuracy of the deep ensemble model with the corresponding configuration of the i th individual and popSize is the size of the current generation. Crossover: We define the probability P c for the crossover process in which crossover occurs if the generated crossover probability is smaller than P c . Here we develop a chunk-based crossover operator to generate new offsprings. As mentioned before, since there are at most K classifiers in each layer of a deep ensemble system with s layers, each chromosome is given in the form of s-chunk in which the chunk size is K. On two selected parents with s 1 and s 2 layers, we generate two random numbers r 1 and r 2 which are the multiple of s 1 and s 2 i.e.  Fig. 1. The illustration of chunk-based crossover operator r 1 ∈ {K, 2 × K, . . . , s 1 × K} and r 2 ∈ {K, 2 × K, . . . , s 2 × K}. r 1 and r 2 will divide each parent into two parts. Each parent exchanges its tail with the other while retains its head. After crossover is performed, we have two new offspring chromosomes. We illustrate in Fig 1 how chunk-based crossover works on a deep ensemble model with 3 classifiers in each layer. Parent 1 encodes a 3-layer deep ensemble model, while parent 2 encodes a 4-layer deep ensemble model. On parent 1 and 2, two random numbers are generated as r 1 = 3 and r 1 = 9. By retaining heads and exchanging tails on these parents, we obtain two new offsprings, the first one encodes a 2-layer deep ensemble, and the second encodes a 5-layer deep ensemble. By using this crossover operator, we can generate the offsprings with different sizes compared to those of their parents, thus improving exploration of the searching process.
Mutation: Mutation operators introduce genetic diversity from one generation in a population to the next generation. It also prevents the algorithm from falling into local minima or maxima by making the population of chromosomes different from each other. We define the probability P m for the mutation process in which mutation occurs if the generated mutation probability is smaller than P m . In this study, we propose to apply a multiple point-based mutation operator on an offspring. First, we generate several random numbers which show the position of mutated genes in a chromosome. The values of these mutated genes will be flipped, i.e., from 0 to 1 or 1 to 0. By doing this way, we obtain a new offspring, which may change entirely from the previous one; consequently, GA can escape from local minima or maxima and reach a better solution.
The pseudo-code of VLGA is present in Algorithm 1. The algorithm gets the inputs including the training data D, the validation data V and some parameters for the evolutionary process (the population size popSize, the number of generations nGen, crossover probability P c and mutation probability P m ) We first randomly generate a population with popSize individuals and then calculate the fitness of each individual on V by using Algorithm 2 (Step 1 and 2). The probabilities for individual selection are computed by using (7) in Step 3. Two selected parents will bread a pair of offsprings if they satisfy the crossover check (Step 7-8). These offsprings will pass through mutation in which some random positions of them are changed if mutation occurs (Step 15-21). We also calculate the fitness of each offspring on V by using Algorithm 2 before adding them to the population. The step 6-23 are repeated until we generate a new popSize offsprings. From the population of 2 × popSize individuals, we keep popSize best individuals for the next generation. The algorithm runs until it reaches the number of generations. We select the candidate from the last population, which is associated with the best fitness value as the solution of the problem.
Algorithm 2 aims to calculate the fitness and deep model generation associated with an encoding. The algorithm inputs training data D, validation data V, an encoding E, and the number of T-folds. From the configuration of E, we can obtain the number of layers and which classifiers are selected in each layer. On the i th layer, we do two steps (i) train selected classifiers at the on whole Step 4) and (ii) generate training data for the (i + 1) th layer by using T-fold Cross-Validation and concatenation operator between prediction data and original training data (Step 7-14). The classifier {h (i) k } predicts on V i−1 which is the prediction matrix for observations of V at the (i − 1) th layer, to obtain the prediction P i (V) (Step 6). P i (V) is also concatenated with V to obtain the validation data for the (i + 1) th layer. After running through the last layer i.e. the s th layer, we apply the Sum Rule on the prediction P s (V ) to obtain the fitness value of E. We also obtain the classifiers {h In the classification process, we assign the class label to an unlabeled test sample. In each layer, the input test data will be predicted by classifiers and then be concatenated with the original test sample to generate new test data for the next layer. The combining function in (4) is applied to the outputs of classifiers of the last layer to give the final prediction.

Experimental Settings
We conducted the experiments on the 20 datasets collected from different sources such as the UCI Machine Learning Repository and OpenML. We used 5 classifiers in each layer of VEGAS in which these classifiers were generated by using learning algorithms: Naïve Bayes classifiers with Gaussian distribution, XgBoost with 200 estimators, Random Forest with 200 estimators, and Logistic Regression. We used the 5-fold Cross-Validation in each layer to generate the new training data for the next layer. 20% of the training data is used for validation purposes [22]. For VLGA, the maximum number of generations was set to 50, the population size was set to 100, and the crossover and mutation probability was set to 0.9 and 0.1, respectively.
VEGAS was compared to some algorithms, including the ensemble methods and deep learning models. Three well-known ensemble methods were used as the benchmark algorithms: Random Forest, XgBoost, and Rotation Forest. All these methods were constructed by using 200 learners. Three deep learning models were compared with VEGAS: gcForest (4 forests with 200 trees in each forest) [22], MULES [15], and Multiple Layer Perceptron (MLP). For MULES,

Algorithm 1 Variable length Genetic Algorithm
Require: Training data D, Validation data V population size: popSize, number of generations nGen, crossover probability: Pc, mutation probability: Pm Ensure: Optimal configuration 1: Randomly generate population 2: Calculate fitness on V of each individual using Algorithm 2 3: Calculate selection probabilities by (7) 4: for i ← 1, nGen do 5: while currentpopulationsize < 2 × popSize do if rc ≤ Pc then 9: Generate two random number r1, r2 which are multiple of K 10: Divide parents to head and tail based on r1 and r2

11:
Swap tails of two parents to create two new offsprings Generate a random number rm ∈ [0, 1]

16:
if rm ≤ Pm then 17: for each offspring do

18:
Generate random number r of mutation points 19: Flip the binary value associated with mutation points Keep popSize best individuals for the next generation

26:
Calculate selection probabilities by (7) 27: end for 28: Return individual (encoding and associated deep model) with the best fitness from the last generation we used parameter settings like in the original paper [15]. It is noted that the performance of MLP significantly depends on the network structure. To ensure a fair comparison, we experimented with MLP on a number of different network configurations: input-30-20-output, input-50-30-output, and input-70-50-output by referencing the experiments [22]. We then reported the best performance of MLP among all configurations and used this result to compare with VEGAS. We used Friedman test to compare performance of experimental methods on experimental datasets. If the P-Value of this test is smaller than a significant threshold, e.g. 0.05, we reject the null hypothesis and conduct the Nemenyi post-hoc test to compare each pair of methods [5]. Table 1 shows the prediction accuracy of VEGAS and the benchmark algorithms. Based on the Friedman test, we reject the null hypothesis that all methods perform equally. The Nemenyi test in Fig 2 shows that VEGAS is better than all benchmark algorithms. In detail, VEGAS performs the best among all methods on 15 datasets. VEGAS ranks second on 5 datasets, and the prediction accuracy of VEGAS and the first rank method are not significant differences (0.9610 vs 0.9756 of gcForest on the Breast-cancer dataset, for example). The outstanding performance of VEGAS over the benchmark algorithms comes from (i) the for t ← 1, T %generate the running data for the next layer do 8:

Comparison to the benchmark algorithms
i−1 , 9: 1 ≤ j1, j2 ≤ T, j1 = j2 10: for all L Use these classifiers to predict on L Surprisingly, Random Forest ranks higher than the other benchmark algorithms in our experiment. Random Forest ranks the first on two datasets Hayes-Roth and Wine white (about 2% better than VEGAS for prediction accuracy). In contrast, VEGAS is significantly better than Random Forest on some datasets such as Hill-valley (about 30% better), Sonar (about 6% better), Vehicle (about 6% better), Tic-Tac-Toe (about 8% better).
MULES and XgBoost rank the middle in our experiment. VEGAS outperforms MULES on all datasets. MULES looks for optimal EoC of each layer by considering two objectives: maximising accuracy and diversity. Meanwhile, VEGAS learns the optimal configuration for all layers of the deep ensemble. It demonstrates the efficiency of the optimisation method, i.e. VLGA of VEGAS. For MLP, although we ran the experiments on its 3 different configurations and reported its best result for the comparison, this method is worse than VEGAS on up to 18 datasets. gcForest is worst among all methods on the experimental datasets. On some datasets such as Conn-bench-vowel, Hill-valley, Sonar, Texture, Tic-Tac-Toe, Vehicle, and Wine-white, gcForest performs poorly and by far worse than VEGAS.

Discussions
VEGAS takes higher training time compared to two deep ensemble models i.e. gcForest and MULES. On the Tic-Tac-Toe dataset, for example, gcForest used only 311.78 seconds for the training process compared to 15192.08 of VEGAS  On the other hand, although VEGAS creates more layers than gcForest and MULES (6.4 vs 3.8 and 2.75 on average), the classification time of VEGAS is competitive to those of gcForest and MULES. That is because on some datasets, VEGAS selects a small number of classifiers in each layer. On the Tic-Tac-Toe dataset, VEGAS takes only 0.016 second for classification with 4 layers and 6 classifiers in total. Meanwhile, gcForest (6 layers and 4800 classifiers in total) used 0.62 second to classify all test instances, and MULES (4 layers with 11 classifiers in total) used 0.26 second with its selected configuration [15].

Conclusions
The deep ensemble models have further improved the predictive accuracy of onelayer ensemble models. However, the appearance of unsuitable classifiers in each layer reduces predictive performance and the computational/storage efficiency of the deep models. In this study, we have introduced an ensemble selection method for the deep ensemble systems called VEGAS. We design the deep ensemble system involving multiple layers of the EoC. The training data is populated through layers by concatenating the predictions of classifiers in the subsequent layer and the original training data. The predictions of the classifiers in the last layer are combined by a combining method to obtain the final collaborated prediction. We proposed the VLGA to search for the optimal configuration, which maximizes the prediction accuracy of the deep ensemble model on each dataset. Three operators of VLGA were considered in this study, namely selection, crossover, and mutation. The experiments on 20 datasets show that VEGAS is better than both well-known ensemble methods and other deep ensemble methods.