Weighted ensemble of deep learning models based on comprehensive learning particle swarm optimization for medical image segmentation.

are combined to solve a computational intelligence problem.

School of Computing, Robert Gordon University Aberdeen, UK Email: 1 t.dang1@rgu.ac.uk, 2 t.nguyen11@rgu.ac.uk, 3 c.moreno-garcia@rgu.ac.uk, 4 e.elyan@rgu.ac.uk, 5 j.mccall@rgu.ac.uk filters to extract anatomical signatures [15]. This is a timeconsuming process and requires expert knowledge. By applying deep learning, practitioners have been able to achieve many successes such as organs detection using 3D dynamic contrastenhanced MRI scans over a period of time [17] and automatic segmentation in brain images using 3D convolutional deep learning architecture on mini-batches of multiple cubes of brain data [18]. Deep learning methods are highly effective when the number of available samples are large during a training stage. For example, in ImageNet Large Scale Visual Recognition Challenge (ILSVRC), the dataset contained 1 million annotated images. Medical datasets meanwhile are considerably smaller, typically less than 1,000 images [15]. This poses a problem for creating deep models for medical imaging which are robust against overfitting. Another problem is that the process of training deep neural networks using popular optimizers such as Stochastic Gradient Descent (SGD) generally require much manual tuning of optimization parameters such as learning rates and convergence criteria [41]. In recent years there have been many alternative optimization methods for deep learning which require less parameter tuning, such as Adam [43]. However, these methods do not generalize as well compared to traditional methods such as SGD [44]. The manual parameter tuning causes a challenge in selecting suitable deep models for a specific problem. A solution to these difficulties is to combine multiple deep learning models trained on medical image datasets which would guarantee better predictions compared to using individual deep models.
Ensemble learning is a popular machine learning technique in which multiple learning methods are combined to solve a computational intelligence problem. [48] tested 179 classifiers on 121 datasets and the results indicated that ensemble-based methods achieved the top ranks. In this study, we introduce an ensemble of deep learning methods for the problem of semantic medical image segmentation. The ensemble includes a number of different segmentation algorithms in which their outputs are combined by a combining algorithm to obtain the collaborated prediction. It is recognized that different segmentation algorithms will perform well on different subsets of examples because of the nature and size of training sets they have been exposed to and because of method-intrinsic factors. Therefore we focus on improving the effectiveness of Abstract-In recent years, deep learning has rapidly become a method of choice for segmentation of medical images. Deep neural architectures such as UNet and FPN have achieved high performances on many medical datasets. However, medical image analysis algorithms are required to be reliable, robust, and accurate for clinical applications which can be difficult to achieve for some single deep learning methods. In this study, we introduce an ensemble of classifiers f or s emantic segmentation of medical images. The ensemble of classifiers h ere i s a set of various deep learning-based classifiers, a iming t o achieve better performance than using a single classifier. W e propose a weighted ensemble method in which the weighted sum of segmentation outputs by classifiers i s u sed t o c hoose t he final segmentation decision. We use a swarm intelligence algorithm namely Comprehensive Learning Particle Swarm Optimization to optimize the combining weights. Dice coefficient, a popular performance metric for image segmentation, is used as the fitness criteria. Experiments conducted on some medical datasets of the CAMUS competition on cardiographic image segmentation show that our method achieves better results than both the constituent segmentation models and the reported model of the CAMUS competition.
Index Terms-image segmentation, deep neural networks, ensemble learning, ensemble method, particle swarm optimization

I. INTRODUCTION
Image segmentation is the process of partitioning an input image into regions which correspond to different objects or parts of an object. Segmentation of medical images is considered very important in providing noninvasive information about human body structure [40], which have a vital role in numerous biomedical imaging applications, such as tissue volumes quantification, d iagnosis, p athology localization, study of anatomical structure, treatment planning, and computer-integrated surgery [42]. Automation of segmentation to integrate into clinical processes is therefore desirable. With the success of deep learning in image classification [ 14] in 2012, practitioners in medical image analysis took notice of these developments and applied it to segmentation of medical images. It is well known that localization and interpolation of anatomical structures in medical images, which is a key step in radiological workflow, was performed by handcrafting image the ensemble by using a weight-based combining method. On a particular problem, some segmentation algorithms will contribute more to the final combining result by associating them larger weights than those of other ones. The final prediction is made by using a weighted sum on the outputs of segmentation algorithms. The weights are chosen to maximize the Dice coefficient, which is a popular performance metric in segmentation, based on a cross-validation procedure on the training data. We empirically compared our proposed ensemble with some well-known deep learning benchmark algorithms on several medical datasets of the CAMUS competition on cardiographic image segmentation [38]. In section 2, we briefly introduce ensemble learning, weighted combining model, Particle Swarm Optimization and Comprehensive Learning, and techniques for medical image segmentation problem. In section 3, we give a detailed description of the proposed ensemble. Experimental studies on a number of datasets are provided in Section 4, followed by conclusions in Section 5.

II. BACKGROUND AND RELATED WORK A. Ensemble System and Weighted Combining Model
Ensemble systems are typically built by generating diverse classifiers and then combine them to make a final decision. The first stage is done by training a learning algorithm on multiple training sets generated from the original training data or training different learning algorithms on the original training data to generate Ensemble of Classifiers (EoC) [1], [2]. The second stage uses a combining method working on the predictions of the generated classifiers for the final decision. Fixed combining methods are frequently applied to the predictions of classifiers to predict class labels. Popular fixed combining methods use fixed combining rules such as the Sum Rule, Product Rule, Min Rule, Max Rule, Median Rule, and Majority Vote Rule [3]. In simple fixed combining rules, all classifiers are treated equally in the aggregation step, i.e. all classifiers make an equal contribution in the final collaborated prediction. It is recognized that the equal contribution of classifiers may downgrade the performance of EoC because classifiers perform differently on a particular dataset and some classifiers need to contribute more than the others. Weighted combining model, in contrast, assumes that each classifier puts a different weight on the combining result. The weights and predictions are used to generate a set of combinations associated with the class labels. The predicted class label for a sample is then decided by selecting the maximum value among these combinations. There are some techniques to obtain the combining weights. Nguyen et al. [1] searched for the weights by minimizing the distance between these combinations computed on the training data and the class label of training observations given in the crisp form. Zhang and Zhou [4] proposed using linear programming to find the weights. Sen et al. [5] searched for the combining weights by minimizing the hinge loss function of the combination and the class labels of training data. Pacheco et al. [29] performed ensemble selection and pruning of deep learning classifiers by learning the Dirichlet distribution of the output probabilities and optimizing the weights dynamically using a loss function based on Mahalanobis distance.

B. Particle Swarm Optimization and Comprehensive Learning
Evolutionary Computation (EC) is a family of algorithms inspired by biological evolution for global optimization. One of the most popular methods of EC is Particle Swarm Optimization (PSO) [6], a swarm-based algorithm inspired by the emergent motion of a flock of birds searching for food. This algorithm simultaneously performs a local exploitation within each particle and global exploration among the whole swarm. For a U -dimension optimization problem, PSO maintains a number of particles whose positions are defined by is associated with each particle x i . PSO ensures each particle learns from the whole swarm during its search by updating each particle's velocity based on its current velocity, local best position, and global best position.
Because all particles learn from the global best position, PSO can converge prematurely at a local optimum [8]. In 2006, Liang et al. proposed Comprehensive Learning PSO (CLPSO) [8] which addresses this shortcoming by having each particle learn from all particles' local best position. Specifically, each particle with U -dimension will also have a U -dimension exemplar vector e i = (e 1 i , e 2 i , ..., e U i ) for comprehensive learning. The exemplar vector is introduced for a particle to learn from the local best (pbest) of itself as well as all the other particles. For example, a particle with the position (0.13, 0.43, 0.22, 0.74, 0, 11), the velocity (0.48, 0.25, 0.52, 0.13, -0.15), and the exemplar (6,8,4,8,4) would learns/updates the 3rd dimension position value based on the 3rd dimension position value of the 4th particle's pbest.
A particle is assigned randomly with an exemplar vector at initialization. When a particle's pbest does not improve after a number of iterations, the exemplar will be updated. In order to choose which particle to learn from for each dimension, the algorithms selects randomly two different particles and the one with higher fitness value will be assigned as the exemplar for the updated particle on the corresponding dimension [8], [9]. Therefore, only one acceleration of constant c is needed. The updated equation is given by: in which a is the inertia weight which controls the velocity speeding rate, c is an acceleration constant used to control the learning rate of the exemplars' local best, pbest u e u i is the u th dimension of particle's best position referring to the u th dimension of exemplar e i , and r 1 is a random number drawn from a uniform distribution over [0,1]. Considering that CLPSO has demonstrated state-of-the-art global search capabilities in various applications [45], such as optimizing reactive power dispatch [46] and [47] optimizing network security, in this paper we use CLPSO as an optimization routine for our proposed method.

C. Medical Image Segmentation
Many research efforts have been made to apply deep learning to medical image segmentation. An example is UNet [20] which consists of an equal number of upsampling and downsampling layers. Each downsampling layer has a skip connection which concatenates its output feature map with the input of the corresponding upsampling layer. This allows the network to take into the full context of the whole image, which is beneficial in performing segmentation task. Other authors have extended this architecture to handle 3D medical data, such as VNet [22], which performs 3D image segmentation using 3D convolutional layers with an objective function based on Dice coefficient. Although these specific architectures achieved remarkable results, many authors have also obtained excellent segmentation results via patch-based deep neural networks. One of the earliest papers on applying deep learning to medical image segmentation performed pixel-wise segmentation of membranes in electron microscopy imagery in a sliding window fashion [25]. More recent papers use architectures based on Fully Convolutional Neural Network (fCNN) [26] over sliding-window due to computational efficiency. A notable examples is vertebral body segmentation in MR images using 3D fCNNs to generate vertebral body likelihood maps for deformable models [27]. Some researchers have also applied graphical models such as Markov Random Fields (MRFs) [28] and Conditional Random Fields (CRFs) [19] on top of the likelihood maps produced by fCNNs to act as label regularizers.

III. PROPOSED METHOD
Let D be the training set of N observations {(I n , Y n )} N n=1 , where I n = I n (i, j), 1 ≤ i ≤ W, 1 ≤ j ≤ H is an image in the training set and Y n be its corresponding ground truth. Each image is given with a number of channels. In this study, we work on grayscale images which have only one channel. The ground truth Y n is also an image with size m=1 is a set of labels. Totally, we have N × W × H pixels and their corresponding labels. For the semantic image segmentation problem, we aim to learn a hypothesis h (i.e., classifier) based on the relationship between each pixel I n (i, j) and its corresponding label Y n (i, j) of the training data and then use this hypothesis to assign a label on each pixel of an unsegmented image. The classifier h is obtained by training a segmentation algorithm on the training data D. Given an image, h assigns a class label to each pixel, and the segmentation result for all pixels of the input image constitutes the segmented image.
We develop an EoC for solving the image segmentation problem. We denote K = {K k } K k=1 as the set of K segmentation algorithms. In the ensemble, we train an EoC including K different classifiers {h k } K k=1 and then use a combining algorithm C to form the final decision making:ĥ = C{h k } K k=1 . The EoC {h k } K k=1 is generated by training K segmentation algorithms on the training set D. We then generate the predictions of pixels in training images and then train the combining algorithm on these predictions. In detail, we use the Stacking algorithm [2] to generate the predictions for pixels of training images. First, we divide training set D into T disjoint parts The segmentation algorithm K j trains oñ D i to obtain a classifier C i j . C i j works on the images in D i to output the probability reflecting how supportive a classifier is to a class label for each pixel. The predictions for an image I is given in an (W × H) × (M × K) matrix P(I): in which P k (y m |I(i, j)) is the probability that the pixel I(i, j) belongs to the class label y m given by the classifier generated by using K k for each k = 1, ..., K; m = 1, ..., M and M m=1 P k (y m |I(i, j)) = 1 [12], [13]. The prediction for all images in the training set D is given by a The next step is to train the combining algorithm on P. There are two combining models developed for the ensemble systems, namely representation-based model and weighted combining-based model [13]. The representation-based model creates M representations for M class labels on the predictions of the training data and then assigns class label which is associated with the biggest value among similarities (or the smallest value among dissimilarities) between the prediction for each test sample and the M representations [2], [12], [13]. Meanwhile, in the weighted combining-based model, classifiers contribute differently to combining by using different combining weights. The weights may vary for each classifier or among pairwise of classifier -class label. In this study, we use a weighted combining-based model which is based on the weight matrix W W W = {w k,m } in which w k,m is the weight of the k th classifier on the m th class (k = 1, ..., K; m = 1, ..., M ). Since the ground truths of the training images are given in advance, the weights of classifiers on the class labels can be obtained by discovering the relationship between predictions P and the class labels of the pixels of the training images. First, the class membership of a pixel I(i, j) associated with the class y m is obtained by a linear combination of the predictions and the associated weights as: with P m = [P 1 (y m |I(i, j)), P 2 (y m |I(i, j))..., P K (y m |I(i, j))] and W W W m = [w 1,m , ..., w K,m ] T . We then compare the class memberships associated with the class labels and assign the class label y s to pixel I(i, j) if its associated class membership is the biggest among all memberships.
In this study, we propose an approach to search for the combining weights W W W by maximizing the Dice coefficient computed on the predictions of the proposed ensemble with the combining weights W W W on training data. Let pred and ground denote the final predictions and ground truths of all training pixels: The Dice coefficient is the average of all Dice coefficients associated with the class labels.
We maximize the Dice coefficient to find the W W W. This optimization problem is solved by using the CLPSO method.
In this study, we use three popular segmentation algorithms namely UNet, LinkNet, and Feature Pyramid Network (FPN) to train the EoC. It is widely recognized that most segmentation algorithms based on deep learning are inspired by Fully Convolutional Network (FCN) [26]. This architecture adapts an existing classification network, such as VGG16, to the segmentation problem by replacing the fully connected layers with convolutional layers, followed by upsampling to produce dense pixel-level result. Deep networks specifically designed for medical image segmentation have also been introduced. A notable example is UNet [20], which consists of a contracting path and an expanding path. The contracting path consists of a number of downsampling operations on the input image in order to extract useful features, while the expanding path upsample the image back to its original size for the final prediction. In order to help with localization, high resolution features from the contracting path are concatenated with the upsampled output. This is an example of encoder-decoder architecture, in which an image goes through an encoder which contracts the image size, and is then decoded back to Classify images in Di by these classifiers 8: Add outputs on samples in Di to P 3 9: Use the CLPSO method: for each candidate W W W, compute the associated Dice coefficient using Algorithm 2 10: Select the optimalŴ W W with the best Dice coefficient 11: returnŴ W W and {h k } K k=1 the original size to get the segmentation result. Other examples include LinkNet [21] which takes the sum of the upsampled output and the corresponding features in the contracting path, and FPN [32] which concatenates features of all levels in the expanding path to help with the final prediction.
The pseudo-code of the training process of the proposed system is present in Algorithm 1. The algorithm gets the inputs including the training images D, K segmentation algorithm {K k } K k=1 , and parameters for the CLPSO (the population size popSize, the number of iterations iter, and learning rate controller C). First, we train K segmentation algorithms {K k } K k=1 on D to create classifiers {h k } K k=1 . Then we generate the prediction P for all pixels of training images by using the Stacking algorithm (Step 2-8). For each candidate W W W generated in the CLPSO, we call Algorithm 2 to calculate its associated Dice coefficient. In Algorithm 2, for each row of P i.e. the predictions of K classifiers for a pixel, we compute the class memberships associated with the class labels by using 4 and then assign a class label to this pixel by using 5. On the prediction result for all pixels of P, we can obtain the final predictions pred in the form of crisp labels. By using the ground truth of all pixels in the training set, we can calculate the Dice coefficient associated with each class label and the average Dice coefficient. The CLPSO runs until it reaches the number of iterations. From the last swarm, we select the candidateŴ W W which is associated with the best Dice coefficient as the solution of the problem.
In the classification process, we assign the class label to an unsegmented image I. We first obtain the predictions P(I) for all pixels of I by using the EoC {h k } K k=1 . The M class memberships of each pixel then are calculated by using these predictions and the optimal weightŴ W W (Step 2-5). The classification rule in 5 is applied to these class memberships of this pixel to give the final prediction. The predictions for all pixels of I constitute its segmentation result. for m ← 1 to M do 3: Compute CMm(In(i, j)) by using 4 4: Assign class label to In(i, j) by using 5 5: Generate pred Compute CMm(I(i, j)) by using Pm getting from P(I) and W W Wm fromŴ W W 5: Assign label to I(i, j) by using 5 6: return Segmented result for I

A. Experimental Settings
Two performance metrics were used for the evaluation of the base segmentation algorithms and the proposed ensemble: Dice coefficient and Mean Absolute Distance (MAD). Dice coefficient, defined in Equation 8, is one of the most popular metrics for medical image segmentation. However, its shortcoming is that it is a measure for total volume difference, without taking into account local discrepancies between contours, which is important in the context of medical image analysis [36]. Therefore, we also used another distance measure between geometrical contours for the evaluation. Let GT m and P R m be the set of coordinate vectors of the ground truth contour and prediction contour with respect to class y m respectively. The MAD for class y m [37] is defined as follows: To evaluate the effectiveness of our proposed ensemble compared to the benchmark algorithms, we participated in the Cardiac Acquisitions for Multi-structure Ultrasound Segmentation (CAMUS) challenge [38], which is a competition for accurate segmentation of 2D echocardiographic images. The datasets provided by the competition consists of clinical exams from 500 patients. For each patient, 2D apical four-chamber and two-chamber cardiographic images and segmentation were recorded at two cardiographic positions, End Diastolic (ED) and End Systolic (ES), making a total of 4 datasets. Three expert cardiologists were involved in the manual segmentation of the datasets. Segmentation ground truth is provided for 450   (9), test image (ground truth not available) patients, while the segmentation of the other 50 patients are not publicly available, and participants have to submit the results to a server for evaluation 1 . The datasets have three classes: Left ventricle, Myocardium and Left atrium, with an additional background class. Example images for two-chamber and four-chamber cases and their corresponding ground truths are shown in Figure 1. The evaluation server reports the aggregate results for ED and ES for both four-chamber and two-chamber cases. We reported the best results achieved by the author of this competition and the results of constituent classifiers as benchmark algorithms. For proposed ensemble, we set T = 5 for the T -fold cross-validation procedure in the Stacking algorithm. For CLPSO, we set c = 1.494 as in [8], and maxT = 600, nP op = 10. The predictions generation on the training set of one case (e.g. two-chamber ED) took approximately 18 hours using the GPU running in parallel. The optimization for each of the four datasets in the CAMUS competition using the CLPSO meanwhile was run on the CPU and took approximates 26 hours. This can be considered a reasonable time, compared to other similar works such as [49] in which the authors took 61 hours to optimize DNN hyperparameters for medical image segmentation.

B. Influence of Using Different Number of Segmentation Algorithms
We first explored the influence of using different number of segmentation algorithms on the performance of the proposed ensemble. We used the following architectures: UNet [20], LinkNet [21] and Feature Pyramid Network (FPN) [32] with two backbones VGG16 [33] and ResNet34 [34] to obtain the ensemble of 6 segmentation algorithms (denoted by Proposed ensemble (6)). We then used these 3 architectures with   Fig. 3. The performance of proposed ensemble using 6 and 9 segmentation algorithms.
backbone ResNet101 to generate 3 more segmentation algorithms for the ensemble (denoted by Proposed Method (9)). All segmentation algorithms were run for 300 epochs when training classifiers. Figure 3 shows the comparison between the performance of Proposed ensemble (6) and Proposed ensemble (9). With respect to the Dice coefficient, it can be seen that both ensemble give similar result. For the ED case, Proposed ensemble (6) achieves a Dice coefficient of 0.946 and 0.959 on Left ventricle and Myocardium class respectively, while Proposed ensemble (9) gives a higher result of 0.1% in both classes. For the Left atrium class, Proposed ensemble (6) has a Dice coefficient of 0.902, which is lower than that of Proposed ensemble (9) by 0.4%. Proposed ensemble (6) is slightly better than Proposed ensemble (9) for the ES case, with Dice coefficient of 0.930 and 0.929 respectively. In contrast, Proposed ensemble (6) achieves a Dice coefficient of 0.933 on the Left atrium class which is lower than that of Proposed ensemble (9) by 0.2%. Both ensembles achieve the same result on the Myocardium class at 0.954.
With respect to the MAD, Proposed ensemble (9) achieves better result compared to Proposed ensemble (6) for the ED case by a margin of 0.1 on all three classes (from 1.5 to 1.4 on Left ventricle, 1.7 to 1.6 on Myocardium and 2.0 to 1.9 on Left atrium). This can also be observed for the ES case, Left atrium class (from 1.7 to 1.6). It is observed that adding 3 segmentation algorithms with ResNet101 backbone increases MAD on the two other classes for this case (from 1.4 to 1.5 on Left ventricle and 1.6 to 1.7 on Myocardium).

C. Comparison with Benchmark Segmentation Algorithms
We compared the performance of the Proposed ensemble (9) with the benchmark algorithms. Tables I and II shows the Dice coefficient and MAD measured for the ED Case. It can be seen that the proposed ensemble achieves the best Dice coefficient for all three classes compared to other benchmarks. For the Left ventricle class, the proposed ensemble achieves a score of 0.947 which is slightly higher than that of the second best by UNet-ResNet34 (0.946). Meanwhile, the author's best achieves a score of only 0.936. For the two other classes, Myocardium  Tables III and IV. As with the ED case, the proposed ensemble achieved the best Dice coefficient on all three classes, and the benchmarks using VGG16 backbone performed poorly. For the Left ventricle class, the proposed ensemble achieved a score of 0.929 which was higher than the second best (LinkNet-ResNet34) by 0.1%. For the Myocardium class, the proposed ensemble obtained the same Dice coefficient as the second best benchmark (LinkNet-ResNet34) at 0.954. UNet-ResNet34 and FPN-ResNet34 also achieved slightly lower scores (0.952 and 0.953 respectively) while the other benchmarks obtained lower scores from 0.93  (9), which was an increase of more than 2%. Similarly, there is an improvement of 1% for ES Dice score (0.942 to 0.952). For MAD score, the proposed ensemble has a better score by a margin of around 0.3. Figure 2 shows an example of prediction made by the benchmarks and the proposed ensemble. It can be seen that FPN-VGG16 (first row, second column) failed to make a correct prediction, while LinkNet-VGG16 did not segment the bottom left part of the Myocardium, and made mistake on a part of the Left ventricle for Myocardium. UNet-VGG16 wrongly predicted an empty part in the top left as My-   (6) and Proposed ensemble (9) improved on the base segmentation algorithms to achieve the better segmentation result. Table VI shows the optimal weights found by CLPSO for the two-chamber ED case. It can be seen that overall the ResNet-based algorithms are assigned a higher weights compared to the VGG16-based algorithms, however there are cases where the VGG16-based algorithms are assigned relatively high weights. For example, with respect to the Left ventricle class, LinkNet-VGG16 and FPN-VGG16 were assigned a weight of 0.766 and 0.675 respectively. This shows that the weights of the proposed ensemble are not biased towards well-performing methods. Instead, all the constituent segmentation algorithms contribute to the ensemble.

V. CONCLUSION
In this paper, we presented a novel weighted ensemble of deep learning models for the problem of medical image segmentation. The probability predictions by the segmentation algorithms are combined based on weighted combining for a final prediction. Comprehensive Learning Particle Swarm Optimization (CLPSO), a swarm intelligence algorithm, was used to find the combining weights which gave the best fitness value over a five-fold cross-validation procedure. Dice coefficient, a popular metrics for medical image segmentation, was used as the fitness criteria. Our result on the datasets of CAMUS competition shows that the proposed ensemble achieves an overall improvement compared to several benchmark algorithms.