Toward an ensemble of object detectors.

. The ﬁeld of object detection has witnessed great strides in recent years. With the wave of deep neural networks (DNN), many break-throughs have achieved for the problems of object detection which previously were thought to be diﬃcult. However, there exists a limitation with DNN-based approaches as some architectures are only suitable for particular types of object. Thus it would be desirable to combine the strengths of diﬀerent methods to handle objects in diﬀerent contexts. In this study, we propose an ensemble of object detectors in which individual detectors are adaptively combine for the collaborated decision. The combination is conducted on the outputs of detectors including the predicted label and location for each object. We proposed a detector selection method to select the suitable detectors and a weighted-based combining method to combine the predicted locations of selected detectors. The parameters of these methods are optimized by using Particle Swarm Optimization in order to maximize mean Average Precision (mAP) metric. Experiments conducted on VOC2007 dataset with six object detectors show that our ensemble method is better than each single detector.


Introduction
Object detection is a problem in which a learning machine has to locate the presence of objects with a bounding box and types or classes of the located objects in an image. Before the rise of Deep Neural Networks (DNN), traditional machine learning methods using handcrafted features [13,22] were used with only modest success since these extracted features are not representative enough to describe many kinds of diverse objects and backgrounds. With the successes of DNN in image classification [11], researchers began to incorporate insights gained from Convolutional Neural Networks (CNN) to object detection. Some notable results in this direction include Faster RCNN [7] or You Look Only Once (YOLO) [16]. However, some object detectors are only suitable for specific types of objects. For example, YOLO struggles with small objects due to strong spatial constraints imposed on bounding box predictions [15]. In this study, we propose to combine several object detectors into an ensemble system. By combining multiple learners for the collaborated decision, we can obtain better results than using a single learner [20]. The key challenge of building ensembles of object detectors is to handle multiple outputs so that the final output can determine what objects are in a given image and where they are located.
The paper is organized as follows. In section 2, we briefly review the existing approaches relating to object detection and ensemble learning. In section 3, we propose a novel weight-based ensemble method to combine the bounding box predictions of selected base detectors. The bounding boxes for combination are found by a greedy process in which boxes having Intersection-over-Union (IoU) values with each other higher than a predetermined threshold are grouped together. We consider an optimisation problem in maximizing the mean Average Precision (mAP) metric of the detection task. The parameters of combining method are found by using an evolutionary computation-based algorithm in solving this optimisation problem. The details of experimental studies on the VOC2007 dataset [6] are described in section 4. Finally, the conclusion is given in section 5.

Object Detectors
Most early object detection systems were based on extracting handcrafted features from given images then applying a a conventional learning algorithm such as Support Vector Machines (SVM) or Decision Trees [13,22] on those features. The most notable handcrafted methods were the Viola-Jones detector [21] and Histogram of Oriented Gradients (HOG) [5]. However, these methods only managed to achieve modest accuracy while requiring great expertise in handcrafting feature extraction. With the rise of deep learning, in 2014 Girshick et al. proposed Regions based on Convolutional Neural Network (CNN) features (called RCNN), the first DNN-based approach for object detection problem [8]. This architecture extracts a number of object proposals by using a selective search method and then each proposal is fed to a CNN to extract relevant features before being classified by a linear SVM classifier. Since then, object detection methods have developed rapidly and fall into two groups: two-stage detection and one-stage detection. Two-stage detection such as Fast-RCNN [7] and Faster-RCNN [17] follows the traditional object detection pipeline, generating region proposals first and then classifying each proposal into each of different object categories. Even though these networks give promising results, they still struggle with objects which have a broad range of scales, less prototypical images, and that require more precise localization. One-stage detection algorithms such as YOLO [15] and SSD [12] regard object detection as a regression or classification problem and adopt a unified architecture for both bounding box localization and classification.

Ensemble methods and optimization
Ensemble methods refer to the learning model that combines multiple learners to make a collaborated decision [18,20]. The main premise of ensemble learning is that by combining multiple models, the prediction of a single learner will likely be compensated by those of others, thus making better overall predictive performance. Nowadays, many ensemble methods have been introduced and they are categorized into two main groups, namely homogeneous ensembles and heterogeneous ensembles [20]. The first group includes ensembles generated by training one learning algorithm on many schemes of the original training set. The second group includes ensembles generated by training several different learning algorithms on the original training set.
Research on ensemble methods focuses on two stages of building an ensemble, namely generation and integration. For the generation stage, approaches focus on designing novel architectures for the ensemble system. Nguyen et al. [19] designed a deep ensemble method that involves multiple layers of ensemble of classifiers (EoC). A feature selection method works on the output of a layer to obtain the selected features as the input for the next layer. In the integration stage, besides several simple combining algorithms like Sum Rule and Majority Vote [10], Nguyen et al. [20] represented the predictions of the classifiers in the form of vectors of intervals called granule prototypes by using information granules. The combining algorithm then measures the distance between the predictions for a test sample and the granule prototypes to obtain the predicted label. Optimization methods have been applied to improve the performance of existing ensemble systems in terms of ensemble selection (ES) which aims to search for a suitable EoC that performs better than using the whole ensemble. Chen et al. [2] used ACO to find the optimal EoC and the optimal combining algorithm.

General Description
In this study, we introduce a novel ensemble of object detectors to obtain higher performance than using single detectors. Assume that we have T base object detectors, denoted by OD i (i = 1, ..., T ). Each detector works on an image to identify the location and class label of objects in the form of prediction results the number of objects detected by OD i ). The elements of R i,j are detailed as: and h i,j are the top-coordinates and the width and height of the bounding box • Prediction l i,j , conf i,j where l i,j is the predicted label and conf i,j is the confidence value, which is defined as the probability for the prediction of this label Our proposed ensemble algorithm deals with the selection of suitable detectors among all given ones, as well as combining the bounding boxes of the selected detectors. In order to select suitable detectors, we introduce a number of selection variables α j ∈ {0, 1}, j = 1, ..., T with each binary variable α j representing whether detector OD j is selected or not. The combining process is conducted after the selection process. To combine the bounding boxes made by the selected detectors, we need to know which bounding box of each detector predicts the same object. Our proposed method consists of two steps: -Step 1: Measure the similarity between pairs of bounding boxes between the detection results from different detectors to create groups of similar bounding boxes -Step 2: For each group, combine the bounding boxes The similarity between bounding boxes is measured using Intersection over Union (IoU ), which is very popular in object detection research [22]. With two bounding boxes BB i,j and BB p,q , the IoU measure between them is given by: This measure is compared to a threshold θ (0 ≤ θ ≤ 1). If the IoU > θ then they are grouped together, eventually forming a number of box groups G = (g 1 , g 2 , ..., g K ), where K is the number of groups. Note that we do not consider the IoU s between boxes made by the same detector (i = p) since we combine bounding boxes of different detectors. We also combine bounding boxes that have the same predicted label. For each group, we perform combination of the bounding boxes. Let W x i , W y i , W w i , W h i ∈ [0, 1] be the weights of detector OD i (i = 1, ..., T ). Then the combined bounding box for group g k will be BB k = (x k , y k , w k , h k ) in which: where I[.] is the indicator function, and coord k ∈ {x k , y k , w k , h k }. Therefore, our ensemble is completely determined by the following parameters:

Optimisation
The question that arises from the proposed method is how to search for the best i are the bounding box weights, α j are the selection variables, and θ is the IoU threshold. We formulate an optimisation problem which we can solve to find the optimal value for these parameters. The fitness function is chosen to be the mean Average Precision (mAP), which is defined as the average of Average Precision for each class. In order to calculate AP c , we need to calculate the precision and recall. Precision and recall are defined as follows: if assigni = 0 then 4: continue 5: assigni ← group idx 6: for j ← i + 1 to nbb do 7: if assignj = 0 or deti == detj or li = lj then 8: continue 9: if IoU (BBi, BBj) > θ then 10: assignj ← group idx 11: group idx ← group idx + 1 12: K ← groupd idx − 1 13: G ← {g1, g2, ..., gK } where g k = {BBi} such that assigni == k 14: for k ← 1 to K do 15: Combine boxes in g k to get BB k = (x k , y k , w k , h k ) by using Eq. 2, 16: E.insert(BB k ) 17: return E P recision = T P T P + F P , Recall = T P T P + F N where T P (True Positive) is the number of correct cases, F P (False Positive) is the number of cases where a predicted object does not exist, F N (False Negative) is the number of cases where an object is not predicted. The IoU measure between a predicted bounding box and a ground truth box determines whether the ground truth box is predicted by the algorithm. The AP summarises the shape of the precision/recall curve, and is evaluated by firstly computing a version of the measured precision/recall curve with precision monotonically decreasing, by setting the precision for recall r to the maximum precision obtained for any recall r ≥ r. Then the AP is calculated as the area under this curve by numerical integration. This is done by sampling at all unique recall value at which the maximum precision drops. Let p interp be the interpolated precision values. Then the average precision is calculated as follows: Thus with T detectors, the optimisation problem is given by: We use PSO [3,9] to find the optimal values for (W . Compared to other optimisation algorithms, PSO offers some advantages. Firstly, as a member of the family of evolutionary computation methods, it is well suited to handle non-linear, non-convex spaces with non-differentiable, discontinuous objective functions. Secondly, PSO is a highly-efficient solver of continuous optimisation problems in a range of applications, typically requiring low numbers of function evaluations in comparison to other approaches while still maintaining quality of results [14]. Finally, PSO can be efficiently parallelized to reduce computational cost. To work with continuous variables in PSO, we convert each α j into a continuous variable belonging to [0, 1]. If α j is higher than 0.5, the corresponding detector is added to the ensemble. The average mAP value in a 5-fold cross-validation procedure is used as the fitness value. The combining and training procedures are described in Algorithm 1. Algorithm 1 receives inputs including the bounding boxes made by the detectors (BB i ), confidence values (conf i ), prediction labels (l i ) and the parameters . Each bounding box (BB i ) also has an associated variable (det i ) which delineates the index of the detector responsible for (BB i ). For example, if (BB i ) is predicted by the detector (OD j ) then det i = j. Line 1 sorts the selected bounding boxes in decreasing order of confidence value. Line 3-10 assigns each bounding box to a group. For each bounding box BB i we first check if it has been assigned to one of the existing groups before assigning it to the new group group idx (line [3][4][5]. Then with each unassigned bounding box BB j that is not made by the same detector as that of BB i and have the same prediction we add BB j to group group idx if its IoU value with BB i is greater than θ (line 6-10). After all boxes are grouped, lines 12 to 17 combine the boxes in each group and returns the combined bounding boxes.

Experimental Setup
In the experiments, we used a number of popular object detection algorithms as base detectors for our ensemble method. The base detectors used are SSD Resnet50, SSD InceptionV2, SSD MobilenetV1 [12], FRCNN InceptionV2, FR-CNN Resnet50 [17], and RFCN Resnet101 [4]. We used the default configuration for all of these methods. Training process was done for 50000 iterations. For the PSO algorithm, the inertial weight a was set to 0.9 while two parameters C 1 and C 2 were set to 1.494. The number of iterations was set to 100 while the population size was set to 50. The dataset VOC2007 was used in this paper containing 5011 images for training and validation, and 4952 images for testing. The evaluation metric used in the paper was mAP (mean Average Precision). Among the 9963 images in the VOC2007 dataset, there are 2715 images having at least one

RFCN-Resnet101
The red or blue color means better or poorer performance on an object

Result and discussion
Table 1 (left) shows the mAP result of the proposed method and the base detectors. The proposed method has mAP value of 67.23%, which outperforms the best base detector RFCN-Resnet101 by 2.56%. Figure 1 shows a detailed comparison of AP values between the two methods for each class. It can be seen that the proposed method achieves a remarkable increase for the "dining table" object, from 35.04% to 56.17%. This is followed by "sofa" with an increase of 9.08% from 54.19% to 63.27%. Other objects such as "dog" or "train" also saw a modest increase. On the other hand, "bicycle" and "bottle" saw a decrease, from 72.73% to 70.31% and from 49.26% to 45.98% respectively. It should be noted that ensemble methods ensure that the overall result is better, even though some cases might be worse than the base learners. In total, there are 14 object types  Figure 2 provides a comparison between the selected base detectors (those with α i ≥ 0.5 after optimisation) and the proposed method. It can be seen that RFCN-Resnet101, SSD-Resnet50, and FRCNN-Resnet50 correctly identify two bicycles, but wrongly predicts another bicycle that spans the two real bicycles. On the other hand, FRCNN-Resnet50 wrongly predicts three person objects in the image. Due to the combination procedure, the redundant bicycle and person objects have been removed. Also, the bounding box for the left person by SSD-InceptionV2 is slightly skewed to the right, but after applying weighted sum of bounding boxes of the base detectors, the combined box has been positioned more accurately.

Conclusion
In this paper, we presented a novel method for combining a number of base object detectors into an ensemble that achieves better results. The combining method is constructed using PSO algorithm to search for a defining parameter set that optimise mAP. Parameters are selective indicators which show whether detectors are selected or not. The bounding boxes of selected detectors then are combined based on a weights-based combining method. Our results on a benchmark dataset show that the proposed ensemble method is able to combine the strengths and mitigate the drawbacks of the base detectors, resulting in an improvement compared to each individual detector.