Aggregation of Classifiers: A Justifiable Information Granularity Approach

In this study, we introduce a new approach to combine multi-classifiers in an ensemble system. Instead of using numeric membership values encountered in fixed combining rules, we construct interval membership values associated with each class prediction at the level of meta-data of observation by using concepts of information granules. In the proposed method, uncertainty (diversity) of findings produced by the base classifiers is quantified by interval-based information granules. The discriminative decision model is generated by considering both the bounds and the length of the obtained intervals. We select ten and then fifteen learning algorithms to build a heterogeneous ensemble system and then conducted the experiment on a number of UCI datasets. The experimental results demonstrate that the proposed approach performs better than the benchmark algorithms including six fixed combining methods, one trainable combining method, AdaBoost, Bagging, and Random Subspace.


Introduction
In supervised learning, the relationship between feature vectors and class labels of training observations is exploited to learn the discriminative decision model. As data gathered from different sources can vary quite substantially, a learning algorithm that achieve high accuracy on one dataset can perform less well on another dataset. Experiments have shown that there is no single learning 2 algorithm that performs well on all data and it is difficult to know a priori which learning algorithm is suitable for a particular dataset. Hence, the research on how to combine several learning algorithms into a single framework to obtain a better discriminative decision model has generated a great deal of interest [1][2][3].
In many classification systems, the outputs usually reflect the probabilities of an observation belonging to given classes. However, in many practical situations, one may not be able to associate a precise probability with every event, particularly when only limited information is available. In this case, interval probabilities with lower and upper bounds provide a more general and flexible way to describe the uncertainty of the underlying knowledge [4]. Interval probability models have been successfully applied to many applications involving probabilistic and statistical reasoning, especially when there is a conflict between different sources of information [5].
In ensemble systems, each learning algorithm uses different methodology to learn base classifier on a given training set, thereby introducing uncertainty to the outputs. In ensemble learning, the metadata of an observation reflects the agreements and disagreements between the different base classifiers. A combiner which can explicitly represent knowledge with uncertainty is therefore desirable. Several combiners that exploit this idea have been proposed, such as fuzzy integral in neural network [6] and Decision Template [7]. In this study, instead of dealing with precise numerical membership values like those encountered in traditional classification system, we propose a novel combining classifiers algorithm that captures the uncertainty in the outputs of base classifiers in an explicit manner using the notion of information granularity. Information granules and Granular Computing are directly attributed to the pioneering work by Zadeh [8-10] and further developed in [11][12][13][14][15]. Specifically, the prediction of base classifiers will be processed by justifiable information granularity to generate interval class memberships associated with class labels. As mentioned before, interval values are a flexible way to describe the uncertainty in the underlying knowledge. Therefore, the proposed algorithm will be more general than existing ensemble systems since it can output both interval values and crisp class memberships. Our experiments have confirmed that it performs significantly better than many existing ensemble systems.
3 The paper is organized as follows. In Section 2, we briefly discuss ensemble methods, with a focus on heterogonous ensemble systems. The concept of information justifiability in the design of information granules is also emphasized. In Section 3, a novel fixed combining method based on the idea of justifiable granularity is discussed. Experimental results are presented in Section 4 in which we compared the results of the proposed method to a number of benchmark algorithms on twenty one datasets. Finally, conclusions are presented in Section 5.

Heterogeneous ensemble systems and fixed combining method
There are many taxonomies of ensemble method that focus on different factors and views at the ensemble systems [1,[16][17][18]. In [17], six strategies were introduced to build a sound combining system. The rationale behind these strategies is that "the more diverse the training set, the base classifiers, and the feature set, the better the performance of the ensemble system". • Different classifiers (also called Heterogeneity scenario [19]): A set of different learning algorithms is used on the same training dataset to generate different base classifiers, a combiner then make decision from the outputs (called Level1 data or meta-data) of these classifiers [24][25][26][27][28][29][30]. This approach focuses more on the algorithms to combine meta-data to achieve higher accuracy than any single base classifier. 5 class with label given by the classifier. There are two popular types of output for for each = 1, … , : • Crisp (Boolean) Label: return only class label P | ∈ 0,1 and ∑ P | = 1 • Soft Label: return posterior probabilities that belongs to classes, i.e. P | ∈ 0,1 and ∑ P | = 1 In this work, we focus only on the soft label. In this case, the posterior probability reflects the support of a class to an observation. The meta-data of an observation is defined in the form of the following matrix: While meta-data of all training observations, a × posterior probability matrix, is defined as:

Justifiable Information Granularity
If the probability distribution of data is known in advance, it is easy to represent the data by its distribution function. However, this information is usually unavailable in many real-world applications, and point estimates such as mean, median and skewness are often used to describe the data. Nevertheless, in many scenarios, pointwise information is less useful for subsequent reasoning [13]. Instead, information granularity explicitly models the inherent uncertainty present in the data.
The concept of information granularity has been defined on many formal ways of describing • Experimental evidence: The designed information granule Ω should reflect the existing experimental data so that the numeric evidence accumulated within the bounds of Ω attains the highest value. When the granule is formalized as a set (interval), the more data included within the bounds of the granule, the more legitimate this set becomes.
• Sound semantics: This requirement implies that the information granule should have welldefined semantics and exhibit high specificity. This implies that the smaller (more compact) the information granule (higher information granularity) is, the better (higher specificity) it is.
For example, if the information granule comes in the form of an interval, the knowledge expressed as an interval [2, 4] is regarded to be more specific than the one residing within the interval [0, 10].
The principle of justifiable granularity is about constructing an information granule in the form of an interval to satisfy the two requirements outlined above. It is noted that two requirements mentioned above are only for the form of information granule proposed in this paper. In fact, there are 7 several different approaches to formalize information granular such as in [45,46] in which S = |J − K| is the length of interval Ω = J, K , and J and K are the lower and upper bounds of the interval, respectively.
It is obvious that the two requirements are in conflict since increasing the cardinality will result in the reduction of the specificity. A compromise can be reached by using the product of these two functions: To build the information granule Ω on a given dataset H, we select the median (denoted by 2YO H ) as the numerical representative of the experimental data . Then, Ω = J, K is formed by specifying its lower and upper bounds in which J ≤ 2YO H ≤ K. Since the upper and lower bounds are constructed independently, we only discuss the procedure to find K (J is determined in the same way). Based on (3) we have: The optimal upper bound of the interval is determined by maximizing the values of [ K i.e., The optimal lower bound is found in the same manner The following algorithm summarizes the construction of information granule  (7) End For

The Proposed framework
We now construct a combining method based on the concept of information granularity for the classification problem. In the proposed method, justifiable granularity will be applied to meta-data of observation to form the interval class memberships and then the predicted label is obtained via a 9 translation to numerical class memberships. As the generated interval class memberships depends on V, the performance of the method depends on V too. In the training process described in the Algorithm 2, we first introduce a method to find the optimal value of V from a set g by exploiting the meta-data of training observations. In this algorithm, we divide the training set h into i disjoint parts h , … , h j , where h = h ∪ … ∪ h j and |h | ≈ ⋯ ≈ |h j |, and their corresponding h m , … , h mj in which h m = h − h . Then, T-fold CV is applied onto training set h such that the meta-data of observations in h n is obtained by classifiers generated by learning the learning algorithms on the associated part h mn (denoted by op mn in Algorithm 2). The meta-data of all training observations in h form a × matrix as in (1b) in which the q row of is the prediction (meta-data) for training observation r . For each r , we apply the principle of justifiable granularity to its meta-data to construct the interval membership values and then predict the class label of r based on a discriminative decision model operating on the intervals. In (1a), the 2 column is the output of classifiers for predicting r to be in the 2 class. For each value of V in g, we apply Algorithm 1 on meta-data r to obtain the interval class memberships sP . | r , P . | r t, 2 = 1, … , Reasoning can be done on the interval membership values, e.g. using interval arithmetic [47], to form the final classification result. In this paper, we introduce a transformation from intervals in (8) to numerical class memberships using the following expression: where NCM r ∈ denotes numerical class memberships that r belongs to class , ‚ ƒP . | r , P . | r " is the function that generates the numerical representation of the interval by using the lower and upper bounds, while ℎ ƒ †P . | r − P . | r †" is a decreasing function of the length of the interval sP . | r , P . | r t which reflects the specificity (or weight) of the numerical value generated by the de-granularization process from g.
In this work, the function ‚ • is chosen in the form of: ‚ ƒP . | r , P . | r " =ˆ.
while ℎ • is given by one of these three expressions.
ℎ ƒ †P . | r − P . | r †" = 1 (11) The Boolean class label of r is then predicted to be in the class with the maximum class membership grades: Since r is a training observation, class label of r i.e. r is known in advance. After looping the procedure though all training observations, classification error rate associated with each V ∈ g can be computed as: in which • Θ = 1 if Θ is true and 0 if otherwise. The optimal value of V is the one that minimizes Y••. This optimal value will be used as input of the next algorithm to predict the class label for unlabeled observations.  In the classification process, for an unlabeled observation ž , we use the trained base classifiers op ,…,# to obtain the meta-data of ž as in (1a). In detail, meta-data of ž associated with base classifier op is obtained in the form of vector 5P | ž , … , P | ž 6 in which P | ž is the posterior probability that ž belongs to class given by op . After that, interval membership values for each class prediction are computed from the meta-data as in (8) i.e. sP . | ž , P . | ž t 2 = 1, … , . Finally, the classification is obtained by (14). We arrive at the following classification process based on justifiable granularity:

Algorithm 3: Predicting label for unlabeled observation
Assign class label to ž by (14) End For Clearly, the proposed method described above is a trainable combining method because the meta-data of training observations is exploited to find the value of V in the training process. If a 15 specific value of V is used, the proposed method becomes a fixed combining method in which the label in the meta-data of training set is not used to train the combiner. In the experiment, we evaluate the proposed method in both cases i.e. trainable and fixed combining method.

Datasets and Experimental Settings
To evaluate the performance of the proposed method, we carried out experiments on twenty one UCI datasets as shown in Table 2. These datasets are often used to assess the performance of classification systems [48].  Abalone  8  4174  3  Artificial  10  700  2  Australian  14  690  2  Blood  4  748  2  Bupa  6  345  2  Contraceptive  9  1473  3  Dermatology  34  358  6  Fertility  9  100  2  Haberman  3  306  2  Heart  13  270  2  Penbased  16  10992  10  Pima  8  768  2  Plant Margin  64  1600  100  Satimage  36  6435  6  Skin_NonSkin  3  245057  2  Tae  20  151  3  Texture  40  5500  10  Twonorm  20  7400  2  Vehicle  18  946  4  Vertebral  6  310  3  Yeast  8  1484  10 We performed extensive comparative studies with a number of existing algorithms as Classifier [49], Nearest Mean Classifier, and Logistic Linear [50], were chosen to construct the heterogeneous ensemble system. These learning algorithms were chosen to ensure diversity of the ensemble system. The proposed method is compared to the benchmark algorithms with respect to the classification error rate and F1 score (which is the harmonic mean of Precision and Recall) [51]. We performed 10-fold cross validation and run the test 10 times to obtain 100 test results for each dataset.
All source codes were implemented in Matlab running on a PC with Intel Core i5 with 2.5 GHz processor and 4G RAM. To assess the statistical significance of the results, i.e., to determine whether the difference in classification error rate is meaningful statistically, we used Wilcoxon signed-rank test [52] (level of significance was set to 0.05) to compare the classification results of our approach and each benchmark algorithm.

The influence of › and ¥
We first analyzed the influence of the parameters on the classification results. Here, we evaluated the effect of V on the classification error rate by setting this parameter to one of the values in 0, 0.1, 0.2, … , 3.9, 4 . For each dataset, we ran the proposed method for each value of V, and reported the classification error rate corresponding to the three functions ℎ , ℎ ' and ℎ • . The relationships between V and the classification error rate on some datasets are displayed in Fig.2.
Several observations could be made. First, it is interesting to see that the three h functions have very similar error rate profile in the proposed ensemble system on the two-class datasets. Meanwhile, on the other datasets, the error rates related to ℎ and ℎ • are nearly equal and are lower than that of ℎ ' . For example, on Contraceptive, Vehicle, Tae, and Yeast, the error rates related to ℎ ' are 3-5% higher than that of ℎ and ℎ • . It is noted that ℎ ' is more sensitive to the interval length than the others.
Specifically, if the interval length is too small, the function ℎ ' returns large values because lim «→¬ « = +∞. Since some information granule intervals can be very small (see Table A.1), we suggest using ℎ or ℎ • to generate the numerical class memberships from the interval-based information granules. In subsequent discussion, we only report the classification results for ℎ • .

. Comparison with the benchmark algorithms
The mean and variance of error rates and F1 scores of ten learning algorithms, the benchmark algorithms, and the proposed method (using ℎ • ) are reported in Tables A.2 to Table A.7. We first compared the average ranking of the proposed method to the ten learning algorithms [52]. Table 3 shows the average ranking of ten learning algorithms and the proposed methods with respect to the error rate and F1 scores on the experimental datasets. The Proposed CV10 and Proposed Specific10 are    Table 4 show that the proposed method is significantly better than all benchmark algorithms on the experimental datasets. It demonstrates the benefit of using information granules to capture the uncertainty in class label prediction as oppose to just using pointwise information in the meta-data. Note that our framework is not only able to return the numerical class memberships for class label prediction but also the interval membership values that reflect the uncertainty associated with the class prediction by the base classifiers.
In detail, the proposed method with cross validation clearly outperformed all six fixed combining rules. Proposed CV10 also outperformed the trainable combining method Decision Template (12 wins vs 1 loss for error rate, and 7 wins vs 3 losses for F1 score). It also achieved better results than the three homogeneous ensemble methods: Bagging (12 wins vs 3 losses for error rate, and 12 wins vs 6 losses for F1 score), Random Subspace (16 wins vs 2 losses for both error rate and F1 score), and Adaboost (18 wins vs 2 losses for error rate, and 16 wins vs 3 losses for F1 score).
When the specific value of V = 1 was used, the proposed method is still better than all the fixed combining rules. Proposed Specific10 also outperformed Adaboost (17 wins vs 2 losses for error rate, and 14 wins vs 4 losses for F1 score), Bagging (9 wins vs 3 losses for error rate, and 8 wins vs 6 losses for F1 score), Random Subspace (15 wins vs 2 losses for error rate, 13 wins vs 2 losses for F1 score). It also outperformed Decision Template by 10 wins vs 2 losses for error rate and 6 wins vs 3 losses for F1 score.

Time complexity analysis
In the case of using a specific value of V = 1, the time complexity of training base classifiers is equal to those of other fixed combining counterparts like Sum Rule and Product Rule. Meanwhile, in the case of using optimal value of V, the overall time complexity of the proposed method using cross validation will be ² ƒmax ƒarg max ,…,# ² " × i, × × × ³´‚ "" in which ²5arg max ,…,# ² " × i6 is the time complexity of generating meta-data of training set by running i-fold Cross Validation with " learning algorithms ( = 1, … , ) having complexity ² " , and ² × × × ³´‚ is the time complexity to obtain the interval class memberships for training observations. The time complexity of testing process is ² × × ³´‚ . Based on the experimental results, our testing process is slightly more complex than other fixed combining methods with longer running time.

Different number of learning algorithms
To demonstrate the effectiveness of the proposed method, five additional learning algorithms, and Proposed Specific15 can be found in the supplement material). First, the average rankings shown in Table 5 indicated the outstanding performance of the proposed method compared to the 15 learning algorithms, where Proposed CV15 ranks first with average ranking of 2.90 and 3.52 for error rate and F1 score, respectively, closely followed by Proposed Specific15 (its ranking is 4.33 and 4.55, respectively). Besides, the statistical test results in Table 6 show that both Proposed CV15 and Proposed Specific15 achieve significantly better performance than all the benchmark algorithms.   Bias-variance theorem is often used to demonstrate that ensemble methods can reduce bias without tradeoff in variance [56].

Conclusions
In this paper, we have introduced a novel fixed combining classifiers ensemble method based on the justifiable granularity concept. Instead of using a single membership value given by pointwise statistics such as the mean, maximum, minimum, or median, we applied the justifiable granularity concept on the meta-data to find the interval associated with each class prediction. This interval reflects the uncertainty in class prediction given by the base classifiers and is a richer representation of information in the meta-data. The numerical class memberships can then be computed from these intervals by considering their bounds and interval length for class label prediction. Extensive experiments were conducted using an ensemble system of ten and fifteen base classifiers, and performance comparison with respect to classification error rate and F1 score was done with several benchmark algorithms on twenty one UCI datasets. Moreover, other designs of information granule such as in [45,46,57] could also be studied. These will be the directions of our future work.
[  [       • : The benchmark algorithm is equal to Proposed CV10, □: The benchmark algorithm is better than Proposed CV10, ■: The benchmark algorithm is worse than Proposed CV10 ◊: The benchmark algorithm is equal to Proposed Specific10, ▲: The benchmark algorithm is better than Proposed Specific10, ▼: The benchmark algorithm is worse than Proposed Specific10