Sparse Representation Based Augmented Multinomial Logistic Extreme Learning Machine with Weighted Composite Features for Spectral Spatial Hyperspectral Image Classification

Although extreme learning machine (ELM) has been successfully applied to a number of pattern recognition problems, it fails to pro-vide sufficient good results in hyperspectral image (HSI) classification due to two main drawbacks. The first is due to the random weights and bias of ELM, which may lead to ill-posed problems. The second is the lack of spatial information for classification. To tackle these two problems, in this paper, we propose a new framework for ELM based spectral-spatial classification of HSI, where probabilistic modelling with sparse representation and weighted composite features (WCF) are employed respectively to derive the op-timized output weights and extract spatial features. First, the ELM is represented as a concave logarithmic likelihood function under statistical modelling using the maximum a posteriori (MAP). Second, the sparse representation is applied to the Laplacian prior to effi-ciently determine a logarithmic posterior with a unique maximum in order to solve the ill-posed problem of ELM. The variable splitting and the augmented Lagrangian are subsequently used to further reduce the computation complexity of the proposed algorithm and it has been proven a more efficient method for speed improvement. Third, the spatial information is extracted using the weighted compo-site features (WCFs) to construct the spectral-spatial classification framework. In addition, the lower bound of the proposed method is derived by a rigorous mathematical proof. Experimental results on two publicly available HSI data sets demonstrate that the proposed methodology outperforms ELM and a number of state-of-the-art approaches.


I. INTRODUCTION
With rich spectral and spatial information contained in a three-dimensional hypercube, hyperspectral images (HSI) provides a unique way for characterizing objects in geographical scenes, especially remote sensing images [1]. However, classification of high dimensional data such as HSI is still challenging, particularly due to the unfavorable ratio between the limited number of training samples and large number of spectral bands, i.e. the Hughes phenomenon [2]- [4]. To tackle this problem, a number of feature extraction and data classification approaches have been proposed [12]. These include the singular spectrum analysis (SSA) [5]- [8], segmented auto-encoders [11], principal component analysis (PCA) and its variations [13]- [14], spectral-spatial classification using multiple kernels and active learning [15] [16], where support vector machines (SVM) [9][10] are widely used for data classification. Although these approaches have produced quite good results, their performance can be further improved by addressing two main difficulties: (1) Inaccurate classification under a large number of spectral bands yet limited training samples. (2) Relatively low efficiency for processing high dimensional HSI data.
As a single forward layer neural network, the extreme learning machine (ELM) is a fast and effective machine learning method and has received a wide attention due to its good performance [17]- [19]. The ELM needs not tune the hidden layer parameters once the number of hidden layer nodes is determined. In ELM, the weight and bias vectors between the input layer and the hidden layer are randomly generated, which are independent of the training samples and the specific applications [1]. ELM have achieved good performance in many applications [20][21][22][23] due to its good performance. Also, ELM has been widely applied to HSI classification. For example, In [24], [25], bilateral filtering and extended morphological profiles were used for feature extraction respectively. And ELM was used for classification. In [26], [27], [28], superpixel, watershed and Gabor filter were used for feature extraction and ELM was used for classification. Although these methods ELM-based have achieved good performance, they just combine ELM and other feature extraction method. These ELM-based method ignore one important problem in ELM that the randomly generated input weights and bias of ELM may cause ill-posed problems. Based on this perspective, we propose augmented sparse multinomial logistic ELM (ASMLELM) for HSIs classification. Based on proposed ASMLELM, we propose weighted composite features (WCFs) for extracting rich spatial information for ASMLELM. Therefore, a novel framework ASMLELM-WCFs have been proposed for spectral-spatial classification of HSI, and the main contributions can be highlighted as follows.
First, the ELM is represented by a maximum a posteriori (MAP) based probabilistic model, which is further represented by a concave logarithmic likelihood function (LLF). As the LLF value can be arbitrarily large if the training data is separable, a prior/regularized term on the LLF is critical [29], for which the sparse representation is employed for representing the ELM. In order to settle the ill-posed problem in ELM, Laplacian prior/regularized term is applied to improve the sparsity of the learnt weights of the proposed ELM classifier. The Laplacian function has the heaviest tailed density derived from its logarithmic concave, where its sparsity promoting nature has been theoretically well justified [29]- [32]. This can guarantee the logarithmic posterior to have a unique maximum when it is combined with a concave logarithmic likelihood [29]. As the ELM is represented by a probabilistic model under a maximum a posteriori (MAP), it can enjoy the advantage of existing a unique maximum. Therefore, the ill-posed problem in ELM can be addressed by above operation namely spare multinomial logistic ELM (SMLELM). Recently, variable splitting and augmented Lagrangian [33] have been proposed for speed improvement. Hence, we adopt the idea variable splitting and augmented Lagrangian for proposed SMELM to construct proposed augmented SMLELM (ASMLELM), which transform SMLELM into a new form using variable splitting and augmented Lagrangian for speed improvement.
Second, by combining composite kernels (CK) [34] and weighted mean filters (WMFs) [35], the weighted composite features (WCFs) are utilized to extract spatial features and further improve the classification accuracy. Accordingly, three improved spectral-spatial classifiers are derived, which include the ELM, the nonlinear ELM (NLELM) and the kernel ELM (KELM) based classifiers.
Third, the generalization bounds of the proposed method is derived in a similar way to determine the margin bounds of the sparse multinomial logistic regression (SMLR) [29][36] [37]. These bounds can provide a theoretical insight of and justification for our proposed methods.
The rest of this paper is organized as follows. In Section II, the background of the ELM is introduced. The proposed method is detailed in Section III. Section IV reports the experimental results in benchmarking with a few state-of-the-art approaches. Finally, some conclusions are drawn in Section V.

A. Basic Concepts of ELM
The ELM is a generalized single layer feedforward neural network (SLFNs) [1], [17]. The weight vector and the bias between input layer and hidden layer are randomly initialized, though the final values will be determined by the learning process. Once the initial values for the weight/bias vectors are assigned, the hidden layer output matrix remains unchanged in the learning process [1].
Let ≡ ( 1 , 2 , … , ) ∈ × be the training data of a HIS, which has N pixels and each pixel has a d-dimensional feature. Let = ( 1 , 2 , … , ) ∈ × be a matrix representing the class label of the training samples, where M is the number of class in datasets. Given a pixel label , if it belongs to the k-th class, we have , = { 1, = , 0, ℎ . The model of a single hidden layer of the neural network with L hidden neurons and the activation function ( ) can be expressed as follows: where represents the weight vector between the hidden layer and the output layer; and are the weight vector and bias from the inputs to the hidden layer, respectively; ( + ) represents the output of the j-th hidden neuron with respect to the input sample . Obviously, equation (1) can be further expressed in the following matrix form: ×1 . H is the hidden layer output matrix, and is the output weight matrix between hidden layer and output layer. From (2), can be simply obtained below, where † is the Moore Penrose generalized inverse of a matrix [17].

B. Constrained Optimization of the ELM
The constrained optimization of the ELM aims to achieve not only the smallest training error but also the smallest output weights [19]: min ∥ − ∥ 2 and ∥ ∥ 2 (4) According to the Bartlett's neural network generalization theory [38], the smaller weights will result in a smaller training error of the feedforward neural networks. According to this optimization theory, Eq. (4) can be rewritten as: min where is the training error for the training sample , C is the regularization parameter. Based on the Karush-Kuhn-Tucker (KKT) theorem [39], training the ELM is equivalent to solve the following dual optimization problem: min where is the column vector of the matrix , and , is the Lagrange multiplier. From the KKT theorem, we can further derive = 0 → = * = 0 → = , = 1, … , where = [ ,1 , ,2 , … , , ] and = [ 1 , 2 , … , ] . Then, it can be shown that the output weight is: The activation functions of the neurons in the hidden layer are unknown, and any kernel satisfying the Mercer's conditions can be used: In fact, the Gaussian kernel is one of the good choices: Based on the above analysis, two well-known constrained optimization methods of ELM had been proposed. One is to define in (10) without a kernel, namely nonlinear ELM (NLELM). The other one is to use the kernel function for the kernel ELM (KELM) as given below.
III. THE PROPOSED FRAMEWORK

A. Augmented Sparse Multinomial Logistic Extreme Learning Machine (ASMLELM)
The goal of a supervised learning algorithm is to design a classifier based on a set of N training samples that is capable of distinguishing M classes on the basis of an input vector of length d [29]. Under the multinomial logistic regression model [40], in equations (3), (13) and (14) can be transformed to a new form via a probability model. If training sample belongs to the j-th class, the probability model can be represented by the following equation: In (3), (13), (14) and (15), may not be optimal due to the ill-posed problem. Therefore, it is important to find the optimal to obtain high classification accuracy. In order to find the optimal parameter for the ELM, will be estimated again after presenting the ELM by a probabilistic model. To this end, the maximum likelihood (ML) estimation is introduced to the ELM. Let = [ 1 ; 2 ; ⋯ ; ] ( × )×1 be a column vector with × elements. A simple maximization of the logarithmic likelihood can be expressed as follows: In order to maximize L( ), consider the second order Taylor series of L( ) evaluated at ′ : where ∈ (0,1) and where ∈ × is an identity matrix, = [1, 1, … ,1] and ⊗ is the kronecker matrix product. This result has been proved in [40], [41]. Then, the ML estimation can be expressed as follows: Hence, at the (t+1)-th iteration can be expressed by a simple update equation: (20), it can be seen that it is very similar to an iteratively reweighted least squares (IRLS) algorithm [42]. However, the Hessian matrix in the IRLS algorithm is replaced by matrix B. Since −1 can be precomputed, it is a big advantage of the proposed approach. Compared to the IRLS algorithm, whose Hessian matrix must be inverted at each iteration [29], [43], our proposed approach is better than the IRLS algorithm.
However, the concave LLF value can be arbitrarily large if the training data is separable. From [29], it is known that a prior on the logarithmic likelihood is crucial. In order to address the ill-posed problem in ELM, the prior/regularized term is adopted on . Here, the Laplacian prior is used: and ∥ ∥ 1 =∑ | | denotes the l 1 norm and | |=√ 2 .
Consider the following inequality: where h>0 and u>0. For any ′ , we have Therefore, (20) can be expressed by the following equation: From the above, it can be seen that the Laplacian prior/regularized term is applied to with acting as a regularization parameter. The Laplacian prior imposed on the sparse multinomial logistic ELM (SMLELM) controls the complexity of the SMLELM classifier and improves the generalization capacity of the SMLELM since ( ) in (22) forces many components of to be zero.
Since the term L( ) in (16) is not quadratic and ( ) in (22) is nonsmooth, finding the solution of the optimization problem defined as (25) is very difficult. Recently, a novel method called the majorization minimization [43] has been proposed to decompose this kind of problems [29], [44]- [47]. However, the computation complexity of this algorithm is very large. In [48], the logistic regression via a variable splitting and an augumented Lagrangian (LORSAL) have been used for speed improvement. Moreover, variable splitting and augmented Lagrangian have been used for speed improvement in some work [33] [16] [48]. This algorithm have been proven that can greatly reduce the computation complexity [16], [45], [49]. As the variable splitting and augmented Lagrangian approach shows a good performance in the corresponding works, then we can sue this approach to reduce the timeconsuming of proposed SMLELM, which transform proposed SMLELM into a new form.
Variable splitting is a very simple procedure which consist in creating a new variable [50]. Then the problem defined in (21) is equivalent to: This optimization problem can be solved via applying the direction method of multipliers [51] (see also [52] and the references therein).This neural network is called the augmented SMLELM(ASMLELM).

B. Weighted Composite Features Based ASMLELM (ASMLELM-WCFs)
From the above, it can be seen that the ASMLELM can just use the spectral information of the HSI data for classification. A pixel and its spatial neighborhood pixel likely belong to the same class [1]. Therefore, the spatial information is very important. The spatial information will be adopted to the ASMLELM. Hence, the WCFs will be used to perform the spectral spatial classification for the proposed ASMLELM.
For a given pixel , let the pixel coordinate of sample be (p, q), then the local pixel neighborhood centered at is ( ) = { = ( , )| ∈ [ − , + ]; ∈ [ − ; + ]}, a=(wopt-1)/2 where wopt is the width of the neighborhood window. Let be the spectral feature of the training sample and be the information extracted from a local spatial neighborhood of the pixel . It can be represented as between the central pixel and the neighboring pixels ( ∈ ( )). Following the setting in [35], we set = 0.2 in this work. Then, the output matrix of the hidden layer defined in (2) and (13) can be expressed as: and μ is a combination coefficient balancing the spectral and spatial information.
For the KELM defined in (14), let be and Here, and control the widths of the spectral and spatial Gaussian kernel. Now, three new methods for performing the spectral-spatial HSI classification can be proposed via the ASMLELM and WCFs. That is, the ASMLBELM-WCFs, the ASMLNELM-WCFs and the ASMLKELM-WCFs. The details are summarized in Algorithm 2. Figure 1 show the flowchart of proposed ASMLELM-WCFs.

D. The lower bound of the ASMLELM
In this section, the lower bound of the proposed ASMLELM will be derived. From (19), we have: (20), it is well known that is symmetric and negative definite independent from . at the (t+1)-th iteration is defined as:  (b) To prove this lemma, suppose that ∥ ∇ ( ) ∥ is bounded by a value larger than 0. From (b) of Lemma 1, it can be seen that the increments are lower bounded. Therefore, it contradicts the boundedness of Q1. As a result, it can be concluded that the sequence ∇ ( ) converges to 0.

A. HSI Data Sets
In this section, the performances of the proposed new framework using two well-known HSI data sets (the Indian Pines data set and the Pavia University data set) will be evaluated. These two data sets are available in the public.
(1) Indian Pines data set: The Indian Pines HSI data set consists of the urban images collected by the AVIRIS sensors built in June 1992. The image scene has 145× 145 pixels with 200 spectral bands, where 20 spectral bands are the water absorption bands. Each band is ranging from 0.2μm to 2.4μm and the spatial resolution of the HSI data is 20m per pixel. There are totally 16 classes in the HSI data.
(2) Pavia University data set: The Pavia University HSI data set consists of data over the city Pavia Italy acquired by the ROSIS instrument in 2001. The image scene has 610×340 pixels with 103 spectral bands after removing 12 water absorption bands. The spatial resolution of the HSI data is 1.3m per pixel. There are totally 9 classes in the HSI data.

B. Measurements and Parameter Setting
The parameter setting and the measurements are described below before conducting the experiments. The proposed six methods are compared with the state-of-the-art methods including SVM, SVM with CK (SVM-CK), LORSAL, KLORSAL, SMLR-SpATV, BELM, NLELM, and KELM. The LIBSVM [53] software is used for the implementation of the SVM and the SVM-CK. For the kernel based methods such as the SVM, the SVM-CK, the KELM, the ASMLKELM and the ASMLKELM-WCFs, the Gaussian kernel is used. The Gaussian kernel parameter and the penalty parameter C will be automatically tuned by the three folds cross validations. C=2 , = 2 , ={1, 2,…, 11, 12, 13, 14, 15} and ={-6, -5, -4, -3, -2, -1, 0, 1}. Other parameters of the SVM and the SVM-CK are set the same as [1]. The parameters of the LORSAL, the KLORSAL (kernel based LORSAL) and the SMLR-SpATV (KLORSAL with the weighted Markov random field [54]- [55]) are chosen the same as [43]. All experiments are conducted in MATLAB R2015a and run in a computer with 2.9 GHz CPU and 32.0 G RAM.
(1) For the proposed ASMLBELM, the total number of the neurons in the hidden layer L and are very important parameters. They will be evaluated in the next subsection.
(2) For the ASMLNLELM, the parameter C will be automatically tuned by these three folds cross validations. The effects of L and will be evaluated in the next subsection.
(3) The important parameters of the ASMLKELM are C, and . C and will be automatically tuned by these three folds cross validations. The effect of will be evaluated in the next subsection.
(4) C, L, and μ are important parameters of the proposed ASMLBELM-WCFs. C will be automatically tuned by these three folds cross validations. μ is empirically set to be 0.1 in the experiments. For the parameters L and , their effects will be evaluated in the next subsection.
(5) The proposed ASMLNLELM-WCFs has several parameters such as L, C, and μ. The parameter C will be automatically tuned by these three folds cross validations. μ is empirically set to be 0.1 in the experiments. For the parameters L and , their effects will be evaluated in the next subsection.
(6) For the ASMLKELM-WCFs, C and will be automatically tuned by these three folds cross validations. μ is empirically set to be 0.1 in the experiments. The effect of will be evaluated in the next subsection.

C. Discussion on Parameters
In this subsection, several important parameters of the proposed methods will be evaluated and its performance will be compared with the BELM and the NLELM. It is worth noting that the window size is set at 9 for the WCFs based methods in both experiments 1 and experiment 2, which means the widths of the neighborhood window are set at 9. The effects of the window size for the proposed CF based methods will be evaluated after evaluating the effects of and L. Fig. 2, the effect of the parameter ( = 2 ) on the proposed method is evaluated. For convenience, in this experiments, the number of the hidden layer is set at L=550 and L=900 for the Indian Pines data set and the Pavia University data set, respectively. Fig. 1(a) and Fig. 1(b) plot the OA results as a function of a with 1043 and 3921 training samples (10% and 9% of the available samples) of the Indian Pines data set and the Pavia University data set, respectively. From Fig. 1, it can be seen that these three spectral and these three spatial classifiers are still stable to achieve the good performances when the parameter a is varying. It can also be seen that the proposed spectral classifiers can achieve higher accuracy when a is relatively large. Overall, the proposed six classifiers can achieve good and stable performances when the parameter a is varying. For following experiments, both Indian Pines datasets and Pavia University datasets, we set to be -20 for the proposed ASMLBELM-WMFs, ASMLNLELM-WMFs and ASMLKELM-WMFs, and assign to be -10 for the proposed ASMLBELM and ASMLNLELM. For the proposed ASMLKELM, we set to be -17 in Indian Pines datasets, and set to be -13 in Pavia University datasets. Experiment 2: In Fig. 3, the OA results are plotted as a function of the hidden layer neurons L and its effects on the proposed methods, the BELM and the NLELM. From Fig. 2, it can be seen that the ASMLBELM and the ASMLNLELM always achieve higher accuracies than the BELM and the NLELM. It can also be seen that the proposed spectral spatial classifiers based on the ASMLBELM-WCFs and the ASMLNLELM-WCFs can greatly improve the performances of the proposed spectral classifiers based on the ASMLBELM and the ASMLNLELM. This is because the spatial information is considered in these two spectral classifiers. Therefore, the ASMLBELM, the ASMLNLELM, the ASMLBELM-WCFs and the ASMLNLELM-WCFs can achieve better performances compared with the BELM and the NLELM for any value of L. For all the next experiments, if no special emphasized, we set L to be 450 for BELM, ASMLBELM-based (including ASMLBELM and ASMLBELM-WCFs), 1000 for NLELM, ASMLNLELM-based (including ASMLNLELM and ASMLNLELM-WCFs) in Indian Pines datasets. In Pavia University datasets, we set L to be 1100 for BELM, NLELM, ASMLBELM, ASMLNLELM, ASMLBELM-WCFs and ASMLNLELM-WCFs. Experiment 3: In this experiment, the effects of the widths of the window for the ASMBELM-WCFs, the ASMNLELM-WCFs and the ASMKELM-WCFs will be evaluated. Fig. 4 plots the OA result as a function of the window size of these three spectral spatial classifiers. From Fig. 3, it can be seen that the results of the proposed method are not very good when the window size is too small. However, the results of the proposed methods become very good and stable when the window size is large. This shows the generalization for achieving the good performance of the spectral spatial classifiers. For the convenience, the window size is set as 13 both for the Indian Pines data set and the Pavia University data set in the following experiments.

D. Experiment Results and Comparisons in the Indian Pines Data Set and the Pavia University Data Set
In this subsection, the classification results are evaluated using the Indian Pines data set and the Pavia University data set. Table  1 and Table 2 show the training samples and the testing samples of the Indian Pines data set and the Pavia University data set. 10% of the training samples are employed for the training and the remaining samples are employed for the testing in the Indian Pines data set. Similarly, 9% of the training samples are employed for the training and the remaining samples are employed for the testing in the Pavia University data set.
The classification accuracies are shown in Table 1 and Table 2. It can be seen that the ASMLBELM, the ASMLNLELM and ASMLKELM obtain the higher accuracies than the BELM the NLELM and the KELM, respectively. The performances of these proposed three spectral classifiers are improved dramatically when the spatial information (WCFs) are considered. Compared with other spectral spatial classifiers such as the SVM-CK and the SMLR-SpATV, the ASMLBELM-WCFs, the ASMLNLELM-WCFs and ASMLKELM-WCFs have achieved the higher classification accuracies than the SVM-CK and the SMLR-SpATV. Based on the above results, it can be concluded that the proposed six methods can achieve the very good performances. Fig. 5 and Fig. 6 show the images and the ground truth obtained by different methods for the Indian Pines data set and the Pavia University data set.

E. Experiment Results with Different Number of Training Samples of The Proposed methods
The performances of these six methods are evaluated under different numbers of training samples. The total number of training samples is chosen as 5,10,15,20,25,30,35 and 40 from each class. Half of the total samples are chosen when the total number of training samples is more than half of the total samples of the class. For the parameters in the proposed methods, the approaches discussed in the previous subsections are employed.  From Table 3 and Table 4, it can be seen that the classification accuracies of these three spectral classifiers are better than those of the BELM, the NLELM and the KELM, respectively. These three spectral spatial classification algorithms can also achieve better performances. Compared with other classification methods listed in Table 3 and Table 4, these three spectral spatial classification methods achieve the higher accuracies than other spectral and spectral spatial classification methods. We can find an interesting result in Table 3 and Table 4 that the classification accuracies of BELM decrease when the number of training samples increases in most cases. This is because the ill-posed problem is particularly serious in BELM. Fortunately, the proposed ASMLBELM and ASMLNLELM-WCFs well alleviate this problem. In Table 3 and Table 4, we also display the execution time of the six proposed algorithms and other methods when using 100 training samples. From Table 3 and Table 4, in Indian Pines datasets and Pavia University datasets, the three proposed spectral algorithms, i.e. ASMLBELM, ASMLBELM and ASMLKELM need more consuming time than BELM, NLELM and KELM, respectively. But these three proposed algorithms have less consuming time than SVM. In Indian Pines data set, the proposed ASMLEBELM-WCFs and ASMLNLELM-WCFs algorithms need less consuming time than SVM-CK and SMLR-SpATV. Although the proposed ASMLEKELM-WCFs needs more consuming time than SMLR-SpATV, it needs less consuming time than SVM-CK. In Pavia University datasets, the three proposed spectral spatial algorithms have less consuming time than SVM-CK and SMLR-SpATV. Based on the above analysis, it can be seen that the proposed six classification methods achieve the very good performances especially the three spectral spatial classification algorithms.  V. Conclusion This paper proposes three spectral algorithms and three spectral spatial methods which are improvements of ELM for performing the HSI classification. First, the ELM is represented by a probabilistic model via the MAP. Then, it was represented by a concave logarithmic likelihood function where its maximum can be obtained. Second, the Laplacian prior is adopted to it. The performance of the proposed framework is improved. Third, we adopt the LORSAL algorithm for the proposed framework in order to improve the performances. Fourthly, the spatial information are considered and the spectral spatial framework is employed for performing the HSI classification to further improve the classification accuracy. Finally, the lower bounds of the proposed method is derived by a rigorous mathematical proof, which demonstrate the good performances of proposed six proposed algorithms.
For future work, we will focus on improving the computationally of proposed ASMLKELM by further spare representation. Moreover, we will also develop the classification accuracy by resorting to extended multi-attribute profiles [56] (EMAPs) method.