Novel Gumbel-Softmax Trick Enabled Concrete Autoencoder With Entropy Constraints for Unsupervised Hyperspectral Band Selection

As an important topic in hyperspectral image (HSI) analysis, band selection has attracted increasing attention in the last two decades for dimensionality reduction in HSI. With the great success of deep learning (DL)-based models recently, a robust unsupervised band selection (UBS) neural network is highly desired, particularly due to the lack of sufficient ground truth information to train the DL networks. Existing DL models for band selection either depend on the class label information or have unstable results via ranking the learned weights. To tackle these challenging issues, in this article, we propose a Gumbel-Softmax (GS) trick enabled concrete autoencoder-based UBS framework (CAE-UBS) for HSI, in which the learning process is featured by the introduced concrete random variables and the reconstruction loss. By searching from the generated potential band selection candidates from the concrete encoder, the optimal band subset can be selected based on an information entropy (IE) criterion. The idea of the CAE-UBS is quite straightforward, which does not rely on any complicated strategies or metrics. The robust performance on four publicly available datasets has validated the superiority of our CAE-UBS framework in the classification of the HSIs.


I. INTRODUCTION
A S AN emerging technology in the past few years, hyperspectral images (HSIs) have become increasingly popular on nondestructive inspection and characterization, owing to their rich spectral information spanning from visible-to-(near) infrared wavelengths. With the capability in identifying minor changes or differences of certain physical properties, such as moisture and temperature, and chemical components, HSIs have been successfully applied in a wide range of applications [1]- [4], especially in remote sensing, such as land cover analysis [5]- [7], precision agriculture [8], and object detection [9], [10]. Although the high-dimension spectral data is beneficial in discriminating different materials and objects, it has evitably led to the "Hughes phenomenon" [11], where the classification performance can be severely affected by insufficient training samples in comparison to a much higher spectral dimension. Moreover, the vast data volume of HSI also results in a huge computation cost, and the difficulty of data storage, transmission, and processing. Besides, the redundant information in HSIs may bring undesired properties and lower the efficiency of data analysis. Therefore, it is crucial to reduce the data dimension of the HSI data whilst preserving the essential discriminative information.
Although most of the feature extraction methods, such as the principal component analysis (PCA) [12], [13], the independent component analysis (ICA) [14], the wavelet transform [15], and the maximum noise reduction (MNF) [16], can generate a discriminative and low-dimensional feature set, the obtained features fail to preserve the physical characteristics of data acquired from the optical sensors. On the contrary, feature selection methods, which are also known as band selection, can choose a desired band subset and maintain the physical characteristics from the raw HSI [17]- [36].
According to whether the class label information is utilized, band selection methods can be grouped into three categories, i.e., supervised, semi-supervised, and unsupervised. With the aid of the prior knowledge of the labeled pixels, (semi-)supervised methods select the optimal subset of bands by optimizing a certain criterion [17], [20], [21]. However, they suffer from several intrinsical limitations. First, it is impractical to collect sufficient training samples for each category in real applications, especially for DL. Second, relying heavily on the classification performance can easily lead to 1558-0644 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
overfitting and poor generality. Besides, the results can be of poor robustness as the selected band subset is subject to the randomly chosen training samples. As the label information is rarely available in practice, unsupervised band selection (UBS) is focused on in this article. Nowadays, DL methods have been successfully applied in many computer vision tasks and beyond [38]- [44], [46], [47]. In comparison to the conventional methods, DL approaches can automatically generate favorable features, not relying on manual intervention and subjective parameter settings. Many deep-learning models have already been applied in HSI, such as convolutional neural network (CNN) [39]- [41] and autoencoder (AE) [44], [46], [47], which are mainly for feature extraction and data classification [41], anomaly detection [46], [47], etc. Unlike the aforementioned tasks in HSI, there is no available ground truth in band selection to evaluate the chosen band subset for training the DL networks. Therefore, it is extremely challenging to determine the desired band subset in DL-based UBS.
In this article, we have proposed a novel AE-based DL framework for UBS in HSI. By training an AE with the defined reconstruction loss, the optimal band subset can be determined for reconstructing the original HSI cube. Different from our previous work in [44], the optimal band subset is obtained directly from the trained AE without the assistance of ranking the significance of each band. The major contribution of this article can be highlighted as follows.
1) A concrete end-to-end AE-based UBS framework (CAE-UBS) is proposed, in which the optimal band subset with the desired number of bands can be easily determined according to the best reconstruction of the original HSI. Rather than using continuous real numbers as the weights in the encoder module, a novel concrete layer is implemented with a binary weight of 1 and 0 to indicate whether the corresponding band is selectable or not. It is only because of the introduced Gumbel-Softmax (GS), the obtained discrete weight matrix can be transformed to continuous variables for optimization of the selected band subset during the backpropagation. To the best of our knowledge, this is the first time to employ the GS trick to obtain the desired band subset directly in AE DL-based UBS in HSI. 2) Being implemented in an unsupervised manner, the proposed CAE-UBS network is found to be efficient and robust for UBS according to the reconstruction loss and the classification accuracy of the HSI. With the aid of information entropy (IE)-based criterion, the desired band subset can be determined with much less computational cost than other DL methods. 3) In the proposed CAE-UBS framework, a weight matrix from a fully connected (FC) layer has been utilized to initialize the class probabilities, which can effectively improve the classification performance. The superior performance of our proposed CAE-UBS framework has been validated on four commonly used HSI datasets to demonstrate its merits over a number of state-of-the-art (SOTA) UBS and one supervised method, especially a more robust performance with less trainable parameters and no label information needed. The rest of this article is organized as follows. Section II introduces the related UBS methods and AE-based DL methods. Section III details the proposed framework, including CAE-based band selection and optimal band subset searching. The experimental results on four HSI datasets are presented and discussed in Section IV. Finally, Section V concludes the article along with some future directions.
II. RELATED WORK In the last two decades, a number of UBS approaches have been proposed, which can be grouped into four main categories, i.e., ranking-based, clustering-based, searching-based methods, and sparsity-based. For each category, a detailed literature review is summarized below. In addition, the background information of the AE and AE-based UBS methods will also be introduced in this section.
In ranking-based band selection, many efforts have been made to evaluate and rank the importance of the raw spectral bands so as to determine the most significant bands from the raw spectral cube. In [22], a maximum-variance PCA (MVPCA) criterion was utilized to estimate the band prioritization. As MVPCA considers the representative and discriminative ability of each individual band but ignores the correlation between the chosen bands, the selected band subset generally lacks robustness. In Chang and Wang [23] proposed a constraint band correlation (CBS) strategy for ranking-based UBS. Four criteria are adopted in the CBS framework for choosing the highly correlated dependent bands, including the band correlation minimization (BCM), the band dependence minimization (BDM), the band correlation constraint (BCC), and the band dependence constraint (BDC). Although noisy band which has less correlation to all other bands will be discarded, similar to MVPCA, the band subset selected from CBS still contains a high degree of redundancy. For rankingbased methods, the result is usually quite redundant because of the high correlation between the selected bands, due mainly to focusing only on the performance of each band rather than the relationship between different bands.
Unlike ranking-based approaches, clustering-based methods first group all the bands into clusters before selecting the most representative band from each cluster. By clustering adjacent bands together under various similarity metrics, the correlation of the bands chosen from different clusters can be naturally reduced. In [24], a hierarchy clustering algorithm (WaLuDi/WaLuMi) is proposed based on the Ward's linkage, which clusters the bands by maximizing the inter-cluster variance whilst minimizing the intra-cluster variance. According to Ward's linkage theory, the chosen band from each cluster is the most representative one hence the formed band subset will be robust. However, the WaLuDi/WaLuMi method suffers from a huge computational cost due to its hierarchy architecture.
Some researchers have dedicated to improving the clustering-based method by combining it with some ranking strategies. Inspired by the fast density-peak-based clustering (FDPC) [45], an enhanced FDPC (E-FDPC) [26] was proposed to rank each band by considering the local density and the intra-cluster distance simultaneously, where the introduction of the intra-cluster distance has effectively reduced the correlation between the selected bands. Wang et al. [27] have proposed an optimal clustering framework (OCF) for UBS in HSI with two objective functions, inspired by the topranked cut and the normalized cut for effective band clustering. Afterward, three ranking strategies are utilized to rank the bands within each cluster for band selection, where the topranked band in each cluster is chosen to form the selected band subset. Although these clustering methods achieve a good performance, noisy bands are prone to become a single cluster and lower the robustness. To tackle this, an adaptive distance-based band hierarchy (ADBH) [28] has been proposed recently to reflect the hierarchy structure of HSI and produce any number of desired band subsets, whilst suppressing the effect of noisy bands. In clustering-based methods, choosing only the most representative band from each cluster may be insufficient as the second representative band in one cluster may contain more information than the first one in another cluster. Thus, it is more important to rank the band subset as a whole rather than individually, which can also avoid the effect of noisy bands, as it can easily form a separate cluster in such approaches.
With a given objective function and a search strategy, searching-based methods determine an optimal band subset by exploring different possible combinations of the bands. In [30], the volume-gradient-based band selection (VGBS) method is introduced, where the defined "volume" information can be obtained from the estimated covariance matrix of all bands. By assuming the most redundant band has the maximum gradient, the VGBS can iteratively remove redundant bands until the desired number of bands is reached. By developing a structure-aware metric for measuring band informativeness and independence, Zhu et al. [34] proposed a dominant-set extraction UBS (DSEBS) method. As a greedy search-based method, DSEBS tackles the UBS as a clustering problem. As searching for the optimal subset is an NP-hard problem and too costly, the used meta-heuristic or evolutionary algorithms usually produces a suboptimal solution [34]. In [52], the relationship between each band and the entire hypercube is determined through the linear reconstruction, and the desired band subset can be searched by removing the effect of noisy bands, the proposed optimal neighborhood reconstruction (ONR) method has achieved a good performance on UBS.
Apart from the searching-based methods, the sparsity-based methods utilize the sparse representation (SR) to explore the underlying structures within the HSI data [32]. The multitask sparsity pursuit (MTSP) [31] searches the optimal band subset with the aid of the SR and the immune clonal strategy. Although in SR-based methods it is quite straightforward to select the informative bands based on the estimated sparse coefficients, the overall computational complexity is still quite high especially in constructing the SR matrix for large-scale HSI datasets [27].
Recently, DL and its variations have shown great superiority in extracting more effective features in HSI. Cai et al. [40] have proposed an end-to-end CNN-based model for band selection, where the final band subset is determined by ranking the average of the learned weights for each band. Unlike other DL-based neural networks, the basic idea of AE-based feature selection is to learn the hidden representations that can effectively reconstruct the input data. Due to its strong ability to explore both linear and nonlinear structures among the extracted features, AE has been successfully applied for feature selection in high dimensional data in an unsupervised manner [44]. For UBS in HSI, the AE-based methods are not as popularly used as the aforementioned other categories of the methods. In our previous work [44], the input weights of the AE are utilized to select the most significant bands in an unsupervised way. However, there are several drawbacks to these kinds of methods. The generated representation from the encoder is more like a combination of the raw data, where the weight values of nodes in the encoder layer can be both positive and negative. Some bands are chosen only because they have large absolute weights, which does not fully represent their significance. Besides, the aforementioned methods rely on the ranking value or the weight to choose the desired band, which can inevitably suffer from the disadvantages of rankingbased UBS methods, especially the high redundancy between the chosen bands. These will be tackled in our proposed approach as detailed in Section III.

III. PROPOSED METHODOLOGY
In this section, our proposed CAE-UBS framework will be presented in detail, including the concept of CAE-based band selection, determining the optimal band subset, and computational complexity analysis. According to the flowchart shown in Fig. 1, first, a HSI hypercube is taken as the input to the designed CAE. Potential band subsets can be acquired based on minimizing the reconstruction error of the hypercube with the designed CAE. After calculating the IE of each candidate of band subsets, the band subset with the maximum IE will be chosen as the result of band selection. Relevant details are presented as follows.

A. CAE-Based Band Selection
In general, a standard AE includes one encoder and one decoder module. The encoder represents the mapping between the input data and the hidden representation while the decoder is to reconstruct the input data from the hidden representation. Let us project an HSI image into a matrix . . , X m ] ∈ R D×m denote the projected data from a hypercube, where m represents the total number of samples in the HSI image and D is the number of spectral bands. Based on that, the encoder function can be depicted as H i = σ en (X i W en + b en ) and the decoder function that reconstructs the input data asX where H i is the hidden representation of the input data and theX i is the reconstructed data. σ en and σ de are the activation functions, and W and b are the weighted matrices and bias vector of each module, respectively. For the UBS work, the w en d within the input weight matrix W en = (w en 1 , . . . , w en d , . . . , w en D ) actually measures the dth band and represents the contribution of the dth band in the reconstruction process. The AE can be trained with the supervision of the reconstruction loss In our previous work [44] and other similar work [40], the desired band subset can be chosen by ranking the learned weight W en from the encoder part. The basic assumption here is that a highly ranked weight indicates the importance of the corresponding band. However, the weight learned from AE, in general, cannot represent the significance of each band. For example, Fig. 2 shows the learned input weight with one column in the learned weights matrix W en of the Indian Pines dataset. Although the positive values represent the contribution of this band, it has several negative values. Besides, the motivation of AE-based band selection is to select the most significant bands for spectrum reconstruction, yet the input weight-based band selection strategy seems not linked to this objective. Therefore, it is inappropriate to choose the band according to the weight values.
As the purpose of the AE-based band selection is to learn an important hidden representation from the input data for HSI reconstruction, it would be more reasonable to extract the desired band subset from the encoder part as the key latent features of the raw data. Inspired by this, we aim to determine a sparse input weight matrix, whose values can be only 1 and 0, indicating the corresponding band is selected or not. In this manner, the weight of the bands that do not contribute to the reconstruction will be 0, otherwise will be 1. Moreover, Fig. 3.
Diagram of the designed concrete autoencoder: X i,D represents the Dth band of the original HSI data X i andX i,D is its corresponding reconstructed value; H i denotes the chosen band subset with k bands. the extracted band subset will be optimal as the weights of the chosen bands are jointly learned. However, this sparse weight matrix cannot be updated during the backpropagation in a standard AE as each column of this matrix is a one-hot vector, i.e., a non-differentiable discrete variable. To tackle this problem, we have introduced a novel concrete AE for the UBS, where the sparse matrix can be estimated with the aid of concrete distribution [48], [49] as detailed in Section III-B.
In our CAE-UBS framework, we have employed the above concrete random variables to select the input bands, as shown in Fig. 3. Let the desired number of bands in the band subset be k, a new sparse weight matrix S will be built with a size of D × k. For each column of the weight matrix S, a D-dimensional concrete random variable S k is sampled following (3). In this way, the output of the encoder module is H i = X i S for an input sample X i . As S k is a one-hot vector, it can select a band to reconstruct the original data. Thus, the composed weight matrix S becomes a desired sparse matrix, in which the selected k bands can be directly identified without introducing another criterion. With the aid of the introduced concrete random variable and reparameterization, the forward propagation can generate a band subset, and the backpropagation will optimize the band selection result.

B. Concrete Distribution
The GS distribution, also referred as the concrete distribution, is defined to produce a continuous distribution over a discrete variable, e.g., a one-hot vector. As a reparameterization trick, the Gumble-softmax trick can efficiently sample z, i.e., a one-hot vector, from a categorical distribution with class probabilities α k , where g k is the sample drawn from Gumbel (0, 1) 1 and k is the element-wise index of the generated onehot vector z z = one_hot arg max k g k + log(α k ) . (2) As the above operation is non-differentiable, it cannot be back-propagated in the network for optimization. To tackle this issue, the GS distribution [48] using the softmax function is proposed as a continuously differentiable approximation to replace the arg max function in (2) for calculating the continuous relaxation of the one-hot vector z, where the kth element of the generated sample S from the concrete distribution is given by the temperature parameter T controls the relaxation of the one-hot vector, where S k will nearly equal to 1 when T approaches to 0. With the reparameterization trick, S k becomes differentiable when estimating the gradient in the process of backpropagation.

C. Optimal Band Subset Searching
For searching the desired band subset efficiently, we randomly divide all samples from a hypercube into different batches in a similar way as other DL models [38]. In this way, multiple band subsets can be obtained during each epoch. Let N be the number of band subsets determined in one epoch, it actually equals the number of iterations, i.e., the number of batches, in each epoch. Although a band subset is selected according to the minimized reconstruction error, it can be potentially only the local optimal solution due to the random selection of the batch, where searching for a global optimal band subset is still needed. To this end, a simple yet robust IE-based searching strategy is introduced in our CAE-UBS framework as detailed below [35].
Generally, there are several motivations for considering the global searching strategy. The first is to find an efficient way to determine the optimal band subset whilst avoiding a huge computational cost. Nowadays, most of the efficient UBS methods are still not DL-related, wherein AE-based UBS the optimal band subset is assumed to be the one with the best reconstruction ability. We have further speculated that the desired band groups contain more information than other subsets, which is beneficial for spectrum reconstruction. To this end, we have defined a global searching strategy using information theory [35], the Shannon IE, where the IE of band X i is defined as where P(X i ) denotes the probability density function of X i , which can be usually estimated by [27], [35]. Based on the determined IE for each band, the band subset with the largest accumulated IE is chosen as the desired band subset, and the result is considered as the global optimal solution [27], [35]. As this search strategy is quite straightforward and efficient, it has been adopted in the proposed CAE-UBS approach.

D. New Weights Initialization for Improved Efficiency
To further improve the efficiency of DL-based UBS in HSI, rapid convergence of the designed network is essential for significantly reducing the computational complexity. In existing GS-based methods [48], [49], the class probability α k is often randomly initialized in small positive values for exploring different linear combinations of the inputs, which may affect the convergence of the network and the result of band selection. In our CAE-UBS framework, we initialize the α k with the weight matrix from an FC layer to regularize the learning process, where the initialized weight matrix has the same size as the composed weight matrix S. In this way, α k are initialized within (−(D) 1/2 , (D) 1/2 ), adaptive to the number of bands, which is further normalized to (0, 1) to indicate the class probability. The efficacy of the proposed initialization has been further validated in the comparison experiments in Section III-E.
To obtain the desired band subsets without too much computational cost, another key point is the efficiency in generating potential candidates. As one training epoch can produce N candidates, this will end up with a large search space after a few epochs. Besides, more training epochs increase the running time of the whole framework. To find the optimal band subset efficiently, we need to reduce the number of training epochs. With our proposed CAE, we have found that the convergence is faster due to the data volume as the HSI data is around 100 000 pixels about several hundreds MB but RGB dataset is usually GB level. In Fig. 4, the training loss, i.e., the reconstruction loss, of 200 training epochs on the Indian Pines dataset is presented. As seen, the training loss is obviously reduced in each epoch in Fig. 4. Based on that, we conclude that the proposed network can converge within only one epoch, and the optimal band subset can be chosen from the generated N candidates. In this manner, the efficiency of the proposed CAE-UBS framework can be ensured.

E. Merits of CAE-UBS
With the concrete random variable-based AE and IE-based searching strategy, our CAE-UBS framework can determine an optimal band subset for the effective reconstruction of the original spectral data. Different from other AE-based band selection frameworks, we have formulated the band selection task as a searching-based task by maximizing the accumulated IE of the desired band group instead of ranking the significance of each band. Moreover, the proposed CAE can solve the problem of backpropagation even with a discrete variable H i = X i S; 9: Save N band subsets 10: Decoder module; 11: Update reconstruction loss L based on (1); 12: Backpropagation with optimizer; 13: end while 14: Global optimal band subset searching with I E (4) of each band and N band subsets; 15: Output: Band subset n. 16: END in the UBS task, which enables the designed network able to be trained with the reconstruction loss L. Being trained in a self-learning way without introducing any class label d band selection in the future. The information, the proposed CAE-UBS has the potential to inspire more related research on the DL-base whole process of the proposed CAE-UBS is summarized in Algorithm 1, where the performance is further discussed in Section IV.

IV. EXPERIMENTAL RESULTS
Due to the lack of the ground truth in UBS, the performance of band selection is usually indirectly assessed by evaluating the classification accuracy with the selected bands. In our experiments, the proposed CAE-UBS is compared with several SOTA methods as detailed below.

A. Datasets
Four commonly used HSI remote sensing datasets are used in our experiments. The first is the Indian Pines dataset, which was captured by the airborne visible/infrared imaging spectrometer (AVIRIS) sensor over the North-Western Indian, USA, in 1992. The raw data have 224 spectral bands with the wavelength ranging from 0.4-2.5 μm. It has a spatial size of 145 × 145 pixels, in which 10 249 pixels are manually labeled in 16 land-cover categories. Often, the dataset is corrected to have 200 bands after the removal of 24 noisy and water absorption bands.
The second is the Pavia University (PaviaU) dataset, which was collected by the reflective optics system imaging spectrometer (ROSIS) system over the Engineering School of the University of Pavia, Italy. The commonly used PaviaU dataset is a cropped version, which consists of 610×340 pixels with a spectral range of 0.43-0.86 μm. This dataset has 42 776 pixels labeled in nine land-cover classes.
The third is the Salinas dataset, which was also acquired by the AVIRIS over the Salinas Valley, CA, USA, in 1998. Therefore, it shares the same wavelength range with the Indian Pines dataset in 224 spectral bands. The spatial size is 512 × 217, in which 54 129 pixels are labeled in 16 classes. After removing the noisy and water absorption bands, 204 bands remained for experiments.
The last is the Botswana dataset, which was captured by NASA EO-1 satellite sensor over OKAvango Delta, Botswana, in 2011. The original dataset contains 242 bands ranging from 400 to 2500 nm. With a spatial size of 1476 × 256 pixels, in total 3248 pixels are labeled in 16 semantic classes. After the removal of 97 noise-corrupted bands, a corrected dataset with 145 bands is often utilized.

B. Settings
For quantitative evaluation of the classification results with the selected bands as features, three commonly used metrics derived from the confusion matrix are adopted, including the overall accuracy (OA), the average accuracy (AA), and the Kappa coefficient. OA represents the percentage of corrected classified pixels, and AA is the mean classification accuracy over all classes. The Kappa coefficient is introduced to estimate the reliability of the obtained results.
1) OCF [27]: A SOTA clustering-based method with leading performance in the UBS of HSI. 2) DSEBS [34]: One of the most representative searchingbased UBS methods. By developing a structure-aware measurement for band informativeness and independence, it tackles the UBS as a greedy-searching problem, which has achieved a relatively good performance on several public datasets. 3) VGBS [30]: Also a searching-based method, frequently cited in UBS [27], [28]. 4) WaLuDi/WaLuMi [24]: Although being proposed earlier than other compared methods, they are still classical clustering-based methods and frequently cited in many literature works [27]- [29], [34]. 5) E-FDPC [26]: Different from other ranking-based methods, an enhanced fast density-peak-based clustering proposed to rank each band by considering the local density and the intra-cluster distance simultaneously, which has a leading performance in ranking-based methods. 6) ASPS [29]: A novel clustering-based method with a robust performance in the UBS of HSI. 7) ADBH [28]: An adaptive distance-based band hierarchybased UBS to reflect the hierarchy structure of HSI for easily producing any number of desired band subsets whilst suppressing the effect of noisy bands. For a fair comparison, the original codes from the authors and the default parameters are used. Besides, the classification results from the original data are also included (shown as "Raw data" in this article).
The proposed CAE-UBS method also has several parameters. In the training process, we have employed the Adam optimizer with a learning rate of 1e-3, where the training epoch is set to 1 for efficiency. In DL, a large batch size can improve the training efficiency than a small one, yet it may suffer from poor convergence and poor generalization. As a result, a proper batch size needs to be determined, which is suggested to be linked to the size of the image [38]. In our experiments, the batch sizes for Indian Pines, PaviaU, Salinas, and Botswana datasets are empirically set to 512, 8192, 8192, and 8192 by considering their spatial sizes, i.e., the number of pixels. These parameters are found to produce particularly good results in band selection in our proposed approach. In addition, the activation function of the designed stacked decoder is ReLU. For the temperature parameter, we follow the schedule in [49].
For the classification part, two commonly used classifiers, K-nearest neighborhood (KNN) [50] and support vector machine (SVM) [11], are employed with the selected band subsets from each method as features. In our experiments, the parameters of KNN and the SVM are optimized through 10-fold cross-validation. We use 10% of the randomly chosen labeled samples as the training set and the rest for testing. For the compared methods, the experiments are repeated ten times, and the average metrics are reported. As our approach is DL-based, the chosen band subset can be affected by some stochastic issues. Therefore, the output band subset slightly different in each run of experiments.
Nowadays, DL-based methods usually report their best results from the trained models in other computer vision tasks such as image segmentation and object detection [38]. Considering that non-deep-learning-based conventional approaches may produce fixed results, it is unfair that they are compared with the best results from DL approaches. Therefore, we randomly choose five groups of the band selection results from our CAE-UBS framework, where the selected bands are taken as features for classification in ten repeated runs. Afterward, the average metrics of these five subsets in 50 total runs are reported for comparison with the peers.
For the hardware and software settings, the proposed CAE-UBS framework is implemented on the Pytorch 1.1.0 package without CUDA. All other band selection methods and the classification part are implemented on the

C. Results and Discussions
For performance assessment, the OA curves of all methods on the four HSI datasets are generated 3-30 chosen bands and shown in Figs. 5-8. As seen, most of the comparing methods compete with the performance using raw data when the number of chosen bands is around 30. Besides, we have compared the average OA, AA, and the Kappa coefficient and their corresponding standard variation under a various number of selected bands. A detailed comparison of each method on the four datasets is given in Tables I-IV, respectively, where the best result is highlighted in bold except those from the raw data.
To summarize the experimental results from the four datasets, some extended discussions are given below. In particular, we will discuss three aspects, i.e., the performance of our method, the comparison between our method and BS-Net, another SOTA DL-based UBS method, and analysis of the computational time of each method.

1) Comparison Results in Different Datasets:
For the Indian Pines dataset, the classification results from different approaches are presented in Fig. 5 and Table I. As seen in Fig. 5, our method has a robust performance on both  II   CLASSIFICATION RESULTS FOR THE PAVIAU DATASET USING THE RAW DATA OR SELECTED BANDS (AVERAGED ON 3-30 BANDS)   TABLE III   CLASSIFICATION RESULTS FOR THE SALINAS DATASET USING THE RAW DATA OR SELECTED BANDS (AVERAGED ON 3-30 BANDS) KNN and SVM classifiers. Although the performance is the second best on the KNN classifier, the difference to the first place, the DSEBS, is marginal. When more than 20 bands are selected, only the DSEBS, ADBH, and our CAE-UBS approaches perform well. For the SVM classifier, our results are also quite stable, especially when the number of selected bands is beyond 20. Although our approach does not outperform others in all cases, a robust OA curve has validated its superiority. Table I shows the classification results of all methods. As seen, the proposed method along with the ADBH and DSEBS methods have better performance than the rest on the KNN classifier. However, the performance of DSEBS with on SVM seems not as good as on the KNN classifier. For our approach, it has achieved the second best results with both classifiers, indicating its robustness.
For the PaviaU dataset, the results are compared in Fig. 6 and Table II, and again our proposed method has shown quite stable performance. For the KNN classifier, our approach has an increasing OA. Although the WaLuMi method produces the best results with the KNN classifier when more than 25 bands are selected, it does not perform well with a small number of selected bands, and the performance with the SVM seems not robust as shown in Fig. 6. With the SVM classifier, our method has achieved a more robust OA than all others. Considering both the classifiers, our generated OA curves are steadier, which has validated the robustness of our CAE-UBS method. This has been further verified in the quantitative results in Table II, where our approach has achieved the best OA on SVM and the second best OA on the KNN classifier. Although the ASPS has achieved the best classification result on the KNN, its performance on the SVM is rather poor. For the OCF and ADBH, their performance on the PaviaU dataset is not good. Fig. 7 and Table III show the classification results for the Salinas dataset. In Fig. 7, our method again has achieved nearly the best performance on the KNN classifier and robust performance on the SVM. Although our approach is not the best on the KNN when less than 15 bands are chosen, its superiority accelerates when more bands are selected. Although OCF has a better performance when less than 15 bands are selected, its OA curve on the KNN classifier is not as stable as ours. For the VGBS and the WaLuDi methods, they fail to produce satisfying results on the KNN. For the SVM classifier, most of the methods have achieved a robust performance except for the WaLuDi, whilst our is the third best slightly behind the ADBH and OCF methods. This is also consistent with the results in Table III, where our approach is the third best on the KNN and the SVM, whilst the difference between ours and the two leading ones is minor. For the Botswana dataset, the classification results from different UBS approaches are presented in Fig. 8 and Table IV. As seen in Fig. 8, our method has the most robust OA curves than all others on both classifiers. Although the WaLuDi has a better performance when 5 or less bands are chosen, our method has a more stable OA curve. With the SVM classifier, the WaLuDi performs not well when more than 5 bands are chosen. The VGBS has poor performance with less selected bands even though it has the best result when 30 bands are chosen. As shown in Table IV, our method has the best average OA on the SVM classifier. For the KNN classifier, we have the second best averaged OA with a marginal difference to the WaLuDi, which demonstrates again its efficacy.
2) Further Result Analysis: Although our method has obtained quite good results with the two popular classifiers on the four HSI datasets, the OA is not always the best which can be explained as follows. In fact, the network architecture and the strategy for searching the optimal band subset used in our method are relatively simple. Taking the proposed CAE-UBS framework as a baseline, its performance can be further improved by introducing a larger neural network or certain regularization terms such as spatial constraints. Actually, the quite satisfactory results on four datasets from three different sensors, i.e., the AVIRIS, ROSIS, and the NASA EO-1 sensors, have validated the robust performance and high generalization capability of the proposed network. To this end, it is safe to say that our method can generate a globally optimal solution in most cases.
As shown in the previous subsection, our proposed CAE-UBS framework can usually produce better results when more bands are selected. For example, our OA curve in Fig. 6 outperforms all others when more than 15 bands are chosen. As our method is searching-based, a larger search space with more bands tends to produce better results. Therefore, it is prone to find the optimal band subset from the increased number of band combinations, which validates the searching ability of our developed DL-based UBS method.

3) Comparison With Other DL-Based UBS Methods:
To further evaluate the effectiveness of the proposed method, we have compared it with one SOTA DL-based UBS method, the BS-Net [40], and the AE-UBS [44]. For BS-Net, the indexes of selected bands provided by the authors are used to test the classification accuracy. For the three test datasets, Indian Pines, PaviaU, and Salinas, the numbers of selected bands given in [40] are 25,15, and 20, respectively. As a result, we compare our approach with BS-Net using the same number of selected bands. The selected bands by BS-Net and our method are listed in the Appendix, where the BS-Net has two groups of results, i.e., by using FC networks (BS-Net-FC) and convolutional neural networks (BS-Net-Conv) for evaluation  [40], [44] AND OUR METHOD ON THE FIRST THREE DATASETS and comparison. In addition, we have listed five groups of results from our approach and one group of results from our previous approach [44]. Taking the selected bands as the spectral features, we can then use the classification results as an indicator to evaluate the efficacy of the band selection approaches. In Table V, quantitative results in terms of OA, AA, and Kappa from the BS-Net, AE-UBS, and CAE-UBS are given for comparison, using the KNN and SVM classifiers on the three datasets. As seen in Table V, the BS-Net-Conv has the best performance on the Indian Pines dataset with both classifiers, followed slightly behind, especially for SVM, is our CAE-UBS method. Nevertheless, our method significantly outperforms the BS-Net-FC with both classifiers. For the PaviaU dataset, the proposed approach has the best performance with the SVM classifier and the second best performance with the KNN classifier, while the BS-Net-FC has achieved the best performance with the KNN classifier. Surprisingly, BS-Net-Conv has produced much worse results than the BS-Net-FC, especially for the KNN classifier, although it has the best results on the Indian Pines dataset. This has indicated a relevant lack of robustness of the BS-Net model in different datasets. Finally, for the Salinas dataset, our CAE-UBS method has yielded the best performance with both classifiers, though it seems BS-Net-FC performs slightly better than BS-Net-Conv. Besides, our proposed CAE-UBS method has outperformed the AE-UBS with both classifiers in the three validated datasets. It is worth noting that the reported band subsets chosen from the BS-Net are selective the best to produce the highest classification accuracies. For CAE-UBS, we have used the averaged classification results from five randomly chosen band subsets. To this end, the superior performance has validated the robustness and efficacy of our method. As our method is implemented using a less complicated network with only the FC layers, the performance could be further improved by introducing the convolutional kernels or adding more layers, which will be explored in the future. 4) Effect of α k Initialization: Generally, the GS distribution initializes the α k with small positive values. In our CAE-UBS, we assume the general GS initialization method cannot reflect the class probabilities, we have employed the weight from an FC layer to initialize the α k . To illustrate the effectiveness of our proposed initialization approach, we have compared the classification results with the general GS initialization methods and ours. The classification results in terms of OA with the SVM classifier on the first three datasets are shown in Fig. 9. As seen, our initialization method has produced a more robust OA curve than the general one, especially in the PaviaU dataset. Accordingly, it can consistently produce improved classification accuracy under the same number of selected bands, which has validated the superiority of the proposed initialization scheme.

5) Analysis of IE:
To further analyze the effect of the utilized (I E) criterion in our proposed CAE-UBS approach, we have replaced it with E-FDPC [26], a popularly used method to rank the band importance [27], [28]. The ranking values obtained by E-FDPC are employed to determine the desired band subset, and the results with the SVM classifier are also compared in Fig. 9. As seen in Fig. 9, the proposed CAE-UBS with the I E criterion has consistently outperformed the variation with the E-FDPC for band ranking rather than the I E, especially on the Salinas dataset. The robust performance here has validated the superiority of the I E criterion used in the proposed CAE-UBS approach.

6) Comparison of Computational Time:
To evaluate the efficiency of the proposed approach, we have also compared in Table VI the computational times of various methods with 30 selected bands. Meanwhile, the average OA from the SVM classifier in 3-30 selected bands for all four datasets is used to indicate the efficacy of these band selection algorithms. As seen in Table VI, our method has outperformed all others yet with a comparable computational complexity to the conventional methods without DL. Although OCF seems quite efficient, its OA is about 0.9% lower than ours. For WaLuDi and WaLuMi, their computational complexity is quite high due to the time-consuming process in calculating the mutual information. As it only requires one training epoch, the proposed CAE-UBS approach has actually provided an efficient and effective solution for UBS in HSI.
As an indicator of the computational complexity of the DL-based approaches, the numbers of parameters of our proposed CAE-UBS approach, and BS-Net [40] are compared in Table VII. As seen, our CAE-UBS has much less trainable parameters than the BS-Net and AE-UBS. However, the classification accuracies are comparable to or even superior to BS-Net as shown in Table V. Note that the reported computational time including the training process for HSI reconstruction is implemented on CPU, hence the efficiency can be further improved with the aid of GPU like other DL-based band selection approaches such as the BS-Nets [40]. In comparison to BS-Nets implemented on a 11GB GPU, our CAE-UBS algorithm implemented on a CPU is about 1000 times faster, yet the classification results are very comparable or superior. Thanks to the GS trick and entropy constraints, this has validated again the great potential of CAE-based UBS in HSI.
V. CONCLUSION Although a few unsupervised approaches have been proposed for hyperspectral band selection in the last two decades, the results, in general, show a lack of robustness due to the bank ranking schemed adopted, whilst the DL approaches often suffer from huge computational burden due to numerous training epochs requested. In this article, we have proposed a novel CAE-UBS framework for unsupervised hyperspectral band selection. With the introduced CAE, the collaborative behavior of the bands during the HSI reconstruction process can be exploited for searching the candidates of potential band subsets. By implementing a novel encoder layer with the GS trick, a discrete matrix can be generated to choose the desired band subset, where parameters of the proposed CAE can be learned by the constraints of the reconstruction loss. In addition, maximizing the accumulated IE is found to be an effective global searching strategy to determine the optimal band subset. As our CAE can produce satisfactory results with only one training epochs, its computational time has been significantly reduced to the same level as conventional methods. The robust performance from experiments on four publicly available datasets has fully demonstrated the efficiency and efficacy of our CAE-UBS framework.
Although the proposed approach has produced overall the best performance, the results vary in the four datasets. In the future, we will focus on the development of a multi-task network for selecting more discriminative bands for classification, aiming to achieve more consistent performance in different datasets. In addition, we will explore other band selection applications in HSI beyond image classification, such as spectral unmixing and HSI reconstruction.