Spectral–Spatial Self-Attention Networks for Hyperspectral Image Classification

This study presents a spectral–spatial self-attention network (SSSAN) for classification of hyperspectral images (HSIs), which can adaptively integrate local features with long-range dependencies related to the pixel to be classified. Specifically, it has two subnetworks. The spatial subnetwork introduces the proposed spatial self-attention module to exploit rich patch-based contextual information related to the center pixel. The spectral subnetwork introduces the proposed spectral self-attention module to exploit the long-range spectral correlation over local spectral features. The extracted spectral and spatial features are then adaptively fused for HSI classification. Experiments conducted on four HSI datasets demonstrate that the proposed network outperforms several state-of-the-art methods.


I. INTRODUCTION
H YPERSPECTRAL images (HSIs) have hundreds of spectral bands, collecting abundant spectral and spatial information for monitoring the surface of the Earth [1]. Such valuable information enables them to discriminate more landcover materials under various conditions, facilitating a wide range of applications, including environment observing [2], resources assessment [3], and urban development monitoring [4]. Classification is one of the important tasks for these applications.
Over the past few decades, various HSI classification methods have been developed. Earlier methods are mainly focused on the spectral features, where typical approaches include support vector machines (SVMs) [5], multinomial logistic regression (MLR) [6], and manifold learning [7]. To mitigate the problem of dimensionality inherent in HSI, some dimensionality reduction strategies were proposed based on feature extraction [8] and band selection [9]. Some unmixing models were also proposed to address the spectral mixture issues in HSI, such as the extended linear mixing model (ELMM) [10] and the augmented linear mixing model (ALMM) [11]. Another unmixing model called sparsity-enhanced convolutional decomposition (SeCoDe) was proposed in [12], which uses the convolution operation to learn spatial contextual information to improve its unmixing performance.
An increasing number of methods incorporate spatial features to improve the class representation using spectral features alone. Some works extract spatial features via morphological operators [13], Gabor filters [14], and hypergraph structure [15] or apply Markov random fields (MRFs) [16], among others, and then combine them with spectral features for classification. Others directly extract the joint spectralspatial features by using 3-D discrete wavelets [17], 3-D scattering wavelets [18], 3-D Gabor filters [19], and so on. Nevertheless, these traditional methods extract the features of the original data in a shallow manner, which is difficult to achieve substantial performance gain.
In recent years, deep learning algorithms have successfully broken the limitations of the traditional feature extraction techniques. It can automatically extract hierarchical features from data, achieving significant progress in computer vision, including object detection [20], semantic segmentation [21], and image classification [22]. Furthermore, various deep learning models have been investigated in HSI classification. Multilayer perceptron (MLP) [23], stacked autoencoder (SAE) [24], and deep belief network (DBN) [25] were used for feature extraction of HSI. In [26], the recurrent neural network (RNN) was used to analyze the hyperspectral sequential data, and then, it was classified via network reasoning. In [27], convolutional neural network (CNN) was used for deep spectral-spatial feature extraction and classification. Hong et al. [28] applied the graph convolutional networks (GCNs) to capture large range spatial features as they can model the topological relations between samples through their graph structures.
The aforementioned studies have shown that feature extraction plays a key role in HSI classification, and it goes through an evolution from shallow to deep [29]. Among these deep learning algorithms, CNN generally outperforms others in feature extraction, mainly because its local connections and shared weights characteristics enable it to maintain the original structure while learning spatial features and greatly reduce the number of network parameters [30].
These unique characteristics make the CNN-based methods valuable in spectral-spatial classification of HSI [31]. In [32], CNNs were used to extract deep spatial features. Several endto-end 2-D CNN models were designed to jointly exploit the spectral-spatial information by using different convolution kernels [33], [34]. More recently, 3-D CNN was used to extract the joint spectral-spatial features for HSI classification [35], [36]. While performance was improved by using the 3-D CNN, the significantly increased parameters may cause overfitting and bring additional computational cost. In [37]- [39], the spatial and spectral features were learned separately by 2-D CNN or other algorithms (e.g., SAE, 1-D CNN, and RNN) and then fused together. This scheme can achieve good performance yet significantly reduce the computational load compared with 3-D CNNs [38]. Considering the insufficient labeled samples of HSIs, we also adopt this scheme in this article to minimize the required parameters for training and avoid overfitting.
In recent years, an increasing number of studies have demonstrated that deeper networks have stronger feature representation ability, but it is difficult to optimize, especially with limited labeled samples [33]. The emergence of the residual network (ResNet) [40] and the dense convolutional network (DenseNet) [41] makes it possible to train deeper networks to boost the performance of HSI classification. In [42], a spectral-spatial residual network (SSRN) was proposed to alleviate the declining-accuracy phenomenon. A fast and dense spectral-spatial convolution (FDSSC) network was proposed to overcome the overfitting problem [43]. To extract spectral-spatial features at multiscales, a cascaded dual-scale crossover network based on SSRN was proposed in [44], where a dual-scale crossover module was designed to capture multiscale features by using different convolution kernels. In addition, a fully dense multiscale fusion network was developed to directly connect feature maps of different layers with different resolutions [45]. Despite these developments, the convolution filters of the CNN-based method still have the limitations of treating the input content equally and only modeling local features. Generally, spectral and spatial features extracted from the input have different contributions to classification.
Recently, the attention mechanism was developed by simulating the human visual system, which can selectively focus on salient parts instead of treating each part equally [1]. Embedding it into the network can promote the representation capacity of the extracted features, which has achieved good performance in computer vision [46]- [48]. Subsequently, many attention mechanisms proposed for scene segmentation of generic natural images in [1], [30], and [49]- [55] have been directly applied into the patch-based CNN for HSI classification. The squeeze-and-excitation (SE) block [55], which uses global pooling to generate the channel attention matrix, was applied to a patch-based CNN to recalibrate their channel-wise feature responses [51], [56]. Subsequently, many similar spectral attention modules [57], [58] were proposed for HSI classification to selectively excite informative channels and suppress useless ones. To make the network adaptively enhance and suppress information in both spectral and spatial dimensions, many spatial-spectral attention modules were proposed. In [1], a new spectral-spatial visual attentiondriven module was incorporated into the ResNet to refine the extracted features. The convolutional block attention module (CBAM) proposed for scene segmentation of generic natural images in [48] was adopted in [30] and [49] for HSI classification. Its channel-wise attention module determines the weight of each channel via MaxPooling and AvgPooling layers along the spatial dimension, and the spatial-wise attention module determines the weight of each position in the feature maps via pooling layers along the channel axis. Similarly, a cooperative spectral-spatial attention module for HSI classification was proposed in [53], which generates the spectral and spatial attention maps by using pooling layers to squeeze the spatial and channel dimensions, respectively. In [59], a dual attention network (DANet) was proposed for scene segmentation of generic natural images. Its position self-attention module captures the spatial correlation between any two positions of the feature maps, and the channel self-attention module captures the spectral correlation between any two-channel maps. These self-attention modules [59] were adopted in [52] and [60] for HSI classification and achieved the state-of-the-art performance.
In the aforementioned attention-based methods for HSI classification, some works (e.g., [49], [52], [56]) directly adopted the attention modules [48], [55], [59], which are embedded in pixel-based CNN for scene segmentation of generic natural images, to their patch-based CNN for HSI classification. Although some proposed attention modules are (e.g., [53], [54]) for HSI classification, the way they compute the attention maps is similar to those embedded in pixelbased CNNs for scene segmentation. As seen, none of them specifically design attention modules according to the characteristics of the patch-based CNN. In patch-based CNN, the input patch is used to predict its central pixel, and the neighboring pixels may have different contributions to the classification of the center pixel. Therefore, it is necessary for the patch-based CNN to explore the latent correlations between the center pixel and its neighbors in a global view.
To investigate this opportunity for better HSI classification, we proposed a spectral-spatial self-attention network (SSSAN) with two subnetworks, designed for spectral and spatial feature extraction. Specifically, the spatial subnetwork introduces the proposed spatial self-attention module to capture the spatial feature correlations between the center pixel and its surroundings. Meanwhile, the spectral subnetwork introduces the proposed spectral self-attention module to exploit the long-range correlations over local spectral features. The "score weighted" fusion method [39] is then used to fuse the extracted spatial and spectral features for classification. The main contributions of this article can be summarized as follows.
1) A spatial self-attention module is proposed for the patchbased CNN to exploit the spatial feature correlation between the center pixel and its surroundings, which has improved the spatial feature representation related to the center pixel specifically. 2) A spectral self-attention module is designed for 1-D CNN to capture long-range spectral correlations over local spectral features.
3) The proposed spatial and spectral self-attention modules are designed as add-on blocks so that they can be plugged into any patch-based CNN and 1-D CNN backbone networks, respectively, to generate high-quality discriminant feature. Both modules are lightweight. The remainder of this article is organized as follows. The related works of the proposed method are presented in Section II. In Section III, we describe the proposed method in detail. The experiments and results are presented and discussed in Section IV. Finally, concluding remarks are provided in Section V.

II. RELATED WORKS
In this section, we briefly introduce the basic techniques of the proposed methods, which are the DenseNet and attention mechanisms.

A. Dense Neural Networks
Generally, deeper networks have better performance, but as the network deepens, its parameters will increase, making it harder to train. The emergence of DensNet mitigates this problem. As shown in Fig. 1, the DenseNet framework is mainly composed of dense blocks and transition layers. Relevant details of these components are presented as follows.
As can be seen in each dense block, the input of each layer comes from the outputs of all previous layers of the corresponding block, which can be expressed as where x 0 , x 1 , . . . , x l−1 denotes that the output feature maps of layers 0 to l − 1 are concatenated in the channel dimension. H l (·) represents a composite function, consisting of a batch normalization (BN) layer, ReLU, and a convolutional (Conv) layer with a kernel size of 3×3 (denoted as BN-ReLU-Conv3 × 3 for short). It should be noted that each Conv layer outputs k feature maps in the dense block, where k is called growth rate in [41]. Assuming that k 0 is the number of channels in the input layer, the lth layer will have k 0 + k × (l − 1) input feature maps. Therefore, as the number of layers increases, the input channels will be very large, though k is set to be small. To alleviate this, a bottleneck layer (i.e., BN-ReLU-Conv1 × 1) is added before each 3 × 3 Conv to reduce the number of input channels. Then, The layers between the dense blocks are the transition layers, used to reduce the size of feature maps. It consists of a BN layer, ReLU, and a 1 × 1 Conv layer followed by an average pooling (AvgPooling) layer.

B. Attention Mechanism
Attention mechanisms can not only adaptively emphasize or suppress information but also model long-range dependencies of data, which have been widely used in many tasks [59]. Recently, many attention modules have been applied to HSI classification. The attention modules proposed in [48] and [59] are embedded in pixel-based CNN for scene segmentation, which is directly applied to the patch-based CNN for HSI classification [49], [52], [60]. Other attention modules as proposed in [53] and [61] for HSI classification are similar to the modules in [48] and [55], which are designed for scene segmentation. These attention modules can be mainly divided into two categories. The first category is the conventional attention modules [49], [53], [54], [61], which computes the spatial or spectral attention map(s) by using the pooling and FC layers to exploit the inter-spatial or inter-channel relationship of the extracted features. The other category is for the self-attention modules [52], [60], which generates the spatial and spectral self-attention maps by calculating the correlation between features.
A basic structure of self-attention module is shown in Fig. 2. Its inputs consist of three matrices: Query (Q), Key (K), and Value (V), all of which come from the same input. The similarity is calculated between Q and K, and then, the results are normalized by a softmax function, getting the self-attention matrix. Finally, multiply the obtained selfattention matrix by the matrix V to get the output. This operation can be described as (2) Self-attention mechanisms can effectively strengthen global feature representations by using fewer parameters. These existing ones, however, are mainly designed for the scene segmentation task, usually embedded in the pixel-based CNN. In HSI classification, few attention modules are designed based on the uniqueness of patch-based CNN and 1-D CNN.

A. Overview
In this section, a novel self-attention module-based CNN architecture is proposed to optimize the discrimination of the extracted features. As shown in Fig. 3, the proposed network consists of three parts, including the spatial self-attention module-based spatial subnetwork, the spectral self-attention module-based spectral subnetwork, and the "score weighted" fusion and classification. Specifically, denote H ∈ R H ×W ×B as an HSI data cube, where H , W , and B denote, respectively, the length, the width, and the number of bands of H. As PCA has no training parameters, we use PCA to simply reduce the dimension of B into b. After dimension reduction, for a pixel p i to be classified, a spatial patch Z i ∈ R d×d×b centered at p i is taken as the spatial subnetwork input. It passes through three spatial attention dense blocks, two 2-D transition layers, and a global 2-D average pooling, and eventually, the 1-D spatial features can be produced. Meanwhile, the spectrum of p i is taken as the spectral subnetwork input. It passes through three spectral attention dense blocks, two 1-D transition layers, and a global 1-D average pooling, and eventually, the 1-D spectral features can be produced. The extracted spatial and spectral features are fed into the "score weighted" fusion part for classification.
After the network is built, its parameters are initialized with the He normalization [62] and regularized with the L2 weight decay penalty. The network is trained in an endto-end manner. During the training process, the Adam [63] optimizer is used to update the parameters of the network through backpropagating the gradient of the cross-entropy cost function. In the following, relevant details of the three parts of the proposed network are presented.

B. Spatial Self-Attention Module-Based Spatial Subnetwork
It is essential to explore discriminant spatial feature representations for more effective HSI classification. Over the past few years, many spatial attention modules [48], [54], [59] were proposed to enhance its discriminability. These spatial attention modules encode where to emphasize or suppress by utilizing the inter-spatial relationship of features. However, none of them explore the latent correlation between the center pixel and its surroundings. In the patch-based CNN, the spatial support from the neighbors around class boundary is often invalid as these neighboring pixels sometimes are different from the center pixel's category. During the convolution operation, these neighboring pixels will have a negative effect on feature learning [64]. To resolve this problem, we design a spatial selfattention module for the patch-based CNN. It assigns weights to different features by measuring the similarity between the surrounding features and its central one. Therefore, it can adaptively strengthen the relevantly long-range features to the center pixel while suppressing unnecessary ones for improving the spatial feature representation in predicting the center pixel. Fig. 4(b) shows the operation of the proposed spatial selfattention module. Let X ∈ R w×w×c be the input feature maps, where w×w denotes the spatial size and c denotes the number of channels. Note that w is always an odd number in the  patch-based CNN since the size of the input patch is odd, and the convolution and pooling operations do not change the parity of its inputs. To facilitate the relational operation between spatial features, we first feed X into two parallel 1 × 1 Conv layers to generate two new feature maps of A and B, where A, B ∈ R w×w×c . Note that we denote the center vector of A as A i ∈ R 1×c and all its neighbors in A as A i,1 , A i,2 , A i,3 , . . . , A i,n with n = w × w. The similarity between A i and its neighbors A i,1 , A i,2 , A i,3 , . . . , A i,n is evaluated as where S i,t measures the feature correlation between the center vector A i and its neighborhood vector A i,t . The softmax function is used to normalize S ∈ R w×w to obtain the spatial attention map A weighted matrix is obtained by where ⊗ denotes the element-wise product, in which the spatial attention values are broadcasted along the channel dimension. W spa ∈ R w×w×c is the refined output, which focuses on more informative features related to the center pixel, while suppressing unnecessary ones. To generate the residual connection V spa where V spa ∈ R w×w×c . It can be inferred from (6) that the features similar to the central pixel feature are enhanced, while dissimilar ones are suppressed, thus improving the feature representation ability for the center pixel. In other words, it can aggregate patch-based contextual information related to the pixel to be classified according to the spatial attention map S . By embedding the spatial self-attention module before the concatenate operation, as shown in Fig. 4(a), the spatial attention dense block can be constructed. It can be expressed as follows: where x 0 and y denote the input and output features of the spatial attention dense block, respectively.
[·] refers to the concatenation operation, and f (·) denotes the operation of composite function, including BN-ReLU-Conv1 × 1, BN-ReLU-Conv3 × 3, and the spatial self-attention module. Note that all the output features of the attention module are passed to subsequent units, which can not only alleviate the vanishing gradient but also strengthen feature representations effectively. The architecture of the spatial subnetwork is shown in Part 1 of Fig. 3. It consists of a 3 × 3 Conv layer, three spatial attention dense blocks, two transition layers, and a global average pooling layer. The output spatial features of this subnetwork are fed into Part 3 of Fig. 3 for fusion and data classification.

C. Spectral Self-Attention Module-Based Spectral Subnetwork
With abundant spectral information in HSI, there are inevitably some correlations between spectral bands. Convolution kernels in 1-D CNN can only represent a local cross-channel interaction, i.e., it cannot explore the longrange channel correlation. A few studies [59], [65], [66] used the spectral attention mechanism to encode the long-range dependencies to improve the spectral feature representation, while they were designed for 2-D CNN. In this section, we designed a spectral self-attention module, aiming to capture long-range spectral correlations of 1-D CNN over the local spectral features. It uses the cosine similarity to exploit the interdependencies between channels, improving the spectral feature representation. The process of the spectral self-attention module is shown in Fig. 5(b). The input spectral feature vectors Y ∈ R l× f , where l is the length of the spectral feature vectors and f is the number of channels, equaling to the number of filters in the Conv layer. Considering that the spectral self-attention module needs to calculate the relationship between different channels, we directly performed a similarity calculation between any two channels in Y to maintain this relationship as follows: where Q u,v measures the correlation between the uth channel and the vth channel. Then, we use the softmax function to normalize each column of Q ∈ R l×l to obtain the spectral attention probability map Q A weighted matrix is obtained by where W spe ∈ R l× f and ⊗ denotes the element-wise product. It can be deduced from (10) that the features at each channel are the weighted sum of the features at all channels. Finally, a residual connection is performed to obtain the final output where V spe ∈ R l× f . It can model the long-range spectral correlations between spectral channels, boosting spectral feature discriminability. Similar to the spatial self-attention module, we insert the spectral self-attention module before the concatenate operation in the spectral attention dense block, as shown in Fig. 5(a). In addition, we can see from Part 2 of Fig. 3 that the setting of the spectral subnetwork is the same as the spatial subnetwork, except that all the Conv and pooling operations in this subnetwork adopt 1-D computation. The output spectral features of this subnetwork are also fed into Part 3 of Fig. 3 for fusion and data classification.

D. Weight Fusion and Classification
Considering that the obtained spatial and spectral features are in two separate domains, we adopt the "score weighted" fusion method in [39] to perform the classification. It can be simply understood that the final score vector is obtained by a weighted sum of the spatial and spectral scores. As shown in Part 3 of Fig. 3, the output features of each subnetwork are fed to an FC layer. Note that the number of neurons in the FC layer is equal to the number of classes, and the value of each neuron can be regarded as a class-specific response. The outputs of FC layers corresponding to the spatial subnetwork and the spectral subnetwork are expressed as F spa ∈ R K and F spe ∈ R K , respectively, where K is the number of classes. Then, the fused probability in different classes is computed as (12) where σ (·) denotes the softmax function. λ is a weighting parameter in the range of [0, 1], which is initialized to 0.5 and then adaptively and automatically adjusted during the process of the network optimization. Experiments and validations are presented and discussed in Section IV.

A. Description of the Datasets
We carried out experiments on four datasets: University of Pavia (PU), Salinas (SA), Kennedy Space Center (KSC), and University of Houston (UH). Details of these datasets are given as follows.
The PU dataset was taken over the University of Pavia, Northern Italy, by the Reflective Optics System Imaging Spectrometer (ROSIS) sensor. It has 610 × 340 pixels with a spatial resolution of 1.3 m, composed of 115 bands covering the wavelength from 0.43 to 0.86 μm. After discarding 12 noisy and water absorption bands, only 103 bands were preserved. As summarized in Table I, there are nine land-cover classes, and the training and testing samples with the same settings as [39] are also given.
The SA image was recorded by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) over the area of Salinas Valley, CA, USA. The image size is 512 × 217 with  Table II. The UH image was gathered by an airborne sensor, which covers the area of University of Houston. It has 349 × 1905 pixels with a spatial resolution of 2.5 m and consists of 144 spectral channels ranging from 0.38 to 1.05 μm. These data include 15 land-cover classes. It adopted the standard training and testing sets given by the 2013 GRSS Data Fusion Contest. Details of these classes and the number of training and testing samples in each class are shown in Table III. The KSC dataset covers an area of KSC, FL, USA, which was also gathered by the AVIRIS sensor. It consists of 512 × 614 pixels and 176 spectral bands after removing water absorption and low SNR bands. It has a spatial resolution of 18 m and a spectral resolution of 10 nm ranging from 0.4 to 2.5 μm. Details of the land-cover types and the number of training and testing samples in each class are listed in Table IV, which are the same as in [39].
In deep learning, data normalization can unify data magnitude, promote network convergence, and prevent gradient explosion. Therefore, the HSI datasets were normalized to [0, 1] by using the min-max normalization before the training and testing in the following experiments. In addition, these datasets all used the same data argumentation strategy as in [37].

B. Experiment Setting
To verify the performance of the proposed method, we conducted a series of experiments on these four datasets. First, we analyzed the impact of different hyperparameters on classification performance. Second, we evaluated the effects of the proposed spectral self-attention module and spatial self-attention module. We also compared the proposed network with other state-of-the-art CNN-related methods. All the experiments were implemented on Ubuntu 16.04 and a GPU of Nvidia GeForce RTX 2080. The classification performance was measured by four common quantitative metrics: the producer accuracy (PA) of each class, overall accuracy (OA), average accuracy (AA), and kappa coefficient (Kappa). PA measures the percentage of correctly classified pixels for a certain class, which can be derived for the training dataset or the testing dataset. AA is the average of the PA over all the classes. OA represents the overall percentage of correctly classified pixels for the whole dataset, including all classes,  III   DETAILS OF THE LAND COVER TYPES AND THE  NUMBER OF SAMPLES FOR THE UH DATASET   TABLE IV   DETAILS OF THE LAND COVER TYPES AND THE  NUMBER OF SAMPLES FOR THE KSC DATASET for either the training or testing dataset. Kappa coefficient is a score that measures the level of agreement between the classification results and the corresponding ground truth (GT). Its value ranges from −1 to 1, and the larger the value, the higher level of agreement. To avoid biased estimation, all experiments were conducted with five independent tests, and the average values were reported for all the evaluation metrics.

C. Parameter Setting
For PCA-based dimensionality reduction, the numbers of the preserved principal components are 3, 4, 2, and 75 for the SA data, the UP data, the UH data, and the KSC data, respectively. This was determined by retaining at least 99% of the total data variation in the original HSI. We trained the network for 25 epochs with a batch size of 100 and a learning rate of 0.0001. The proposed networks were carried out using the Keras framework with TensorFlow as the backend.
Besides, we also analyze some key hyperparameters on the classification performance. Details are presented as follows.
1) Effect of the Number of Conv Layers: Fig. 6(a) shows the effect of the number of Conv layers (i.e., the depth of the network) on the OA of the proposed network. Here, the number of Conv layers is calculated within each subnetwork, excluding those in the self-attention modules. Deeper networks generally have more powerful feature representation ability, but too deep networks will cause gradient instability and network degradation. In Fig. 6(a), it is clear that 16 Conv layers lead to the best results on these four datasets. After that, the network performance remains unchanged or decreases slightly. Therefore, in the following experiments, the number of Conv layers is set to 16 for all datasets.
2) Effect of the Growth Rate k: Fig. 6(b) shows the performance on different growth rates k, which determines the width of the network. Increasing the width of the network enables each Conv layer to learn richer features and obtain better performance. However, due to the increased number of parameters, it will increase the possibility of overfitting. From Fig. 6(b), we can see that the OA reaches its peak at 22 on the PU, SA, and UH datasets. Although the OA of the KSC dataset still increases when k exceeds 22, the increase is minor. Therefore, for convenience, the growth rate is uniformly set to 22 for all the datasets.
3) Effect of the Input Patch Size: We also investigated the effect of patch size on the classification performance, which is shown in Fig. 6(c). It can be seen that the OA shows an upward trend, while over the size of 21 × 21, the rise is minor. This is because larger patches contain more spatial information, which is conducive to classification. However, when the patch size is too large, it may contain some negative information. On the other hand, a large patch will increase the computational load. Therefore, we set the patch size to 21 × 21 for all the datasets.

D. Contribution of the Self-Attention Modules
In this section, we conducted a series of tests to analyze the contribution of the proposed spatial self-attention module and spectral self-attention module. We separately tested the spatial subnetwork and spectral subnetwork on a different number of training samples, where 50, 100, and 150 labeled samples per class were randomly selected from the PU, SA, and UH datasets, and 5%, 10%, and 15% samples per class were randomly selected from the KSC dataset, respectively.
1) Contribution of the Spatial Self-Attention Module: To verify the effectiveness of the proposed spatial self-attention module, we compared the classification performance of the spatial subnetwork with and without the spatial self-attention module (denoted as Spa-A and Spa, respectively). Fig. 7 shows the results on a different number of training data.
According to Fig. 7, it is clear that employing the spatial self-attention module can consistently improve the performance with lower standard deviation. The spatial self-attention module can promote the discriminant feature learning ability of the network, especially with limited training samples. As shown in Fig. 7, the fewer samples, the more significant the superiority of the Spa-A. This is because the proposed  spatial self-attention module can compensate for information from a small training set by effectively capturing useful spatial information related to the pixels to be classified. It is demonstrated that the proposed spatial self-attention module can enhance the spatial feature representation of the network.
2) Contribution of the Spectral Self-Attention Module: Similarly, to demonstrate the effectiveness of the proposed spectral self-attention module, we tested and compared the spectral subnetwork with and without the spectral self-attention module (denoted as Spe-A and Spe, respectively). Experimental results are shown in Fig. 8.
As can be seen from Fig. 8, using the spectral self-attention module can improve the performance remarkably. In these four datasets, Spe-A has a consistent improvement over the Spe on a different number of training samples. As shown in Fig. 8(c) and (d), Spe-A is able to reach a better OA with lower standard deviation on UH and KSC datasets, especially with fewer training samples. It is demonstrated that using the proposed spectral self-attention module has great benefits to 1-D CNN for spectral feature extraction.

E. Comparison With State-of-the-Art Methods
To evaluate the performance of the proposed method for HSI classification, we compared our method with other existing state-of-the-art CNN-related methods, such as contextual CNN (CCNN) [33], SSRN [42], FDSSC [43], localized spectral features and multiscale spatial features network (LSMSC) [67], adaptive spectral-spatial multiscale network (ASSMN) [39], double-branch multiattention mechanism network (DBMA) [49], and double-branch dual-attention  [52]. Specifically, CCNN is a traditional spectral-spatial network, and SSRN and FDSSC are spectral-spatial networks based on ResNet and DenseNet, respectively. LSMSC and ASSMN are spectral-spatial multiscale networks, and DBMA and DBDA are spectral-spatial attention networks. All these methods were implemented using the open-source code with their optimal parameters as described in the corresponding references. Besides, for a fair comparison, all the methods were trained and tested on the same sample sets, as listed in Tables I-IV. 1) Quantitative Evaluation: Quantitative results of OA, AA, Kappa, and PA of each class are listed in Tables V-VIII. We can see that the proposed method achieved higher classification accuracy with lower standard deviation compared with other methods. Taking Table V for example, the proposed method achieved the highest accuracy of 98.86%, which exceeds CCNN, SSRN, FDSSC, LSMSC, ASSMN, DBMA, and DBDA by 7.18%, 1.44%, 0.59%, 0.86%, 2.53%, 0.81%, and 0.74%, respectively. Although the AA of SSSAN is slightly lower than that of ASSMN in Table VI, it achieved higher accuracy in OA and Kappa. Besides, in Tables VII and VIII, the proposed one also achieved the highest accuracy in OA, AA, and Kappa.
It can be seen from Tables VI-VIII that the accuracy for CCNN is lower than those of other methods since it only uses a weak 2-D CNN to extract spectral and spatial features. Compared with CCNN, the accuracy of SSRN is significantly   improved, due to its use of spectral and spatial residual blocks to consecutively learn spectral and spatial features. FDSSC uses densely connected structures to deeply learn features, obtaining better results than SSRN. To enhance feature learning, LSMSC fuses localized spectral features and multiscale spatial features by considering the correlations between different bands. ASSMN employs a multiscale strategy in spectral and spatial simultaneously. However, FDSSC, LSMSC, and ASSMN cannot achieve good results on some datasets. For example, ASSMN achieved good performance on the SA and KSC datasets, but its accuracy is very low on the PU and UH datasets, especially on the UH dataset. DBMA and DBDA use attention mechanisms, achieving stable results on all these four datasets. Comparatively, DBDA generates better performance compared to DBMA. Furthermore, the proposed method constantly performs better than DBDA on all datasets because it is based on powerful baseline of DenseNet and the proposed self-attention modules. Overall, the proposed method provides better performance on all these four datasets.
2) Qualitative Evaluation: The corresponding classification maps alongside false-color maps (FCMs) and GT are shown in Figs. 9-12. These maps are consistent with the quantitative results listed in Tables V-VIII. The classification maps obtained by our method have the least noise and the clearest  object boundary, which is very close to the GT maps. The proposed method can correctly label almost all classes, even some easily confused classes, such as Grapes_untrained and Vinyard_untrained in Fig. 10, which are marked with red circles.
3) Analyses of Running Time: To measure the efficiency of the proposed method, we compared our method with the other seven methods tested in terms of training and test time on PU, SA, UH, and KSC datasets. The results are listed in Table IX. It can be seen that the training time of the proposed method is shorter than SSRN, DBDA, FDSSC, ASSMN, and DBMA. The possible reason is that the proposed methods use the 1-D CNN and 2-D CNN to learn spectral and spatial features, respectively, while others use 3-D CNN with a large number of parameters. On the other hand, the proposed method converged faster than this 3-D CNN-based method (e.g., 25 epochs for the proposed method and about 50-100 epochs for the SSRN, DBDA, FDSSC, and DBMA). Since ASSMN uses the ConvLSTM with multitime step calculation and multibranch architectures, it takes much longer than other methods.  TRAINING AND TESTING TIME OF DIFFERENT METHODS ON THE PU, SA, UH, AND KSC DATASETS CCNN uses the shortest time because it is simple 2-D CNN architecture with less training parameters. LSMSC is the second shortest, which uses the band grouping strategy to reduce the computation burden. Although the running time of CCNN and LSMSC is shorter, their performance is lower than our methods.

V. CONCLUSION
In this article, a novel spectral-spatial self-attention CNN architecture is proposed for HSI classification. First, based on the proposed spatial self-attention module, the spatial subnetwork has significantly enhanced the patch-based relevant longrange contextual information related to the center pixel while suppressing unnecessary one, improving the accuracy for the center pixel recognition. Meanwhile, based on the proposed spectral self-attention module, the spectral subnetwork has successfully extracted more discriminative spectral features by exploiting the long-range spectral correlations over local spectral features.
The weighted fusion of the extracted spectral and spatial features can further improve the classification accuracy. The proposed method is found to outperform a number of state-ofthe-art methods, including CCNN, SSRN, LSMSC, ASSMN, DBMA, and DBDA. Future work includes further optimization of the network for fast parameter selection.