Fusion of PCA and Segmented-PCA Domain Multiscale 2-D-SSA for Effective Spectral-Spatial Feature Extraction and Data Classification in Hyperspectral Imagery

As hyperspectral imagery (HSI) contains rich spectral and spatial information, a novel principal component analysis (PCA) and segmented-PCA (SPCA)-based multiscale 2-D-singular spectrum analysis (2-D-SSA) fusion method is proposed for joint spectral–spatial HSI feature extraction and classification. Considering the overall spectra and adjacent band correlations of objects, the PCA and SPCA methods are utilized first for spectral dimension reduction, respectively. Then, multiscale 2-D-SSA is applied onto the SPCA dimension-reduced images to extract abundant spatial features at different scales, where PCA is applied again for dimensionality reduction. The obtained multiscale spatial features are then fused with the global spectral features derived from PCA to form multiscale spectral–spatial features (MSF-PCs). The performance of the extracted MSF-PCs is evaluated using the support vector machine (SVM) classifier. Experiments on four benchmark HSI data sets have shown that the proposed method outperforms other state-of-the-art feature extraction methods, including several deep learning approaches, when only a small number of training samples are available.

spatial information simultaneously. HSI has hundreds of contiguous spectral bands acquired from visible light to infrared, and the spectral profile of each pixel can act as a fingerprint for identification and discrimination of various materials [1]. This characteristic makes HSI great use in many areas, such as mineralogy [2], agriculture [3], target detection [4], and landcover classification [5]. However, how to efficiently classify HSI is still a major challenge.
Over the past decades, a lot of HSI classification techniques have been developed, including support vector machine (SVM) [6], multinomial logistic regression (MLR) [7], the k-nearest neighbor (KNN) [8], and sparse representation classifier (SRC) [9]. Among them, the SVM classifier has been widely used since its low generalization error rate and simple operation. However, the high spectral dimensionality of HSI usually causes the model parameters difficult to estimate when labeled samples are limited for training, which results in the "Hughes phenomenon" [10]. Meanwhile, the data redundancy and noise are also unavoidable, which brings obstacles to the classification [11]. Thus, effective spectral and spatial feature extraction is essential for HSI classification.
Specific to the abovementioned problems, some spectral feature extraction methods have been developed to reduce the spectral dimensionality. For instance, several widely used linear transformation methods include the principal component analysis (PCA) [12], the independent component analysis (ICA) [13], and linear discriminant analysis (LDA) [14], as well as some nonlinear dimension reduction methods, i.e., manifold learning [15]. PCA can transform highdimensional data into linearly uncorrelated variables, which is beneficial for data compression and visualization, as well as mining of compact effective information. There are also some improved PCA methods, including structured covariance PCA (SC-PCA) [16], segmented PCA (SPCA) [17], and folded-PCA (FPCA) [18], which not only reduces computational load and memory requirements but also incorporates local spectral characteristics.
The utilization of spatial context information has a potential improvement in classification accuracy. One popular approach for spatial feature extraction is based on the morphological profile (MP) [19], which can extract spatial geometrical details by opening and closing morphological transformation approaches with a series of structuring elements (SEs). Several extension methods of MP, including morphological attribute profile (MAP) [20], extended MP (EMP) [21], and the extended MAP (EMAP) [22], are developed to extract different kinds of spatial structural information. In addition, another interesting approach, i.e., the 2-D-singular spectrum analysis (2-D-SSA), has been successfully applied for spatial feature extraction of images [23], [24]. Similar to its 1-D case (SSA), based on the singular value decomposition (SVD), it decomposes the given signal into a few components and reconstitutes an approximate description of the original signal using several components [25]. 2-D-SSA can extract the different spatial structural content and fully explore the global spatial correlation of each HSI image, while eliminates the image noise. However, it has several drawbacks, such as its low utilization of spectral features and the undersmoothed or oversmoothed results of land covers due to the fixed embedding window, which limits the popularization of 2-D-SSA. The integration solutions of 2-D-SSA may effectively solve these issues.
Recently, a number of joint spectral-spatial feature extraction methods are preferred for HSI classification, which can further improve the classification performance [26]. These methods can be categorized into two main groups: shallow structure models and deep learning models. The shallow structure models mainly use one or more simple feature extraction techniques to jointly extract spectral feature and singlescale or multiscale spatial features, such as spectral-spatial Gabor surface feature (GSF) fusion [27], intrinsic image decomposition (IID) [28], and SPCA-based Gaussian pyramid features (SPCA-GPs) [29]. Duan et al. [30] proposed a novel multiscale total variation (MSTV) method to make full use of multiscale spatial information and spectral features while reducing the influence of textures and noise. Hong et al. [31] apply a sequence of well-designed attribute filters to extract invariant features locally from HSI in both spatial and frequency domains, and these features are robust to local pixel or material changes. In [32], a global spatial and local spectral similarity-based manifold learning group sparse representation (GSLS) is proposed to make full use of the intrinsic features of HSI, achieving good classification results.
Deep learning models, which have been proven to be very useful for HSI feature extraction, can hierarchically extract the abstract and discriminative features using multiple nonlinear networks [5]. Various deep learning networks have been designed, such as the stacked autoencoder [33], deep belief network (DBN) [34], and convolutional neural network (CNN) [35]. Due to the unique characteristics (e.g., local connections and shared weights) [36], CNN-based methods have been widely used for HSI spectral-spatial classification, such as 3-D CNN [37], ResNet [38], and its extensions spectral-spatial residual network (SSRN) [39] and multipath residual network (MPRN) [40]. Song et al. [41] proposed a novel deep feature fusion network (DFFN); combining with residual learning, it extracts features of different scales that can provide complementary yet correlated information for classification. Mu et al. [42] proposed a multiscale and multilevel spectral-spatial feature fusion network (MSSN) to fully integrate spatial and spectral features in different scales, resulting in superior classification performance.
Although deep learning-based methods could exploit highlevel features and have made remarkable progress in HSI classification, there are several drawbacks in HSI applications. One of the key challenges with deep modeling is the lack of insufficient training samples. The deep network structure used for feature extraction and classification is a time-consuming and error-prone process with a huge number of hyperparameters (e.g., learning rate and kernel sizes of convolutional filters). Furthermore, the extracted deep features can hardly be linked to any semantic meanings compared with handcrafted features. In other words, these deep features form a latent representation of the high-dimensional data, and thus, to some degree, they are human inexplicable. There is still a long way to go to study the interpretability of the deep learning model.
No matter shallow structured models or deep learning models, the extraction of the representative features of HSI is the key to improve the classification performance. Accordingly, several factors have been widely used in existing work for the extraction of representative features, including the joint utilization of spectral and spatial information, multiscale feature extraction, and the combination of the global and local features. In addition, considering the problems in the deep models, shallow models are still desirable because of their simple structures and fewer parameters. To this end, we aim to integrate several simple techniques for efficient and effective feature extraction and data classification of the HSI.
In this article, a fusion framework, PCA and SPCA domain multiscale 2-D-SSA, is proposed for this purpose. PCA and SPCA are used to extract overall and adjacent spectral correlation characteristics of objects, respectively, and dimension reduction. Besides, 2-D-SSA uses several embedding windows of different size to extract abundant multiscale spatial trend features by applying on the SPCA dimension-reduced images. Postprocessing via PCA is applied to reduce the dimension and enhance the differences of spectral pixels. The multiscale 2-D-SSA spatial features, as well as overall spectral features obtained by PCA, are fused as final spectral-spatial features (called MSF-PCs), as the input of SVM classifier. The major contributions of this method can be concluded as follows.
1) By combining PCA, SPCA, and 2-D-SSA, a novel approach is proposed to extract more concentrated spectral-spatial features of the HSI. 2) With the multiscale 2-D-SSA, spatial characteristics of complex scenes at different scales can be extracted for more accurate classification.
3) The global and local features of HSI are taken into account simultaneously to explore the interconnections between pixels or objects, which features multiscale 2-D-SSA for extracting spatial features and PCA/SPCA for spectral feature extraction. 4) The proposed method provides superior classification performance to several state-of-the-art methods, including CNN-based approaches, even with a small number of training samples. The remainder of this article is organized as follows. The principles of PCA, SPCA, and 2-D-SSA are briefly reviewed in Section II. Section III describes our proposed method. Experimental setup, results, and analysis of the four HSI data sets are given in Section IV and V, respectively. Finally, some concluding remarks are drawn in Section VI.
II. RELATED WORKS As the three main techniques adopted in our approach, PCA and its expanded version SPCA, as well as 2-D-SSA, are briefly introduced in the following two sections.

A. PCA and SPCA
As a standard linear transformation method, PCA has been widely used for data dimension reduction in HSI. The main processing of PCA is to construct a linear transformation matrix to preserve most spectral features of the HSI in the principal components (PCs) of fewer dimensions.
Given an HSI data H ∈ B×N with N pixels and B bands, though mean-adjusted, this matrix is first used for the covariance matrix C calculation, which is defined as whereH is the mean-adjusted matrix H . The eigendecomposition is applied for the covariance matrix C to obtain the first k (k < B) eigenvectors corresponding to bigger eigenvalues. These eigenvectors constitute the linear transformation matrix M with the size of B × k. This linear transformation can be written as where Y is the dimension-reduced matrix after PCA. Unlike conventional PCA, segmented-PCA (SPCA) is proposed to consider the local useful spectral characteristics of the HSI. In SPCA, the original HSI data are first partitioned into K subgroups, which consists of highly correlated bands. PCA is conducted separately on each band subgroup. By selecting several main components from each subgroup, it can avoid the low correlations between the highly correlated blocks and reduce the dimension while essentially preserve the important information in each subgroup of bands. The components selected can be regrouped and transformed again to compress the data further. Further details can be found in [17].

B. 2-D-Singular Spectrum Analysis
As a model-free technique to analyze 1-D time series, singular spectrum analysis (SSA) can extract different components of given signals, including trend, oscillations, and noise, leading to smoothing and denoising results. Its 2-D extension version, 2-D-SSA, is first presented in [43], which is believed to have similar capabilities with SSA. In 2010, Golyandina and Usevich [44] enriched the algorithm and theory of 2-D-SSA and popularized its application, especially in images.
Given a grayscale with a size h × w, define the embedding window of size u × v (with 1 ≤ u ≤ h and 1 ≤ v ≤ w), which moves from the top left to the bottom right of the image to construct the trajectory matrix T . The pixels in the window are expanded and joined as a vector V ∈ uv×1 and as a column in the trajectory matrix, which is shown as follows: Note that the trajectory matrix T has a structure called HbH, i.e., Hankel by Hankel. Then, the eigenvalues of TT T and their corresponding eigenvectors are denoted as (λ 1 ≥ λ 2 ≥ . . . ≥ λ L ) and U 1 , U 2 , . . . ,U L ), respectively. The trajectory matrix can be written as follows: where U i and V i are the empirical orthogonal functions and the PCs of the trajectory matrix, respectively. Afterward, we select T 1 as an approximation to T , mainly because it contains the most important spatial information that benefits for classification [23]. Finally, the matrix T 1 is converted to a new image of size h × w again, as the reconstructed image, by a two-step diagonal averaging process in the matrix antidiagonals in both each block and between blocks.
2-D-SSA has several advantages. First, it simultaneously considers the local and global spatial relevant information on remote sensing images. Different Hankel blocks involve horizontal or vertical spatial information of images, while lagged vectors contain local spatial characteristics of ground objects. This local information is put into the HbH matrix structure together to preserve the global spatial correlation. In addition, 2-D-SSA can effectively reduce the influence of serious image noise, and its reconstructed images by the key component have shown the great noise robustness both in HSI and non-HSI images [23], [45].

III. PROPOSED METHODOLOGY
The flowchart of our proposed spectral-spatial HSI classification method is shown in Fig. 1, which consists of three following steps. First, PCA and SPCA are applied to extract different spectral correlation characteristics and reduce the dimension of HSI, respectively. Then, the dimension-reduced images obtained by SPCA are used to extract and excavate spatial features via multiscale 2-D-SSA. Finally, the multiscale spatial features, as well as spectral features obtained by PCA, are fused to MSF-PCs features, and the SVM classifier is adopted to evaluate its performance.

A. Spectral Feature Extraction
In HSI, the object identification information contained in the spectral profile is beneficial for classification. However, the narrow hyperspectral bands and wide sensitization range have caused a huge dimension, which brings higher computational complexity. In order to reduce the feature dimension and preserve the expression ability of the feature, PCA is employed to extract global spectral features. The spectral feature constituted by the former PCs after PCA can be fused with the spatial features for classification.
However, the global spectral feature obtained by PCA ignores the diversity of spectral profiles in a different wavelength range. In other words, PCA can often fail to extract the local useful characteristics of the HSI. Using a unified projection in dimension reduction for an entire spectral vector is inappropriate. In addition, there are also highly relevant characteristics in adjacent spectral bands, which has brought information redundancy. Therefore, SPCA is employed to solve the abovementioned problems and extract local spectral features. In SPCA, highly correlated bands are selected as the subgroups based on pairwise separability measures, such as the Bhattacharyya distance, while this type of band selection usually requires a time-consuming clustering process. Based on the fact that the correlations between neighboring spectral bands are generally higher than for bands further apart [17], we use a simple averaging method to select subsets.
Denote X ∈ W ×H ×B as an HSI data cube, where W and H represent the rows and columns of data, the B bands are first partitioned equally into K subsets, and each subset X sub is given as follows: where [B/K ] represents the value of the smallest integer greater than B/K . X i is the i th band of the HSI data. Then, PCA is employed to each adjacent subset, and the first PCs of each band subsets, which contains the main information of local adjacent bands, are stacked together as local spectral features. SPCA concentrates several adjacent spectral contents and preserves the diversity of spectral features, which is beneficial to the subsequent spatial feature mining. Besides, it also reduces the data dimension and eliminates spectral redundancy, which can significantly improve the efficiency of the following-on analysis.
For convenience, the spectral features by PCA and SPCA are denoted as (7) where P(P B) and K (K B) represent the number of dimension after PCA and SPCA, respectively.

B. 2-D-SSA-Based Multiscale Spatial Feature Extraction
After SPCA, the retained K images have more concentrated spectral diversity characteristics and lower feature dimensions compared with raw HSI. 2-D-SSA with different sizes of embedding windows [23] is applied to Y SPCA , which can extract more spatial features than raw single-band image, as well as spend less computing time.
The 2-D-SSA with embedding window size u l ×v l performs on Y SPCA to obtain the lth scale spatial features (SF l ), which is calculated by where n is the number of total scales for different embedding windows. The size of SF l is exactly the same as Y SPCA . The multiscale 2-D-SSA can capture the multiscale spatial information of the hyperspectral image effectively. Fig. 2 shows the false-color image constituted by the first three PCs in each 2-D-SSA scale. It can be seen the diversity of spatial features and the discrimination information in different extraction scales. The spatial difference information is stacked to multiscale spatial features, which can supplement the feature information of different land covers.
However, the dimension of stacked features from different scales will have an inevitable increase in the computational cost, as well as redundant information. In addition, the resulted images obtained by different 2-D-SSA scales are continuous smooth, and some texture information is removed, which has a negative effect on the classification performance because the differences of pixels belonging to different classes are reduced distinctly. Therefore, PCA is applied to featured images at each scale again to solve the problem presented earlier. Let MSF l represents the first L components after applying PCA on the lth scale feature image SF l as follows: where L (L ≤ K ) represents the number of dimensions after PCA. The final multiscale spatial features can be stacked as Compared with SF l , the MSF l feature further highlights the local scene differences in different scales because of the ability of PCA in emphasizing the spectral differences among pixels, which is beneficial for classification. The final stacked features can further enrich the spatial features of ground objects.

C. Feature Fusion and Classification
In this section, the obtained multiscale spatial features MSF, as well as PCA-based spectral features Y PCA , are fused to MSF-PCs features, which can be represented as For any pixel vector in MSF-PCs, it not only has the main spectral feature of raw spectral profile but also the multiscale concentrated features containing neighborhood spatial and adjacent band information. Multiscale concentrated features can reduce the misclassification caused by similar features on a single scale and improve the identification ability to determine the corresponding label.
In addition, selecting a suitable classifier is crucial for the performance assessment of the abovementioned features, especially with a small training sample size in hyperspectral images. Among the various pixelwise classifiers, SVM is a widely used supervised statistical learning framework. By using a kernel function, it can map the data to the higher dimensional space via a nonlinear transformation, aiming to find the optimal hyperplane to separates samples belonging to different classes. Because of robustness to the variation of data dimensions, SVM has shown an outstanding performance in the classification of HSI [46]- [48]. Therefore, we use the fused feature vectors as input to the SVM classifier to evaluate its effectiveness for classification.
IV. EXPERIMENTAL SETUP In this section, we describe the details of three benchmark HSI data sets and experimental setup. Furthermore, the parameters' sensitivity in the proposed method is analyzed.
1) Indian Pines: This data set was acquired by the Airborne Visible/Infrared Imaging Spectrometer sensor that has a spectral coverage ranging from 0.4 to 2.5 μm. It covers the Indian Pines study site in Northwest Indiana, USA. This data set has a spatial size of 145 × 145 pixels with a spatial resolution of 20 m per pixel and 220 spectral bands. In our experiment, the number of bands is reduced to 200 by removing 20 water absorption bands (104-108, 150-163, and 220). The color composite image and the ground-truth map, which contains 16 land cover classes, are shown in Fig. 6(a) and (b).
2) Kennedy Space Center: This data set was acquired by the Airborne Visible/Infrared Imaging Spectrometer sensor over the KSC, Florida, on March 23, 1996. This data set contains 512 × 614 pixels with a spatial resolution of 18 m per pixel and 224 spectral bands in the wavelength range of 0.4-2.5 μm. After removing water absorption and low SNR bands, 176 bands were used for the analysis. The color composite image and its ground-truth image, which contains 13 land cover classes, are presented in Fig. 7(a) and (b).
3) Pavia Centre: This data set was acquired by the Reflective Optics System Imaging Spectrometer (ROSIS-03) sensor during a flight campaign over Pavia, Italy, with a spectral coverage ranging from 0.43 to 0.86 μm. This data set contains 103 bands of size 1096 × 1096 pixels with a spatial resolution of 1.3 m per pixel. In our experiment, a subscene of 510 × 490, including all nine land cover classes, is employed; its color composite image and ground-truth image are shown in Fig. 8(a) and (b).

4) DFC2018:
This data set was distributed by the 2018 IEEE GRSS Data Fusion Contest (DFC). It was acquired by the IRTES CASI-1500 sensor at a GSD of 1 m over the campus of the University of Houston and its surrounding areas, in Houston, TX, USA. It contains 601 × 2384 pixels with 50 spectral bands sampling the wavelength of between 380 and 1050 nm at intervals of 10 nm. Its color composite image and ground-truth image, which contains 20 land cover classes, are shown in Fig. 8(a) and (b).

B. Parameters Sensitivity Analysis
In order to extract abundant spatial features, the scales of 2-D-SSA are important. As suggested experience in [23]   and [24] and considering the fact that larger embedding window causes the computation time to grow exponentially, the 2-D-SSA scale sets u l × v l = {5 × 5, 10 × 10, 20 × 20, 30 × 30, 40 × 40} are adopted for the first three data sets, while u l × v l = {3 × 3, 5 × 5, 10 × 10, 20 × 20, 30 × 30} for the DFC2018 data set. The reason is that the large image with a big embedding window will cause the trajectory matrix to be huge and exceed the memory limit, which is also the defect of 2-D-SSA itself. In this experiment, the classification results of raw data, SF l features in five scales above, and stacked SF l features are compared by the SVM classifier, which is given in Table I. As we can see, the single scale of 2-D-SSA has an improvement in classification to some degree compared to the raw HSI data set. The combination of different scales can further improve the accuracy by about 2%-13%, achieving the best classification performance. The stacked spatial features could fully characterize different land-cover features, which demonstrates the advantages of multiscale analysis. For effective dimension reduction, the combination of the number of subsets K and the number of reduced-dimension L is critical, where L < K . In this section, another set of experiments is designed to select the optimal K and L values on the four data sets. Fig. 3 shows the obtained overall accuracy (OA) of classification with different parameters. According to the experience in [29] and our experiments, the parameters K and L are set to vary in a range from 5 to 15. The training samples are selected randomly, and the numbers used for training are 2%, 2%, 0.2%, and 0.1% of the four data sets, respectively. As can be seen in Fig. 3, the superior accuracy (highlighted in red) is mainly concentrated in the second half of the K and the middle part of the L value on all data sets. With the increase in K , more subsets containing local spectral features can be retained, while the OA is greatly improved. The classification accuracy is optimal at K = 13 or K = 14 on average, where a larger K will bring a significant increase in running time but limited gain in OA. In addition, preserving most of the subsets, i.e., L is slightly lower than K , usually achieves the best OA except on the PaviaC data set. On the PaviaC data set, a small L usually has better OA, mainly because it concentrates the main spectral discrimination features of the data, removing redundant and useless information. In order to obtain satisfactory classification results and ensure processing efficiency, the parameters K and L are fixed as 11 and 9, respectively, on all the four data sets in the following experiments.
In addition, according to the abovementioned optimal dimension-reduced parameters, we further evaluate the effectiveness of the SPCA in (7) and PCA in (9) (see Table II). In the spectral dimension reduction step in (7), three methods, including PCA, dominant set extraction-based band selector (DSEBS) [49], and FPCA [18], are adopted, in which the number of bands in the dimension-reduced data is fixed as K = 11 for a fair comparison. In the feature dimension reduction step in (9), another three methods, LDA [14], locally linear embedding (LLE) [50], and ICA [13], are used for comparison with PCA, and their dimension-reduced bands are all L = 7. As shown in Table II, the combination of SPCA and PCA achieves the highest accuracies, which demonstrates that, first, SPCA is effective for the spectral dimension reduction than others, mainly because it concentrates the spectral characteristics of adjacent bands; then, PCA not only further reduces the dimension of spatial features but also highlights the distinction of pixels.
Finally, the parameter P determines the number of PCs in spectral features obtained by PCA. An appropriate number of PCs can provide useful spectral characteristics while avoiding abundant information. Noisy information and high dimension problem would not be introduced into the finally derived spectral-spatial features. The parameter P is fixed as 3 in

C. Experimental Setup
In order to evaluate the performance of our proposed method, seven state-of-the-art HSI feature extraction approaches are adopted as compared methods. These include IID [28], robust local manifold representation (RLMR) [15], SPCA-GPs [29], MSTV [30], local adaptive joint sparse representation (LAJSR) [51], invariant attribute profiles (IAPs) [31], and GSLS [32] while the raw HSI data set as a baseline method (abbreviated as "Raw"). Note that these methods use the optimal parameters according to different data sets. For our proposed method, the parameters are the same as described in Section IV-B. The optimal parameters of all involved methods are summarized in Table III. The SVM classifier is used in the stage of classification, which is implemented by the LIBSVM library [52] using an RBF kernel with fivefold cross validation.
For avoiding systematic errors and reducing random discrepancies, all experiments were independently carried out ten times. The training and testing samples sets were sampling randomly without any overlapping each time, using an equal sample rate or sample quantity for each class. In addition, different numbers of training samples are used in experiments to fully investigate the performance of the involved methods. In experiments, we mainly concern the case of small sample size, and the number of selected training samples varies within {1%, 1.5%, 2%, 2.5%, 3%} for Indian Pines, {3, 5, 7, 9, 11, 13} for both KSC and PaviaC data sets per class, and {0.05%, 0.1%, 0.15%, 0.2%, 0.25%, 0.3%} for the DFC2018 data set. In addition, five objective quality indexes, i.e., the OA, the average accuracy (AA), the kappa coefficient, class-by-class accuracy, and running time, are utilized to evaluate the performance of the image classification. All the experiments are performed using MATLAB 2017a on a computer with a 3.5-GHz CPU and 8-GB memory.
V. EXPERIMENTAL RESULTS AND ANALYSIS In this section, the effectiveness of the fused features is validated using four HSI data sets discussed in Section IV, in a number of experiments, presenting results that demonstrate the related benefits in HSI classification.

A. Classification Results and Analysis
In this section, the qualitative and quantitative results of the proposed method and the other eight compared methods are presented, using four different HSI data sets. The details are given as follows.
First, the classification results with a different number of training samples on three data sets are compared and shown in Fig. 4. It can be seen that the increase in the labeled sample amount for training has a positive promotion on the classification performance. In almost all cases, our proposed method has achieved the highest classification accuracies even in the small number of samples, demonstrating the robustness of the proposed method.
In order to quantitatively evaluate the superiority of the proposed method, the classification results from the four data sets are given in Tables IV-VII. In addition, the classification maps yielded by all compared methods and the proposed method on four data sets are shown in Figs. 5-8, respectively. 1) Indian Pines: As shown in Table IV, the proposed method obtains the highest values in terms of three metrics and most of the classes, including 100% accuracy in two classes and over 89% accuracy in all classes. Compared with the raw HSI data, the proposed MSF-PCs features can significantly improve the OA from 64.86% to 96.11%, which is about 3% higher than the other two multiscale feature extraction methods: SPCA-GPs and MSTV. The results also demonstrate our features are superior to the intrinsic image features in IID and spatial invariant features in IAPs. Similar to GSLS, our method also considers the global and local spectral and spatial information in HSI, while our method obtains better classification results in almost all classes.
As shown in Fig. 5, the proposed method shows satisfactory performances from classification maps. The raw data plus SVM method appear obvious misclassification inside the land covers, which is manifested as classification noise. The RLMR and LAJSR methods reduce this spot-like misclassification by using neighborhood spatial information while still exist small lump or strip misclassification. These misclassifications are also exist in the IID method due to its single spatial feature. In contrast, the multiscale methods, including SPCA-GPs, MSTV, and IAPs, can effectively reduce those misclassifications, but they still suffer from incorrectly distinguishing the boundaries of land covers. As can be seen, our method has effectively solved the abovementioned problems because  of the reasonable utilization of spectral and multiscale spatial features.
2) KSC and PaviaC: For the rest two data sets, there are some small and irregular land covers, such as trees, marshes, and roads, which bring difficulties for classification. From Tables V and VI, it can be seen that the classification accuracy of our method is still the highest among all compared methods in terms of OA, AA, and the kappa coefficient, with decent processing efficiency. The OA of our method improves about 14% and 13% compared with the baseline method on KSC and PaviaC, respectively. All the compared methods have different degrees of improvement on accuracies than baseline. However, the IAPs method has difficulty classifying hommock, oak, and slash pine, leading to lower classification accuracy than raw data.
Figs. 6 and 7 show the classification maps yielded by the compared methods and our method, in which our proposed method has the best performance on both two data sets. IID cannot distinguish meadows and trees well. The RLMR and LAJSR methods have spot and blocking artifacts in the resulting maps because of the limited utilization of spatial information. MSTV ignores some small or detailed land covers, resulting in a too smooth classification map. IAPs yielded  poor classification results containing irregular blocking artifacts on KSC, mainly because its extracted semantic features from complex land covers cannot correctly identify by the simple spectral-based classifier. GSLS does not distinguish well between oak and slash pine. In addition, most of these methods have edges misclassification problems between land covers. In contrast, the proposed method not only effectively eliminates the misclassification within land covers but also successfully preserves the boundary of different areas.
3) DFC2018: The quantitative results and corresponding classification maps of different compared methods on the DFC2018 data set are given in Table VII and Fig. 8, respectively. As observed in Table VII, the same conclusions can be drawn that the proposed method has good robustness and achieves the highest classification accuracies in challenging scenes. More specifically, the baseline method only holds a 69.93% result with the raw data due to the lack of spatial information. On the basis of spectral features, the RLMR and LAJSR combine local spatial similarity information and improve the OA to 71.40% and 73.69%, respectively. However, considering the local and ignoring global spatial characteristics can only bring finite accuracy improvement while increasing the processing time. IID fully excavates the intrinsic spatial features of complex land covers but takes an enormous amount of processing time. The accuracy of SPCA-GPs on this data is poor since it does not consider the global spectral and local spatial characteristics. The GSLS method takes into account the features of SPCA-GPs that are ignored and achieves 77.95% accuracy, but it is still limited by processing efficiency. MSTV utilizes multiscale spatial features, as well as global and local spectral features, and obtains good classification results (80.42%), but it is still lower than our method (81.41%). Fig. 8 highlights the superiority of the proposed method by means of classification maps. Generally speaking, our method effectively eliminates the effects of salt-and-pepper and blocking noise from classification maps and preserves the shape and semantic structure of different objects, especially In terms of processing efficiency, our method is not the most efficient but still faster than most of the compared methods. The efficiency of our method is only lower than SPCA-GPS and IAPs on three data sets, while, on the KSC data set, the time cost of RLMR and LAJSR is less than our method because they only execute on the labeled pixels, while the latter processes all pixels in HSI. In addition, the computational complexity of the proposed method is also analyzed, and the details are given in Section S.IV in the Supplementary Material.

B. Evaluation With Different Classifiers
Although the SVM classifier is utilized in the proposed method, the relatively weak classifier can verify the efficacy of the extracted features as hidden in the results [23]. To this end, another three classifiers, KNN [8], MLR [7], and SRC [9], are adopted to verify the efficacy of MSF-PCs features from the proposed model. With default parameter settings as suggested in [7]- [9], the classification results of these three classifiers and SVM performed on the raw data and MSF-PCs features are produced, which are shown in Table VIII for comparison.
As seen in Table VIII, it is obvious that the classification results obtained by SVM are superior to those obtained by the other three classifiers, and the extracted MSF-PCs features achieve a significant improvement in the classification accuracy than the raw data of HSI under all four classifiers. Take the kappa metric as an example, under the SVM classifier, the accuracies of extracted features are improved by 0.36 for Indian Pines, 0.16 for KSC, 0.13 for PaviaC, and 0.18 for DFC2018. While the accuracy improvement are 0.4, 0.27, 0.12, and 0.21 under the KNN, 0.2, 0.39, 0.24, and 0.09 under the MLR, and 0.24, 0.17, 0.07, and 0.07 under the SRC, for the four data sets, respectively. It can be seen that, under weak classifiers, the improvement of kappa from raw data to the MSF-PCs features is higher than the SVM classifier on some data sets. As the generalization capacity of the weak classifier is limited, the significant accuracy improvement can be mainly attributed to the superior discrimination ability of the MSF-PCs features. These results again highlighted the effectiveness of the extracted features for classification.

C. Comparison With Deep Learning Approaches
In this section, several CNN-based spectral-spatial feature extraction methods are chosen for comparison, which includes 3-D-CNN [37], ResNet [38], SSRN [39], MPRN [40], DFFN [41], and MSSN [42]. This experiment is carried out on the Indian Pines, where 10% of samples per class are selected for full training of the deep network. The classification accuracies of all methods are listed in Table IX. As we can see, compared with conventional 3-D-CNN and ResNet that simply use limited spectral and spatial information, DFFN and MSSN further utilize multiscale and multilevel feature extraction layers, obtaining higher classification accuracies. Our proposed method excavates the multiscale features of HSI with a shallow structure, and it has achieved the best results in terms of three metrics. It can be concluded that the MSF-PCs features using fewer parameters are comparable and superior to those mined by deep networks. Section S.II in the Supplementary Material gives more details.

D. Classification With Spatially Disjoint Samples
Considering the fact that some works [53]- [55] have indicated that traditional random sampling strategy usually leads to an improper assessment of different methods, in this section, we use spatially disjoint samples to acquire more realistic classification results and more accurate assessment of the models. The classification results on two sampling strategies of all compared methods in Section IV-C are summarized in Table X. As can be seen, the proposed method still achieves superior accuracies with region samples, which is especially significant in small training samples. Besides, the accuracy differences between random sampling and region sampling of our method are relatively small among all spectral-spatial

E. Discussion
In HSI, the multiscale spectral-spatial features' extraction of complex scenes can improve the feature representation capability, which affects the classification performance. Traditional methods only extract the superficial features of HSI, which constrains the classification performance. In contrast, our proposed method fully mines representative features in multi scales of various land covers.
First, 2-D-SSA extracts spatial trend features, which considers the correlation information of local and global spatial contexts. Multiscale 2-D-SSA features provide complementary yet correlated information for classification. In addition, combining SPCA and PCA postprocessing, the spatial features are further merged on the concentrated spectral features. The local spectral-spatial and global spatial features at different scales are obtained to characterize the land-cover differences at different scales. Finally, the main spectral discrimination feature obtained by PCA further improves the feature classification performance.
The fused MSF-PCs feature contains abundant spectral-spatial information and mines the intrinsic correlation of the HSI data. It can be viewed as the representative features in HSI because of its excellent feature representation capability.

VI. CONCLUSION
In this article, a novel multiscale feature fusion method based on PCA and 2-D-SSA is proposed to extract abundant spectral-spatial features from HSI for data classification. SPCA and multiscale 2-D-SSA are integrated to extract local spectral-spatial features, where multiscale 2-D-SSA helps to derive spatial features at different scales. These features are further fused with the global spectral discrimination features obtained by PCA, aiming for improved classification performance.
Compared with existing spectral-spatial methods, results on three real HSI data sets have demonstrated that the proposed method provides better performance when only a small number of training samples are available. It shows that the extracted abundant features are effective for distinguishing different land-cover classes. Furthermore, compared with state-of-the-art CNN-based deep learning approaches, the proposed method can still achieve higher classification accuracy, which has validated again its superiority in feature extraction.
Although multiscale 2-D-SSA has improved the classification accuracy, a large number of embedded windows have inevitably increased the computational complexity. In future work, superpixel-alike segmentation may be applied to further improve the efficiency of the proposed approach.