Multiscale Superpixelwise Prophet Model for Noise-Robust Feature Extraction in Hyperspectral Images

Despite various approaches proposed to smooth the hyperspectral images (HSIs) before feature extraction, the efficacy is still affected by the noise, even using the corrected dataset with the noisy and water absorption bands discarded. In this study, a novel spectral–spatial feature mining framework, multiscale superpixelwise prophet model (MSPM), is proposed for noise-robust feature extraction and effective classification of the HSI. The prophet model is highly noise-robust for deeply digging into the complex structured features, thus enlarging interclass diversity and improving intraclass similarity. First, the superpixelwise segmentation is produced from the first three principal components of an HSI to group pixels into regions with adaptively determined sizes and shapes. A multiscale prophet model is utilized to extract the multiscale informative trend components from the average spectrum of each superpixel. Taking the multiscale trend signal as the input feature, the HSI data are classified superpixelwisely, which is further refined by a majority vote-based decision fusion. Comprehensive experiments on three publicly available datasets have fully validated the efficacy and robustness of our MSPM model when benchmarked with 11 state-of-the-art algorithms, including six spectral–spatial methods and five deep learning ones. Besides, MSPM also shows superiority under limited training samples, due to the combined strategies of superpixelwise fusion and multiscale fusion. Our model has provided a useful solution for noise-robust feature extraction as it achieves superior HSI classification even from the uncorrected dataset without prefiltering the water absorption and noisy bands.


I. INTRODUCTION
N OWADAYS, with the enhanced sensor and communication techniques, increasing spectral ranges and bands can be obtained in hyperspectral images (HSIs) [1], [2]. The rich spectral information in an HSI allows for accurate discrimination of materials with subtle spectral differences than panchromatic and multispectral images [3]. As a result, the HSI has been widely used in a wide range of applications, such as salient object detection [4], peated barley malt analysis [5], beef quality evaluation [6], corneal epithelium injuries classification [7], and fingerprint anthocyanins extraction [8].
Often, pixel-level mapping is widely used for classification of the HSI [9], where a certain number of pixels in each class should be labeled in prior for training a supervised classifier [10], such as the popularly used support vector machine (SVM) [11]. Due to the limitations of the sensors, environmental factors, and even atmospheric effects, there inevitably exist bands with a low signal-to-noise ratio (SNR) [12]. Even after removal of these water absorption bands and noise bands, the resting bands still contain severe noise [13]. This would unavoidably degrade the subsequent feature extraction and classification, especially under limited training samples and severely unbalanced distributions in different classes.
To tackle these challenges, the feature mining of an HSI has gained increasing attention recently in both the spectral and spatial domains [1]. Typical techniques for spectral feature mining include the principal component analysis (PCA) [14], nonnegative matrix factorization (NMF) [15], empirical mode decomposition (EMD) [16], and singular spectrum analysis (SSA) [13], which have shown good performance in preserving the dominant spectral information while removing the noise. However, most of these techniques only utilize the spectral information, which fails to deal with the objects with low interclass heterogeneity and low intraclass homogeneity [17], [18].
There are also many spatial feature mining methods, such as the Gabor filter [19], morphological profile [20], sparse representation [21], 2-D-EMD [16], and 2-D-SSA [3]. In addition, a supervised spectral-spatial classification approach was proposed in [10], which features a spectral data fidelity term derived from the sparse multinomial logistic regression (SMLR) and a spatially adaptive Markov random field (MRF) prior model constructed via spatially adaptive total variation regularization (denoted as SMLR-SpATV). This method could spatially smooth the images and gain robust results of classification even with very limited training samples.
Besides, deep learning methods, which can extract latent structure features in spatial and spectral domains, have recently drawn increasing attention in classifying HSIs. Typical methods include the diverse region-based convolutional neural network (DR-CNN) [22], the spectral-spatial cascaded recurrent neural network (SSCasRNN) [23], the Gaussian-Bernoulli restricted Boltzmann machine (GBRBM) in parallel [24], the fully convolutional segmentation network (FCSN) [2], and the segmented autoencoders using only the spectral features [25]. However, their results depend heavily on the network structure and the training strategy, where the results can be quite poor when there are insufficient training samples [26].
In most of the aforementioned approaches, spatial-spectral classification produces much improved results, where often the spatial region adopted has a fixed size and shape. This actually differs from real cases where the regions of interest are usually irregularly shaped and inconsistently sized [20]. Therefore, the spatial regions used for feature mining should be adaptive to the specific structures of the image. To this end, the superpixel has been employed to adaptively extract the spatial features for the object-based HSI classification [9], [20], [26]. The superpixels in an image are local homogeneous regions with different sizes and shapes, where smooth areas tend to form larger superpixels, while superpixels from heterogeneous areas tend to be much smaller. In fact, superpixel is found effective to represent the spatial-contextual information and improve the classification [12]. In [9], multiscale superpixel (MSP) segmentation and subspace-based SVM, SVMsub, are combined for spectral-spatial classification of HSIs. By using the simple linear iterative clustering (SLIC) for extracting superpixels, multiscale spatial features are combined and fused for improved classification of HSIs. In [26], the superpixel was combined with the conventional SSA and 2-D SSA for the HSI classification, where the local spatial features are adaptively extracted based on the size of the extracted superpixels.
However, most of these spectral/spatial feature mining techniques fail to deal with the noisy bands, as they are often tested on the corrected datasets with the water absorption bands and noisy bands removed. In practical applications, this will cause information loss and also extra burden to filter these noise affected bands. To tackle the noisy bands rather than simply removing them, a few advanced denoising approaches have been developed in recent years, such as low-rank representation (LRR) and superpixel segmentation-based denoising (denoted as SS-LRR) [12]. In [27], the low-rank matrix approximation and an iterative regularization framework are proposed to denoise the low SNR bands while preserving high SNR bands. In [28], the subspace LRR was used to decompose an HSI into two low-rank submatrices, using superpixel-based segmentation to exploit low rank in local spatial regions.
However, these approaches pay more attention on denoising rather than extraction of the discriminative features for data classification in an HSI. In [29], we proposed a multiscale 2-D-SSA with PCA (2-D-MSSP). By fusing the spatial features of 2-D-SSA in different scales, it shows promising performance in noise-robust feature extraction even on the uncorrected HSI datasets.
As a recently developed forecasting tool by Facebook, the Prophet model can deeply analyze data by decomposing it into different components while well preserving the trend component and the most informative part of signal [30]. The derived trend component is jointly decided by the information of former data in the series, making it robust to any missing or highly noisy data. The HSI also characterizes sequential data though in the spectral domain [28], which contains highly nonlinear scattering noise and even missing data, and thus, it motivated us to apply the Prophet model in analyzing the HSI.
In this article, a novel multiscale superpixelwise prophet model (MSPM) is proposed for noise-robust feature extraction under small training size. With the exceptional capability of the Prophet model in noise-robust trend extraction and data prediction for enhancing features with improved interclass diversity and intraclass similarity, we aim to realize improved classification even on the uncorrected HSIs. The major contributions of this article can be summarized as follows.
1) The proposed MSPM model can simultaneously denoising the data while extracting the noise-robust features of an HSI, even from the uncorrected dataset without removing or denoising the water absorption bands and severely noisy bands. 2) It is the first time the Prophet model is applied in HSIs, where the proposed MSPM can adaptively extract the trend components in different scales based on the characteristics of the input data. The fusion of multiscale trend components can extract effective features under unknown noise levels, which is beneficial to the HSI classification. 3) With both multiscale superpixelwise spatial-spectral feature-level fusion and decision-level fusion, the proposed framework is found to be effective for the HSI classification even under limited training samples, which has outperformed a number of state-of-the-art approaches, including conventional spectral-spatial classifiers and several deep learning models as evaluated on three publicly available datasets. The remainder of this article is organized as follows. Section II introduces the principles of the Prophet model. Section III presents in detail the proposed MSPM framework. The design of experiments, including the testing datasets and parameter settings, is discussed in Section IV. Section V presents the experimental results and analysis, followed by some concluding remarks drawn in Section VI.

II. PRINCIPLES OF THE PROPHET MODEL A. Concept and Algorithm of the Prophet Model
As a recently developed forecasting tool by Facebook for analyzing time-series data of business [30], the Prophet model blends the advantages of judgmental and statistical modeling to convert the forecasting problem as a curve-fitting task. As a nonlinear and decomposable model, the Prophet model can deeply analyze data by decomposing a given time series y(t) into different components, in which the trend component represents the most informative part of signal where g(t) is the trend component modeling the nonperiodic changes; p(t) is the periodic seasonality component, i.e., daily, weekly, or annual seasonality; h(t) denotes the holiday effects, which usually appear on irregular schedules over one or more days; and ε t is the error term assumed in the normal distribution.
The trend component represents the most informative part of signal, which is widely defined by a logistic function [30] where C is the carrying capacity determining the maximum value of g(t), r denotes the growth rate, and m is an offset parameter. In g(t), the growth rate r is defined to characterize the changes in trend component. More specifically, there are S changepoints that are set over the whole time t. At each changepoint s(s = 1, . . . , S), the growth rate r is set to change for better curve fitting. The rate change is defined as δ ∈ R S , where δ s denotes the rate change at the sth changepoint. Therefore, the growth rate r s at any changepoint is calculated by adding the rate adjustment to the former growth rate where the base rate r 1 is determined by Stan' limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS) [32] and the rate adjust vector δ is represented by a sparse prior via a Laplace distribution: δ s ∼ Laplace(τ ). The more explanations are given in Section IV-B. Based on the growth rate, the parameter m is correspondingly updated for connecting the trend at different changepoints. The adjustment on m at the sth changepoint is defined as ∅ s . Thus, the offset value m at the sth changepoint is given by where the base offset m 1 is also defined using Stan's L-BFGS. ∅ s is calculated as where ∅ 1 is set to 0. Business time series usually possess periodic phenomena due to the associated repeated seasonal behaviors [31]. To achieve this, the period function of t is employed as the seasonality in the model. It employs a standard Fourier series to flexibly approximate the seasonal effects where M represents the expected regular period of data (e.g., M = 7 for weekly series when using days as the time scale).
The parameter vector β is set to satisfy β ∼ Normal 0, σ 2 . The factor N affects the fitting performance on the seasonality, where a large N results in more decomposed terms in p(t).
It allows seasonal patterns change more quickly with an increased risk of overfitting [30], [31].
In the forecasting of business time series, the holiday's effects are incorporated in a straightforward way based on the prior knowledge of the events as follows: where Z (t) = [1(t ∈ Day 1 ), . . . , 1(t ∈ Day L )] is an indicator signifying if a time t is in the period of holiday events. The factor Day is the predefined holiday days. The parameter κ denotes the magnitude of the change caused by the corresponding holiday events in the forecast, which is defined as . Further details about the Prophet model can be found in [30] and [32].

B. Enhancing Features of HSI With the Prophet Model
From (3) to (5), it is clear that the derived trend component g(t) is decided not only by the input data at time t but also by all information of the former data jointly in the time series. Therefore, when the input data at time t are corrupted by noise, the obtained g(t) can avoid the negative effect of noisy data and still derive the correct information. As such, the Prophet model has great potential to analyze the complex uncorrected HSI with noisy and water absorption bands. By taking the spectral data in hundreds of bands as a time series, we can gain insights into an HSI, including decomposing the spectral profiles into different components to derive the informative parts, i.e., the trend, and mitigate the noise in an HSI.
For effective feature extraction, it is necessary to deal with the subtle interclass difference and large intraclass variations of an HSI [33]. In order to evaluate the feature characterization ability of Prophet on HSIs, the interclass and intraclass variations are calculated and assessed in this section. Here, some classic feature extraction approaches, such as 1-D-SSA [13] and mean filtering, are compared with the Prophet model. The raw spectral profiles are taken as the baseline for comparison, too. Besides, the Prophet model with and without seasonality settings is also evaluated to test the effects of seasonality components in feature extraction of an HSI. For 1-D-SSA, the window size is set to 5 × 5. Only the first component is used for eigenvalue grouping according to the recommended configurations [3], [13]. The size for the mean filtering is set to 5 × 5. In the Prophet model, the values of parameters τ , σ , and N are set to 25, 30, and 3 empirically in this test, respectively.
For evaluating how the Prophet model may affect the interclass diversity and intraclass variance, the Kullback-Leibler divergence [34] is employed, which is based on the information entropy theory to quantify the dissimilarity (or difference, distance, and discrimination) between different data. Here, the Salinas dataset [35] is taken as an example. For each comparing approach, ten samples are randomly selected from each class of the 16 labeled classes. The Kullback-Leibler divergences among the samples from every two classes, each with all other classes on average, and within each class are calculated. The results of interclass and intraclass Kullback-Leibler divergences are shown in Fig. 1.
As shown in Fig. 1, the Prophet model achieves the highest interclass Kullback-Leibler divergence and the lowest intraclass Kullback-Leibler divergence in most classes, that is, the Prophet model can effectively improve the disparity between classes and similarity within each class, which would help enhance further data classification [36]. This validates the potential of Prophet in an HSI feature extraction. The raw spectral profiles (baseline) contain much noisy information. As shown in Fig. 1, compared with mean filtering and 1-D-SSA, the baseline exhibits higher interclass and intraclass Kullback-Leibler divergence. As for mean filtering, it performs better in classes 1, 3, and 15 on interclass disparity and classes 2-4, 7-9, and 11 on intraclass similarity. The 1-D-SSA shows higher interclass Kullback-Leibler divergence on class 6 and better intraclass Kullback-Leibler divergence on classes 1, 10, 12, and 14-16. The main reason for the inferior performances of 1-D-SSA and mean filtering is that they filter out the noisy contents only using the local information within each sliding window size. On the contrary, the Prophet model can suppress the noisy and reconstruct data based on the contextual spectral information in former bands. In addition, from Fig. 1, we can find that the Prophet model without seasonality shows better results, i.e., higher interclass Kullback-Leibler divergence and lower intraclass Kullback-Leibler divergence in most cases. This may be due to that it is difficult to use the periodic seasonality settings to characterize the complicated nonperiodic changes in HSI spectra. The results from the Indian Pines and PaviaU datasets are given in Figs. S1 and S2 of the Supplementary Material, which are consistent with the findings here. Thus, in this article, we use the Prophet model without seasonality for further feature extraction and data classification.

III. PROPOSED APPROACH
The flowchart of the proposed MSPM approach is shown in Fig. 2, which has three main stages, i.e., multiscale superpixelwise spatial constrained smoothing, multiscale Prophet modelbased feature extraction, and superpixelwise HSI classification and fusion, as detailed in the following.

A. Superpixelwise Spatial Segmentation and Smoothing
In general, adjacent pixels in an HSI often share the similar spectral profiles because of the same surface material. Through superpixelwise segmentation, these pixels can be grouped to one region (superpixel), while the whole image is segmented into homogeneous regions with adaptively determined sizes and shapes. To ease the computational cost, the PCA is applied to the original HSI dataset to extract the first three principal components to form a pseudocolor image. The SLIC [37] method is then selected and applied to this pseudocolor image for superpixelwise segmentation of the HSI, as the SLIC can produce accurate boundary adherence yet with a low computational complexity [9].
Specifically, the SLIC starts with k initial cluster centers distributed on an evenly spaced grid, where k is the desired number of superpixels. It clusters on a square region sized of 2R × 2R around the centers, where R = √ n/k denotes the segmentation scale and n is the total number of pixels of the HSI. For measuring the homogeneity of a superpixel, a distance between two pixels Dist(i, j) is defined, which includes the spectral distance to measure the spectra similarity and the spatial distance to estimate the regularity, proximity, and compactness of these two pixels Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
where the spatial coordinates of the two pixels x i and x j are p i and p j , and G is a geometric factor to balance the relative importance between the spectral and spatial features. For each superpixel in the generated segmentation results of the superpixel maps, the location indices of all its pixels will be recorded and mapped to all the spectral bands, resulting in nonoverlapping 3-D superpixels [20]. In this way, each superpixel contains a group of adjacent pixels in D dimensions. As shown in Fig. 2, superpixelwise spatially constrained image smoothing is then employed to remove the noise. Specifically, for each spectral band, the mean value of all pixels within the superpixel is taken as the value for that superpixel. In other words, superpixelwise mean spectral vectors are calculated to replace all pixels within each superpixel. In this article, rather than setting a fixed segmentation scale for each dataset, we adopt an MSP strategy, in which a set of segmentation scales are applied to different datasets when generating multiple superpixel maps for improved feature extraction.
The superpixelwise image smoothing can remove the noise within each superpixel to some extent. Besides, the next Prophet operation will only need to be applied to the mean spectral vector of each superpixel, which can significantly improve the efficiency of the following-on analysis.

B. Superpixelwise Multiscale Prophet Model
The Prophet model without the seasonality was used for the further image process. Specifically, for the kth superpixel in one superpixel map, let SP k,d denote its superpixelwise mean spectra at the dth band gained in Section III-A, and the spectra data from all bands would form a vector S P k . The trend component in SP where C k is the carrying capacity of the kth superpixel, which is determined based on the input spectra, i.e., the maximum value among the spectra of all bands, C k = max d∈1,...,D SP k,d . For better capturing the statistical spectral-spatial structures in an HSI throughout the whole bands, we define each band as a changepoint. This can help to extract the local shifts of the superpixelwise profile among consecutive bands. Based on (3)- (5), r d and m d in an HSI data can be calculated. During this process, as explained in (3), the factor τ significantly affects the value of r in each pair of consecutive bands, which further determines the magnitude of output trend components, that is, a large τ leads to the derived trend components overfitting, while a small τ causes trend components underfitting with the input superpixelwise spectral profile. In other words, the parameter τ determines the informativeness and noise level of the input data. Generally, the noise level varies throughout the hyperspectral bands, which is also correlated with sensor characteristics, image contents, and weather conditions, while the image was acquired [38].
Taking one pixel from the Indian Pines dataset [3] as an example, Fig. 3(a) shows the extracted trend signals under different values of τ , indicating the effect of various noise levels on the derived features. The trend function with a larger τ generates more similar estimates to the original data, although it may risk of potential overfitting. In essence, for an HSI data with a low SNR, smaller τ values can remove most of the noisy content. However, if τ is too small, important characteristics of the spectral profile may be lost. Thus, applying a small τ to a dataset with a high SNR can lead to the loss of discriminating information in the samples. Besides, it also shows that the Prophet model can derive the informative trend from this HSI data, which is robust to the noisy content. This will be further validated in Section V.
In addition, the effectiveness of superpixelwise Prophet model on the water absorption band is also analyzed, as shown in Fig. 3(b)-(d). Although water absorption bands are severely affected by noise, they still contain some useful information, as shown in the following. We take the band image with wavelength of 2499 nm as an example, which is severely affected by water absorption. The reconstructed image from Fig. 3(b), using our proposed superpixelwise Prophet model with a superpixel scale of 100 and τ of 20, is shown in Fig. 3(c). Although Fig. 3(b) seems to contain no useful information, the denoised version from our approach shown in Fig. 3(c) has dramatically improved the usefulness of the features. In other words, due to the strong prediction capability of the Prophet model, the severely noise-affected water absorption band now becomes reusable, in particular the wellpreserved and enhanced local spatial structures compared with the ground truth (GT) in Fig. 3(d). Overall, it has demonstrated the great value and noise robustness of the proposed model to extract useful information from the water absorption bands and further improve the classification of HSI.
For the noisy HSI data, in real cases, the signal and noise ratio varies differently in each band [38]. Therefore, it is a challenging or even impossible task to estimate an optimal scale of τ for the trend components for all the bands. To tackle this challenge, a multiscale approach is adopted in our MSPM framework, where a set of τ values are used, τ = {τ 1 , τ 2 , . . . , τ T }, representing different scales, where T is the number of scales. For various values of τ , in total, T trend components can be extracted from the smoothed mean spectral vector of each superpixel. These are then taken as multiscale features of the HSI for data classification in the next stage, where the Prophet R package is used for implementation in our experiments [39].

C. Superpixelwise HSI Classification and Decision Fusion
At a given scale, the trend components extracted from the Prophet model can be taken as the feature for the corresponding superpixel to produce a superpixelwise classification map. For T scales, we will have T classification maps, denoted as where l represents the possible class label in HSI and F is an indicator function given by F(i, j) = 1 if i = j or F(i, j) = 0.

IV. EXPERIMENTAL SETTINGS AND DATASETS DESCRIPTION
For performance assessment, three publicly available HSI datasets are used in this article. The descriptions of the datasets and the experimental settings as well as some ablation studies are presented in detail as follows.

A. Description of the Datasets
The first dataset is the Indian Pines, which was collected using the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor [35] over the Indian Pines study site in Northwest Indiana, USA. This dataset contains 145 × 145 pixels with a low spatial resolution of 20 m/pixel and 220 spectral bands in the wavelength range of 0.4-2.5 µm after removing four invalid bands. In addition, 20 water absorption bands (104-108, 150-163, and 220) were often discarded [40] before data classification, leaving 200 spectral reflectance bands for analysis and testing. There are 16 manually defined land-cover classes, which are mostly different kinds of crops.
The second dataset is Salinas, which was also acquired via AVIRIS [35] in the Salinas Valley in California, USA. It has 512 × 217 pixels and 224 spectral bands with a spatial resolution of 3.7 m. Similar to the first dataset, 20 water absorption bands are usually discarded. In addition, the corresponding GT map also has 16 classes.
The third is the Pavia University dataset, namely, PaviaU, which was acquired by the Reflective Optics System Imaging Spectrometer sensor over Pavia, Northern Italy [41]. It contains 114 spectral bands in a spectral range of 0.43-0.86 µm and 610 × 340 pixels with a spatial resolution of 1.3 m. Similarly, for avoiding the effect of water absorption, the available number of bands was reduced from 114 to 103. The GT has nine classes of land covers. Further detail, including the number of samples in each class within these datasets, can be found in [20].

B. Experimental Settings
For performance evaluation, two groups of experiments are designed, using the uncorrected HSI dataset (without removing any bands) and the corrected dataset (with the noisy and water absorption bands removed). This is to validate the robustness First, a comparison analysis was conducted and benchmarked with six state-of-the-art spectral-spatial HSI classification methods, using both the corrected and uncorrected HSI images. These include two signal decomposition-based methods, 2-D-EMD [16] and 2-D-MSSP [29], the MSP-based method, MSP-SVMsub [9], an MRF weighted spatial-spectral classifier, SMLR-SpATV [10] and two superpixelwise LRR-based denoising techniques, and SS-LRR [12] and fast superpixel-based subspace low-rank learning method (FS 2 LRL) [28], which exhibited superiority in directly classifying the uncorrected HSI data. Our MSPM is also compared with some deep learning approaches. Besides, the approach to apply the SVM classifier on the raw spectral profiles of HSI for classification, denoted as SVM-spe, is taken as the baseline. The ablation experiments for comparing the effectiveness of different components of the proposed approach are given in Table S1 of the Supplementary Material.
The parameter settings of relevant methods are summarized in Table I. For our MSPM, the superpixel segmentation scales were set to {50, 100, 200, 400, 600} for all HSI datasets according to the parameter analysis in Section IV-C. As for the scale τ of the Prophet model, it varies within the range of {1, 5, 10, 15, 20, 25, 30, 35} in different permutations and combinations. For balancing the accuracy and efficiency, the optimal set of {5, 10, 20, 25, 30} was employed based on the parameter analysis in Section IV-C. The base rate r 0 and the base offset value m 0 are set by using Stan's L-BFGS in the Prophet R package by determining a maximum a posteriori estimate [32]. For SVM, the LIBSVM library was utilized for implementing multiclass classification [42], [43]. According to [16] and [44], the Gaussian radial basis function (RBF) was employed, where the kernel factor and penalty parameter were optimally determined via a grid search to 0.125 and 1024, respectively. To ensure a fair comparison, we keep the SVM parameters consistent for all experiments. The other parameters for the benchmarking approaches are set according to their recommended default values [13], as shown in Table I.
To reduce the random discrepancies and avoid systematic errors, all experiments were independently run ten times. In each run, the training and testing sets were randomly divided with no overlapping in between. To fully evaluate the performance of these methods, experiments were also conducted with different numbers of training samples. In particular, we mainly concern the problem of a small training set. Hence, the number of selected samples m set to 5, 10, and 20 for each land-cover class in the training set via stratified sampling. If the total number of samples within the landcover class is less than 30, such as grass/pasture-mowed and oats in Indian Pines, 50% of samples in each class will be used for training [45]. In addition, the overall accuracy (OA), Kappa coefficients (κ), and average accuracy (AA) are used for quantitative evaluation of the classification results.
All involved approaches were implemented using the MATLAB 2018a platform on a computer with an Intel 1 Core 2 i7-8700 CPU (3.20 GHz) and 16.0 GB of memory. The Prophet model was implemented on R 3.5.3.
C. Parameter Analysis 1) Segmentation Scale: The scale of superpixel segmentation is a key parameter that affects the generated superpixels and the classification results [17]. Here, the optimal scale values for the three datasets were tested, which were set to {10, 50, 100, 200, 400, 600, 800}. Note that the numbers of training and test samples were set the same as aforementioned in Section IV-B. Fig. 4 shows the mean OA over ten independent runs of the proposed MSPM using different segmentation scales on the three corrected datasets, under various numbers of training samples per class. In general, more training samples will lead to better classification results in terms of higher OA values. For a given training size, the segmentation scale in MSPM determines the homogeneity within a local region and thus affects the classification accuracy. An appropriate scale value can generate more accurate object boundaries. The land covers in three datasets have different characteristics and complexity. As shown in Fig. 4, the best segmentation scales for Indian Pines, Salinas, and PaviaU are found to be 100, 600, and 200, respectively, as they can help to generate the highest OA for a given number of training samples. Specifically, the proper range of segmentation scales for Indian Pines, Salinas, and PaviaU are supposed to be 50-400, 50-800, and 50-600, respectively. Therefore, in the MSP segmentation, the set 1 Registered trademark. 2 Trademarked. {50, 100, 200, 400, 600} is adopted for three datasets in this article in the remaining experiments.
2) Parameter τ : To test the optimal sets of parameter τ on three datasets, the achieved class-based accuracy and OA of the proposed method with varying τ within {1, 5, 10, 15, 20, 25, 30, 35} are analyzed in this section. The results of class-based accuracy are given in Fig. S3 of the Supplementary Material. The number of training samples is set to 5 for each land cover for all tests.
From Fig. 5, it is clear that the value of parameter τ in the range [5], [30] enables the MSPM to achieve better classification results in terms of higher OA values. As for the class-based classification performance shown in Fig. S3, on the Indian Pines dataset, the values of 5, 10, 20, and 25 for parameter τ lead to the best accuracy in the three classes (classes 3, 5, and 16), two classes (classes 6 and 15), two classes (classes 10 and 14), and two classes (classes 2 and 11), respectively. On the corrected Salinas dataset, apart from classes 1, 9, and 12, where the 100% accuracy of MSPM is produced on all cases, the best results are generated when the parameter τ is equal to 20 and 25, in which MSPM achieves the highest accuracy in three classes, respectively. The proposed MSPM also realizes better performance when the parameter τ is set to 5, 15, and 30. The best value of parameter τ for PaviaU is found to be 1 and 5, as they can help to generate the highest accuracy on classes 1 and 4, and 3 and 5, respectively. The values of 10, 20, 25, 30, and 35 for parameter τ also lead to the best classification performance on classes 9, 2, 6, 8, and 7, respectively. To sum up, considering the OA and class-based performance of MSPM on three datasets, we employ the set of {5, 10, 20, 25, 30} for the parameter τ in this article. In this section, qualitative and quantitative results from the three HSI datasets are presented for performance assessment of the proposed MSPM approach. The relevant results from uncorrected and corrected HSI datasets as well as the efficacy of different strategies in MSPM and further comparison with several deep learning approaches are detailed as follows.

A. Results From the Indian Pines Dataset
Indian Pines dataset is heavily corrupted by noise and water absorption [28]. Here, corrected and uncorrected Indian Pines datasets, where noisy and water absorption bands are removed or kept, respectively, are used for quantitative and qualitative performance evaluation as detailed next.  Table II. Note that the best results in each row are highlighted in bold for comparison.
First, all the results in Table II are positively correlated with the increasing number of training samples. With the introduced spatial features, 2-D-EMD, 2-D-MSSP, and SMLR-SpATV have all yielded better classification results than SVM-spe, in which no spatial features are used. With superpixel-based smoothing, more effective spatial features can be extracted and, thus, the further improved results from MSP-SVMsub. This superior performance is mainly due to the utilization of adaptive local structures in superpixelwsie segmentation and smoothing, which has enhanced the consistency of the spectral features better than using a fixed-size window. Besides, SS-LRR and FS 2 LRL also improve the classification accuracy due to the effect of their denoising operations. Overall, the proposed MSPM remains the best and constantly produces the highest accuracy in terms of OA and κ. It is worth noting that the 22-D-MSSP method also works effectively for the HSI feature extraction, which shows competitive results to MSPM on both corrected and uncorrected Indian Pines dataset and ranks second among all. Compared with 2-D-MSSP, due to the introduced Prophet model, our MSPM has further improved the spectral feature extraction. The MSPM achieves better performance than 2-D-MSSP, especially when the training size is small. In the case of five training samples, the classification accuracy of MSPM is about 5% higher than that of 2-D-MSSP. This has validated the efficacy of our proposed MSPM approach in addressing the Hughes phenomenon under very limited training samples.
When comparing the performance of each method on corrected and uncorrected Indian Pines dataset, we can find that in most cases, SVM-spe, 2-D-EMD, SMLR-SpATV, MSP-SVMsub, and FS 2 LRL achieve better results on the corrected dataset. The 2-D-MSSP and advanced noise reduction method, SS-LRR, can effectively remove the effect of noise while preserving the spectral features. As a result, they achieve higher accuracy on the uncorrected HSI in most cases. For our proposed MSPM approach, in almost all cases, the results on the uncorrected dataset have outperformed those from the corrected one. Due to our multiscale superpixelwise smoothing and multiscale Prophet modeling, the proposed MSPM model can successfully suppress the noise and produce satisfactory results, even without removing the low SNR and water absorption bands. This has provided additional benefits for efficient and effective interpretation of HSI in an automatic way, as no extra work is needed to filter out those unwanted bands.
2) Visual and Class-Based Performance Assessment: Although the superior performance of our MSPM has been validated in quantitative assessment, here, visual and classbased performance evaluation is used to show how exactly the classification errors have been reduced from the uncorrected dataset. Three methods in Table II with promising performance, i.e., SMLR-SpATV, MSP-SVMsub, and 2-D-MSSP, are selected for further comparison. For visual assessment, the classification maps from our approach and the three others on both the uncorrected and corrected dataset under 20 training samples per class are compared in Fig. S4 of the Supplementary Material, where black and magenta circles are used to mark, respectively, the correct and incorrect classification. In Table III, class-by-class classification results are also compared in terms of the OA, AA, and Kappa coefficient.
First, as shown in Table III, MSPM has achieved the best classification accuracy in 13 and 11 out of 16 classes on the uncorrected and corrected datasets, respectively, especially in classes 3, 10, and 11 (Corn-min till, Soybeans-no till, and Soybeans-min till). Overall, MSPM ranks first in terms of OA, AA, and Kappa coefficient, and shows better accuracy on the uncorrected dataset. This has demonstrated again the efficacy of MSPM on classification of the Indian Pines dataset with limited and unbalanced samples. Note that the recently proposed 2-D-MSSP also shows high efficiency in noise-robust feature extraction with promising class-based performance.

B. Results From the Salinas Dataset
Similarly, the Salinas dataset with or without noisy and water absorption bands, denoted as uncorrected Salinas and  Table IV, the similar factual conclusions can be made from the Indian Pines dataset. The difference here is that the addition of various spatial features contributes less than that on the Indian Pines dataset. The possible reason is the high intraclass homogeneity caused by low-level noise and geometrically structured simple agricultural objects in the image. Overall, the proposed MSPM remains the best and constantly produces the highest OA and κ except one occasion on the corrected dataset with five training samples per class. In most cases, MSPM has better results on the uncorrected dataset than the corrected one. These results have clearly shown the efficacy of our MSPM for HSI classification, which is again mainly due to combining multiscale superpixelwise segmentation and smoothing along with the multiscale Prophet model-based noise-robust feature mining. In the classification of Salinas dataset, both the 2-D-MSSP and MSPM achieve enough accuracy when the training size is 20. The 2-D-MSSP shows competitive results. When comparing the OA and κ of other methods on the corrected and uncorrected Salinas datasets, in all cases, SVM-spe and 2-D-EMD generate better results on the corrected Salinas than the uncorrected one. As for the SMLR-SpATV and 2-D-MSSP, they have inconsistent results by achieving better results in more cases from the uncorrected dataset. The two denoising methods, SS-LRR and FS 2 LRL, and MSP-SVMsub produce better classification on uncorrected Salinas in almost all cases.
2) Visual and Class-Based Performance Assessment: In addition, the results of class-by-class classification from SMLR-SpATV, MSP-SVMsub, 2-D-MSSP, and MSPM with 20 training samples per class are compared in Table V, with the classification maps given in Fig. S5 of the Supplementary Material for visual comparison. In Table V, the proposed MSPM produces the best results in 14 and 13 out of 16 land-cover classes on the uncorrected and corrected datasets, respectively. The 2-D-MSSP ranks the second, which achieves the best results in 10 out of 16 land-cover classes on both uncorrected and corrected datasets. Apart from SMLR-SpATV, all the methods in this section have achieved better OA on the uncorrected Salinas dataset. This has validated the efficacy of the useful information in water vapor absorption bands.

C. Results From the PaviaU Dataset
In this section, the performance of our proposed MSPM is further evaluated using the PaviaU dataset. Here, the performances of all comparing approaches on the corrected PaviaU dataset are analyzed. As the uncorrected PaviaU dataset with water absorption bands is unavailable, a simulated noisy dataset is produced with results given in Table S2 of the Supplementary Material. 1) Quantitative Evaluation: In Table VI, our proposed MSPM is compared with SVM-spe, 2-D-EMD, 2-D-MSSP, SMLR-SpATV, MSP-SVMsub, SS-LRR, and FS 2 LRL, where class-based classification results are also shown. Similarly, the best quantitative results are labeled in bold for clarity. As shown in Table VI, the proposed MSPM has produced the best results in terms of the highest OA, κ, and AA, which is followed by the 2-D-MSSP. The proposed methods, MSPM and 2-D-MSSP, are significantly better than other compared methods. For the class-based performance, MSPM produces the highest accuracy in six out of nine classes and realizes the correct classification of all testing data in classes 6 and 7. Table VII shows the classification results from the corrected PaviaU dataset using different numbers of training samples for all the involved algorithms. As seen, MSPM still achieves the highest OA and κ in all the experiments. With a lower number of spectral bands than the other two datasets, the Hughes effect in PaviaU dataset is not apparent [9]. As a result, spectral-spatial classification strategies have shown fewer contributions than those on the other two datasets. Nevertheless, the superpixel-based approaches, MSP-SVMsub, still exhibit superior performance due to the preserved local homogeneous regions, especially when the number of training samples is above 5. Besides, the proposed MSPM and 2-D-MSSP show significantly better results than other methods, which validates their effectiveness in feature extraction. In general, the proposed MSPM has again produced effective and reliable results on the PaviaU dataset.
2) Visual Comparison: Visual comparison of the classification maps from our MSPM and other comparing methods on the corrected PaviaU dataset is shown in Fig. S6, where black and magenta circles denote the correct and incorrect classified pixels to show the best results that have been achieved by the proposed MSPM method, respectively.

D. Running Time Comparison
The running time of different methods is compared in this section. As shown in Table VIII    on all the datasets. The proposed MSPM unfortunately requires the longest processing time, followed by 2-D-EMD without using the multiscale strategy hence quite poor results. In this article, the improvement of the classification accuracy under a small number of training samples is first emphasized. The fast Prophet model will be explored in the future study.

E. Comparing With Other Methods Including Deep Learning
We further compare our MSPM with five state-of-the-art approaches, including four deep learning approaches, i.e., GLapLRR [46], SSCasRNN [23], GBRBM in parallel [24], FSCN [2], and DR-CNN [22] on the corrected Indian Pines, Salinas, and PaviaU datasets, and the results are given in Table IX. Note for the three datasets, we only use 20 samples per class for training, in comparison to 10% training samples used in GLapLRR, SSCasRNN, and GBRBM in parallel, which equals on average 65, 338, and 475 samples per class.
The FSCN adopts 64 training samples per class on the corrected Indian Pines dataset and 238 samples per class on the corrected PaviaU dataset. For DR-CNN, 50 training samples per class are used in each dataset. Obviously, in most cases, MSPM has produced quite comparable or even slightly better results than those methods, including four deep learning models. In summary, this has validated again the superiority of our proposed MSPM approach in noise-robust spectral-spatial feature extraction and data classification of the HSI.

VI. CONCLUSION
In this article, we have proposed MSPM, a novel spectralspatial feature mining framework for noise-robust feature extraction and effective data classification of the HSI. First, we demonstrate that the Prophet model is able to enhance the HSI features in terms of reduced intraclass variance and enlarged interclass diversity. Second, superpixelwise image segmentation has been found particularly useful for grouping local spectrally similar pixels and reducing the high intraclass heterogeneity and interclass homogeneity of different land-cover classes in the HSI. Third, our MSPM model has successfully exploited spectral data at different noise levels. The Prophet model has also contributed noticeably to the classification, especially to the uncorrected datasets. The superpixelwise segmentation and the Prophet model can supplement to each other in enhancing the features in the spatial and spectral domains, respectively. As a result, the joint spectral-spatial features can more effectively characterize both the corrected and uncorrected HSI datasets, especially with very limited training samples. Our MSPM model has significantly outperformed a few state-of-the-art approaches, including several deep learning models along with much more training samples. The improved classification results from the uncorrected datasets have enabled potentially a new and fully automatic roadmap for interpreting the HSI where conventional wisdom of prefiltering of unwanted bands can be skipped.
Future work will focus on the analysis of the nonperiodic seasonality effects in the spectral profile of the HSI. Besides, fast and adaptive Prophet model will be investigated, where an adaptive scale will be exploited to replace the multiscale strategy for improved computational efficiency as well as new applications in missing data restoration and HSI reconstruction. Since the proposed method performs well with limited training sample size, the unsupervised Prophet model for the HSI data classification will be explored in the future work.