Bayesian Gravitation-Based Classification for Hyperspectral Images

Integration of spectral and spatial (SS) information is extremely important for the classification of high-resolution hyperspectral images (HSIs). Gravitation describes interaction among celestial bodies, which can be applied to measure similarity between data for image classification. However, gravitation is hard to combine with spatial information and rarely been applied in HSI classification. This article proposes a Bayesian gravitation-based classification (BGC) to integrate the SS information of local neighbors and training samples. In the BGC method, each testing pixel is first assumed as a massive object with unit volume and a particular density, where the density is taken as the data mass in BGC. Specifically, the data mass is formulated as an exponential function of the spectral distribution of its neighbors and the spatial prior distribution of its surrounding training samples based on the Bayesian theorem. Then, a joint data gravitation model is developed as the classification measure, in which the data mass is taken to weigh the contribution of different neighbors in a local region. Four benchmark HSI datasets, i.e., the Indian Pines, Pavia University, Salinas, and Grss_dfc_2014, are tested to verify the BGC method. The experimental results are compared with that of several well-known HSI classification methods, including the support vector machines (SVM), sparse representation, and other eight state-of-the-art HSI classification methods. The BGC shows apparent superiority in the classification of high-resolution HSIs and also flexibility for HSIs with limited samples.


I. INTRODUCTION
T HE recent advances of the hyperspectral remote sensing technology are capable of collecting hyperspectral images (HSIs) with hundreds of spectral bands and reasonable spatial resolution simultaneously [1]. This detailed spectral and spatial (SS) information increases the possibility of more accurately discriminating materials of interest. Thus, the hyperspectral remote sensing technology has played an increasingly important role in many remote sensing application, such as plant assessment [2], mineral exploration [3], disaster monitoring [4], and astronomy [5]. In these applications, classification is one of the most fundamental while challenging tasks.
To obtain high-quality classification results, complete utilization of SS information is extremely important [6], [7]. In the past decades, researchers have developed numerous methods to synthetically utilize the SS information of HSIs [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24]. Some studies first extract spatial features using texture filters on each spectral feature, such as gray-level co-occurrence matrix (GLCM) [8], Gabor [9], local binary pattern (LBP) [10], and Markov model [11]. The extracted SS features are then stacked together as the input of commonly used classifiers, such as support vector machines (SVMs) [23], random forest (RF) [25], decision tree [26], artificial neural networks [27], and so on. Unfortunately, these kinds of methods always produce huge dimension of features and result in deterioration of classification accuracy. Meanwhile, the data projection-based methods, such as sparse representation [14], [28], manifold learning [15], [29], and graph-based methods [30], are also developed to mine the physically meaningful consensus in a low-dimensional feature space in order to predict the classification labels [16]. Another way to deal with the high dimensional and big computational challenges in HSI filed is the cloud computing architectures [31]. In these algorithms, the spatial information is often incorporated by the joint of neighbors, such as the joint sparse representation classification (JSRC) [17]. Nevertheless, the uses of spatial information are limited to that among the central pixel and its surrounding neighbors, and the spatial information of the training samples is not incorporated.
In addition to the aforementioned methods, some other studies combine the SS information by integrating different classifiers [12], [19], [20], [21], [22], [23], [24]. One typical kind of method is the fusion of pixel-based classification and superpixel-based segmentation results by the majority voting [12], [32]. In these researches, the spectral information contributes mainly to the classification, while the spatial adjacent relationships between pixels are considered to refine the results. The commonly used classification methods are machine learning methods, such as SVM and RF [25], while the superpixel-based segmentation methods include meanshift [33], simple linear iterative clustering (SLIC) [34], and fractal net evolution approach (FNEA) [35]. The classification results are often varied due to the difficulty in determining the optimal segmentation scale. Another type of method is to design multiple classifiers to model different types of features and their combination produces the final classification [19]. One widely used example is the multikernel SVM classifier [20]. For example, in [21], two Gaussian radial basis function (RBF)-based kernels, i.e., the spectral RBF kernel and texture RBF kernel, are combined by a weighting method. In [23], a series of kernel functions with different scale parameters are constructed to model spatial information at different scales. However, the selection of kernel functions, training of kernel parameters, and determination of the weighting parameters are still challenging [24], [36]. Although some intelligent optimization algorithms (IOAs), such as genetic algorithm (GA) [37], particle swarm optimization (PSO) [38], and differential evolutionary (DE) [23], have been introduced to optimize the parameters, how to avoid an IOA fall into a local optimum is a big issue [24].
Most of the aforementioned classification methods are constructed based on the eager learning algorithms/principles (ELAs) [39]. The ELAs put significant effort in abstracting from the training samples by creating condensed representations during the learning phase [40]. The representations are considered approximate globally and applied directly to the new testing pixels. However, ELAs generally unable to provide good local approximations and show poor generalization ability when dealing with complex problem. Even the deep learning algorithms, which have recently become a hot topic and superior methods in HSI classification, also face a series of issues [5], [13], [27], [41], [42], [43], [44], [45]. For example, although the 2-D and 3-D conventional neural networks (CNNs) [41], [46], [47] can simultaneously extract and abstract the SS information, they usually require significantly large training samples and more computation resources and time [48].
In contrast, lazy learning algorithms (LLAs), such as k-nearest neighbor (KNN) [14] and case-based reasoning [49], are nonabstracting local learning methods. LLAs classify a testing pixel by finding the most similar samples in its local neighborhood and assign the majority class label of the neighborhood [40]. The most prominent advantage of LLAs is that they have the innate incremental learning ability and thus can be very effective under complex decision space [50]. In recent years, the LLAs have been combined with the sparse representation methods to promote the classification accuracy of HSIs [51], [52], [53].
More recently, several LLAs inspired by Newton's law of gravity have been proposed [54], [55], [56], [57], [58], [59], [60], [61]. These methods simulate the attractive gravitational force between data points, denoted as data gravitation, as the classification measure. Following Newton's law of gravity, if the testing sample falls into a certain category, the distance between it and the corresponding training sample will be small, and thus, the data gravitation between them is large. A data point is assigned the class label, whose training samples exerted the maximum resultant data gravitation to it. Obviously, different from the general distance-based clustering models such as KNN and K -means, the data gravitationbased methods can comprehensively utilize the information of all samples, and these methods have proven their superiority in data classification [54], [57], [60]. For example, Peng et al. [54] first propose a data gravitation classification (DGC) method for the classification of UCI datasets. In [60], to process the imbalanced classification problem, an amplified gravitation coefficient (AGC) is introduced to develop an imbalanced DGC (IDGC) method. The effectiveness of IDGC has been further verified in the imbalanced traffic identification mission [61]. To deal with the classification problem with noisy data, Wen et al. [57] propose a cognitive gravitation model (CGM) based on both the law of gravitation and the law of cognitive. Besides, the superiority of data gravitation in classification has motivated its application to edge detection of HSIs [62].
Motivated by these applications, we have tested the availability of data gravitation in HSI classification by developing a joint neighborhood learning-based GDC method (JDGC) [63]. Even so, the integration of SS information is still a challenging problem for data gravitation. Especially, the spatial prior information of the training samples is omitted, whereas the spatial prior of training samples can be a beneficial constraint to the classifier [64]. Furthermore, similar to the traditional joint KNN method, JDGC relies on Euclidean distance as a classification measure and holds that weight of each neighbor in a local region is identical. This is not reasonable because of each neighbor have different importance and distribution. How to find a way to obtain the different weights of each neighbor in a joint region is another major issue.
Aiming to address the above questions, we present a Bayesian gravitation-based classification (BGC) method in this article. In this method, each testing pixel is assigned to the class, whose training samples exerted the largest data gravitation in a local joint region. The main contributions of BGC contain the following.
1) Proposed a Bayesian-based data mass calculation method. Here, the testing pixel with more serried neighboring pixels and denser training samples is assigned with larger mass. This process effectively promotes the information mining of the limited training samples and successfully integrates the SS information of HSIs and training samples. 2) Designed a novel joint data gravitation model as the classification measure, in which the joint Euclidean distance is weighted by the data mass. Thus, each neighboring pixel in the joint region can contribute different weights to the central one synthetically based on their spectral and structural similarity. The remainder of this article is organized as follows. Section II reviews the principle of the data gravitation. Section III presents the details of the proposed BGC method. Section IV describes the experimental datasets, and the corresponding comparison results are discussed in Section V. Finally, Section VI concludes this study.

II. DATA GRAVITATION
In data gravitation model, each data point is assumed as a massive object. Following Newton's law of gravity, any two massive objects in the universe can exert attractive gravitational force to each other. The strength of the gravitational force (F) between two objects is directly proportional to the product of their masses (m 1 and m 2 ) and inversely proportional to the square of the distance (r ) between them [65] where G is the gravitational constant, which is often defined as 1 in the data gravitation-based methods. Obviously, the larger mass and the shorter distance will produce the bigger gravitational force. This is used to define the data gravitation and is used as a scalar to express the similarity between data points [54]. For a data set, each data will be attracted by all the other data. Thus, the larger accumulative data gravitation indicates the greater similarity. As shown in Fig. 1(a), for data point with unit mass, when the distance between data point P and the geometric center of class 1 (2r ) is two times of that of class 2 (r ), following (1), data gravitation exerted by class 2 (4F) is 4× larger than that of class 1 (F). That is, P is more likely belonging to class 2, while for Fig. 1(b), although the distance between data point P and class 1 and 3 is the same, the data gravitation exerted by class 3 (2F) is two times larger than that of class 1 because of the overall mass of class 3 is twice of the class 1. In this situation, P is more likely belonging to class 3.
Owing to its simple principle, the data gravitation theory has been utilized to solve the data clustering and classification problem [54], [57], [60], [61], [62], [63]. These studies found that the definition of the data mass and distance metric is crucial since they form the similarity metric. Basic DGC method sets the mass of one specific data to unit value, and the classification measure relies on the Euclidian distance between data [54]. In [57], self-information is used to respect the data mass. In [59], the reciprocal of the accumulative distance between a data and its KNNs is set as the data mass, and thus, the densely distributed data will show larger data gravitation. This accords with the observation of Fig. 1(b) that the data density affects the classification result.
Nevertheless, until now, data gravitation rarely been applied in HSI classification because of gravitation is hard to combine with spatial information. This article aims to design a novel data mass calculation method and thus develop a more effective data gravitation-based classification measure by integrating the SS information of testing pixels and prior training samples. Step 1: Bayesian-based data mass calculation, in which the mass of each pixel is calculated based on the spectral distribution of its neighbors and the spatial distribution of its surrounding training samples.

III. BAYESIAN GRAVITATION-BASED CLASSIFICATION
Step 2: Joint data gravitation model, where the data mass is taken as the weight coefficient of joint Euclidean distance in the local joint region. Details of the two main steps are presented in Sections III-A and III-B.

A. Bayesian-Based Data Mass Calculation
Assuming an HSI with N pixels belongs to C classes, the first step of the BGC method is to calculate the data mass of each testing pixel y i (i ∈ [1, 2, . . . , N]). In BGC, each testing pixel is assumed as a massive object with unit volume and particular density. As is well-known, the mass (M) of an object is the product of its density (ρ) and volume (V ), i.e., M = ρV [66]. As the volume of a pixel is assumed as a unit value, the data mass of y i is only determined by its density Generally, the density of a data point is determined by its distribution in the attribute space. Given this, for each testing pixel y i , we define a local spectral density parameter λ y i and a local spatial prior density parameter P j y i to jointly describe its distribution in the spectral-spatial space.
For the local spectral density, λ y i is defined as follows: where d y i ,k is the spectral Euclidean distance between y i and its kth neighbor (k ∈ [1, 2, . . . , W spe × W spe ]), as shown in Fig. 2. Obviously, if the similarity between the spectral features of the central pixel and its neighbors is strong, the corresponding density value is high. For the local spatial prior density, P j y i is defined as the spatial prior distribution of the training samples in a W spa × W spa local neighborhood, as shown in Fig. 2. In the local neighborhood, the denser the training samples of a certain class, the higher the possibility that the pixel belongs to this class. Hence, the possibility that the pixel i is classified to the j th class is set as P j y i . To calculate P j y i , we first suppose two events. Event A j y i : The pixel y i is a training sample of the j th class. Event B j y i : The pixel y i is classified to the j th class, i.e., P j y i = P(B j i ). Obviously, events A j y i and B j y i are correlate to each other, and the prior knowledge of event A j y i can make the probability of event B j y i closer to the real probability. This happens to follow the Bayesian theorem [67], [68]. The details for calculating P(B j i ) are given in the following. Suppose the neighborhood of y i contains N S i training samples and N S i (n j is the training samples size of the j th class in the local neighborhood), the probability of event can be written as As the events A j y i and B j y i correlate to each other, following the Bayesian theorem, the probability of event B j y i can be obtained by: In a classification model, a pixel will be classified to a specific class of it belongs to the class in the training data. We thus can conclude that when event A j y i occurring, event B j y i will be true. Therefore, their joint probability can be expressed as For the conditional probability P(A j y i |B j y i ), when we classify the pixel y i to the j th class, the probability of event A j y i occurring is the proportion of j th training samples in the total number of training samples in the local neighborhoods, i.e.
Then, we can substitute (6) and (7) into (5) and obtain Thus, the local spatial prior density of the training samples involved with the j th class can be defined as In this way, the category with more training samples in the local spatial neighborhood will be regarded to have a larger local spatial prior density.
Consequently, integrating the local spectral density and the local spatial prior density shown in (3) and (9) can make comprehensive use of the SS information of HSIs and the prior information of training samples. In this article, the integrated density of y i involved with the j th category ( j ∈ [1, 2, . . . , C]) is defined by Accordingly, the data mass of y i can be rewritten as As shown, the data mass is a vector associated with the spatial prior distribution of the training samples. In this way, the more similar the center pixel is to the surrounding pixels and the more training samples of the j th class in the neighborhood is, the heavier the center pixel will be.

B. Joint Data Gravitation Model
The HSIs with high resolution usually have large homogeneous regions, where the neighboring pixels within the regions are likely to be the same type target [14]. That is, pixels located in a small neighborhood often sharing the similar spectral character and spatial structure. This understanding has been utilized to present the joint Euclidean distance in joint KNN, which holds that the importance of each testing sample in a local region is equal [53]. Apparently, this is not reasonable for the omission of the different SS distribution of testing pixels in a local region. Therefore, in this article, we design a joint data gravitation model to calculate the data gravitation exerted on each testing pixel y i (i ∈ [1, 2, . . . , N]), in which the data mass is used to weigh the joint Euclidean distance of different neighboring pixels. The basic principle of the joint data gravitation model versus the basic data gravitation model is illustrated in Fig. 3.
As seen in Fig. 3(a), in the conventional data gravitation model, a testing sample will be affected by all the training samples, and therefore, it will be assigned to the class that exerts the maximum accumulative gravitational force. Note that to use all the samples directly, not always better. In the proposed joint data gravitation model, on the contrary, as shown in Fig. 3(b), the category of the center pixel is determined only by its closely neighboring pixels in the local joint region and their nearest training samples. Especially, the data mass, M j y i , is taken to weigh the contribution of different neighbors in the local joint region. Hence, the proposed BGC method can effectively improve the efficacy and robustness of HSI classification.
Details of the proposed joint data gravitation model are given in the following: Step 1: For each central pixel y i , set a W J × W J square neighborhood as its joint region, y i,joint , as shown in the left bottom of Fig. 2.
Step 2: Calculate the data mass of pixel y i using the Bayesian-based data mass calculation strategy [see (11)]. Since the mass is directly related to the spatial distribution of every class of training samples, the mass is a vector other than a scalar, which can be written as Step 3: For each testing pixel y i , calculate the Euclidean distance between its every neighboring pixel k (k ∈ [1, 2, . . . , W J × W J ]) in the joint region and all the training samples, denoted as d y i,joint(k) ,S j , where j ∈ [1, 2, . . . , C].
Step 4: For each neighboring pixel, select the nearest training sample of every class, denoted as S j y i ,joint(k) .
Step 5: Calculate the data gravitation exerted on the kth neighbors, F j y i,joint(k) , from the nearest training sample of the j th class by where m S is the mass of the training sample, and ε is a small constant. In this article, they are set to 1 and 10 −6 , respectively. Accordingly, for each testing pixel y i , we can obtain the data gravitation between all its neighbors and the nearest training sample of the j th class, denoted by F j y i,joint , where j ∈ [1, 2, . . . , C]. Here, the data mass is taken as the weighting coefficient of the joint Euclidean distance of different neighboring testing pixels y i,joint(k) .
Step 6: The average gravitation of the joint neighbors exerted by the j th class samples, F j i , therefore can be calculated by In this way, the SS information of the joint neighbors is combined to produce the final data gravitation.
Step 7: The label of the testing pixel y i can be determined by The pseudocode of the proposed BGC method is given in Algorithm 1.

IV. EXPERIMENTAL SETUP
In this section, we describe the details of four benchmark HSI datasets and experimental settings. The sensitivity of parameters in BGC method is also analyzed.

A. Datasets
The four benchmark datasets are Indian Pines, Salinas, Pavia University, and Grss _dfc_2014. All the datasets are downloaded from website. 1,2 Details of these datasets are given in the following.

1) Indian Pines (20-m Resolution, 145 × 145):
The dataset is acquired by the AVIRIS instrument over the agricultural area of Northwestern Indiana in 1992. The data have a wavelength range from 0.4 to 2.5 μm with 224 wavebands. Considering the atmospheric and water absorption, 24 bands are discarded as previous works [14], [69] and this article preserves the other 200 bands for classification test. This agricultural area mainly contains 16 different types of crops as the ground truth (GT) image shown in Fig. 4(a).

2) Salinas (3.7-m Resolution, 512 × 217):
The dataset is acquired by the AVIRIS sensor over Salinas Valley, California. Compared to Indian Pines data, spatial resolution of this data is significantly improved. After removing 20 water  Fig. 4(b).

3) Pavia University (1.3-m Resolution, 610 × 340):
The dataset is acquired by the Reflective Optics Spectrographic Imaging System-03 (ROSIS-03) sensor over the University of Pavia, Pavia, Italy. This sensor contains 115 spectral channels with a much higher spatial resolution than the first two datasets. After removing 12 noisy bands, 103 channels are kept for further processing. Nine classes of different materials are considered in this data as the GT image shown in Fig. 4(c).

4) Grss_dfc_2014 (0.2-m Resolution, 759 × 564):
The data are originally used in the 2014 IEEE GRSS Data Fusion Contest [70], collected over an urban area near Thetford Mines in Quebec, Canada. These data contain 84 bands in the long wavelength range from 7.8 to 11.5 μm. Moreover, its spatial resolution is up to 0.2 m; thus, it is a very challenging dataset for the HSI classification. This area contains seven different classes of land cover, as shown in Fig. 4(d).

B. Parameters Sensitivity Analysis
In BGC method, the spatial contextual information is utilized in both the Bayesian based data mass calculation process and joint data gravitation model process by setting three square neighborhoods. Given that too small neighborhood cannot fully describe the structural information and too large neighborhood will increase the heterogeneity and computational time, we determine the values of, W spe , W spa and W J on different datasets seriously based on individual spatial resolution and content of the scene.
For W spe , it decided the value of local spectral density λ y i and the related formulation is an exponential function, as shown in (3). It is important to maintain the homogeneity of local neighbors and thus keep the spectral density larger than 1. Only in this way, the mass can be an increasing function of spatial prior density of training samples, as illustrated in (9). Thus, the value of W spe should be a very small value and set as 5 for all the tested datasets finally.
For W spa and, W J because they jointly affect the value of data gravitation, we conducted the combined test to select the parameters for each dataset by trial and errors. The results are listed in Table I. Note that the setting of training and testing   3, 5, 7, 9, 11, 13, 15, 17, 19}, respectively. Fig. 5(a) presents the generated overall classification accuracy (OA in percentage) for each pair of parameters. As shown, when W spa equals {7, 9, 11} and W J is set to {3, 5}, the BGC method produces the best OA. To reduce the computational time, W spa and W J are set to 7 and 3 for the Indian Pines as listed in Table I. For the other three datasets, because of their much higher spatial resolution, we tested the larger values of W spa belonging to {11, 13, 15, 17, 19, 21, 23, 25, 27, 29}. The corresponding experimental results are displayed in Fig. 5(b)-(d). Following the principle that the larger OA with the smaller neighborhood is better, W spa and W J for Salinas, Pavia University, and Grss_dfc_2014 are set to {5, 21}, {5, 23}, and {9, 19}, respectively, as illustrated in Table I. Table I

C. Experimental Settings
In this article, to fully evaluate the performance of the proposed BGC method, two classical HSI classification algorithms, SVM and JSRC [14], and two improved variants of data gravitation-based classification methods, IDGC [60] and JDGC [63], are adopted as comparisons. The SVM and IDGC only use the original spectral bands in this article, while JSRC and JDGC can synthetically utilize the SS information by combining with the joint neighborhood learning. To keep fair comparison, in the following experiments, the parameter settings of all the four compared methods were also decided by trial and errors, as summarized in Table I. For SVM, RBF was selected as the kernel function, in which the penalty coefficient and gamma parameter were estimated by fivefold cross validation. For JSRC, on all the datasets, the sparsity was set to 3 and the regularization parameter is empirically set as 10 −2 . The size of the joint neighborhood was set to 7, 7, 9, and 17, respectively. IDGC does not involve any parameter. For JGDC, the values of W J were set to 7, 13, 15, and 15 for testing. In addition, for all the five algorithms, the same settings of training samples were used as listed in Tables II-V. As shown, 10% of the samples were randomly selected as the training samples and the remaining 90% were chosen as the testing samples for the Indian Pines image. For the Salinas, Pavia University, and Grss_dfc_2014 datasets, the percentage   of the training samples and testing samples was set to 1% and 99%, respectively.
For the quantitative evaluation, average classification accuracy (AA) of each class, OA, and kappa coefficient (κ) were calculated to compare the performance of the five classification methods. Besides, the running time was utilized to compare the computation efficiency of the comparison methods. All the experiments were performed using MATLAB 2016b on a computer with 3.0-GHz CPU and 8-GB memory.

A. Classification Results and Analysis
This section presents the qualitative and quantitative classification results as well as the running time of the four datasets   Tables VI-IX. Table X summarizes the running time. For each test, the best results are marked in bold. Details and discussions of the experiments are given in the following. 1) Indian Pines: As shown in Fig. 6, the classification results of SVM and IGDC have many salt-and-pepper noise. In comparison, the results of JSRC, JGDC, and BGC methods are much better, due to the combination of SS information. Moreover, the joint neighborhood learning-based data gravitation methods, i.e., the JGDC and BGC, show better  classification capability than that of JSRC as highlighted by the red circles in Fig. 6. This is because the neighbor average filtering mechanism of the joint model can alleviate the effects of noise, which has been widely used in image denoising applications. Furthermore, the Bayesian-based model helps to promote the BGC method, as shown in Fig. 6(e). In a word, the BGC shows the best qualitative results for Indian Pines. This is also confirmed by the quantitative results listed in Table VI. From Table VI, we can conclude that in 15 out of the 16 kinds of crops, the BGC generated the best classification accuracy and obtained the best results in OA, AA, and κ. The utilization of the data density distribution in spectral space and spatial distribution knowledge of the training samples may help to perform further utilization of the SS information. Besides, for classes 7 and 9, which only have 3 and 2 training samples, respectively, the BGC method also got high classification accuracy. We therefore can infer that the BGC may also suit to classification problem with limited samples.
2) Salinas: Fig. 7 shows the classification maps of the Salinas, and the corresponding details of classification accuracy are summarized in Table VII. Comparing to the Indian Pines, the spatial resolution of Salinas is much high. Thus, the spectral heterogeneity is more serious. Especially, the Celery and Vineyard, which are highlighted by the red circles, are easily confused, as shown in Fig. 7. As shown, SVM and IDGC are almost impossible to distinguish these two classes and the improvement of JSRC is limited. In comparison, the combination of SS information in JDGC and BGC can effectively promote their classification results and the BGC also obtained the best classification accuracy as illustrated in Table VII. This experiment demonstrated that the BGC can better process the HSI with high spatial resolution.
3) Pavia University: Fig. 8 displays the classification accuracy of the Pavia University. The corresponding details of classification accuracy are listed in Table VIII. Like the Salinas, the spatial resolution of Pavia University is also very high. Thus, abundant details result in significant differences  Fig. 8(a) and (b), the SVM only using the spectral features shows a lot of misclassification, which is also confirmed by the qualitative evaluation results listed in Table VIII. Besides, although the SS feature-based JSRC and JDGC produced much better results, there are still existing numerous confusions between the Tress and the Meadows as highlighted in the red circles. Moreover, compared to the results of the Indian Pines and Salinas, the difference between JDGC and BGC is enlarged. This may because the BGC considers the local density of pixels to calculate their masses. It can be seen as a weighted process for JDGC. This overcomes the problem that the JGDC is greatly affected by its neighborhood heterogeneity [63]. Overall, the BGC method achieved the best classification results with an OA up to 0.97, as shown in Table VIII. 4) Grss_dfc_2014: Fig. 9 and Table IX show the qualitative and quantitative results of the Grss_dfc_2014 generated by the five comparisons, respectively. Due to the very high spatial resolution and long wavelength property, the classification difficulty of the Grss_dfc_2014 is much high than the other three datasets. As shown in Table IX, the OA of different algorithms is relatively lower. Nevertheless, the integration of SS information is still meaningful. As illustrated, the OA values of both SVM and IDGC are smaller than 0.7 and that of JSRC, IDGC, and BGC are all bigger than 0.8. The improve effect of the SS information can also be confirmed by the figures shown in Fig. 9(b), (c), and (e). Especially for the circled area, although the very high spatial resolution leads to the image details cannot be fully displayed, the speckle problem of SVM and IDGC is obviously more serious. This further confirms that the proposed BGC is feasible for the classification of high-resolution HSIs. 5) Running Time: Table X lists the running time of the five comparison methods on the four datasets. Although the IDGC is the most efficient, its classification performance is unable to meet the requirement, as shown in Figs. 6-9 and Tables VI-IX. For the other four algorithms, as can be seen, the DGC-based methods, i.e., JDGC and BDGC, have much higher efficiency than that of the SVM and JSRC. This is mainly because the SVM and JSRC methods are ELAs that require longer time to perform classification model training.
In particular, there are the largest training samples in Indian Pines dataset (1031); thus, SVM takes the longest running time in this case. By contrast, the DGC-based methods are LLAs that only need to calculate the data similarity between testing pixels and training samples. In addition, although the BGC desires longer running time than JDGC, BGC obtained the best performance on all the four datasets and the time is much lower than SVM and JSRC.

B. Comparison With the State-of-the-Art Approaches
To further test the performance of the proposed algorithm, we compare the BGC algorithm with eight state-of-the-art approaches, including generalized tensor regression (GTR) [71], invariant attribute profiles (IAPs) [72], SS manifold reconstruction preserving embedding (SSMRPE) [29], superpixelwise adaptive SSA (SpaSSA) [73], mini-batch graph convolutional networks (mini-GCNs) [74], 2DCNN [41], endto-end fusion network (FuNet-C) [74], and CNN-enhanced graph convolutional network (CEGCN) [75]. These algorithms are all recently proposed for HSI classification with high precision. As to the parameter settings, for the first four algorithms, parameters are set by trial and errors. For the other four algorithms, we adopted the settings of parameters from the corresponding literatures [41], [74], [75]. The parameter settings for BGC are kept the same as the settings in Table I. The comparison results are listed in Table XI. As shown in Table XI, the CEGCN with pixel-and superpixel-level feature fusion obtains a close result to BGC in the first three datasets. The GTR, SpaSSA, IAP, and SSMRPE obtained relatively better classification performances than the other three deep learning-based algorithms in most cases. This is mainly because these algorithms deeply excavate the spectral features or comprehensively utilize the SS features, which further confirms the necessity of SS combination in HSI classification. In comparison, the BGC method does not involve with any feature extraction process but can effectively mine the SS information and achieved the better OA and kappa coefficient, especially on the complex scene of Grss_dfc_2014. Besides, the computational efficiency of BGC method has an

C. Analysis of Training Sample Size
In view of the good performance of BGC in dealing with small-sample problem in previous experiments, we further tested its flexibility in cases with different sizes of small sample. Specifically, we randomly selected training samples according to their number and percentage. For the number of sample size (num/class), the data are selected as 3, 5, 10, 12, and 15 per class. For the percentage of sample size (per/class), the value is set to 1%, 3%, 5%, 8%, and 10% per class. The Indian Pines dataset is selected as the testing data, which has 16 different kinds of targets and several of the classes only with few training samples. The parameter settings of BGC on each dataset are consistent with that listed in Table I.
We compared the experimental results with a recently presented model, denoted as 3D-Gabor method, which focuses on the HSI classification with small sample [69]. The 3D-Gabor method also emphasizes the simultaneous utilization of the SS information. In this article, the obtained results of BGC and that of the 3D-Gabor method are all summarized in Table XII. Note that the results of 3D-Gabor method are inherited from its original paper. As can be seen, for all the cases, the BGC algorithm is still superior to 3D-Gabor method. This comes from that the 3-D Gabor method in essential is a handcrafted feature-based method, which requires predefined image structures. In contrast, BDC does not need to design any feature extraction strategy using determined image structure while only explores the simple SS information from its local area. This may further verify that BGC has better data mining capability than many other classification methods do.

D. Discussions
The integration of SS information is essential for improving the HSI classification. Although the simplicity of data gravitation principle has motivated its application in HSIs, it is hard to combine with the spatial information. In contrast, our proposed BGC method can fully mine the SS information of the HSI and training samples. The BGC thus possesses three specific properties.
First of all, BGC method is an LLA with innate incremental learning ability. Hence, it obtains outstanding classification performance on the Grss_dfc_2014 dataset, as shown in Fig. 9 and Table XI. Moreover, BGC spend less running time,  as illustrated in Tables X and XI, since it does not require  training process. Second, the Bayesian-based data mass calculation method combines the local spectral information of neighbors and the local spatial prior information of training samples; thus, it effectively promotes the information mining of the limited training samples and improves the combination of SS information of testing pixels and training samples. In addition, the presentation of joint data gravitation model makes good use of the neighborhood pixels and alleviates the effects of heterogeneity in the high-resolution HSIs.
Lastly, the BGC method does not involve with any feature extraction process but can effectively mine the spectral and structural similarity of pixels. The classification accuracy can precede many of the state-of-the-art classification methods, as reported in Sections V-B and V-C. This also shows its high potential for the small sample classification problem.

VI. CONCLUSION
This article presents a BGC to further explore the SS information of HSIs. In BGC, the SS information of the neighboring pixels and training samples is first combined to calculate the mass of each pixel. Then, a joint gravitation model is designed based on the joint neighborhood learning strategy, in which the joint Euclidean distance is weighted by the data mass. By this means, the SS similarity of each neighboring testing pixel can be measured. Finally, the testing pixel is labeled to the class that exerted the largest average data gravitation to the neighbors. To verify the proposed method, four benchmark hyperspectral datasets, i.e., the Indian Pines, Salinas, Pavia University, and Grss_dfc_2014 data sets, are tested. Comparison results with the SVM, JSRC, IDGC, JDC, and other eight state-of-the-art classification methods confirm the superiority of the proposed BGC. The BGC also shows feasibility on the HSI classification with limited samples. Nevertheless, its feasibility for extremely small size of training samples needs further research.