1 Introduction

Hyperspectral image classification (HSIC) aims to assign a unique category identity to each element in the image, which is a key technology for the intelligent interpretation of the hyperspectral image (HSI) and has been widely used in many fields such as urban development planning [1, 2], agricultural land use [3,4,5], military target detection [6, 7] and medical pathological diagnosis [8, 9]. However, the problems of low spatial resolution, high spectral dimensionality and lack of labelled samples in HSI pose great challenges to the classification task [10,11,12]. In the early days, researchers proposed a series of feature extraction methods such as principal component analysis [13, 14], independent component analysis [15,16,17], and linear discriminant analysis [17, 18], and combined them with machine learning classifiers such as support vector machines [19, 20], random forests [21, 22], and Gaussian mixture model [23, 24] to classify the HSI. These methods can effectively alleviate the Hughes phenomenon [25] that classification accuracy decreases with increasing spectral dimension, but because only spectral features are considered and based on manual design, the classification accuracy and applicability are not ideal. Therefore, it is important to design a HSIC method with high accuracy and applicability.

Recently, with the improvement of computer computing power, deep learning models represented by convolutional neural network (CNN) have been widely used in HSIC [26, 27]. Yue et al. [28] used a two-dimensional convolutional neural network (2D-CNN) to automatically extract deep features. However, the HSI is a three-dimensional cube, and the use of 2D convolution can only extract the spatial features and cannot fully utilize the spectral features. Another HSIC model based on three-dimensional convolution neural network (3D-CNN) is designed by Chen et al. [29], which introduced 3D convolution to directly extract the spatial-spectral features and achieved better classification results. Zhong et al. [30] employed a 3D residual network, using the idea of residual learning to effectively alleviate the problem of model degradation caused by deepening the network layers. Sun et al. [31] utilized spatial and spectral attention mechanisms to enhance the expression of the HSI spatial-spectral features and improved the classification accuracy. Although the introduction of 3D convolution allows the extraction of both rich spatial and spectral features, it also leads to an increase in the number of parameters of the network model. To address this problem, Roy et al. [32] developed a hybrid model (HybridSN) of 2D convolution and 3D convolution, which can obtain higher classification accuracy while reducing parameters. The above CNN-based methods have made some progress, but they only perform convolutional operations on the regular regions of the image and cannot fully express the local features of irregular regions, leading to higher misclassification in the case of small samples.

Graph Convolutional Neural Network (GCN) is a deep learning network model that performs convolutional operations on graph-structured data based on graph theory, which can make full use of the spatial structure and neighbourhood information among nodes to effectively learn the features of irregular data, and has been widely used in the field of HSIC. Qin et al. [33] used a semi-supervised spectral-spatial graph convolution network, which effectively utilizes the spatial neighbourhood information of images. Ding et al. [34] proposed a global consistent graph convolution HSIC model (GCGCN), which achieves global feature smoothing for all in-class samples by combining an adaptive global high-order graph structure with a two-layer network. Wan et al. [35] introduced a multi-scale dynamic graph convolution network, which fully exploits the spatial information using multi-scale graph convolution and obtains a better feature representation. Shi et al. [36] used convolution to extract pixel-level features, combined with the results of super-pixel segmentation to obtain a graph structure with stronger expressiveness, and then combined with GCN and Transformer encoder structure to optimize features and obtain good classification results.

The HSIC models based on graph convolution can effectively improve the classification accuracy, but there is the problem that the weights between nodes cannot be changed in the process of building the graph, which limits the representation capability of the model. Veličković et al. [37] constructed a graph attention mechanism network model, which improves the representational power of the model in operations by introducing an attention mechanism to assign weights to different nodes in the neighbourhood. Dong et al. [38] constructed a graph structure based on hyperpixels and extracted hyperpixel-level features of the image through a graph attention network, and then achieved classification by weighted fusion with pixel-level features extracted by a CNN based on spectral and spatial attention. Sha et al. [39] applied a graph attention mechanism network to the HSIC and represented the relationship between neighbouring nodes adaptively, but did not fully use the deep spatial-spectral features of HSI when constructing the graph structure. An HSIC model (S2RGANet) using a spatial-spectral residual graph attention mechanism was proposed by Xu et al. [40]. This model effectively improves the classification accuracy by constructing a deep convolutional residual module while introducing a network of graph attention mechanisms to obtain more important spatial information, but the training time of this model is long. Therefore, how to combine the advantages of CNN and graph attention mechanism networks to design an end-to-end network model that can effectively reduce the network training time while improving the model accuracy is a problem that needs to be solved.

To address the above problems, this paper proposes an HSIC model (3D-2D-GAT) based on a combination of 3D-2D hybrid convolution and a graph attention mechanism network. The main contributions can be summarized as follows.

  • Combining the respective advantages of 2D convolution and 3D convolution, a feature extraction module based on hybrid convolution is constructed, which can quickly extract the discriminant deep spatial-spectral features of various ground objects in HSI.

  • The samples are randomly selected in the global range and the adjacency matrix is constructed using their deep spatial-spectral features, so that the graph structure can better express the spectral and spatial connections between different nodes.

  • GAT is used to automatically adjust the weights between different nodes and learn the long-distance spatial relationship between hyperspectral data, which can improve the model's ability to distinguish the intra-class variation and inter-class similarity among different samples.

  • A new end-to-end HSIC model is proposed, which utilizes the collaborative work of hybrid convolutional feature extraction module and GAT module to improve classification accuracy. Experimental analysis is performed on three HSI datasets, and the experimental results confirm the superiority of our method.

2 Proposed Method

2.1 Model Overview

The overall flow of our proposed method is shown in Fig. 1. The method consists of three main parts: feature extraction based on hybrid convolution, graph structure construction using K-Nearest Neighbour (KNN), and classification by introducing Graph Attention Network (GAT). Firstly, the spatial neighbourhood information of the original HSI is extracted and deep spatial-spectral features are extracted by constructing a hybrid convolutional network. Then, KNN is adopted to construct a graph structure for the extracted deep spatial-spectral features. Finally, the features of the graph nodes are extracted using GAT and classified by a softmax classifier to obtain the classification results of the image.

Fig. 1
figure 1

Overall flow of the proposed method

2.2 Hybrid Convolutional Feature Extraction Network

2D-CNN uses two-dimensional convolutional kernels to sequentially extract local features of the HSI by window sliding, but this approach can only perform the extraction of spatial features and cannot obtain the spectral features of images. 3D-CNN directly uses the original image as the network input without complicated preprocessing and can extract both spatial and spectral features, but has the problem of high computational complexity. To combine the advantages of the above two methods, a hybrid feature convolutional extraction network is proposed in this paper as shown in Fig. 1. It mainly includes three 3D convolutional layers, two 2D convolutional layers, and one fully connected layer.

The input data blocks are first subjected to 3D convolutional operations to simultaneously learn the joint spatial-spectral features of HSI. The parameters of the first 3D convolutional layer is set to (8, (3, 3, 3)), the second is set to (16, (3, 3, 3)), and the third is set to (32, (3, 3, 3)). The step size of all three 3D convolution layers is set to (1, 1, and 1). Then, the extracted features after 3D convolution is calculated by Reshape function. However, with the increase of the depth of the network, there will be a gradual loss of detailed features and an increase in the amount of computation. Therefore, to further enhance the local spatial feature of HSI, 2D convolution is used to extract spatial feature based on 3D convolution. The parameters of the first 2D convolutional layer is set to (64, (3, 3)), and the second is set to (128, (3, 3)). The step size of all two 2D convolution layers is set to (1, 1). Finally, the features are fed into the fully connected network layer to obtain a set of feature maps stacked by 256 channel features.

2.3 Construction of Graph Structure Based on KNN

GAT is a multilayer neural network that can only process data with a graph structure. The feature graph is represented as \(X\). Each pixel location on \(X\) is defined as a node, and its feature vector along the channel direction is the initial feature of the node, denoted as \(X \in R^{N \times C}\), where \(N\) is the number of nodes and \(C\) is the feature dimension.

The Euclidean distance \(d_{ij}\) between any two nodes can be expressed as:

$$ d_{ij} = \left( {\sum\limits_{c = 1}^{C} {\left| {x_{ic} - x_{jc} } \right|^{2} } } \right)^{\frac{1}{2}} $$
(1)

The distance matrix \(D = \left[ {d_{ij} } \right] \in R^{N \times N}\) is obtained according to the above equation. Then, KNN is used to arrange \(D\) in order from largest to smallest, from which the K nearest nodes to each node are selected as its neighbour nodes. The edges are established between these nodes, from which the adjacency matrix \(A \in R^{N \times N}\) of the graph can be represented as:

$$ A_{ij} = \left\{ \begin{gathered} 1,X_{i} \in KNN\left( {X_{j} } \right) \, or \, X_{j} \in KNN\left( {X_{i} } \right) \hfill \\ 0,otherwise \hfill \\ \end{gathered} \right. $$
(2)

2.4 Graph Attention Network

GAT [36,37,38] is a variant of Graph Neural Network (GNN) and mainly consists of a graph attention layer (GAL). GAT iteratively updates the representation of each node by aggregating the representations of neighbouring nodes using a multi-head attention network mechanism, thus enabling the adaptive assignment of weights to different neighbouring nodes.

The input to GAL is a set of vector \(V = (v_{1} ,v_{2} ,...,v_{N} )\), \(v_{i} \in {\mathbb{R}}^{F}\) where \(N\) represents the number of nodes and \(F\) is the feature dimension of each vector. The output of GAL is a new set of vector \(V^{\prime} = (v^{\prime}_{1} ,v^{\prime}_{2} ,...,v^{\prime}_{N} )\), \(v^{\prime}_{i} \in {\mathbb{R}}^{{F^{\prime}}}\) where \(F^{\prime}\) represents a feature dimension different from \(F\).

To convert the input features into higher-level features with sufficient expressiveness, the features of each node are parameterized using a weight matrix. The computation of the attention coefficient between node \(i\) and node \(j\) is expressed as [41]:

$$ e_{ij} {\text{ = LeakyReLU}}\left[ {\left( {{\vec{\text{a}}}} \right)^{T} \left( {Wv_{i} \parallel Wv_{j} } \right)} \right] $$
(3)

where \(W \in {\mathbb{R}}^{{F^{\prime} \times F}}\) is a weight matrix and \({\vec{\text{a}}} \in {\mathbb{R}}^{{2F^{\prime}}}\) is a single-layer feed-forward neural network. \( (\cdot)^{T}\) denotes the transposition operation and \(\parallel\) is the concatenation operation. LeakyReLU is a nonlinear activation function to express the attention coefficients between node \(i\) and node \(j\). The softmax function is used to normalize the attention coefficient \(e_{ij}\), and the final attention coefficient \(\alpha_{ij}\) between node \(i\) and node \(j\) can be computed as:

$$ \alpha_{ij} = {\text{softmax}}\left( {e_{ij} } \right) $$
(4)

Next, the aggregated feature \(v^{\prime}\) of node j is calculated based on the parameterized features and attention coefficient, and the multi-head attention mechanism allows the model to learn more stable features. The results are computed independently H times and the results of each computation are stitched together as the final aggregated features of node i, which is calculated by:

$$ v_{i}^{\prime } = \mathop \parallel \limits_{h = 1}^{H} \sigma \left( {\sum\limits_{{j \in {\mathbb{N}}_{i} }} {\alpha_{ij}^{h} W^{h} v_{j} } } \right) $$
(5)

where \( \sigma (\cdot)\) represents the ReLU activation function [42, 43].\(\alpha_{ij}^{h}\) is the normalized attention coefficient calculated from the hth attention mechanism and \(W^{h}\) is the corresponding weight matrix.

To maintain high computational efficiency, the number of graph attention layers is set to 3. The overall computational procedure for GAT in this paper is:

$$ V_{out} = GAL\left( {GAL\left( {GAL\left( {V_{in} ,A} \right),A} \right),A} \right) $$
(6)

where Vin and Vout denote the input and output of GAT, respectively. A is the adjacency matrix of the graph constructed using KNN.

2.5 The Classification Module

The new node features extracted by GAT are input to the fully connected layer to obtain global features \(F_{i}\). Then, they are fed into softmax for classification, which can be represented as:

$$ y = {\text{softmax}}\left( {W_{y} F_{i} + b_{y} } \right) $$
(7)

where Wy is the vector of parameters that can be learned. We use the cross-entropy loss to calculate the loss between the predicted value of the model and the label, which can be expressed as:

$$ Loss = - \frac{1}{N} \times \sum\limits_{i = 1}^{N} {\left( {y_{i} log\left( {y_{i}^{\prime } } \right) + \left( {1 - y_{i} } \right)\log \left( {1 - y_{i}^{\prime } } \right)} \right)} $$
(8)

where N denotes the number of samples. yi represents the predicted outcome and \(y^{\prime}_{i}\) represents the true value (Figs. 2, 3).

Fig. 2
figure 2

Classification results of different spatial window size in IP, PU and SV datasets

Fig. 3
figure 3

Classification results of different K and H

The overall process of GAT in HSI classification is shown in detail in Algorithm 1 as follows

Algorithm 1
figure a

GAT in HSI classification

3 Experiment and Result Analysis

3.1 Dataset Description

We used three HSI datasets of Indian Pines (IP), Pavia University (PU) and Salinas Valley (SV) for performance evaluation.

  • IP dataset was acquired from the airborne visible/Infrared Imaging Spectrometer (AVIRIS) in Indiana, USA. The image size is 145 × 145 pixels and the spectral coverage is 400 to 2500 nm, which contains 10,249 pixels and a spatial resolution of 20 m. After removing the bands affected by noise, the remaining 200 bands can be used for classification. The IP dataset has 10,249 samples with 16 classes. The false-colour composite image (30, 60, and 90) and the corresponding ground-truth map are demonstrated in Fig. 4a and b.

  • PU dataset was acquired by the Reflection Optics System Imaging Spectrometer (ROSIS) at the University of Pavia. The image size is 610 × 340 pixels and the spectral coverage is 430 to 860 nm, which contains 42,776 pixels and a spatial resolution of 1.3 m. After removing the bands affected by noise, the remaining 103 bands can be used for classification. The PU dataset has 42,776 samples and 9 classes. The false-colour composite image (20, 60, and 80) and the corresponding ground-truth map are demonstrated in Fig. 5a and b.

  • SV dataset was acquired by AVIRIS sensors on the Sarinas Valley, California. The image size is 512 × 217 pixels and the spectral coverage is 400 to 2500 nm, which contains 54,129 pixels and a spatial resolution of 3.7 m. After removing the bands affected by noise, the remaining 204 bands can be used for classification. The SV dataset has 54,129 samples and 16 classes. The false-colour composite image (30, 60, and 120) and the corresponding ground-truth map are demonstrated in Fig. 6a and b.

Fig. 4
figure 4

Classification maps for IP dataset. a False-colour composite image (30, 60, 90). b Ground truth. c 2D-CNN. d 3D-CNN. e HybridSN. f GCGCN. g G2T. h WFCG. i S2RGANet. j 3D-2D-GAT

Fig. 5
figure 5

Classification maps for PU dataset. a False-colour composite image (20, 60, 80). b Ground truth. c 2D-CNN. d 3D-CNN. e HybridSN. f GCGCN. g G2T. h WFCG. i S2RGANet. j 3D-2D-GAT

Fig. 6
figure 6

Classification maps for SV dataset. a False-colour composite image (30,60,120). b Ground truth. c 2D-CNN. d 3D-CNN. e HybridSN. f GCGCN. g G2T. h WFCG. i S2RGANet. j 3D-2D-GAT

Our experimental environment is built under Windows 10 using the Python language and the open-source deep learning framework TensorFlow. The hardware environment is an Intel Core i9-12900KF, 32 GB RAM, and an Nvidia GeForce GTX3060 Ti graphics card.

We use Overall Accuracy (OA), Average Accuracy (AA), and Kappa coefficient as classification accuracy evaluation metrics to measure the performance of the classification. The OA, AA and Kappa coefficient can be defined as [44,45,46]:

$$ {\text{OA = }}\frac{{\sum {p_{i} } }}{{\sum {t_{i} } }} $$
(9)
$$ {\text{AA = }}\frac{{\sum {\frac{{p_{i} }}{{t_{i} }}} }}{N} $$
(10)
$$ {\text{Kappa = }}\frac{{{\text{OA}} - \frac{{\sum {p_{i} \times t_{i} } }}{n}}}{{1 - \frac{{\sum {p_{i} \times t_{i} } }}{n}}} $$
(11)

where \(p_{i}\) is the number of correctly classified samples of the ith class,\(t_{i}\) is the number of samples of the ith class in the ground-truth data, and \(N\) is the number of classes.

To verify the classification effectiveness of the proposed method, 5% samples of each class of ground objects were randomly selected as the training set in the IP dataset, and the remaining samples were used as the test set. 1% samples of each class of ground objects were randomly selected as the training set in the UP and SV dataset, and the remaining samples were used as the test set. Samples were randomly selected 10 times for all experiments and the average results were obtained.

4 Parameters Setting

The proposed model uses Adam [25] as the optimizer in the training process. The training epoch is set to 300, the batch size is set to 16, and the learning rate is 0.0001. The parameters of the model mainly include the spatial neighbourhood window size, the number of neighbour nodes in KNN and the number of heads of the multi-headed attention mechanism in GAT, and we have experimented and analysed the effects of them as follows.

4.1 Influence of Spatial Window Size

The difference in spatial neighbourhood information contained in different window sizes affects the classification results of the model, so experiments are needed to find the optimal window size on IP, PU and SV datasets. The experimental results are shown in Fig. 2. From the figure, it can be seen that as the window size increases, the spatial information that the model can utilize gradually increases and the classification accuracy also increases. And when the window size is too large, the classification accuracy of the model appears to decrease instead due to the increase of redundant spatial information. Therefore, the input size of the IP data set is set to 15 × 15, and the input window size of the PU data set is set to 19 × 19, and the input window size of the SV data set is set to 17 × 17.

4.1.1 Influence of Different K and H

During the training of the model, the number of neighbour nodes in the KNN algorithm and the number of heads of the multi-headed attention mechanism in the graph attention mechanism determine the learning efficiency of the model. Therefore, we investigated the effects of them on the model performance on the three datasets.

The optimal values of the number of neighbour nodes K and the number of attention heads H are explored experimentally between 1 and 10 on the IP, PU and SV datasets, and the experimental results are shown in Fig. 3.

From Fig. 3, it can be seen that the experimental results are poor when the value of K is taken low, and the performance of the model is improved as the value of K increases, while too high a value of K will cause the graph structure to fail to accurately express the spatial connections between samples, thus affecting the accuracy of the model. We observe that the change in the number of attention heads H for the same K value does not have a significant effect on the performance of the model, but the model is more likely to converge with higher accuracy when H takes smaller values within a limited number of training batches. Therefore, based on the experimental results, K is set to 4 and H is set to 2 for the IP dataset. For the PU dataset, K is set to 6 and H is set to 2. For the SV dataset, K is set to 6 and H is set to 1.

4.2 Ablation Experiments and Analysis

In order to verify the effectiveness of the hybrid convolutional structure and GAT in the proposed model, the whole network model is divided into three components: 3D convolutional layer, 2D convolutional layer and GAT, and the ablation study are put forward on the three datasets. The experimental results of the ablation experiments are shown in Table 1. As can be seen from the table, the OA obtained on the IP dataset using only 3DCNN is 85.30%, the OA improves by 7.93% when using the hybrid convolutional structure, and by 13.89% when using both the hybrid convolutional structure and GAT. The OA on PU dataset is 89.93% when using only 3DCNN, improved by 6.74% when using the hybrid convolutional structure, and improved by 9.80% when using both the hybrid convolutional structure and GAT. The OA on the SV dataset is 88.67% when using only 3DCNN, the OA is improved by 6.68% when using the hybrid convolutional structure and the OA is improved by 10.76% when both the hybrid convolutional structure and GAT are used. The above experimental results show that using hybrid convolutional structure and GAT can effectively improve the classification accuracy of the model.

Table 1 Ablation experiments results on IP, PU and SV dataset

4.3 Comparative Experiments and Analysis

To verify the effectiveness of our proposed method, 2D-CNN [28], 3D-CNN [30], HybridSN [32], GCGCN [34], G2T [36], WFCG [38] and S2RGANet [39] were selected for comparison experiments. To ensure the fairness of the experiments, the comparison methods were configured according to the optimal parameters, and all methods used the same number of samples for both the training and testing sets. Tables 2, 3 and 4 lists the comparison of classification results of different methods on the IP, PU and SV datasets. From these tables, the following conclusions can be drawn:

  1. 1.

    3D-2D-GAT has the best classification results compared with other comparison methods. Compared to 2D-CNN, 3D-CNN, HybridSN, GCN, G2T, WFCG and S2RGANet, On the IP dataset, the OA of 3D-2D-GAT is improved by 19.3, 13.65, 5.87, 4.06, 3.01, 2.8, and 0.89%, respectively. AA by 19.08, 12.14, 6.75, 3.26, 1.92, 2.03, and 0.62%, respectively, and Kappa improved by 22.07, 15.57, 6.69, 4.63, 3.43, 3.19, and 1.01%, respectively; On the PU dataset, the OA of 3D-2D-GAT improved by 9.04, 7.42, 3.87, 2.35, 1.91, 1.62, and 0.57%, respectively. AA improved by 12.03, 8.36, 5.51, 2.95, 2.83, 2.42, and 1.22%, respectively, and Kappa improved by 11.96, 9.89, 5.13, 3.13, 2.55, 2.16, and 0.78%, respectively. On the SV dataset, the OA of 3D-2D-GAT improved by 13.09, 9.88, 3.93, 2.68, 2.17, 1.58, and 0.65%, respectively. AA improved by 8.1, 5.4, 2.54, 2.28, 1.34, 1.28, and 0.44%, respectively, and Kappa improved by 14.58, 11.05, 4.39, 2.99, 2.42, 1.76, and 0.73%, respectively.

  2. 2.

    Compared with the methods using only 2D-CNN or 3D-CNN, HybridSN shows a greater improvement in classification accuracy on all datasets, indicating that the network can more adequately extract the spatial-spectral features in HSI when it employs both 3D convolution and 2D convolution.

  3. 3.

    GCGCN, WFCG, G2T, S2RGANet and 3D-2D-GAT all use graph structures for feature analysis, and their classification accuracies are higher than those of several other methods using ordinary convolution. This is because the use of graph structure can better learn the intra-class variation and inter-class similarity in hyperspectral data.

  4. 4.

    Compared with GCGCN, the classification accuracies of S2RGANet and 3D-2D-GAT using the graph attention network are significantly improved, indicating that the graph attention network has the feature of dynamically changing the weights between nodes, which is conducive to improving the expressiveness of the networks.

  5. 5.

    G2T adopts a new Graph Guided Transformer structure, but its accuracy is still not as good as that of S2RGANet and 3D-2D-GAT which use graph attention network. Therefore, for graph structure learning, graph attention network is still a structure with strong learning ability.

  6. 6.

    Compared to S2RGANet and 3D-2D-GAT, which use pixels to construct the graph structure, GCGCN, WFCG and G2T all use hyperpixels to construct the graph, which reduces the complexity of the image processing, but is prone to lose the details, which is not conducive to the improvement of the accuracy of the model, and therefore the classification results of these models are poor.

  7. 7.

    S2RGANet introduces only spectral features to construct the graph structure, ignoring important spatial features, and the 3D-2D-GAT model employs deep spatial-spectral features extracted by a hybrid convolutional network to construct the graph, making full use of spatial information and effectively improving the classification accuracy.

Table 2 Comparison of classification results of different methods on IP dataset (%)
Table 3 Comparison of classification results of different methods on PU dataset (%)
Table 4 Comparison of classification results of different methods on SV dataset (%)

Figures 4, 5 and 6 show the classification results of different methods on the three datasets. As can be seen from the figures, the classification results of different methods on the IP, PU and SV dataset demonstrate that the classification results of 3D-CNN and 2D-CNN have serious salt and pepper phenomenon, and the overall classification effect is poor, and the difference with the ground truth map is relatively large. HybridSN and GCGCN have limited feature learning ability, and many misclassified samples in the classification result maps affect the overall effect. WFCG and S2RGANet have good feature learning ability by combining CNN and GAT, which obtained relatively smooth classification result maps. 3D-2D-GAT has the least number of misclassified features in the classification result map, and the overall classification effect is smoother with only a few noise points, which is closer to the ground truth map.

4.4 Computational Efficiency and Time Consumption

To compare the computational efficiency between different methods, experiments were performed using IP, PU and SV datasets, and the results are shown in Table 5.

Table 5 Comparison of computational efficiency of different methods on IP, PU and SV datasets

Through the statistics of time consumption, it can be found that 3D-CNN requires longer training time than the 2D-CNN, which is due to the fact that 3D-CNN has many parameters. While the structure adopted by HybridSN combines 3D-CNN and 2D-CNN, which can effectively reduce the parameters of the network, and improve the computational efficiency of the network. Compared with the convolution-based models, GCGCN does not use the convolution layers, so it has an advantage in running speed. S2RGANet uses the spectral features extracted by 3D-CNN and leaves the task of extracting spatial features to GAT, which increases the difficulty of model operations. Since WFCG and G2T use the hyperpixel to construct graphs, they are computationally more efficient than S2RGANet and 3D-2D-GAT, which construct graphs based on pixels. In contrast, 3D-2D-GAT first uses 3D-CNN and 2D-CNN to extract the deep spatial-spectral features and then performs the subsequent composition and operations, which reduces the computational cost of the graph attention module and improves the operation speed of the model. Although the running time of 3D-2D-GAT is slightly higher than some comparison methods, it is still a competitive model considering its stable performance and fast test time.

4.5 Influence of Small Samples

To verify the effect of changing the number of training samples on the performance of different classification methods under small sample conditions, experiments were conducted using the IP, PU and SV datasets. 5, 10, 15, 20 and 25 samples were randomly selected from each category of ground objects as training samples, and the experimental results are shown in Fig. 7. As it can be observed from the changing trend of the OAs of different methods under different training sets, increasing the number of training samples has a relatively positive effect on OA of all classification methods. The proposed 3D-2D-GAT method has more robustness and adaptability with better classification accuracy than the six compared methods with limited training samples.

Fig. 7
figure 7

The classification performance of each method with different number of training samples on IP, PU and SV datasets

5 Conclusion

In this paper, a new end-to-end HSIC method is proposed to improve the performance of the HSI classification model by introducing three key techniques, namely, deep spatial and spectral feature extraction, KNN-based graph structure construction and graph attention mechanism. The model uses a 2D–3D hybrid convolutional network to extract deep spatial-spectral features from HSI. In KNN-based graph structure construction, the model utilizes deep spatial-spectral features to improve the representation of the graph structure and performs a transformation between the two modules. We also use GAT to learn long-range spatial links between data and use the extracted spatial features for classification. Our proposed model has some advantages over existing HSIC methods.

In our future research work, we will introduce a semi-supervised method based on this model to further improve the classification accuracy of HSI under small sample conditions.