1 Introduction

Graph is a data representation that could be the closest to the representation of the real world, such as social relations representation, knowledge representation, chemical structure representation, protein representation, etc., which are all graph-structured data. However, these graph-structured data are difficult to obtain high-quality labels for supervised learning due to the high domain knowledge required for labeling graph nodes, such as unknown molecules, non-trivial proteins, social relations, etc. Self-Supervised Graph Representation Learning (SSGRL) utilizes the data’s supervised signals, enabling large-scale models to be trained on massive unlabeled data by designing different proxy tasks. Therefore, SSGRL has been suitable for processing real-world graph data without pre-defined labels. Recently, SSGRL is widely used in knowledge engineering [40] and bio-informatics [7] domains. It has become a research hotspot in intelligent applications. Excellent achievements have been made in the fields of traffic prediction [46], protein prediction [48], and bioinformatics [49].

Contrastive learning (CL)-based [25, 33, 45] and generative learning (GL)-based [14, 18, 28] categories are now the two most popular types of SSGRL approach. Self-supervised graph representation learning is driven by contrastive learning-based techniques that maximize positive sample node similarity and minimize negative sample node similarity. Positive sample nodes are obtained from graph enhancement techniques, and the remaining nodes are used as negative samples. These methods call for expensive training approaches and powerful graph augmentation techniques. On varied graph data, getting the best outcomes with the current graph augmentation approaches is challenging. As a result, SSGRL once more over-depends on high-quality data augmentation (Fig. 1).

Fig. 1
figure 1

In earlier studies, models based on generative learning methods were used to destroy the original graph structure and remove some edges. Finally, the damaged structure is restored

Generative learning-based methods such as self-supervised graph autoencoders (GAE) [18] can avoid the above problems because they aim to reconstruct corrupted data without additional data augmentation methods. In the early research, such methods [18, 23, 28] drive self-supervised graph representation learning by reconstructing adjacency matrices, which makes such methods only focus on graph structure information and ignore rich graph node information. In the latest research [14, 16, 24], the method represented by GraphMAE [14] achieves state-of-the-art results. GraphMAE aims to achieve self-supervised graph learning by masking and restoring graph nodes, which takes full advantage of the information in graph nodes and trains more powerful graph representation encoders. However, since such methods mainly emphasize node information and ignore graph structure information, they usually cannot achieve better results on downstream tasks of graph classification focusing on graph structure information.

In general, most generative learning methods in current research are roughly divided into two categories, one is based on the reconstruction of graph structure, and the other is based on the reconstruction of graph node information. Both methods have their advantages. This paper combines the two methods so that the model can be applied to a wider range of tasks. However, how to effectively combine these two methods so that the model does not conflict when reconstructing graph structure information and graph node information is the challenge of this paper. To this end, this paper proposes a node and edge dual-masked self-supervised graph representation model to ensure that the final graph representation can capture the information of nodes in the graph and obtain deep graph structure information. First, a dual masking model is proposed to perform node masking and edge masking on the original graph at the same time to generate two masking graphs \({\mathcal {G}}^{1}_{mask}\) and \({\mathcal {G}}^{2}_{mask}\). Second, a graph encoder \(G^{e}_{\theta }\) is designed to encode the two generated masking graphs to obtain graph representations \(H_{1}\) and \(H_{2}\). Then, two reconstruction decoders \(G^{d1}_{\theta }\) and \(G^{d2}_{\theta }\) are designed to reconstruct the nodes and edges according to the masking graphs. At last, the reconstructed nodes and edges are compared with the original nodes and edges to calculate the loss values without using the labeled information.

The method has been verified on a large number of graph node classification tasks and graph classification tasks and achieved good performance, especially in the graph classification task, it has achieved an improvement of 0.5–2%. The method achieves obvious improvements compared to the previous state of the art. In summary, the work has the following highlights:

  • We first propose the concept of mask learning for simultaneous masking and reconstruction of graphs and edges and achieve state-of-the-art results in the field of self-supervised graph representation;

  • Compared to CL-based methods, our method does not rely on high-quality graph augmentation and thus can be used on real-world datasets;

  • Compared with the state-of-the-art methods, our method considers both the node information and the structure information of the graph;

2 Related work

2.1 Graph neural network

The graph algorithm’s primary strategy in early research [21, 22, 26] was the graph embedding technique. In order to apply machine learning to the graph, this type of method generates the node sequence using a random walk of the nodes to obtain the node embedding. However, the graph representations obtained by most graph embedding techniques are ineffective in downstream tasks since the techniques only focus on the graph structure information and ignore the rich node information. The difficulties of the above research have been solved by the development of graph convolutional neural networks with great success. The first study [5] fully exploits node feature information and graph structure information by fitting convolution kernels with Chebyshev polynomials. The complexity of the former is then significantly reduced by GCN [17] using a first-order approximation technique. GraphSAGE [8] proposes a method to obtain node representations by integrating first-order adjacent node information since the first two methods are difficult to use for large graphs. In order to provide more accurate graph representations, GAT [34] uses edge weights as learnable parameters based on GraphSAGE. Graph neural networks have so far outperformed earlier graph embedding technologies in terms of performance and have taken over as the most popular methods of graph representation. These supervised learning-based methods [6, 41] usually require high-quality graph labels, however obtaining graph labels usually requires extensive domain expertise. Therefore, self-supervised graph representation learning is the most effective approach to solving this problem.

2.2 Contrastive self-supervised graph learning

Contrastive learning has achieved great success in computer vision at the earliest and has proposed many excellent methods [3, 13] in the research process. Inspired by this, many researchers began to study how to apply contrastive learning to self-supervised graph representation learning. DGI [35] first introduced the concept of mutual information into graph representation learning. The method relies on maximizing the mutual information between the enhanced graph representation and the currently extracted graph information. Subsequently, GMI [25] and infoGraph [25] improved the method so that mutual information can be better used in the graph domain. MoCo [11] improves NCE loss based on cross-entropy loss and proposes infoNCE loss. GRACE [50] brings infoNCE to graph representation learning. GraphCL [45] leverages the ideas of SimCLR [3] to study the performance impact of different data augmentations on self-supervised graph representation learning.

The above methods are usually extremely expensive to train since these methods require a large number of negative samples to ensure the model does not collapse. Inspired by SimSiam [4] idea of using prediction instead of contrast and using a momentum encoder to ensure that the model does not collapse, BGRL [33] first proposed contrastive learning that does not require negative samples in graph representations. Then LaGraph [37] proposed a method for predicting latent graphs.

Data augmentation in contrastive learning methods is intuitive and understandable in computer vision. However, graph augmentation currently needs to be fully explained theoretically to guarantee whether the method is optimal.

2.3 Generative self-supervised graph learning

Generative methods are important in self-supervised graph representation learning. In previous methods [26], node embeddings were produced using a potent natural language model after the graph structure had been flattened into a sequence by random walk sampling. These methods [2] only rely on adjacency matrix information and do not fully utilize the initial feature information of nodes. Later graph neural network methods such as GCN [17] and GAT [34] fully combine the characteristics of nodes and the relationship between nodes to solve the problem that the previous methods can only use single information in the graph. Therefore, GAE [18] can apply autoencoders to graph representation learning. It uses GNN as the encoder to obtain the node representation features and obtains the inner product of the node representation features through the decoder to reconstruct the adjacency matrix. However, real-world data often have a large number of noisy edges, and this method pays too much attention to the structural information of the graph. Therefore, the method reconstructing these noisy edges makes the model unstable and degrades performance. On this basis, GATE [29] also reconstructs the features of nodes for better graph representation.

Subsequently, Masked AutoEncoders [10] were proposed in the field of computer vision. This method can restore images with 75% occluded pixels, showing strong model performance. Inspired by this idea, GraphMAE [14] applies this method to self-supervised graph representation learning. It improves the original MSE loss function and proposes a scaled cosine loss function more suitable for graph representation learning. The method uses a graph attention network GAT as the encoder and then restores the node’s features in the decoder. Its performance reaches state of the art in the field of self-supervised graph representation learning. However, this method ignores the important structure information of the graph. Therefore, we propose a new method to fully use graph node information and graph structure information based on Masked AutoEncoders.

3 Method

Fig. 2
figure 2

The input to the model is the graph \({\mathcal {G}}\). Two mask graphs are obtained by masking graph nodes and masking graph edges. The two graphs pass through the same Graph Encoder and then pass through different Graph Decoders, respectively. Finally, restore nodes and edges

The method is divided into three steps. First, perform a mask operation on the graph to obtain a node and edge mask graph. Second, use it as input to the graph encoder to obtain a graph representation. At last, the masked nodes and edges are recovered using the graph representation as input to the decoder. Figure 2 shows a block diagram of the entire model.

3.1 Graph masking

In this chapter, we will introduce how to mask the graph. In computer vision, it is common to mask the pixels of an image. And the mask rate is very large, usually reaching 75%. A high mask rate is used to prevent the model from obtaining its own data by simply copying the data of surrounding pixels when restoring pixels. In the latest work, GraphMAE has achieved state-of-the-art results in self-supervised graph representation learning by masking graph nodes. In graph representation learning, the input to the model is a graph. Comparing nodes to pixels is not feasible since the relationship between pixels in an image is determined by their position in the image and the relationship between nodes is represented by edges. Therefore edges play an important role in the graph and they cannot be ignored when masking and reconstructing. Our method generates two mask graphs, one with only nodes masked and the other with only edges masked. Both graphs need to use the same encoder to obtain their graph representations. However, it should be noted that different strategies are used in the decoding stage. We designed two different decoders for reconstructed nodes and reconstructed edges.

The input graph consists of its node feature matrix \(X\) and adjacency matrix \(A\), denoted as \({\mathcal {G}}= (A, X)\). \(A\in {\mathbb {R}}^{|V|\times |V|}\) represents the adjacency matrix of the graph. \(X\in {\mathbb {R}}^{|V|\times d}\) represents the feature matrix of graph node. The first mask graph masks the node feature matrix \(X\), selects the node feature of the \(p_{1}\) ratio and sets it to 0 to obtains \({\mathcal {G}}_{mask}^{1} =( A,X^{'})\). The adjacency matrix A is then masked, removing edges with \(p_{2}\) ratios in the input graph to obtain \({\mathcal {G}}_{mask}^{2} =(A^{'},X)\). The values of \(p_{1}\) and \(p_{2}\) are two important parameters of the model. In the field of computer vision, the larger the mask rate, the better the model’s performance. If the masking rate is too small, the difficulty of the reconstruction task will be small and the performance of the trained model will be poor. However, if the masking rate is too large, there will be too much missing information, leading to the model’s collapse and the inability to complete the task of reconstruction. Therefore, selecting the appropriate \(p_{1}\) and \(p_{2}\) is the key to training the model. The experimental chapter will show the effects of different \(p_{1}\) and \(p_{2}\) on the model performance.

3.2 The graph encoder

Nowadays, Graph Neural Networks (GNNs), such as GCN (Graph Convolutional Network) [17], GIN (Graph Isomorphic Network) [38] and GAT (Graph Attention Network) [34], are widely used as the graph encoder and have the similar performance on graph representation. This paper selects a classic GCN model as the graph encoder.

In the training process, \({\mathcal {G}}_{mask}^{1}\) and \({\mathcal {G}}_{mask}^{2}\) are used as the encoder input. After the model training, the original graph \({\mathcal {G}}=\left( A, X \right) \) is used as the input of the encoder to obtain the graph representation H. The graph representation H can be used for downstream machine learning tasks, such as graph node classification, graph classification, link prediction, and other tasks. During model training, \({\mathcal {G}}_{mask}^{1}\) and \({\mathcal {G}}_{mask}^{2}\) get the graph feature representations \(H_{1}\) and \(H_{2}\) after passing through the encoder \(G_{\theta }^{e} \left( \cdot \right) \). The formula is defined as follows.

$$\begin{aligned}{} & {} H = G_{\theta }^{e}({\mathcal {G}} ) = \sigma ({\tilde{D}}^{-\frac{1}{2} } {\tilde{A}} {\tilde{D}}^{-\frac{1}{2}}XW_{l}) \end{aligned}$$
(1)
$$\begin{aligned}{} & {} H_{1} = G_{\theta }^{e}({\mathcal {G}}_{mask}^{1}) = \sigma ({\tilde{D}}^{-\frac{1}{2} } {\tilde{A}} {\tilde{D}}^{-\frac{1}{2}}X^{'}W_{l}) \end{aligned}$$
(2)
$$\begin{aligned}{} & {} H_{2} = G_{\theta }^{e}({\mathcal {G}}_{mask}^{2}) = \sigma ({\tilde{D}}^{-\frac{1}{2} } \tilde{A^{'}} {\tilde{D}}^{-\frac{1}{2}}XW_{l}) \end{aligned}$$
(3)

where \({\tilde{A}} =A+I\),\({\tilde{D}}_{ii}= {\textstyle \sum _{j}^{}}{\tilde{A}}_{i,j}\), I is an unit matrix and \(W_{l}\) is a learnable weight matrix. \(\sigma \) represents an activation function. In this paper, PReLU [12] activation function is used.

After the model training is complete, input the original graph \({\mathcal {G}}\) to the encoder, as shown in Eq. 1. The resulting graph representation H is used for the graph node classification task. Then, H is further aggregated to a one-vector graph representation \({\mathcal {R}}\) by a readout function, as shown in the following equations.

$$\begin{aligned} {\mathcal {R}} = \sum _{i=1}^{|V|}H_{i} \end{aligned}$$
(4)

Here, V is the number of graph nodes, H is used for the node classification task, and \({\mathcal {R}}\) is used for the graph classification task.

3.3 Dual decoder for reconstruction of graph nodes and edges

Now the graph representation of the graph of masking nodes \(H_{1}\) and the graph of masking edges \(H_{2}\) have been obtained. Taking the graph representation as input to the two decoders yields \({\hat{H}}_{1}\) and \({\hat{H}}_{2}\). Here the decoder selects GCN as the decoding method. The formula is defined as follows.

$$\begin{aligned}{} & {} {\hat{H}}_{1} = \sigma ({\tilde{D}}^{-\frac{1}{2} } {\tilde{A}} {\tilde{D}}^{-\frac{1}{2}} H_{1} W_{l}) \end{aligned}$$
(5)
$$\begin{aligned}{} & {} {\hat{H}}_{2} = \sigma ({\tilde{D}}^{-\frac{1}{2} } \tilde{A^{'}} {\tilde{D}}^{-\frac{1}{2}} H_{2} W_{l}) \end{aligned}$$
(6)

The decoded graph representation \(H_{1}\) and the original graph representation H calculate the cross loss. The cross loss is based on the scaled cosine error function [14] as defined in Eq. 6. Then use graph representation \(H_{2}\) to reconstruct the adjacency matrix \(A_{mask} = {\hat{H}}^{T}_{2}{\hat{H}}_{2}\). This method obtains the correlation matrix between nodes by directly calculating the similarity between each feature. Then calculate the similarity between the adjacency matrix \(A_{mask}\) and the original graph’s adjacency matrix A. The mean squared error function is used here, and its formula is shown in Eq. 7.

$$\begin{aligned}{} & {} {\mathcal {L}}_{sce} = \frac{1}{sum([MASK])}\sum _{i\in [MASK]}(1-\frac{{\hat{H}}^{T}_{i} X_{i} }{||{\hat{H}}_{i}||\cdot ||X_{i}||} ) \end{aligned}$$
(7)
$$\begin{aligned}{} & {} {\mathcal {L}}_{mse} = \frac{1}{n^{2}} \sum _{i}^{n}\sum _{j}^{n}(A_{i,j}-A_{mask i,j})^{2} \end{aligned}$$
(8)

Here, [mask] is the index of the masker node, and n is the number of graph nodes. Finally, the final loss of the proposed method is a weighted summation of \({\mathcal {L}}_{sce}\) and \({\mathcal {L}}_{mse}\) as defined in Eq. 8.

$$\begin{aligned}{} & {} {\mathcal {L}} = \alpha \cdot {\mathcal {L}}_{sce} + (1-\alpha ) \cdot {\mathcal {L}}_{mse} \end{aligned}$$
(9)

where \(0<\alpha <1\) is the weighted hyperparameter that balances the two losses. The whole process of the method is presented in Algorithm 1.

Algorithm 1
figure d

The Multi-view Graph Learning Process

As shown in Algorithm 1, the input of the algorithm’s input is the original graph \({\mathcal {G}}\). The output of the algorithm is the aggregated representation H and \({\mathcal {R}}\). H is used for downstream node classification tasks. \({\mathcal {R}}\) is used for downstream graph classification tasks. Our method achieves state of the art in most downstream tasks. The experimental data will be described in detail in the experiments section.

3.4 Computational complexity analysis

Assume C, N, D, and M represent architecture-dependent constants, numbers of nodes, feature dimension of the node, and numbers of edges, respectively. Assume that the computational costs of reverse and forward propagation are equivalent. Our decoder complexity is twice that of GraphMAE because we use two decoders. The computational costs for the encoders and projection are \(2C(N+M)\) and \(4C(N+M)\), respectively. The cost of model training is \(C_{\textrm{Ours}}(N^{2})\) due to the the loss function used to reconstruct the edge is \({L}_{mse}\). In summary, Table 1 provides the computational costs analysis of representative SSGRL methods.

Table 1 The computational costs analysis of representative SSGRL methods

4 Experiments

In this section, we will verify the effectiveness of this method on the two downstream tasks of node classification and graph classification. And an ablation experiment is set up to verify the importance of reconstructing edges in self-supervised graph representation learning. The hyperparameter experiment discusses the influence of different occlusion rates of nodes and edges on the experimental results.

Table 2 The results of node classification task

4.1 Node classification

Datasets: We select seven widely used graph datasets to verify the performance of our method in node classification. The seven datasets are Cora, CiteSeer, PubMed [42], Photo, Computers [30], DBLP [1], CS [30] and WikiCS. Cora, Citeseer, PubMed, and DBLP are citation networks, and the node features of these datasets are bag-of-words representations of documents. Photo and Computers are Amazon co-purchase graph. The nodes of this type of dataset represent items, and the edges represent the relationship of whether two items are frequently purchased together. CS is a co-author network whose nodes represent authors, keywords in the author’s paper represent node features, and edges represent whether two authors co-authored a paper. The WikiCS dataset is a web page related to computer science in Wikipedia. The nodes in this dataset represent articles related to computer science, and the edges indicate whether there are hyperlinks between these articles. Detailed statistics for these 7 datasets are provided in Table 3.

Table 3 The statistics of the node classification datasets
Table 4 Graph classification results

Settings for each experiment, we follow the linear evaluation scheme of DGI [35]. Each encoder uses one-layer GCN [17]. First, it is trained in a self-supervised way to obtain node representation H. Then, H is used to train and test a simple l2-regularized logistic regression classifier for evaluation. For model tuning, we tune the mask rate within 0.1 to 0.9, while the learning rate is within 1e−1 to 1e−5. We adopt the public splits for Cora, Citeseer, PubMed, WikiCS, and a 1:1:8 training/validation/testing splits for the other 4 datasets [47]. We implement the model with PyTorch. All experiments are conducted on an NVIDIA RTX3090 GPU with 24GB VRAM.

Competitors: Three categories of graph representation methods are compared. They are: (1) Generative Meth ods, GAE [18], GPT-GNN [15] and GATE [29]; (2) Contrastive Learning Methods, DGI [35], GMI [25], MVGRL [9], GRACE [50], CCA-SSG [47], BGRL [33], LaGraph [37] and AFGRL [20]; (3) Mask-based Method. GraphMAE [14]. There are also two supervised baselines for reference, GCN [17], and GAT [34].

The results of the experiment are shown in Table 2. In the table, bold indicates the highest record, and underline indicates the second-best record. The performance achieved by our method on the Cora dataset differs from state-of-the-art methods by 1.3% and by 0.2% on the CiteSeer dataset. We achieve the best results on four datasets and the second-best on one dataset. Specifically, compared with the state-of-the-art method LaGraph, it achieves 1.8% and 0.3% improvements on the Computers and CS datasets, respectively. Compared with the state-of-the-art method GraphMAE, it achieves a 0.4% improvement on the PubMed dataset. This shows that the model’s performance will be degraded on some specific datasets after the edge reconstruction is added, but it can still achieve better performance on most datasets.

4.2 Graph classification

Table 5 The statistics of the graph classification datasets

Datasets: We select seven widely used graph datasets to verify the performance of our method in graph classification. The seven datasets are IMDB-BINARY [39], IMDB-MULTI, PROTEINS, COLLAB [39], MUTAG [19], REDDIT-BINARY and NCI1 [36]. Among them, IMDB-BINARY and IMDB-MULTI are movie collaboration dataset that consists of the ego-networks of 1,000 actors who played roles in movies in IMDB. In each graph, nodes represent actors, and edges represent whether these two actors have collaborated on a movie. These graphs are derived from the Action and Romance genres. PROTEINS is a dataset of proteins that are classified as enzymes or non-enzymes. Nodes represent the amino acids, and an edge connects two nodes if they are less than 6 Angstroms apart. COLLAB is a scientific collaboration dataset. A graph corresponds to a researcher’s ego network, i.e., the researcher and its collaborators are nodes and an edge indicates collaboration between two researchers. MUTAG is a collection of nitroaromatic compounds, where nodes in the dataset represent atoms and edges represent bonds between atoms. REDDIT-BINARY is a dataset of posts published by Reddit. Its nodes represent posts, and edges represent whether the same user has commented on both posts. The NCI1 dataset comes from the cheminformatics domain, where each input graph represents of a chemical compound: each vertex stands for an atom of the molecule, and edges between vertices represent bonds between atoms. Detailed statistics for these 7 datasets are provided in Table 5.

Table 6 Experimental results on graph node classification tasks without masking edges or masking nodes
Table 7 Experimental results on graph classification tasks without masking edges or masking nodes

Settings: We follow GraphCL’s [45] linear evaluation protocol. In particular, we tune the encoder layer within 1 to 5 in increments of 1. We first train the encoder in a self-supervised form. Then we freeze the parameters of the encoder to get the corresponding representation H. Specifically, we sum the readout functions R of each view as a new representation. Finally, train a linear classification model with the fixed representation R, and report the mean tenfold cross-validation accuracy with standard deviation after 5 runs.

Competitors: Three types methods are compared. They are: (1) Supervised Methods, GIN [38] and DiffPool [43]. (2) Graph Kernel, WL [36], DGK [39]. (3) Self-supervised Methods, graph2vec [22], Infograph [31], GraphCL [45], JOAO [44], GCC [27], MVGRL [9], AD-GCL [32], LaGraph [37] and GraphMAE [14].

Result: We compare the performance of state-of-the-art self-supervised models in Table 4. The highest record on each dataset has been highlighted in bold. The second place is shown as underlined. The results in Table 4 show that our method achieves state-of-the-art results on six datasets. It can be found in the observation table that the performance of generative methods outperforms contrastive learning methods based on negative samples or predictions. For the latest method LaGraph, our method outperforms it comprehensively with a performance gain of 0.2–4.9%. Specifically, our method has 4.88% improvement on COLLAB datasets. It shows the great promise of generative methods. In particular, our method leads 0.2–2.5% than GraphMAE of 2022 SOTA on six datasets. This illustrates the advantages of considering edges in graph structures. This also aligns with our initial prediction that GraphMAE can obtain better node representation features when only node reconstruction is performed. However, it is no longer advantageous for graph classification tasks that require overall graph information.

Fig. 3
figure 3

The impact of the node masking ratio

Fig. 4
figure 4

The impact of the masking ratio

4.3 Ablation experiment

The following ablation experiments are set up to verify the method’s effectiveness. The base is a model that does not mask nodes and edges. The base(+n) model masks nodes but does not mask edges. The base(+e) is the opposite of base(+n), which masks edges but does not mask nodes. All is our final method. The experimental results are shown in Tables 6 and 7. Table 6 is the ablation experiment on the node classification task, and Table 7 is the ablation experiment on the graph classification task.

In the ablation experiment of the node classification task, it can be seen that the model that only reconstructs the edges is improved compared to the base model. This proves the effectiveness of reconstructing edges for model performance improvement. The model (all) that reconstructs edges and nodes can achieve better results than the model (base+e) that reconstructs only edges. This means that even if the edges are reconstructed, the impact of the model on the node classification task is not reduced and can lead to some improvement.

The experimental results of ablation on the graph classification task show that edge reconstruction can significantly improve the model performance. Moreover, the model reconstructing only edges on the REDDIT-B dataset achieves better results. However, the model with only reconstructed nodes on the COLLAB dataset achieves better results. Regardless, both refactoring node and refactoring edge models can achieve better results than cardinality. This is different from the results of the graph node classification task. This shows that both reconstructed nodes and reconstructed edges can achieve better graph representation.

It can be seen from the results of ablation experiments that a better graph representation can be obtained by reconstructing edges and nodes. However, there are sometimes degraded results for specific datasets. In general, combining the two is more applicable to a wide range of datasets. Our method effectively combines the different advantages of reconstructed nodes and reconstructed edges.

Table 8 Optimal parameters for each model

4.4 Hyperparameter experiment

In this method, the mask rate of nodes and the mask rate of edges are two important parameters. In order to study the impact of different masking rates on model performance, we have done a lot of experiments to find the most suitable masking rate. We selected one dataset each from graph node classification task and graph classification task to conduct experiments. The results are shown in Figs. 3 and 4.

The datasets selected in the hyperparameter experiments are Amazon Computers and IMDB-MULTI. In Fig. 3, the abscissa is the 5 different masking values of the node masking rate from 0.1 to 0.9, and the ordinate is the accuracy rate of the dataset. As can be seen from the figure on the left, the Amazon Computers dataset has the highest accuracy when the node mask rate reaches 0.7, and the lowest accuracy when the node mask rate reaches a minimum value of 0.1 and a maximum value of 0.9. In the figure on the right, the IMDB-MULTI dataset is also the most accurate when the node mask rate is 0.7, and the accuracy drops significantly when the mask rate reaches 0.9. Through the analysis of the experimental results, the low mask rate makes the reconstruction task of the model too simple to allow the model to learn the deep information in the graph. The high mask rate causes too much loss of graph information, making it impossible to perform the set node reconstruction task. In Fig. 4, it can also be observed that the model performs best when the mask rate of the edge is 0.5\(-\)0.7. It is similar to the node mask rate, too low and too high mask rate will degrade the model’s performance.

The final selected hyperparameters for all experiments in this paper are listed in Table 8.

4.5 Discussion

In this method, the final graph representation dimension is also very important for downstream tasks. The lower feature dimension can greatly reduce the training cost of the model, but it also leads to the loss of more information. The upper limit of the machine memory also limits the maximum value of the feature dimension. Therefore, we conduct experiments on multiple feature dimensions to observe their impact on model performance.

Fig. 5
figure 5

The impact of the embedding dimension d

We selected four datasets from the node classification task to observe the impact of the final graph representation feature dimension on downstream tasks. The four datasets are Citeseer, PubMed, CS, and Computers. Hyperparameter experiments were performed on these datasets in feature dimensions of 128, 256, 512, 1024 and 2048, respectively. The experimental results are shown in Fig. 5. It can be seen from the figure that the four datasets have achieved better experimental results in higher-dimensional feature representation. Among them, the accuracy of the CS dataset is around 93%, so the improvement observed in the figure is not obvious. However, compared with the feature dimension of 128, the feature dimension of 2048 still has an accuracy improvement of 0.4%. It is worth mentioning that the feature dimension of the Citseer dataset has the most obvious change. Compared with the feature dimension of 128, the accuracy of the feature dimension of 2048 is improved by 10.8%. It can be inferred from these experiments that our method achieves better downstream task performance with higher-dimensional node representations.

5 Conclusion

In this paper, we propose a node and edge dual-masked self-supervised graph representation method. For the first time, we propose to use two different decoders to reconstruct nodes and reconstruct edges. The method is validated on a total of 14 real-world datasets for graph node classification tasks and graph classification tasks. The experimental results demonstrate the effectiveness of the method. In the future, we plan to explore better ways to make edge reconstruction and node reconstruction better act on the model. And exploring generative models that are more suitable for self-supervised graph representation learning than existing methods.