1 Introduction

Graphs provide a way of representing a wide-variety of complex data in real-world systems. Several graph analysis approaches have emerged to extract useful information hidden in graphs. Community detection as an essential tool for graph analysis has been applied to many real-world problems like social networks [1], citation networks [2], brain networks [3] and protein–protein interaction (PPI) network [4]. Many community detection algorithms have been proposed, from shallow approaches [5, 6] to deep ones [7, 8]. Several recent deep methods utilize graph convolutional network (GCN) [9] to extract features from graphs [10, 11]. Many of these methods rely on local graph information (e.g., adjacency matrix reconstruction) which is not appropriate for graph clustering [8].

Recently, contrastive approaches have demonstrated significant results in graph analysis tasks. These self-supervised methods mainly discriminate between positive and negative sample pairs from node or graph level representations [12,13,14,15]. Despite the high performance of contrastive methods in node representation learning, they have not received sufficient attention for community detection. Several contrastive approaches for node clustering have been proposed in the literature, but most of them isolate the node embedding step from the clustering task [16]. When community structure of the graph is ignored during node representation learning process, the resulted embedding space is suboptimal for the clustering task because the representation learning step is unaware of the downstream clustering task and is performed independent of it. Therefore, it is beneficial to define a clustering-oriented contrastive objective function to achieve better clustering performance.

To address this, we propose to learn a novel clustering-friendly node embedding framework which utilizes a contrastive method for node representation learning, and meanwhile, it considers the community structure of the graph for optimizing node representations. Therefore, the proposed method not only employs the great potential of contrastive node embedding, but also explores the cluster information of the embedding space. As the embedding method, we utilize a contrastive approach which relies on mutual information maximization to learn representations. In order to have a clustering-friendly node embedding space, our approach is to impose a Gaussian mixture distribution on the representation space. Combining Gaussian mixture embedding and deep models is not a straight-forward task. Several approaches have been introduced to do so. Jiang et al. [17], Uğur et al. [18] use variational approaches to have a mixture model in their latent space. Makhzani et al. [19] utilize an adversarial training procedure to make the latent space of their model follow a mixture of Gaussians distribution. Our approach is to assume that the learnt node embedding space follows a mixture of Gaussians (MoG) distribution and learn the parameters of this mixture distribution along with the parameters of the contrastive model in a unified framework by taking iterative single (or limited) steps of expectation–maximization (EM) and gradient descent. Moreover, since many message passing algorithms are restricted to local messages, it is beneficial to employ a method which goes beyond direct neighbors to capture higher order information in the graph. To do so, we employ graph diffusion convolution (GDC) [20] which may help with the task of clustering by providing a global view of the graph. However, since GDC does not perform well for some complicated graphs, we employ a clustering quality measure, the modularity [21], to decide whether to utilize GDC for clustering a given graph or not. Our code is available on https://github.com/MaedehAhmadi/GMIM.

Contributions of our method are summarized as follows:

  1. 1.

    We propose a clustering-oriented contrastive learning-based method, Gaussian mixture information maximization (GMIM), for learning node embedding. Different from other simple contrastive methods, which ignore the community structure of the graph during node embedding, GMIM learns a clustering-friendly node representation which cares about the downstream clustering task.

  2. 2.

    We utilize graph diffusion to benefit from the global view of the graph in the clustering task in cases it is beneficial. Diffusion makes it possible to surpass the limited information of direct neighbors in message passing process and provides the proceeding contrastive learning algorithm with the global view of the graph.

  3. 3.

    Extensive experiments on six real-world datasets demonstrates the effectiveness of our method in comparison with the state-of-the-art deep graph clustering methods.

The rest of the paper is organized as follows. We review the related work of graph embedding and graph clustering in Sect. 2. Section 3 introduces a detailed description of the proposed method. Experimental results on six real-world datasets are presented in Sect. 4. The conclusions are given in Sect. 5.

2 Related works

2.1 Graph embedding

Recently, approaches based on deep learning have made great progress in many fields of graph learning specially graph embedding. Early deep learning based researches were mostly rely on random walk objectives [22, 23]. These methods take random walks along nodes and utilize neural language models (like SkipGram [24]) for node embedding. They assume that close nodes in the input graph, which co-occur in the same random sequence, should also be close in the embedding space.

Graph neural networks (GNNs) [25,26,27] have demonstrated strong representation power for attributed graph learning tasks. GNN-based methods follow a message passing mechanism to capture structural information of graph data. For unsupervised graph embedding, graph autoencoder-based methods [7, 9, 10] mainly try to reconstruct adjacency matrix so they impose closeness of fist-order neighbor nodes in the embedding space. Both of random walks and graph autoencoders-based methods over-emphasize the local proximity information [13].

Recently, contrastive approaches have achieved state-of-the-art results in graph data analysis [13, 14, 28]. They contrast samples from a desired distribution and another undesired one. Motivated by the excellent results of contrastive learning in visual representation learning [29, 30], graph contrastive algorithms propose to retain local and global structural information of graphs [13, 14].

2.2 Community detection

Many methods for detecting communities have been proposed. Early methods employ shallow approaches to community detection, mostly focusing on the information of network topology. Non-negative matrix factorization (NMF) [5, 6] and Laplacian eigenmaps [31] are two widely used approaches in this area. Stochastic block model-based methods [32] are also well explored. Modularity maximization is a popular goal to extract communities [33]. To exploit both of content and structural information, several extended algorithms based on topic models [34] and NMF [35, 36] are proposed.

As graph analysis problems and graph data get more complicated, deep learning-based methods have demonstrated great performance in graph analysis tasks including community detection. As baseline methods among deep approaches, applying well-known clustering algorithms on embedding results of GAE and VGAE [9] have better performance than many shallow algorithms. Some works present enhanced graph autoencoder-based methods with boosted results in graph clustering [7, 10, 11].

While some methods perform graph embedding and clustering in two independent stages, some other methods try to combine clustering and graph embedding goals. Wang et al. [8] co-optimize a graph attention-based reconstruction loss and the clustering loss of [37]. Tsitsulin et al. [38] maximize modularity on the embedding space of a GCN. A probabilistic generative model which learns node embedding and community assignment jointly is proposed in [39]. In [40], GCN is integrated with Bernoulli–Poisson probabilistic model [6] for overlapping community detection. Zhang et al. [41] train a graph autoencoder to find an appropriate embedding space for relaxed Kmeans. A variational framework for learning clustering and node embedding is introduced in [42].

There exist some contrastive approaches for graph clustering in the literature. SCGC [16] presents a contrastive approach for node clustering. But unlike our method, it performs node embedding and clustering in two independent steps. Therefore, the second step is totally unaware of the first one. Moreover, since it defines a neighbor-oriented contrastive objective function, it over-emphasizes on the limited local information of direct neighbors. CCGC [43] is a node-by-node contrastive learning method which utilizes the clustering result of each iteration for selecting positive and negative pairs of nodes in the next iteration. It defines a loss function to minimize the similarity of cross-view different high-confidence cluster centers. CONVERT [44] is also a contrastive approach with a label-matching module which aligns the pseudo-labels selected via clustering and semantic labels obtained from applying softmax on node embeddings.

3 Method

3.1 Problem formalization and method overview

We consider community detection in attributed networks in this paper. The input is a graph \(G=(V,E,X)\), where \(V=(v_1,v_2,\ldots ,v_N)\) is the set of N nodes and \(E=\lbrace e_{ij} \rbrace \) is the edge set. \( X=\lbrace x_1; x_2; \ldots ; x_N \rbrace \) are the attribute values where \(x_i\in {\mathbb {R}}^F\) is the feature vector of node \(v_i\). An adjacency matrix \(A\in {\mathbb {R}}^{N\times N}\) encodes the structural connectivity of nodes where \(A_{i,j}=1\) if \((v_i, v_j )\in E\); otherwise, \(A_{i,j}=0\).

The purpose of attributed community detection is to divide the nodes into K communities (or clusters) based on the attributes and structural information.

Our proposed method considers the clustering and node embedding tasks in a joint manner. To achieve this goal, we assume that the node embedding space flows a mixture of Gaussians distribution. We learn the parameters of the MoG and the contrastive method jointly. This results in a more cluster-friendly representation space which is more appropriate for Kmeans clustering algorithm to be applied to.

The proposed method includes two main parts: (1) node embedding part which utilizes contrastive learning for extracting embedding vectors of the nodes and (2) clustering part that tries to impose a Gaussian mixture distribution on the learned latent representation. The overall framework of our proposed method is shown in Fig. 1.

Fig. 1
figure 1

Our \({\textrm{GMIM}}\) framework. Given X and A as the input, we select our input graph structure according to the clustering quality of two initialized embedding spaces. The resulted graph and its corrupted version are fed in to a shared encoder. The output features construct information maximization embedding objective (bottom). The clustering module (top) aims to enforce this representation to follow a \({\textrm{MoG}}\) distribution. The embedding and clustering modules are trained jointly

3.2 Node embedding

Our proposed framework for clustering-friendly node representation learning is not limited to a specific type of contrastive node embedding approach. contrasting representation vectors can be done in node versus node level [45] or node versus graph level [13, 14]. In this paper, we use the contrastive framework of [13] for learning node embedding on attributed networks. We maximize the mutual information between node representation vectors and a global graph summary vector. The objective is to train an encoder \({\mathcal {E}}\) such that \({\mathcal {E}}(X,A)=H=\lbrace h_1, h_2, \ldots , h_N \rbrace \in {\mathbb {R}}^{N\times F^{'}}\) represent node representations \(h_i\in {\mathbb {R}}^{F^{'}}\) for each node i. We generate a negative graph \({\tilde{G}}\) by a corruption function \({\tilde{G}}=C(G)\) that shuffles the rows of X. The same encoder \({\mathcal {E}}\) is applied to the positive and negative graphs to obtain H and \({\tilde{H}}\) representation matrices. Summary vector s is obtained by the readout function \(s={\mathcal {R}}(H)=\sigma (1/N \sum _{i=1}^N h_i)\), with logistic sigmoid nonlinearity \(\sigma \). Given representation vector h and s, The following discriminator \({\mathcal {D}}\) distinguishes between representations from positive and negative graphs by assigning higher probabilities to representation vectors that the summary contains them:

$$\begin{aligned} {\mathcal {D}}(h, s)=\sigma (h^T Ws), \end{aligned}$$
(1)

where W is a learnable scoring matrix. To maximize the mutual information between \(h_i\) and the summery vector s, the following cross-entropy loss is minimized:

$$\begin{aligned} \begin{aligned} {\mathcal {L}}_{MI}&=-\frac{1}{2N}\Bigg (\sum _{i=1}^N {\mathbb {E}}_{(X, A)} [\log {\mathcal {D}}(h_i, s)]\\&\quad +\sum _{j=1}^N{\mathbb {E}}_{({\tilde{X}}, {\tilde{A}})} [\log ( 1-{\mathcal {D}}({\tilde{h}}_j, s))]\Bigg ). \end{aligned} \end{aligned}$$
(2)

The encoder is the following single-layer \({\textrm{GNC}}\):

$$\begin{aligned} {\mathcal {E}}_A(X, A)={\textrm{PReLU}}({\widehat{D}}^{-\frac{1}{2}} {\hat{A}}{\widehat{D}}^{-\frac{1}{2}}X\Phi ), \end{aligned}$$
(3)

where \({\hat{A}}=A+I_N\) is the adjacency matrix with self-connections and \({\widehat{D}}_{ii}=\sum _j {\hat{A}}_{ij}\) is the corresponding degree matrix. \(\Phi \) is a learnable transformation matrix and \({\textrm{PReLU}}\) represents parametric rectified linear unit function.

3.3 Graph diffusion

Message passing neural networks pass messages between immediate nodes of the graph. Although they try to aggregate the messages from higher-order neighbors in deep layers, most of them achieve their best performance with 2-layer networks because of over-smoothing phenomenon [46]. Limiting the messages of each layer to one-hop neighbors is restrictive, and some methods try to capture higher-order information in the graph. One of the successful methods in this regard is graph diffusion convolution (GDC) [20]. It replaces the adjacency matrix with a diffusion matrix which is formulated as:

$$\begin{aligned} S=\sum _{k=0}^{\infty }\theta _k T^k, \end{aligned}$$
(4)

with generalized transition matrix T and weighting coefficients \(\theta \). One popular example of graph diffusion is Personalized PageRank \(({\textrm{PPR}})\) [47]. Given adjacency matrix A and related degree matrix \({D}_{ii}=\sum _j {A}_{ij}\), \(({\textrm{PPR}})\) chooses \(T=AD^{-1}\) and \(\theta _k=\alpha (1-\alpha )^k\) with teleport probability \(\alpha \in [0,1]\). The closed-form solution for \({\textrm{PPR}}\) diffusion is as below:

$$\begin{aligned} S^{{\textrm{PPR}}}=\alpha \left( I_n-(1-\alpha ) D^{-\frac{1}{2}} AD^{-\frac{1}{2}}\right) ^{-1}. \end{aligned}$$
(5)

This diffusion matrix provides global view of a graph, acts as a low-pass filter and smooths out the neighborhood over the graph [20]. GDC can be integrated with any kind of graph-based model. We can utilize it as the input of our model instead of adjacency matrix. But for complicated datasets, GDC may not perform well for clustering. In order to make decision about using diffusion or adjacency matrices as our input graph structure, we utilize the modularity measure [21]. This score measures the clustering quality without regarding label information. To do so, we train two encoders using (2) as the loss function. Adjacency matrix is fed to one encoder and diffusion matrix is fed to the other. We then apply Kmeans clustering on the resulted representations from two encoders. The higher value of modularity indicates the better clustering quality. In GMIM framework of Fig. 1, we utilize the winner matrix and the winner encoder as the matrix R and the encoder \({\mathcal {E}}\), respectively. Also, the selected trained encoder is utilized for initialization as will be stated in Sect. 3.6. Note that we use the following GCN encoder in case of using diffusion matrix as the input structure:

$$\begin{aligned} {\mathcal {E}}_{PPR}(X, S^{PPR})={\textrm{PReLU}}(S^{PPR}X\Phi ), \end{aligned}$$
(6)

where \(\Phi \) is a learnable transformation matrix.

3.4 Gaussian mixture modeling for community detection

Assume we have calculated a node embedding \(h_i\) for every node \(v_i\) of the graph by a node embedding model with parameters \(\Psi \). We consider each node is generated from a multivariate Gaussian distribution. Then, the likelihood for all the nodes of the graph is a Gaussian mixture distribution:

$$\begin{aligned} p(V)=\prod _{i=1}^{\mid V \mid }\sum _{k=1}^K p(c_i=k)p(v_i{\mid }c_i=k;\Psi ,\mu _k, \Sigma _k ), \end{aligned}$$
(7)

here \(c_i\) denotes the soft community assignment of node i and \(p(c_i=k)\) indicates the probability of node i being assigned to community k. \(p(v_i{\mid }c_i=k; \Psi , \mu _k, \Sigma _k )\) is a multivariate Gaussian distribution as follows:

$$\begin{aligned} p(v_i{\mid }c_i=k; \Psi , \mu _k, \Sigma _k )=N(h_i{\mid }\mu _k, \Sigma _k). \end{aligned}$$
(8)

For simplicity of notations, we denote \(p(c_i=k)\) as \(\pi _k\) where \(\sum _{k=1}^K \pi _k=1\). So the parameters of the Gaussian mixture are \(\Theta =\lbrace \Pi =\lbrace \pi _k \rbrace , M=\lbrace \mu _k \rbrace \) and \(\sum =\lbrace \Sigma _k \rbrace \rbrace \) for \(i=1, \ldots ,{\mid }V{\mid }\) and \(k=1, \ldots , K\). We assume covariance matrices \(\Sigma _k\) are diagonal.

3.5 Clustering-friendly node embedding

We propose a clustering-promoting objective which outputs a latent space that is suitable for clustering. We assume that the learnt latent space follows a \({\textrm{MoG}}\) distribution. Our defined objective function has two parts: embedding and clustering. The embedding part utilizes the self-learning objective of \({\mathcal {L}}_{MI}\) for node representation learning and the clustering module tries to enforce this representation to follow a \({\textrm{MoG}}\) distribution. The later goal is achieved by minimizing the negative log-likelihood (\({\textrm{NLL}}\)) under \({\textrm{MoG}}\) distribution:

$$\begin{aligned} {L}_{NLL}=-\sum _{i=1}^{{\mid }V{\mid }}\log \sum _{k=1}^K\pi _k{\mathcal {N}}(h_i{\mid }\mu _k, \Sigma _k). \end{aligned}$$
(9)

Our total loss function is defined as:

$$\begin{aligned} {\mathcal {L}}=\omega {\mathcal {L}}_{MI}+\beta {\mathcal {L}}_{NLL}, \end{aligned}$$
(10)

where \({\mathcal {L}}_{MI}\) and \({\mathcal {L}}_{NLL}\) are the mutual information loss and the negative log-likelihood (\({\textrm{NLL}}\)), respectively. The weighs \(\omega \) and \(\beta \) balance between two terms of the objective function. After optimizing our objective, we have a Kmeans-friendly latent space on which we apply Kmeans algorithm to obtain the final clusters of nodes.

3.6 Inference

The total loss function of (10) consists of two sets of parameters: node embedding parameters \((\Psi )\) and \({\textrm{MoG}}\) parameters \(\Theta =\lbrace \Pi , M~\textrm{and}~ \Sigma \rbrace \). To optimize these parameters, we use an iterative approach by fixing one set and optimizing the other. We initialize the \(\Psi \) parameters by training the model using (2) as the loss function. To initialize \({\textrm{MoG}}\) parameters, we apply Kmeans algorithm on the achieved embedding from \(\Psi \) initialization. We initialize \((\Pi , M, \Sigma )\) using the hard assignment results of Kmeans algorithm. The details of this iterative approach are described below.

Fixing \(\varvec{\Psi }\) Parameters and Optimizing \(\varvec{\Theta =\lbrace \Pi , M, \Sigma \rbrace }\)

Fixing deep network parameters, we use expectation–maximization algorithm [48] to optimize \((\Pi , M, \Sigma )\). The following equations are used iteratively to update these parameters:

$$\begin{aligned} \pi _k= & {} \frac{N_k}{{\mid }V{\mid }},\end{aligned}$$
(11)
$$\begin{aligned} \mu _k= & {} \frac{1}{N_k} \sum _{i=1}^{{\mid }V{\mid }} {\mathcal {V}}_{ik} h_i,\end{aligned}$$
(12)
$$\begin{aligned} \Sigma _k= & {} \frac{1}{N_k} \sum _{i=1}^{{\mid }V{\mid }} {\mathcal {V}}_{ik} (h_i-\mu _k ) (h_i-\mu _k )^T, \end{aligned}$$
(13)

where

$$\begin{aligned} {\mathcal {V}}_{ik}=\frac{\pi _k {\mathcal {N}}(h_i{\mid }\mu _k, \Sigma _k )}{\sum _{k^{\prime }=1}^K \pi _{i,k^{'}} {\mathcal {N}}\left( h_i{\mid }\mu _{k^{'}}, \Sigma _{k^{'}}\right) }, \end{aligned}$$
(14)

and

$$\begin{aligned} N_k=\sum _{i=1}^{\mid V \mid }{\mathcal {V}}_{ik} \quad 1\le k \le K. \end{aligned}$$
(15)

More precisely, we update \({\mathcal {V}}_{ik}\) in E-step and \((\Pi , M, \Sigma )\) in the M-step of the \({\textrm{EM}}\) algorithm. Note that we perform a limited steps of E and M in each iteration not until the convergence of \({\textrm{EM}}\). More details about the derivation of updating formulas are provided in Appendix A.

Fixing \(\varvec{\Theta =\lbrace \Pi , M, \Sigma \rbrace }\) and Updating \(\varvec{\Psi }\) Parameters

Fixing \({\textrm{MoG}}\) parameters, we optimize the total loss function of (10) with respect to \(\Psi \) parameters using gradient descent \(({\textrm{GD}})\):

$$\begin{aligned} \Psi = \Psi -\eta \left( \frac{\omega \partial {{\mathcal {L}}_{MI}(\Psi )}}{\partial \Psi }+\frac{\beta \partial {\mathcal {L}}_{NLL}(\Psi ,\Theta )}{\partial \Psi }\right) , \end{aligned}$$
(16)

where \(\eta \) is the learning rate. Parameters \(\Psi \) are updated via PyTorch auto-grad. Green arrows in Fig. 1 denote the backpropagation process. \(\Psi \) consists of the learnable scoring matrix W of (1), encoder parameters \(\Phi \) and \({\textrm{PReLU}}\) parameters of (3). Our proposed method is summarized in Algorithm 1.

Algorithm 1
figure d

Gaussian Mixture Information Maximization.

3.7 Computational complexity

In this section, we analyze the computational complexity of GMIM with N nodes, \({\mid }E{\mid }\) edges, K clusters, attribute and embedding dimensions of F and \(F'\) and \(T_i\) and T epochs for initialization and main optimization, respectively. Our algorithm consists of three main components: computing diffusion, initialization and Gaussian mixture modeling. We calculate the diffusion using Eq. 5 whose time complexity is \({\mathcal {O}}(N^3)\). In the initialization phase, we optimize \({\mathcal {L}}_{MI}\) for \(T_i\) epochs. In each epoch, encoder and discriminator constitute the main parts of \({\mathcal {L}}_{MI}\). The encoder can be implemented through sparse or dense multiplication with time complexities of \({\mathcal {O}}(NFF'+{\mid }E{\mid }F')\) or \({\mathcal {O}}(N^2F'+NFF')\), respectively. The complexity of computing the discriminator is \({\mathcal {O}}(NF'^2)\). The time complexity of calculating \({\mathcal {L}}_{MI}\) is the sum of complexities of the encoder and discriminator. Gaussian mixture modeling phase consists of T epochs. In each epoch, \({\mathcal {L}}_{MI}\) should be calculated with the complexity mentioned above and the parameters of MoG are updated once with the complexity of \({\mathcal {O}}(NKF')\). Since K is usually much smaller than F, we can ignore it and since we set \(T_i\) and T at the same order, we assume the same value of T for both.

To sum up, by combining these three components, the overall time complexity of our model is \({\mathcal {O}}(N^3+NFF'T+{\mid }E{\mid }F'T)\) for sparse multiplication implementation and \({\mathcal {O}}(N^3+N^2F'T+NFF'T)\) for dense one. It is worth noting that we can utilize an approximation (using Andersen algorithm [49]) with a linear runtime \({\mathcal {O}}(N)\) for calculating diffusion. In this case, the time complexity will be reduced to \({\mathcal {O}}(NFF'T+{\mid }E{\mid }F'T)\). We leave this task for our future work.

4 Experiments

4.1 Benchmark datasets

We conduct attributed graph community detection experiments on six standard widely used network datasets (Cora [50], PubMed [50], Wiki [51], ACM [52], Flickr [53] and Coauthor-Phys [54]). Cora and PubMed are two citation networks. Their nodes represent papers and edges correspond to citations. Wiki is a webpage network dataset in which nodes and edges are related to webpages and links between them, respectively. In both cases nodes are represented by bag-of-words vestors. Features of Cora are binary vectors while PubMed and Wiki are represented by tf-idf weights. ACM is a paper network in which nodes represent papers and two papers are connected by an edge if they are written by the same author. Node features representing papers are bag-of-words of keywords. Flickr is a social network in which users play as nodes and edges indicate friendship connection between users. The labels of nodes are user interest groups. Features of each node is a list of tags specified by the users to indicate their interests. Coauthor-Phys is a co-authorship network in which nodes are authors. Two nodes are connected if their corresponding authors have co-authored a paper. Node features are defined based on the set of keywords of author’s papers. Class labels correspond to the field of research. Table 1 summarizes the detailed statistics of datasets.

Table 1 Datasets statistics

4.2 Compared methods

We compare GMIM with the following methods. These approaches are categorized into three groups:

  1. 1.

    Methods which use node features only: Kmeans and spectral clustering [55] are two common clustering methods. Spectral-F is a spectral clustering method which considers the cosine similarity between node features as the similarity matrix.

  2. 2.

    Methods which use graph structure only: Spectral-G considers the adjacency matrix as the similarity matrix. DeepWalk [22] generates random paths along a graph and use them to train SkipGram language model to learn node embedding. GraphEncoder [56] trains a stacked sparse autoencoder to obtain node embedding. DNGR [57] uses stacked denoising autoencoders to learn each node representation. Kmeans in applied to the learnt latent space of the three later methods. vGraph [39] is a probabilistic generative model which performs graph clustering and node embedding jointly.

  3. 3.

    Methods which use both node features and graph structure: TADW [58] adds node features to DeepWalk framework. GAE and VGAE [9] integrate (variational) autoencoder and graph neural networks for node embedding. ARGA and ARVGA [7] use an adversarial training scheme to impose a prior distribution on latent space of GAE and VGAE. DGVAE [59] presents a graph variational generative model which uses the Dirichlet distributions as priors on the latent variables. AGC [11] designs a high-order graph convolution to take smooth node features for enhancing clustering results. CommDGI [60] incorporates contrastive learning to learn cluster assignment of the nodes. DAEGC [8] optimizes graph reconstruction loss and a clustering loss jointly. SENet [61] uses a spectral clustering loss to learn node embeddings. GC-VGE [42] introduces a joint framework for clustering and representation learning by utilizing a variational graph embedding mechanism. DBGAN [62] introduces an adversarial framework to learn node embeddings. SCGC [16] is a contrastive method for node clustering which perform node embedding and clustering tasks in two isolated steps. it defines an adjacency-oriented loss function to contrast between views. CCGC [43] proposes a cluster-guided contrastive learning method which uses high-confidence clustering information for selecting discriminative positive and reliable negative samples for contrastive learning. CONVERT [44] is a contrastive approach which guides the node embedding procedure via matching the pseudo-labels obtained from clustering and the semantic labels.

4.3 Evaluations metrics and experimental settings

We report three evaluation metrics to measure the performance of graph clustering: clustering accuracy (ACC), normalized mutual information (NMI), adjusted rand index (ARI). The higher values of all these metrics indicates the better results. We run our algorithm 10 times on each dataset and report the average and standard deviation of the obtained metrics.

For the encoder, we set the size of hidden dimension to 512 for Cora, Wiki and ACM and 256 for Flickr and PubMed and Coauthor-Phys. The weight \(\omega \) is set to Cora:25,000, Wiki:15,000, PubMed:1000, ACM:15,000, Coauthor-Phys:10 and Flickr:2000, to balance two terms of objective function. At the start of training we set \(\beta \) to zero and as training progresses, we gradually increase it to reach one. We use the Adam \({\textrm{GD}}\) optimizer with learning rate of 0.001 in both initialization and training phases for all datasets except the ACM dataset for which we use learning rate of 0.0001 in the training phase. \(T_1\) and \(T_2\) are set to one for all datasets. We train the model for 200 epochs on Cora, Wiki, ACM and Coauthor-Phys, 400 epochs on Flickr and 1000 epochs on PubMed.

We set \(\alpha =0.2\) for \({\textrm{PPR}}\) diffusion on Cora, PubMed and ACM. We have not used diffusion for Wiki, Flickr and Coauthor-Phys since the decision process specifies the adjacency matrix as the appropriate input structure for these datasets.

4.4 Experimental results

Our experimental results are summarized in Tables 234,  5 and 6. F, G and F &G indicate the methods which use only node features, graph structure or both of features and structure information, respectively. Boldface indicates the best metric value in each column. According to these tables, we obtain the following observations:

  1. 1.

    Methods using both feature and structure generally outperform the methods using only one source of information. This indicates the importance of these information for the clustering task.

  2. 2.

    Our method significantly outperforms classic GNN-based methods GAE, VGAE, ARGA and ARVGA. These are two-stage methods perform node embedding and clustering stage independently. In addition, they basically try to reconstruct adjacency matrix.

  3. 3.

    Some methods (like SENet, CCGC and Convert) exploit clustering information of nodes in node embedding. Also, some competitors (including DAEGC, DGVAE, GC-CGE and CommDGI) present unified frameworks for clustering and representation learning. But, overall, they have inferior results compared to ours. The reason of this matter is different for various methods. For instance, DAEGC, DGVAE and GC-CGE mainly rely on adjacency matrix reconstruction which focuses too much on local proximity information and is not appropriate for the graph clustering goal.

  4. 4.

    Compared with some recent contrastive based methods like SCGC, CONVERT and CCGC, our method achieves significant performance improvements in all cases except ACC on Wiki which is comparable to GMIM. This confirms the positive effect of our clustering-oriented contrastive loss function.

  5. 5.

    GMIM consistently surpasses all of its competitors w.r.t to all metrics on Cora and ACM datasets. On other cases, it can be seen that in few cases that a competitor method has a higher performance than GMIM w.r.t a specific metric on one dataset, it is consistently outperformed by GMIM w.r.t. other metrics on that dataset and also, w.r.t. all metrics on all other datasets. To be more precise, On PubMed: GMIM absolutely outperforms all methods w.r.t. all metrics, excluding CommDGI which has higher NMI than ours. This method is outperformed by GMIM in terms of other metrics on this dataset and also w.r.t. all metrics on other datasets. On Flickr: GMIM again consistently surpasses all methods w.r.t. all metrics, except GC-VGE with higher NMI value than ours. This method is outperformed by GMIM in terms of other metrics on this dataset and also w.r.t. all metrics on other datasets. On Wiki: GMIM exceeds all of the baselines in terms of NMI (1.93% higher than the best baseline). It also outperforms all of the methods with respect to ACC, excluding SCGC which is slightly better (0.25%) than ours on Wiki dataset in terms of ACC. However, GMIM significantly surpasses SCGC on all datasets (averagely by 7.57%, 8.98% and 9.28% in terms of ACC, NMI and ARI, respectively). In terms of ARI, ACG and DAEGC have higher results than GMIM, while both of these methods are outperformed by GMIM with respect to ACC and NMI. Moreover, GMIM exceeds both of the methods on all other datasets w.r.t. all metrics.

Table 2 Clustering results on Cora dataset
Table 3 Clustering results on PubMed dataset
Table 4 Clustering results on Wiki dataset
Table 5 Clustering results on ACM dataset
Table 6 Clustering results on Flickr dataset

4.5 Ablation study

To evaluate the effectiveness of our unified framework for learning clustering-friendly node embedding, we have performed an ablation study which is shown in Table 7. We train the model using only the mutual information loss (\({\mathcal {L}}_{MI}\)). We then apply Kmeans and Gaussian mixture model (GMM) on the learnt embedding, and the average result of 10 runs is reported in Table 7 as MI+Kmeans/GMM. Best results are shown in bold. The fact that GMIM outperforms MI+Kmeans/GMM confirms the effectiveness of jointly optimizing MI and NLL objectives.

Table 7 Ablation study

4.6 Effect of embedding dimension

In this section, we investigate the influence of the embedding dimension on clustering performance. Figure 2 shows the clustering results of five datasets for different embedding dimensions. It is worth noting that all other hyper-parameters except embedding size are fixed according to Sect. 4.3. We have verified the embedding dimensions of [64, 128, 256, 512] for Cora, Wiki, ACM and Flickr and [16, 32, 64, 128, 256] for PubMed.

According to Fig. 2, the clustering results have small variations using different embedding dimensions. Mostly, the larger embedding dimension results in higher performance.

Fig. 2
figure 2

Effect of different embedding dimensions on node clustering

4.7 Influence of hyper-parameter \(\omega \)

In this section, we focus on the impact of weight \(\omega \) which balances between two terms of our total loss function. The value of this hyper-parameter is proportional to the ratio of \({\mathcal {L}}_{NLL}\) to \({\mathcal {L}}_{MI}\) for each dataset. For obtaining best results, this value is selected such that both of \({\mathcal {L}}_{MI}\) and \({\mathcal {L}}_{NLL}\) decrease at the end of training with respect to their initial value.

Figure 3 depicts the clustering results for different values of on five datasets. The size of hidden dimension is set to 128 for PubMed and to the same values as stated in Sect. 4.3 for other datasets. All other hyper-parameters except \(\omega \) are fixed. As is shown in this figure, the clustering performance is robust within a wide range of \(\omega \). However, for very large or small values of \(\omega \), one of the terms of the total loss function dominates the other, and consequently the performance degrades.

Fig. 3
figure 3

Effect of different values of \(\omega \) on node clustering

4.8 Influence of full covariance matrices

As stated in Sect. 3.4, we have used diagonal covariance matrices for Gaussian mixture modeling. Theoretically, even if the elements of feature vectors are not independent, the linear combination of diagonal covariance Gaussians can equally describe the correlations among features modeled by at least one full covariance matrix Gaussian [63]. To verify the effect of using full covariance Gaussians, we examined GMIM with full covariance matrices on Cora dataset. We obtained 75.67%, 60.36% and 54.75% in terms of ACC, NMI and ARI, respectively. Comparing these results with the diagonal covariance case shows that using full covariance Gaussians does not enhance the performance significantly, whereas diagonal covariance matrices result in simplified computations and need less memory.

4.9 Running time analysis

The running time of our model (the required time to train the model and generate the embeddings) on five datasets is shown in Fig. 4. To analyze the efficiency of GMIM, we performed clustering on a larger dataset, namely Coauthor-Phys. The statistics of this dataset is presented in Table 1. We evaluated the clustering results of GMIM and several baseline methods (including GAE, VGAE, ARGA, ARVGA and also SCGC, CCGC and CONVERT as three recent contrastive graph clustering methods) on this dataset. The running times versus clustering accuracy results are depicted in Fig. 5. The running time is in log-scale.

Fig. 4
figure 4

Running time of GMIM on five datasets

Fig. 5
figure 5

Clustering accuracy versus running time on Coauthor-Phys dataset

Fig. 6
figure 6

2D visualization of node embeddings on five datasets. (Left) the raw features and (right) the learnt node representations. Different classes are shown by different colors

As shown in this figure, our method significantly outperforms all the competitors in terms of accuracy, while the running times of CCGC, SCGC and CONVERT are considerably greater than ours. Although our running time is greater than GAE, VGAE, ARGA and ARVGA, GMIM achieves more than 20–30% improvements in terms of ACC against these methods.

4.10 Visualization

To show the effectiveness of learnt node embedding, we visualize the node representations of five datasets in two-dimensional space using t-SNE [64] in Fig. 6. As illustrated in this figure, comparing raw features with the node representations in 2D space shows that the learnt representations are cluster-friendly and appropriate for Kmeans to be applied to.

5 Conclusions

In this paper, we introduce a clustering-promoting objective for node embedding. Our proposed method utilizes contrastive learning to produce a clustering-friendly latent space by assuming that the learnt representation follows a mixture of Gaussians distribution. The embedding and clustering-related objectives are optimized in a unified framework to benefit each other. Our experiments show that incorporating the clustering-directed objective function can enhance the clustering ability of graph contrastive learning. We evaluated the proposed method on six real-world datasets. Empirical results demonstrate the effectiveness of our method compared with state-of-the-art methods.