research-article

Open Access

Less is More: Removing Redundancy of Graph Convolutional Networks for Recommendation

Authors:
Shaowen Peng

Kyoto University, Japan

Kyoto University, Japan

0000-0003-4020-9100
View Profile

,
Kazunari Sugiyama

Osaka Seikei University, Japan

Osaka Seikei University, Japan

0000-0003-3962-821X
View Profile

,
Tsunenori Mine

Kyushu University, Japan

Kyushu University, Japan

0000-0002-7462-8074
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 42 Issue 3Article No.: 85pp 1–26https://doi.org/10.1145/3632751

Published:22 January 2024Publication History

ACM Transactions on Information Systems

Abstract

While Graph Convolutional Networks (GCNs) have shown great potential in recommender systems and collaborative filtering (CF), they suffer from expensive computational complexity and poor scalability. On top of that, recent works mostly combine GCNs with other advanced algorithms which further sacrifice model efficiency and scalability. In this work, we unveil the redundancy of existing GCN-based methods in three aspects: (1) Feature redundancy. By reviewing GCNs from a spectral perspective, we show that most spectral graph features are noisy for recommendation, while stacking graph convolution layers can suppress but cannot completely remove the noisy features, which we mostly summarize from our previous work; (2) Structure redundancy. By providing a deep insight into how user/item representations are generated, we show that what makes them distinctive lies in the spectral graph features, while the core idea of GCNs (i.e., neighborhood aggregation) is not the reason making GCNs effective; and (3) Distribution redundancy. Following observations from (1), we further show that the number of required spectral features is closely related to the spectral distribution, where important information tends to be concentrated in more (fewer) spectral features on a flatter (sharper) distribution. To make important information be concentrated in as few features as possible, we sharpen the spectral distribution by increasing the node similarity without changing the original data, thereby reducing the computational cost. To remove these three kinds of redundancies, we propose a Simplified Graph Denoising Encoder (SGDE) only exploiting the top-K singular vectors without explicitly aggregating neighborhood, which significantly reduces the complexity of GCN-based methods. We further propose a scalable contrastive learning framework to alleviate data sparsity and to boost model robustness and generalization, leading to significant improvement. Extensive experiments on three real-world datasets show that our proposed SGDE not only achieves state-of-the-art but also shows higher scalability and efficiency than our previously proposed GDE as well as traditional and GCN-based CF methods.

1 INTRODUCTION

Recommender systems have been playing an indispensable role in people’s daily lives by suggesting items the user may be interested in based on the analysis of users’ historical data, such as user-item interactions, user reviews, demographic information, and so on. We focus on Collaborative Filtering (CF), a fundamental task for recommendation which infers user preference from past user-item interactions. A common paradigm for CF is to characterize users and items as learnable vectors in a latent space and is optimized based on user-item interactions. Matrix factorization (MF) [23] is one of the most widely used CF methods, which simply estimates the ratings as the inner product between user and item latent vectors. Subsequent works improve MF mostly by (1) exploiting advanced algorithms such as perceptrons [7, 15], recurrent neural networks [17, 45], memory networks [8], attention mechanisms [4], transformer [38], generative adversarial network [41], and so on to model non-linear user-item relations; (2) augmenting interactions with auxiliary information [1, 26]. However, due to the sparsity of interactions which is common in practice, traditional CF methods always show poor performance.

Recently, Graph Convolutional Networks (GCNs) have attracted much attention in various fields including recommender systems [2, 9, 31, 42] as they can learn high-quality representation under data sparsity by exploiting higher-order neighborhood. However, as shown in Figure 1, unlike conventional CF methods such as MF, the user/item embeddings are repeatedly updated by aggregating messages from neighborhood implemented by multiplying by an adjacency matrix followed by a feature transformation [42, 56], resulting in high computational cost and poor scalability. While recent advances mostly focus on combining GCNs with other advanced algorithms such as self-supervised learning [46], learning in hyperbolic space [53], negative sampling [18], and so on by further sacrificing model scalability and efficiency, we raise a simple yet crucial question: Can GCNs be both effective and efficient? Despite some incremental improvements being made to remove useless components of GCNs (e.g., activation functions and feature transformations) [5, 14], the complexity mainly comes from the core design neighborhood aggregation. We argue that the key to answering the above question is to dissect GCN by investigating the mechanism of how GCNs work (i.e., why and how updating user/item embeddings by aggregating the messages from neighborhood helps?).

Fig. 1. A visualization of difference between conventional (e.g., MF) and GCN-based methods. For GCN-based methods, the user/item embeddings are repeatedly updated by multiplying by an adjacency matrix which is computationally expensive.

To this end, we review GCN-based recommendation methods and show three kinds of redundancies that significantly affect model efficiency and effectiveness: (1) Feature redundancy. Only a very few smoothed and rough graph features contribute to recommendation accuracy, while most features are noisy and removable. Stacking graph convolution layers can suppress but cannot completely remove the noisy features; (2) Structure redundancy. What makes user/item representations distinctive lies in the spectral graph features, while repeatedly aggregating messages from the neighborhood is not the reason making GCNs effective; (3) Distribution redundancy. The number of required spectral features contributing to recommendation is highly related to the spectral distribution, where important information tends to be concentrated in more (fewer) graph features on a flatter (sharper) distribution (i.e., the spectral value drops more slowly (quickly)). To condense the important information into fewer spectral features and to reduce the complexity of retrieving them, we can sharpen the spectral distribution by increasing the node similarity without sabotaging the original data. To remove feature redundancy, we previously proposed a Graph Denoising Encoder (GDE) [30] which only keeps important spectral features without stacking graph convolution layers for recommendation. By taking two other redundancies into consideration, we can further simplify GCN-based methods. Based on the above analysis, we propose a new GCN formulation dubbed Simplified Graph Denoising Encoder (SGDE) which only exploits the top-K singular vectors for recommendation and is equipped with a lighter structure than traditional as well as GCN-based CF methods. By analyzing how graph contrastive learning (GCL) works for recommendation, we further propose a scalable contrastive learning framework to boost model generalization and robustness and to augment sparse supervisory signals with rich higher-order neighborhood signals. Finally, we comprehensively evaluate the proposed SGDE on three datasets with respect to efficiency and effectiveness. Extensive results show that our proposed methods outperform state-of-the-arts with 289\(\times\) and 21\(\times\) speed-up over LightGCN and GDE on Gowalla, respectively. The main contributions of this work can be summarized as follows:

—	We provide a deep insight into GCN-based recommendation methods in terms of model scalability and efficiency, by showing three kinds of redundancies (i.e., feature, structure, and distribution redundancies) on existing methods.
—	We propose an SGDE which only uses the top-K singular vectors and is equipped with a lighter structure than MF.
—	To reduce the computational cost for retrieving spectral features, we concentrate important information from interactions on fewer features by increasing node smoothness to sharpen the spectral distribution, resulting in significant improvement as well.
—	We further propose a scalable contrastive learning framework by performing augmentation on spectral features and augmenting sparse supervisory signals with abundant higher-order neighborhood signals, resulting in significant improvement.
—	Extensive experiments on three real-world datasets not only show that our proposed SGDE outperforms competitive baseline as well as GDE with superior scalability and efficiency but also demonstrate the effectiveness of our proposed designs.

2 PRELIMINARIES

2.1 GCNs for CF

The interaction matrix is defined as \(\mathbf {R}\in \lbrace 0, 1\rbrace ^{\left| \mathcal {U} \right| \times \left| \mathcal {I} \right|}\) with \(\left| \mathcal {U} \right|\) users and \(\left| \mathcal {I} \right|\) items. On top of conventional methods such as MF, each user u or item i is characterized not only as a low-rank embedding vector \(\mathbf {e}_u/\mathbf {e}_i\in \mathbb {R}^d\), but also as a node on the graph \(\mathcal {G}=(\mathcal {V}, \mathcal {E})\), where the nodes \(\mathcal {V}=\mathcal {U} \cup \mathcal {I}\) contain all users and items, the edges \(\mathcal {E}=\mathbf {R}^+\) are represented by observed interactions, where \(\mathbf {R}^+=\lbrace r_{ui}=1|u\in \mathcal {U}, i\in \mathcal {I}\rbrace\). Starting from the initial state \(\mathbf {h}^{(0)}=\mathbf {e}_u\), the embeddings are updated as follows: (1) \(\begin{equation} \begin{aligned}&\mathbf {h}^{(l+1)}_{u}=\sigma \left(\sum _{i\in \mathcal {N}_u} \frac{1}{\sqrt {d_u}\sqrt {d_i}}\mathbf {h}^{(l)}_{i} \mathbf {W}^{(l+1)}\right),\\ &\mathbf {h}^{(l+1)}_{i}=\sigma \left(\sum _{u\in \mathcal {N}_i} \frac{1}{\sqrt {d_u}\sqrt {d_i}}\mathbf {h}^{(l)}_{u} \mathbf {W}^{(l+1)}\right), \end{aligned} \end{equation}\) where \(\mathcal {N}_u\) and \(\mathcal {N}_i\) are the directly connected neighbors of u and i; \(d_u\) and \(d_i\) are u’s and i’s node degrees, respectively; \(\mathbf {W}^{(l+1)}\) is a feature transformation; \(\sigma (\cdot)\) is an activation function. Let \(\mathbf {E}\in \mathbb {R}^{(\left|\mathcal {U}\right|+\left|\mathcal {I}\right|)\times d}\) be the stacked embedding vectors for users and items, the updating rule can be formulated with a matrix form: (2) \(\begin{equation} \mathbf {H}^{(l+1)}=\sigma \left(\mathbf {\hat{A}} \mathbf {H}^{(l)} \mathbf {W}^{(l+1)} \right), \end{equation}\) where \(\mathbf {\hat{A}}\) is a symmetric normalized adjacency matrix and is defined as follows: (3) \(\begin{equation} \mathbf {\hat{A}}=\begin{bmatrix} \mathbf {0}& \mathbf {\hat{R}}\\ \mathbf {\hat{R}}^T &\mathbf {0} \end{bmatrix}, \end{equation}\) where \(\mathbf {\hat{R}}=\mathbf {D}^{\mbox{-}\frac{1}{2}}_U\mathbf {R}\mathbf {D}^{\mbox{-}\frac{1}{2}}_I\) can be considered as a normalized interaction matrix, \(\mathbf {D}_U\) and \(\mathbf {D}_I\) are matrices with diagonal elements representing user and item degrees, respectively. The final representations are generated by accumulating the embeddings from different layers. While recent works [14] and [5] show that activation function and feature transformations are of no help boosting recommendation performance, the representations can be simplified as (4) \(\begin{equation} \mathbf {O}=\sum _{l=0}^L \frac{\mathbf {H}^{(l)}}{L+1}=\left(\sum _{l=0}^L\frac{\mathbf {\hat{A}}^l}{L+1} \right) \mathbf {E}. \end{equation}\) The reduced complexity for activation functions and feature transformations are \(\mathcal {O}(d)\) and \(\mathcal {O}(|\mathcal {V}|d^2)\), respectively. Despite the improvements made by previous works to reduce the complexity of GCNs, we can see storing or calculating the power of the adjacency matrix (\(\mathcal {O}(L|\mathcal {E}|d)\)) is still computationally expensive.

2.2 Graph Signal Processing

According to spectral decomposition \(\mathbf {\hat{A}}=\mathbf {V}diag(\lambda _k)\mathbf {V}^T=\sum _k \lambda _k \mathbf {v}_k \mathbf {v}_k^T\), eigenvectors are important (spectral) features constituting the graph.

Definition 1.

Given a graph signal \(\mathbf {s}\), the total variation of \(\mathbf {s}\) on a graph \(\mathcal {G}\) \({\rm TV}_{\mathcal {G}}(\mathbf {s})\) is defined as (5) \(\begin{equation} {\rm TV}_{\mathcal {G}}(\mathbf {s})=\Vert \mathbf {s}-\mathbf {\hat{A}}\mathbf {s} \Vert . \end{equation}\)

The total variation measures the difference between the signal samples at each node and at its neighbors on the graph [34]. On the other hand, we have the following relation if we apply the eigenvector \(\mathbf {v}_k\) to the Definition 1: (6) \(\begin{equation} \Vert \mathbf {v}_k-\mathbf {\hat{A}}\mathbf {v}_k\Vert =1-\lambda _k\in [0, 2), \end{equation}\) where \(\lambda _k\in (-1,1]\) [24], we use Euclidean norm here. Equation (6) shows that the variation of an eigenvector is related to the eigenvalue, where the one with larger (smaller) eigenvalue has smaller (larger) variation. Intuitively, the feature with a small variation implies the smoothness (nodes are similar to their neighbors), while the one with a large variation emphasizes node difference and is rough. Thus, it is natural to raise a question: Do different spectral features (i.e., \(\mathbf {v}_k\)) contribute differently to recommendation accuracy?

3 REDUNDANCIES IN GCNS

3.1 Feature Redundancy and GDE Brief

To evaluate how distinct spectral features affect accuracy, we define a cropped adjacency matrix as (7) \(\begin{equation} \mathbf {\hat{A}}^{\prime }=\sum _k \mathcal {D}(\lambda _k) \mathbf {v}_k \mathbf {v}_k^T, \end{equation}\) where \(\mathcal {D}(\lambda _k)=\lbrace 0, \lambda _k\rbrace\) is a binary value function. \(\mathcal {D}(\lambda _k)=0\) and \(\lambda _k\) for the untested and tested features, respectively. We replace \(\mathbf {\hat{A}}\) with Equation (7) and evaluate two extensively used baselines: vanilla GCN [22] and LightGCN [14] on two datasets: CiteULike and ML-1M. Figure 2 shows the results, and we observe the following:

Fig. 2. In (a) and (b), we divide the spectral features into different groups based on the variation; the blue lines and green bars represent the accuracies and the ratios(%) of the features in different groups to all features, respectively; ’Random’ stands for the accuracy using a randomly initialized adjacency matrix below which the features can be considered useless. In (c) and (d), we test how the features with variation \(\le x\) contribute to recommendation on LightGCN, the blue lines and green bars represent the accuracies and ratios(%) of the tested features to all features, respectively.

—	In (a) and (b), only the features with rather small (i.e., smoothed features) or large variations (i.e., rough features) tend to show a positive effect for recommendation accuracy which only account for a very small percentage since most features (83% on CiteULike and 76% on ML-1M) contribute as little as the randomly created adjacency matrix to the accuracy.
—	In (c) and (d), we observe slight and significant drops in accuracy when removing the features with small and large variations, respectively, while removing other features even leads to an improvement.

We can summarize the above observations as follows:

Observation 1.

Recommendation accuracy is mainly and slightly contributed by the spectral features with small variations (i.e., smoothed features) and large variations (i.e., rough features), respectively, which only account for a very small percentage of all features, while most features can be considered as noise that instead reduce the performance.

Now we take a look at Equation (4) and rewrite it as (8) \(\begin{equation} \mathbf {O}=\sum _{l=0}^{L}\frac{\mathbf {\hat{A}}^l}{L+1}\mathbf {E}=\left(\sum _k \left(\sum _{l=0}^L \frac{\lambda _k^l }{L+1 }\right) \mathbf {v}_k \mathbf {v}_k^T\right) \mathbf {E}, \end{equation}\) where \(\sum _{l=0}^L \frac{\lambda _k^l }{L+1 }\) is the weight for spectral features. Figure 3(a) illustrates the normalized weights (i.e., \(\in [0,1]\)) for distinct spectral features on LightGCN with different layers, and we make the following observation:

Fig. 3. The normalized weights for distinct spectral features.

Observation 2.

As stacking more graph convolution layers, the model emphasizes more on the smoother features and tends to filter out rougher features.

Observation 2 explains why stacking more layers tends to show better performance. However, we can see the noisy features cannot be completely removed even stacking many layers (e.g., 100 layers), Figure 3(b) illustrates an ideal design. GDE can be considered as an instantiation of the ideal design. We adopt a hypergraph representation as it is more informative and powerful: (9) \(\begin{equation} \begin{aligned}&\mathbf {A}_{U}=\mathbf {D}_u^{\mbox{-}\frac{1}{2}}\mathbf {R}\mathbf {D}_i^{\mbox{-}1}\mathbf {R}^T\mathbf {D}_u^{\mbox{-}\frac{1}{2}} ,\\ &\mathbf {A}_{I}=\mathbf {D}_i^{\mbox{-}\frac{1}{2}}\mathbf {R}^T\mathbf {D}_u^{\mbox{-}1}\mathbf {R}\mathbf {D}_i^{\mbox{-}\frac{1}{2}}. \end{aligned} \end{equation}\) We generate embeddings on smoothed and rough graphs which are composed of top smoothed and rough features, respectively. For instance, the embeddings on the smoothed graph are generated as (10) \(\begin{equation} \begin{aligned}&\mathbf {H}_{U}^{(s)}=\left(\mathbf {P}^{(s)} diag\left(\Delta \left(\pi ^{(s)}_k\right)\right){\mathbf {P}^{(s)}}^T\right) \mathbf {E}_U,\\ &\mathbf {H}_{I}^{(s)}=\left(\mathbf {Q}^{(s)}diag\left(\Delta \left(\pi ^{(s)}_k\right)\right){\mathbf {Q}^{(s)}}^T\right)\mathbf {E}_I, \end{aligned} \end{equation}\) where \(\mathbf {P}^{(s)}\) and \(\mathbf {Q}^{(s)}\) are top smoothed eigenvectors of \(\mathbf {A}_{U}\) and \(\mathbf {A}_{I}\), respectively; \(\pi ^{(s)}_k\) is the corresponding eigenvalue; \(\mathbf {E}_U\) and \(\mathbf {E}_I\) are embedding matrices for users and items, respectively. \(diag(\cdot)\) is a diagonalization operator; \(\Delta (\cdot)\) is a function to learn the importance of different spectral features since stacking layers is essentially adjusting the weights of spectral features. Similarly, we can generate the embeddings on the rough graph, and the final representations are generated based on the embeddings on the two graphs. Overall, GDE makes GCN simpler as it does not need to stack layers to achieve superior performance, which significantly reduces the complexity of GCN. However, there are still two directions to improve GDE: (1) Since the sizes of the adjacency matrix and embedding matrices linearly increase with the size of datasets which still could result in heavy computation cost, can the model structure be simpler? (2) GDE additionally requires preprocessing for retrieving spectral features, however, we notice that the number of required spectral features is related to datasets, making the efficiency unstable and unpredictable (i.e., what if the preprocessing costs too much time on some datasets?). In addition, we notice that the rough features barely bring improvement and tend to be less effective on sparse data, whereas smoothed features are always contributive to the accuracy, thus we only focus on smoothed features in this work. It is worth noting that the findings in this subsection only hold on CF because the importance of different spectral features varies on tasks according to the data characteristics [3].

3.2 Structure Redundancy

The node form of Equation (4) can be formulated as follows: (11) \(\begin{equation} \mathbf {o}_k=\sum _{z\in {\mathcal {V}}}\alpha _{kz}\mathbf {e}_z, \end{equation}\) where k and z are arbitrary nodes including both users and items, \(\alpha _{kz}\) is the contribution from node z. The term \(\mathbf {e}_z\) is the same for all nodes, the difference between the representations of distinct nodes lies in \(\alpha _{kz}\), and it is easy to obtain the following relation as (12) \(\begin{equation} \alpha _{kz}=\frac{\sum _{l=0}^L\mathbf {\hat{A}}^l_{kz}}{L+1}=\left(\mathbf {V}_{k*}\odot \frac{\sum _{l=0}^L\lambda ^l}{L+1}\right)\mathbf {V}_{z*}^T, \end{equation}\) and we can informally define the similarity of representations between a user u and an item i as (13) \(\begin{equation} \begin{aligned}{\rm sim}(\mathbf {o}_u, \mathbf {o}_i)&=vec(\alpha _{uz})vec(\alpha _{iz})^T\\ &=\left(\mathbf {V}_{u*}\odot \frac{\sum _{l=0}^L\lambda ^l}{L+1}\right)\left(\mathbf {V}_{i*}\odot \frac{\sum _{l=0}^L\lambda ^l}{L+1}\right)^T, \end{aligned} \end{equation}\) where \(\mathbf {V}_{k*}\) is the kth row of \(\mathbf {V}\) and can be considered as a k’s feature vector, \(\lambda\) is a row vector containing all eigenvalues, \(\odot\) is an operator of element-wise multiplication, \(vec(\cdot)\) is a vectorization operator, \(vec(\alpha _{uz})\) and \(vec(\alpha _{iz})\) are row vectors containing contributions from all nodes to u and i, respectively. We can see that \(\alpha _{kz}\) can be considered as a weighted cosine similarity between two nodes’ feature vectors. Then, Equation (11) can be rewritten as (14) \(\begin{equation} \begin{aligned}&\mathbf {o}_u=\left(\mathbf {V}_{u*}\odot \frac{\sum _{l=0}^L\lambda ^l}{L+1}\right)\mathbf {V}^T\mathbf {E},\\ &\mathbf {o}_i=\left(\mathbf {V}_{i*}\odot \frac{\sum _{l=0}^L\lambda ^l}{L+1}\right)\mathbf {V}^T\mathbf {E}. \end{aligned} \end{equation}\) We observe that \(\mathbf {V}^T\mathbf {E}\) is a common term for distinct users or items. Furthermore, comparing Equation (14) with (13), we can see that \(\mathbf {V}^T\mathbf {E}\) seems to be redundant for the model. To further verify this observation, we evaluate two variants: (1) the original model, (2) we replace \(\mathbf {V}^T\mathbf {E}\) with a weight matrix \(\mathbf {W}\) as we assume it is redundant on two datasets: CiteULike and ML-100K, and report results in Figure 4(a) and (b). One might raise a question here, that it is more straightforward to simply remove \(\mathbf {V}^T\mathbf {E}\) if we claim its redundancy. Although the similarity between two nodes can be defined as Equation (13), the spectral features contain the information from high dimensional sparse interaction matrix which is always noisy, simply removing \(\mathbf {V}^T\mathbf {E}\) deprives the model of its denoising ability. Existing work denoise by mapping the interaction matrix to a low dimensional space [23], thus here we add a linear transformation. We can see the accuracies of these two models are pretty close, which solidifies our observation. On the other hand, the simplified model does not explicitly aggregate neighborhood compared with Equation (4): (15) \(\begin{equation} \begin{aligned}&\mathbf {o}_u=\left(\mathbf {V}_{u*}\odot \frac{\sum _{l=0}^L\lambda ^l}{L+1}\right)\mathbf {W},\\ &\mathbf {o}_i=\left(\mathbf {V}_{i*}\odot \frac{\sum _{l=0}^L\lambda ^l}{L+1}\right)\mathbf {W}, \end{aligned} \end{equation}\) while can match GCN-based methods in terms of accuracy, thus we can make the following observation:

Fig. 4. In (a) and (b), we compare the original Equation (14) and the simplified Equation (15) models on CiteULike and ML-100K evaluated by nDCG@10, respectively. In (c) and (d), we compare S-LightGCN and S-LightGCN-T on CiteULike and ML-100K evaluated by nDCG@10, respectively.

Observation 3.

Although neighborhood aggregation is considered as a core design for GCNs, it is not the reason making GCNs effective for CF. The accuracy is mainly contributed from the weighted spectral graph features.

In our previous work [30], we adopt a hypergraph setting. We further make the following observation: (16) \(\begin{equation} \begin{aligned}&\mathbf {\hat{R}}=\mathbf {P}diag\left(\sigma _k\right)\mathbf {Q}^T,\\ &\mathbf {A}_{U}=\mathbf {\hat{R}}\mathbf {\hat{R}}^T=\mathbf {P}diag\left(\sigma _k^2\right)\mathbf {P}^T,\\ &\mathbf {A}_{I}=\mathbf {\hat{R}}^T\mathbf {\hat{R}}=\mathbf {Q}diag\left(\sigma _k^2\right)\mathbf {Q}^T. \end{aligned} \end{equation}\)

Observation 4.

Given \(\mathbf {\hat{R}}\)’s left and right singular vectors \(\mathbf {p}_k\) and \(\mathbf {q}_k\) (matrix form \(\mathbf {P}\) and \(\mathbf {Q}\)) and singular value \(\sigma _k\) (vector form \(\sigma\)), \(\lbrace \mathbf {P}, \mathbf {Q}\rbrace\) and \(\sigma _k^2\) are the hypergraph’s (\(\mathbf {A}_{U}\)’s and \(\mathbf {A}_{I}\)’s) eigenvectors and -values, respectively.

Decomposing \(\mathbf {\hat{R}}\) with SVD has a lower computation cost than decomposing \(\mathbf {\hat{A}}\) (See results in Section 5.2.2) which also saves the space for storing \(\mathbf {\hat{A}}\). According to Observation 4, the previous analysis on GCNs is applicable to Equation (16) as well, we can simply replace \(\mathbf {V}\) and \(\lambda\) in Equation (15) with \(\mathbf {P}\), \(\mathbf {Q}\), and \(\sigma\), and name it S-LightGCN. In addition, we propose an S-LightGCN-T which only exploits top-x% smoothest features, and Figure 4(c) and (d) show the comparison between these two variants. We can see the model with only top 1% smoothed features already outperforms the model with full spectral features (i.e., singular vectors), and the performance is maximized at 4% and 6% on CiteULike and ML-100K, respectively, which further demonstrates that our previous observations on conventional GCNs still holds on S-LightGCN.

3.3 Distribution Redundancy

We notice that the number of required spectral features (i.e., K) varies on datasets. Figure 5(a) and (b) illustrate the spectral distributions of CiteULike and ML-1M, on which our previously proposed GDE [30] reaches the best performance with top 30% and 5% smoothed features, respectively. On CiteULike, over 10% of components correspond to the largest eigenvalue, and the spectral value drops slowly; while the spectral distribution on ML-1M is sharper and the spectral value drops more quickly. Recall our previous analysis that the feature with a larger spectral value (i.e., smoother feature) tends to be more important to recommendation. Thus, we can make the following observation:

Fig. 5. Experimental results for tackling distribution redundancy.

Observation 5.

A flatter (sharper) spectral distribution (i.e., the spectral value drops more slowly (quickly)) implies the graph is composed of more (fewer) smoothed components and thus requires more (fewer) features.

Apparently, it requires more computational cost for retrieving spectral features on a flatter spectral distribution, while we hope the important information can be concentrated in as few features as possible to reduce the cost. Then, we raise a simple question: Can we reduce the required features K? This question is equivalent to: How can we sharpen the distribution? Without considering the specific recommendation algorithm, a user or an item can be represented as the corresponding row of the adjacency matrix (e.g., \(\hat{\mathbf {A}}_{u*}\) for u and \(\hat{\mathbf {A}}_{i*}\) for i), and the similarity can be simply measured as their dot product or cosine similarity. And we can see that only homogeneous nodes (i.e., user-user and item-item) connected to the common nodes (e.g., the users that interacted with common items, and the items that are interacted by common users) have similarities. Consider the average similarity between a node u and other nodes: (17) \(\begin{equation} \begin{aligned}&\mathbf {\hat{A}}_{u*}\sum _{z\in \mathcal {N}_u^2}\mathbf {\hat{A}}_{z*}^T=\left(\mathbf {V}_{u*}\odot \lambda \right)\mathbf {V}^T \left(\left(\sum _z\mathbf {V}_{z*}\odot \lambda \right)\mathbf {V}^T\right)^T,\\ &=\left(\mathbf {V}_{u*}\odot \lambda \right) \left(\sum _z\mathbf {V}_{z*}\odot \lambda \right)^T, \end{aligned} \end{equation}\) where \(\mathcal {N}_u^2\) is the node set of second-order neighbor who have similarities with u.

Definition 2

(Variation on Second-order Graphs).

The variation of the eigenvectors on the second-order graph can be defined as (18) \(\begin{equation} \left\Vert \mathbf {v}_k-\mathbf {\hat{A}}^2\mathbf {v}_k\right\Vert =1-\lambda _k^2\in [0,1] . \end{equation}\)

Interpretation of Equation (17). Definition 2 measures the difference between the signal samples of eigenvectors at each node (\(\mathbf {V}_{uk}\)) and at its second-order neighbor (\(\sum _z\mathbf {V}_{zk}\)). Intuitively, \(\mathbf {v}_k\) with \(|\lambda _k|\rightarrow 1\) implies that the nodes are similar to their second-order neighbor: \(|\mathbf {V}_{uk}-\sum _z\mathbf {V}_{zk}|\rightarrow 0\), while \(\mathbf {v}_k\) with \(|\lambda _k|\rightarrow 0\) emphasizes the difference between \(\mathbf {V}_{uk}\) and \(\sum _z\mathbf {V}_{zk}\). Consider \(\lambda\) as a band-pass filter, with the spectral distribution sharper, the components with \(|\lambda _k|\rightarrow 1\) and \(|\lambda _k|\rightarrow 0\) are emphasized and suppressed, respectively, leading to a higher similarity between the nodes and their second-order neighbor who have non-zero similarities with them. In other words, the sharpness of the spectral distribution is closely related to the average node similarity defined on the normalized adjacency matrix. On the other hand, the obvious difference between ML-1M and CiteULike is the data density, where the users/items of ML-1M have more interactions that are easier to have similarities with other the users/items, thus resulting in a sharper spectral distribution. Then, the original question can be transformed to: How do we increase the average node similarity (defined on the interaction matrix) to sharpen the spectral distribution?

Without changing the interactions, the key to increase the average similarity lies in the weight of the adjacency relation: \(\mathbf {\hat{A}}_{ui}=\frac{1}{\sqrt {d_u}\sqrt {d_i}}\). To this end, we define a modified normalized adjacency matrix \(\mathbf {\bar{A}}_{ui}=w(d_u)w(d_i)\) and investigate what node weights lead to a higher average similarity. Consider a node with a degree z, any two nodes connected to this node have similarities and there are \(z(z-1)\) different pairs. We let \(p(z)\) be the possibility that any two nodes are connected to this node rather than other nodes, \(p(z)\propto z^2\); the weight of a node with a degree z is represented by \(w(z)\). It is reasonable to assume that the average similarity is contributed from nodes with different node degrees, we can measure the contribution of a node to the average similarity with \(p(z)w(z)\), and measure the average similarity with: (19) \(\begin{equation} \begin{aligned}&\int ^{d_{max}}_{d_{min}}p(z)w(z)dz,\\ s.t. &\int ^{d_{max}}_{d_{min}}p(z)=1,\quad \int ^{d_{max}}_{d_{min}}w(z)=1, \end{aligned} \end{equation}\) where \(d_{min}\) and \(d_{max}\) are the minimum and maximum node degrees, respectively. Note that Equation (19) does not reflect the exact value of the average similarity, we use it to investigate what \(w(z)\) leads to a higher average similarity, which helps sharpen the spectral distribution to reduce the required spectral graph features. We do not need to calculate this integral if we set \(w(z)=\frac{1}{z^\alpha }\) (\(\alpha =0.5\) in the setting of \(\mathbf {\hat{A}}\)). Then, it is obvious that \(p(z)w(z)\propto z^{2-\alpha }\), and this integral monotonically decreases with \(\alpha\), implying that a higher weight over the high-degree nodes (i.e., by setting \(\alpha\) smaller) leads to a higher average similarity. Thus, theoretically, \(w(z)\) with a higher weight over high-degree nodes than the original setting \(w(z)=\frac{1}{\sqrt {z}}\) results in higher average similarity. In this work, we set \(\mathbf {\bar{A}}_{ui}=\frac{1}{\sqrt {d_u+\alpha } \sqrt {d_i+\alpha }}\), where \(\alpha \in \mathbb {R}^+\). To avoid introducing too many different notations, we still use \(\alpha\) here for simplicity. The range of \(\alpha\) here is different from the \(\alpha\) in \(w(z)=\frac{1}{z^\alpha }\). Note that since \((\frac{1}{\sqrt {z}})^{\prime }\) is monotonically increasing, the weights of \(\frac{1}{\sqrt {z+\alpha }}\) over high-degree nodes are higher than that of \(\frac{1}{\sqrt {z}}\), and are emphasized more by setting \(\alpha\) larger. Figure 5(c) and (d) show that the average user and item similarities constantly increase as increasing \(\alpha\); in Figure 5(e) and (f), we can observe a sharper distribution as increasing \(\alpha\), thus the number of required features (i.e., K) is expected to be reduced. Note that there is a tradeoff between the reduction of K and the integrity of interactions: a too large \(\alpha\) would sabotage the original interactions.

4 METHODOLOGY

4.1 Simplified Graph Denoising Encoder (SGDE)

Based on the analysis and observations in Section 3, the user and item representations can be formulated as follows: (20) \(\begin{equation} \begin{aligned}&\mathbf {O}_U=\mathbf {P}^{(K)}diag\left(\Delta \left(\sigma _k\right)\right)\mathbf {W},\\ &\mathbf {O}_I=\mathbf {Q}^{(K)}diag\left(\Delta \left(\sigma _k\right)\right)\mathbf {W}. \end{aligned} \end{equation}\) SGDE has the following three components:

—	Stacked top-K smoothest left and right singular vectors of \(\mathbf {\hat{R}}\): \(\mathbf {P}^{(K)}\in \mathbb {R}^{\left\|\mathcal {U}\right\|\times K}\) and \(\mathbf {Q}^{(K)}\in \mathbb {R}^{\left\|\mathcal {I}\right\|\times K}\).
—	GCNs use a polynomial to weight the spectral features. Here, we abstract it as a nonparametric function \(\Delta (\cdot)\), since the dynamic choice has been shown ineffective [30].
—	A feature transformation \(\mathbf {W}\in \mathbb {R}^{K\times d}\), where \(K\ll {\rm min}(\left\|\mathcal {U}\right\|, \left\|\mathcal {I}\right\|)\).

The embeddings can only be updated via a matrix form on conventional GCNs, resulting in high space complexity. While we can optimize SGDE node-wisely: (21) \(\begin{equation} \begin{aligned}&\mathbf {o}_u=\mathbf {P}^{(K)}_{u*}\odot \Delta \left(\sigma \right)\mathbf {W},\\ &\mathbf {o}_i=\mathbf {Q}^{(K)}_{i*}\odot \Delta \left(\sigma \right)\mathbf {W}, \end{aligned} \end{equation}\) where \(\sigma\) is a row vector containing top-K singular values. Note that the element-wise multiplication is preprocessed since \(\Delta (\cdot)\) is a static function. Finally, SGDE is optimized with BPR loss [33]: (22) \(\begin{equation} \mathcal {L}_{main}=\sum _{u\in \mathcal {U}}\sum _{(u,i^+)\in \mathbf {R}^+,(u,i^{\mbox{-}})\notin \mathbf {R}^+}\ln \sigma \left(\mathbf {o}_u^T\mathbf {o}_{i^+} - \mathbf {o}_u^T\mathbf {o}_{i^{\mbox{-}}} \right). \end{equation}\)

4.2 Discussion

4.2.1 Weighting Function.

By considering \(\Delta (\cdot)\) as a continuous function of \(\sigma _k\), we can expand it according to Maclaurin expansion: (23) \(\begin{equation} \Delta (\sigma _k)=\sum _{l=0}^L \alpha _l \sigma _k^l, \end{equation}\) where \(\alpha _l=\frac{\Delta ^{(l)}(0)}{l!}\). By applying \(\Delta (\cdot)\) to conventional GCNs such as Equation (4), we can rewrite it as (24) \(\begin{equation} \mathbf {O}=\mathbf {V}diag\left(\Delta \left(\lambda _k \right)\right)\mathbf {V}^T=\left(\sum _{l=0}^L \alpha _l \mathbf {\hat{A}}^l \right) \mathbf {E}. \end{equation}\) Here, \(\alpha _l\) is also the contribution from lth order neighborhood. In addition to the previous observation that \(\Delta (\cdot)\) should be monotonically increasing to emphasize smoother features, we also hope the model can capture neighborhood signals as far as possible with positive contributions, implying that \(\alpha _k \ge 0\) and \(L\rightarrow \infty\) (i.e., \(\Delta (\cdot)\) is infinitely differentiable with any-order derivatives non-negative.)

4.2.2 Comparison with Truncated SVD.

Truncated SVD is an extensively used CF method, which only exploits top-K largest singular values and corresponding vectors for recommendation: (25) \(\begin{equation} \begin{aligned}&\mathbf {O}_U=\mathbf {P}^{(K)}diag\left(\sqrt {\sigma _k}\right),\\ &\mathbf {O}_I=\mathbf {Q}^{(K)}diag\left(\sqrt {\sigma _k}\right). \end{aligned} \end{equation}\) We can see that truncated SVD can be considered as a special case of SGDE with \(\mathbf {W}=\mathbf {I}\) and \(\Delta (\sigma _k)=\sqrt {\sigma _k}\). In other words, the mechanism of how GCNs work is closely related to conventional CF methods, while the main difference lies in the weighting function \(\Delta (\cdot)\) that we can adjust the order (i.e., stacking more layers) to learn more smoothed embeddings.

4.2.3 Comparison with GCN-based Methods.

Compared with most GCN-based methods extended from vanilla GCN [22], we slim them by removing the core step of GCNs (i.e., neighborhood aggregation) and significantly reducing model parameters (which is only \(\frac{K}{{\rm min}(\left|\mathcal {U}\right|, \left|\mathcal {I}\right|)}\) times that of LightGCN’s or MF’s parameters). Compared with GDE, SGDE shows less space and time complexity since we only need to store top-K singular vectors and user/item embedding matrices are not required anymore. In addition, we significantly reduce the preprocessing time for retrieving spectral features by replacing Eigendecomposition with SVD and sharpening the spectral distribution to concentrate important graph information on fewer spectral features. Overall, SGDE is equipped with a light and simple structure. A recent work UltraGCN [27] draws our attention as it also simplifies GCNs by removing neighborhood aggregation: (26) \(\begin{equation} {\rm max}\sum _{u\in \mathcal {U}, i\in \mathcal {N}_u} \beta _{u,i}\mathbf {e}_u^T\mathbf {e}_i, \end{equation}\) where \(\beta _{u,i}=\frac{\sqrt {d_u+1}}{d_u\sqrt {d_i+1}}\) is obtained from a single-layer LightGCN. UltraGCN is equivalent to a weighted MF, which can only capture the direct neighbor signals and loses the power of leveraging higher-order neighborhood. On the other hand, our analysis and observations are more general as they are based on any-layer GCNs, thus our proposed SGDE can maximize the power of GCNs in terms of effectiveness and efficiency.

4.2.4 Complexity.

We compare the complexity of several methods in Table 1. LightGCN has the simplest structure among conventional GCN-based methods, which has the same model parameters as BPR (or MF): \(\left|\mathbf {E}\right|=(\left|\mathcal {U}\right|+\left|\mathcal {I}\right|)d\), so does our previously proposed GDE; while SGDE even has a lighter structure than BPR with only Kd parameters.

Table 1.

Complexity/Model	BPR	GDE	SGDE	LightGCN
Space	\((\left\|\mathcal {U}\right\|+\left\|\mathcal {I}\right\|)d\)	\((\left\|\mathcal {U}\right\|+\left\|\mathcal {I}\right\|)d\)	Kd	\((\left\|\mathcal {U}\right\|+\left\|\mathcal {I}\right\|)d\)
Time	\(\mathcal {O}(c\left\|\mathbf {R}^+\right\|d)\)	\(\mathcal {O}(K(\left\|\mathbf {A}_U^+\right\|+\left\|\mathbf {A}_I^+\right\|)+K^2(\left\|\mathcal {U}\right\|+\left\|\mathcal {I} \right\|))+\)\(\mathcal {O}(c\left\|\mathbf {R}^+\right\|(\left\|\mathcal {U}\right\|+\left\|\mathcal {I}\right\|)d)\)	\(\mathcal {O}(K\left\|\mathbf {R}^+\right\|+K^2(\left\|\mathcal {U}\right\|+\left\|\mathcal {I} \right\|))+\)\(\mathcal {O}(c\left\|\mathbf {R}^+\right\|Kd)\)	\(\mathcal {O}(cL\frac{\left\|\mathbf {R}^+\right\|^2}{B}d)\)

View Table

Table 1. Complexity Comparison

For time complexity, \(\left|\mathbf {A}_U^+\right|\) and \(\left|\mathbf {A}_I^+\right|\) are the number of edges (i.e., non-zero elements) of \(\mathbf {A}_U\) and \(\mathbf {A}_I\), respectively; B is the batch size, and c is the training epoch. Since conventional GCN-based methods such as LightGCN are updated in a matrix form, the running time is inversely proportional to the batch size. The time complexities of SGDE and GDE consist of two parts: preprocessing for retrieving spectral features and training. Compared with GDE, SGDE has a lower complexity for training, and although the complexity for preprocessing seems to be very close, we will show in Section 5.2.2 that the preprocessing on SGDE runs much faster than that of GDE.

4.3 Contrastive Simplified Graph Denoising Encoder (CSGDE)

Recently, GCL has received much attention including recommender systems [46, 49]. The core idea is to maximize and minimize the agreement of two views from the same and different node(s), respectively, where the views are generated by perturbing the original graph such as randomly dropping out edges or nodes. However, the existing GCL learning paradigm is computationally expensive as generating multiple node views basically requires multiple times the complexity of GCNs. In this section, we aim to incorporate GCL into our method without bringing too much complexity. To this end, we first attempt to analyze how GCL works for recommendation. We focus on edge dropping as it gains more improvement than other augmentations [46]. As shown in Figure 6, the edge-dropping noise tends to attack noisy components in the middle area while the smoothed and rough components tend to be preserved. By maximizing the agreements between embeddings from perturbed graphs, the dissimilar components (i.e., noisy features) tend to be filtered out. However, there is no guarantee that the edge-dropping noise always attacks the noisy spectral features as we can see the spectral features in the middle area on CiteULike shows higher relevance than ML-1M. To summarize, the limitations of GCL are that (1) The noise added on the graph is uncontrollable, (2) computationally expensive; (3) GCL ignores the latent relations that indirectly connected users/items might be as well closely related to the target user/item, as it only maximizes the views from the same node. To this end, we propose a feature augmentation by adding random noise on the weighted spectral features: (27) \(\begin{equation} \begin{aligned}&\mathbf {O}_U=\mathcal {N}\left(\mathbf {P}^{(K)}diag\left(\Delta \left(\sigma _k\right)\right), \mu \right)\mathbf {W},\\ &\mathbf {O}_I=\mathcal {N}\left(\mathbf {Q}^{(K)}diag\left(\Delta \left(\sigma _k\right)\right), \mu \right)\mathbf {W}, \end{aligned} \end{equation}\) where the noise is generated from normal distribution \(\mathcal {N}(0, \mu)\) with \(\mu\) as the standard deviation. Since \(\Delta (\cdot)\) outputs the feature weight according to their importance to recommendation, the more (less) important features have larger (smaller) weights thus are less (more) perturbed by the noise. Then, Equation (27) emphasizes the important features and tends to filter out the noisy ones. Instead of only maximizing the views from the same node, we incorporate higher-order neighbor signals: (28) \(\begin{equation} \mathcal {L}_{user}=\sum _{u\in \mathcal {U}}\sum _{(u, u^+)\in ,\, (u, u^{\mbox{-}})\notin \mathcal {E}_{\mathcal {A}_U}^L}\ln \sigma \left(\mathbf {o}_u^T\mathbf {o}_{u^+} - \mathbf {o}_u^T\mathbf {o}_{u^{\mbox{-}}} \right), \end{equation}\) where \(\mathcal {E}_{\mathcal {A}_U}^L\) is the edge set considering \(\lbrace 1, \ldots , L\rbrace\) hop neighbors of \(\mathbf {A}_U\). Although users are not directly connected, they still might show similar interests and should be close on the embedding space if they are close on the graph. Since the InfoNCE loss brings additional complexity, we stick to the BPR loss here. Similarly, the contrastive loss for higher-order item signals are generated as (29) \(\begin{equation} \mathcal {L}_{item}=\sum _{i\in \mathcal {I}}\sum _{(i, i^+)\in , \, (i, i^{\mbox{-}})\notin \mathcal {E}_{\mathcal {A}_I}^L}\ln \sigma \left(\mathbf {o}_i^T\mathbf {o}_{i^+} - \mathbf {o}_i^T\mathbf {o}_{i^{\mbox{-}}} \right), \end{equation}\) where \(\mathcal {E}_{\mathcal {A}_I}^L\) is the edge set considering \(\lbrace 1, \ldots , L\rbrace\) hop neighbors of \(\mathbf {A}_I\). Finally, the model is optimized by the following loss: (30) \(\begin{equation} \mathcal {L}=\mathcal {L}_{main}+\delta \mathcal {L}_{user}+\zeta \mathcal {L}_{item}+\gamma \left\Vert \Theta \right\Vert ^2_2, \end{equation}\) where \(\Theta\) denotes the model parameters, \(\delta\) and \(\zeta\) are hyperparameters controlling the effect from higher-order neighbors. We additionally propose a robust SGDE (RSGDE) by only applying Equation (27) to SGDE to investigate how feature augmentation solely affects the performance.

Fig. 6. Arranging spectral features in rough \(\rightarrow\) smooth order, we evenly divide them into 10 groups and calculate the average relevance (taking the absolute value of cosine similarity) between the eigenvectors of the original and the perturbed graphs (we randomly drop the edge with probability 0.1). The smaller (larger) relevance implies that the noise is more intense (weaker) on the group of features.

5 EXPERIMENTS

In this section, we comprehensively evaluate our proposed methods in terms of effectiveness and efficiency. Particularly, we aim to answer the following research questions:

—	Do our proposed methods outperform other competitive baselines as well as our previously proposed GDE?
—	How are the efficiency of our proposed methods, especially compared with GCN-based methods?
—	How do hyperparameters affect model performance? Do the proposed designs positively affect the model performance?

5.1 Experimental Settings

5.1.1 Datasets and Evaluation Metrics.

We list statistics of datasets in Table 2. Two MovieLens datasets: ML-1M and ML-100K are collected by GroupLens¹ and have been widely used to evaluate CF algorithms. CiteULike² is collected from CiteULike which allows users to create their own collections of articles. Gowalla [42] is a check-in dataset which records the locations users have visited. Yelp [16] is the Yelp Challenge data for user ratings on businesses. Since we focus on implicit feedbacks, we remove other auxiliary information such as ratings and reviews and only leave user/item IDs. To further verify our previous observations and generalize on other datasets, we evaluate on ML-1M, Yelp, and Gowalla.

Table 2.

Datasets	#User	#Item	#Interactions	Density%
CiteULike	5,551	16,981	210,537	0.223
ML-100K	943	1,682	100,000	6.305
ML-1M	6,040	3,952	1,000,209	4.190
Yelp	25,677	25,815	731,672	0.109
Gowalla	29,858	40,981	1,027,370	0.084

View Table

Table 2. Statistics of Datasets

We adopt two widely used evaluation metrics: Recall and nDCG [19] to evaluate model performance. Recall measures the ratio of the relevant items in the recommended list to all relevant items in test sets, while nDCG takes the rank into consideration by assigning higher scores to relevant items ranked higher. The recommendation list is generated by ranking unobserved items and truncating at position k. Since the advantage of GCN-based methods over traditional CF methods is the ability of leveraging higher-order neighbor signals to augment training data, thereby alleviating the data sparsity, we only use 20% of interactions for training and leave the remaining for test to evaluate the model robustness and stability; we randomly select 5% from the training set as validation set for hyperparameter tuning and report the average accuracy on test sets.

5.1.2 Baselines.

We compare our proposed methods with the following competing baselines, where the hyperparameter settings are based on the results of the original articles:

—	BPR [33]: This is a stable and classic MF-based method, exploiting a Bayesian personalized ranking loss for personalized rankings.
—	EASE [37]: This is a neighborhood-based method with a closed-form solution and shows superior performance to many traditional CF methods.
—	LightGCN [14]: This method removes activations function and feature transformation, and only leaves neighborhood aggregation for recommendation. We use a three-layer architecture as the baseline.
—	LCFN [52]: To remove the noise from interactions for recommendation, this method uses a low pass graph convolution to replace the spectral graph convolution. We set \(F=0.005\) and use a single-layer architecture.
—	SGL-ED [46]: This model explores self-supervised learning based on LightGCN [14], by maximizing the agreements of multiple views from the same node, where the node views are generated by performing noise such as randomly removing the edges or nodes on the original graph. We set \(\tau =0.2\), \(\lambda _1=0.1\), \(p=0.1\), and use a three-layer architecture.
—	UltraGCN [27]: This model simplifies LightGCN by replacing neighborhood aggregation with a weighted MF, which shows faster convergence and less complexity.
—	GDE [30]: This method only uses a very few graph features for recommendation without stacking layers, showing less complexity and higher efficiency than conventional GCN-based methods.

We remove some popular GCN-based methods such as Pinsage [50], NGCF [42], and SpectralCF [56] as the aforementioned baselines have already shown superiority over them.

5.1.3 Implementation Details.

We implemented the proposed model based on PyTorch³ and released the code on Github.⁴ For all models, we use SGD as the optimizer, the embedding size d is set to 64, the regularization rate \(\gamma\) is set to 0.01 on all datasets, the learning rate is tuned amongst \(\lbrace 0.001,0.005,0.01,\ldots \rbrace\); without specification, the model parameters are initialized with Xavier Initialization [10]; the batch size is set to 256. We report other hyperparameter settings in the next subsection.

5.2 Comparison

5.2.1 Performance.

We report the accuracy of our proposed SGDE variants and other baselines in Table 3, and make the following observations:

Table 3.

Datasets	Methods	nDCG@10	nDCG@20	Recall@10	Recall@20
Yelp	BPR	0.0388	0.0374	0.0371	0.0370
	Ease	0.0360	0.0362	0.0346	0.0368
	LCFN	0.0617	0.0627	0.0613	0.0653
	UltraGCN	0.0417	0.0403	0.0404	0.0403
	LightGCN	0.0751	0.0710	0.0725	0.0698
	SGL-ED	0.0817	0.0794	0.0784	0.0792
	GDE	0.0866	0.0850	0.0839	0.0860
	SGDE	0.0900	0.0877	0.0870	0.0880
	RSGDE	0.0947	0.0919	0.0917	0.0924
	CSGDE	0.0966	0.0938	0.0933	0.0939
	Improv.%	+11.55	+10.35	+11.20	+9.19
ML-1M	BPR	0.5521	0.4849	0.5491	0.4578
	Ease	0.3773	0.3249	0.3682	0.3000
	LCFN	0.5927	0.5197	0.5887	0.4898
	UltraGCN	0.5326	0.4688	0.5302	0.4434
	LightGCN	0.5917	0.5261	0.5941	0.5031
	SGL-ED	0.6029	0.5314	0.6010	0.5035
	GDE	0.6482	0.5681	0.6471	0.5376
	SGDE	0.6491	0.5730	0.6496	0.5445
	RSGDE	0.6559	0.5771	0.6554	0.5468
	CSGDE	0.6581	0.5798	0.6583	0.5502
	Improv.%	+1.52	+2.06	+1.73	+2.34
Gowalla	BPR	0.1086	0.0907	0.0917	0.0743
	Ease	0.0722	0.0670	0.0680	0.0642
	LCFN	0.1305	0.1132	0.1144	0.0980
	UltraGCN	0.0977	0.0815	0.0841	0.0681
	LightGCN	0.1477	0.1327	0.1368	0.1224
	SGL-ED	0.1789	0.1561	0.1563	0.1353
	GDE	0.1857	0.1632	0.1657	0.1449
	SGDE	0.1820	0.1607	0.1628	0.1428
	RSGDE	0.1917	0.1690	0.1691	0.1485
	CSGDE	0.1950	0.1712	0.1714	0.1496
	Improv.%	+5.01	+4.90	+3.44	+3.24

Improv.% denotes the improvements over the best baselines.

View Table

Table 3. Overall Performance Comparison

Improv.% denotes the improvements over the best baselines.

—	Overall, GCN-based methods tend to show better performance over traditional CF methods, demonstrating the superiority of GCNs for CF especially when the data is extremely sparse, as GCNs can augment interactions with rich higher-order neighbor signals.
—	SGL-ED performs the best among GCN-based baselines, which demonstrates the effectiveness of self-supervised learning for CF. UltraGCN shows relatively poor performance, despite the superior performance reported in the original article. According to our previous analysis in Section 4.2.3, that UltraGCN is basically a weighted MF which loses the ability to leverage higher-order neighborhood, explaining why it performs poorly when data is sparse (with 20% interactions for training).
—	By comparing SGDE, RSGDE, and CSGDE, we can see that the improvement from feature augmentation is more significant than leveraging higher-order neighbor signals, indicating that a well-designed data augmentation helps learn robust and generalizable representations.
—	SGDE achieves similar performance to GDE on ML-1M, outperforms GDE on Yelp, and slightly underperforms GDE on Gowalla. Note that GDE is trained with an adaptive loss which gains improvement over GDE with BPR loss, SGDE still outperforms GDE on Gowalla when both models are trained with the same BPR loss. RSGDE and CSGDE show consistent improvements over GDE, demonstrating the effectiveness of our proposed contrastive learning framework. For instance, the improvements of CSGDE over GDE on Yelp, ML-1M, and Gowalla are 11.6%, 1.5%, and 5.0%, respectively, in terms of nDCG@10.
—	Compared with conventional GCN-based methods such as LightGCN, CSGDE outperforms it by 28.6%, 11.2%, and 32.0%, in terms of nDCG@10, on Yelp, ML-1M, and Gowalla, respectively, demonstrating the superiority and effectiveness of our proposed designs.

5.2.2 Efficiency.

The results shown in this subsection are obtained on a machine equipped with AMD Ryzen 9 5950X, GeForce RTX 3090, 32GB\(\times 4\) (DDR4 3200 MT/s), and ST6000DM003 (6TB) for the hard disk. We report how the preprocessing time changes with K in Figure 7, where SOTA refers to SGL-ED achieving the best among baselines excluding GDE. The accuracy increases first then drops as increasing the number of spectral features K, the best performance is reached at 60, 60, and 90 on ML-1M, Yelp, and Gowalla, respectively, requiring only several seconds for preprocessing. We also compare the processing time of Eigendecomposition and SVD implemented by PyTorch in Figure 8, where the processing time of Eigendecomposition increases faster than that of SVD as increasing the number of spectral features. Thus, the preprocessing on SGDE has a lower computation cost than that on GDE.

Fig. 7. How the preprocessing time and accuracy of SGDE change with K.

Fig. 8. Comparison between SVD and Eigendecomposition on the preprocessing time (seconds) for calculating top-K spectral features.

We compare the running time of several methods in Table 4. LightGCN is the most efficient conventional GCN-based method as it only uses the core design neighborhood aggregation for recommendation. UltraGCN runs and converges faster than LightGCN, as it is basically a weighted MF without leveraging a higher-order neighborhood. SGDE runs even faster than BPR and requires the fewest epochs as it is essentially a truncated SVD with much fewer model parameters than that of MF. Overall, SGDE variants tend to show higher efficiency on larger and sparser datasets. For instance, SGDE shows 289\(\times\) and 21\(\times\) speed-ups on LightGCN and GDE on Gowalla with only 0.13% parameters of that of them, respectively. Although CSGDE is slower than UltraGCN and GDE in terms of the training time per epoch, it still runs much faster than them considering the fast convergence speed.

Table 4.

Dataset	Model	Time/Epoch	Epochs	Running Time	Parameters
Yelp	LightGCN	3.66s	180	658.8s	3.3m
	UltraGCN	1.40s	60	84.00s	3.3m
	BPR	0.77s	330	254.10s	3.3m
	GDE	0.97s	150	213.5s	3.3m
	SGDE	0.74s	8	7.660s	4.1k
	RSGDE	0.78s	8	7.980s	4.1k
	CSGDE	1.83s	8	16.38s	4.1k
ML-1M	LightGCN	4.76s	270	1285.2s	0.64m
	UltraGCN	1.50s	25	37.50s	0.64m
	BPR	0.80s	120	96.00s	0.64m
	GDE	1.00s	40	40.30s	0.64m
	SGDE	0.60s	10	6.820s	4.1k
	RSGDE	0.87s	10	9.520s	4.1k
	CSGDE	1.80s	10	18.82s	4.1k
Gowalla	LightGCN	6.43s	600	3,858s	4.5m
	UltraGCN	2.55s	90	229.5s	4.5m
	BPR	1.48s	250	370.0s	4.5m
	GDE	1.95s	120	281.0s	4.5m
	SGDE	1.28s	8	13.31s	5.7k
	RSGDE	1.32s	8	13.63s	5.7k
	CSGDE	3.05s	8	27.47s	5.7k

View Table

Table 4. Running Time Comparison

5.3 Model Analysis

5.3.1 Distribution Redundancy.

We report how the number of required features K and the accuracy change with \(\alpha\) in Figure 9. We can see \(K=\)120, 2,000, and 3,000 on ML-1M, Yelp, and Gowalla at \(\alpha =0\), and it constantly decreases as increasing \(\alpha\), where the best accuracy is achieved at \(\alpha =\)2, 3, and 3, with K reduced to 60, 60, and 90, respectively. For instance, we can observe a 13\(\times\) speed-up on Yelp by comparing the processing times for \(K=60\) and \(K=2000\) which are 1.74 s and 24.94 s, respectively, and the best accuracy at \(\alpha =3\) outperforms that at \(\alpha =0\) by 10.0%, indicating the important information can be concentrated in fewer spectral features by increasing \(\alpha\). Overall, we can both boost the efficiency and effectiveness by setting \(\alpha\) in a reasonable way (i.e., there is a tradeoff between sharpening the distribution and keeping the original data uncontaminated).

Fig. 9. How the accuracy of SGDE and the number of required features change with \(\alpha\) .

5.3.2 Structure Redundancy.

Table 5 demonstrates how GCN becomes more efficient and effective by removing the structure redundancy, where SGDE+NA refers to SGDE with neighborhoood aggregation. We can see that SGDE+NA shows inferior performance to SGDE, indicating the uselessness of neighborhood aggregation. SGDE+NA requires more training epochs to reach the best performance since it has more model parameters (same as LightGCN and BPR). Overall, our proposed SGDE has a more efficient and effective architecture than conventional GCNs.

Table 5.

Datasets	Models	nDCG@10	Epochs	Parameters
Yelp	SGDE+NA	0.0854	40	3.3m
Yelp	SGDE	0.0900	8	4.1k
ML-1M	SGDE+NA	0.6432	20	0.64m
ML-1M	SGDE	0.6491	8	4.1k
Gowalla	SGDE+NA	0.1787	35	4.5m
Gowalla	SGDE	0.1820	8	5.7k

View Table

Table 5. The Effectiveness and Efficiency of Structure Redundancy

5.3.3 Weighting Function.

As the poor performance of a dynamic strategy has been reported in our previous work [30], here we only compare static functions in Table 6. According to the analysis in Section 4.2.1, we summarize three properties that might affect accuracy: if \(\Delta (\cdot)\) is monotonically increasing, has non-negative derivatives, and is infinitely differentiable. By comparing the exponential function (i.e., \(e^{\beta \sigma _k}\)) with positive (increasing) and negative values (decreasing), we can see “Monotonically increasing” matters the most to recommendation accuracy, implying the importance of components corresponding to the large singular values (i.e., smoothed features), while the other two properties do not significantly affect performance as the functions that only satisfy two properties still show superior performance (e.g., log and polynomial functions), which is different from the results reported in GDE. We try to analyze the reasons behind this phenomenon. “non-negative derivatives” implies that \(\Delta (\cdot)\) is a concave increasing function that the weights increases faster over the feature smoothness than other functions such as the log function. While “infinitely differentiable” empowers the model to be equivalent to a GCN with infinite layers. Since stacking more layers also emphasizes more on the smoothed features, both these two properties further increase the importance of smoothed features. Different from GDE, we concentrate important information on the graph in as few spectral features as possible, thus the difference of the importance among spectral feature is reduced. As a result, emphasizing the difference in feature importance has less impact on accuracy.

Table 6.

Function	nDCG@10	Property
Function	nDCG@10	Increasing	Pos Coef.	Infinite
\(\log (\beta \sigma _k)\)	0.0897	\(\checkmark\)	\(\times\)	\(\checkmark\)
\(\sum _{l=0}^L\sigma _k^l\)	0.0899	\(\checkmark\)	\(\checkmark\)	\(\times\)
\(\frac{1}{1-\beta \sigma _k}\)	0.0899	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)
\(e^{\beta \sigma _k} (\beta \gt 0)\)	0.0900	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)
\(e^{\beta \sigma _k} (\beta \lt 0)\)	0.0609	\(\times\)	\(\times\)	\(\checkmark\)

View Table

Table 6. Accuracy of SGDE with Different Weighting Functions on Yelp

We choose the exponential function (i.e., \(e^{\beta \sigma _k}\)) in this work as it shows better performance with a simpler form, and report how the accuracy changes with \(\beta\) in Figure 10. The accuracy increases as constantly increasing \(\beta\), and reaches the best at \(\beta =2\). There is a significant drop in accuracy when \(\beta \le 0\) especially on Yelp and Gowalla, and we speculate the reason is that the smoothed features are more important on sparse datasets on which the accuracy is more sensitive to the weighting change. Overall, SGDE is not so sensitive to hyperparameter changes as other CF methods including GDE, thus it can be adapted to different datasets without much efforts on hyperparameter tuning.

Fig. 10. How the accuracy changes with \(\beta\) .

5.3.4 Contrastive Loss.

We report the effect of the contrastive loss in Figure 11. In Figure 11(a)–(c), we show how the accuracy changes with \(\delta\) and \(\zeta\), “Both” represents the accuracy of the model with the best settings of \(\delta\) and \(\zeta\). We observe the following:

Fig. 11. Effect of the contrastive loss.

—	The accuracy increases first then drops as constantly increasing the hyperparameters and is mostly maximized at 0.5 on three datasets (excluding \(\zeta =0.3\) on Yelp). The contrastive loss tends to achieve a larger improvement on the sparser data, as the improvement is 0.3% on the denser data ML-1M and nearly 2% on the other two datasets.
—	incorporating both user and item homogeneous relations does not lead to an improvement compared with incorporating either of them. A reasonable explanation is that incorporating either relations can help optimize another relation as well. For instance, consider the target users and neighbor users \({(u, u^+)\in \mathcal {E}_{\mathcal {A}_U}^l}, l=\lbrace 1,\ldots ,L\rbrace\), let i and \(i^+\) be the items u and \(u^+\) have interacted, respectively, then the possible distance between i and \(i^+\) is \(l-1 (l\gt 1)\), l, or \(l+1\). When u and \(u^+\) are optimized to be close, since i and \(i^+\) are optimized to be near to u and \(u^+\), respectively, then i and \(i^+\) are pulled to be close as well. Thus, the item higher-order relations are also optimized to some extent when considering the user higher-order relations. Similarly, we can draw the same conclusion if we consider the relations between neighbor items and the target items. As a result, incorporating both relations results in overfitting, showing a worse performance on the data used in this work.
—	We observe that the model performs better with user relations on ML-1M, and with item relations on Gowalla and Yelp. If we define the scale of relations as the number of all possible relations (i.e., \(\left\| \mathcal {U} \right\|^2\) for user relations and \(\left\| \mathcal {I} \right\|^2\) for item relations), then this observation shows that optimizing the relations with a larger scale might lead to larger improvement.

Figure 11(d)–(f) shows how the accuracy changes with standard variation \(\mu\) where a larger (smaller) \(\mu\) implies intenser (weaker) noise, and the maximum accuracy is reached at 0.015, 0.02, and 0.03 on Yelp, ML-1M, and Gowalla, respectively, where the noise tends to be more intense on sparser datasets when reaching the best accuracy. Table 7 shows how the accuracy changes with L. Since almost all users and items are connected when \(L\gt 1\) on ML-1M (i.e., all users and items can sampled as positive), we only report the accuracy with \(L=1\). The accuracy gradually increases as increasing L on Gowalla, while the best accuracy is achieved at \(L=1\) on Yelp, which might be due to the data density since Gowalla is sparser than Yelp. Overall, higher-order relations provide auxiliary information to help extract user preference.

Table 7.

	L = 1	L = 2	L = 3	L = 4
Yelp	0.0965	0.0963	0.0962	0.0964
ML-1M	0.6581	–	–	–
Gowalla	0.1924	0.1925	0.1949	0.1950

View Table

Table 7. How the Accuracy Changes with L

6 RELATED WORK

Recommender Systems have become ubiquitous in people’s daily life as they are extensively applied in platforms such as E-commerce websites, social network services, and video streaming apps, and so on, and bring tremendous economic benefits. In this section, we briefly review traditional and advanced recommendation methods.

Early memory-based approach provides recommendation by computing the similarity between users or items [35], leading to significant development. Model-based methods have become prevalent as they can represent users and items in a systematic way. MF [23], one of the most extensively used model-based methods and the cornerstone of advanced recommendation algorithms nowadays, predicts the rating by calculating the dot product between the user and item latent vectors. With the development of web service and computing hardware, subsequent works mostly focus on (1) incorporating side information to better understand user taste and (2) developing advanced algorithms to infer user preference in a more fine-grained way. For instance, Ma et al. [26] incorporates social relations to understand user taste from friends; Lian et al. [25] augments MF by additionally modelling user activity and item influence area; Bao et al. [1] exploits rich textual reviews to better handle the cold start problem. On another line, advanced algorithms such as neural networks [15, 48], autoencoder [36], attention mechanism [21], transformer [38], reinforcement learning [55], and so on are also applied to infer user preference in a more complicated way and have achieved tremendous success.

Graph Neural Networks (GNNs) have achieved tremendous success in various fields owing to its powerful ability to handle graph data. Basically, there are two kinds of GNNs: spatial-based GNNs which focus on the vertex domain [11, 40] and spectral-based GNNs (i.e., GCNs) [22, 28] which put more emphasis on the spectral properties of graphs. The vanilla GCN [22] is the most extensively used GCN architecture and is applied to different fields including recommender systems. Most existing GCN-based recommendation methods are based on vanilla GCN [42, 56] and achieve better performance than traditional methods. Due to the high complexity of GCNs, some works [5, 14] show the redundancy of activation functions and feature transformations, and simplify GCN architectures. Instead of using a simple graph representation, efforts have been made to exploit hypergraphs to provide powerful representations [20]. On another line, recent works combine GCNs with other advanced tools and algorithms, such as self-supervised learning [46, 47], learning in hyperbolic space [39], federated learning [44], disentangled representation learning [43], and so on. In addition, some works focus on tackling the issues of recommender systems or GCNs such as over-smoothing [29], cold-start [12], popularity bias [54], and so on. However, most existing GCN-based methods are still based on the vanilla GCN, which show high computational complexity and limited scalability, in spite of some efforts that have been made to simplify the model architecture to some extent. Thus, in this work, we showed the redundancy of existing works in three aspects and proposed a simplified formulation of GCNs, which significantly boosts the training efficiency and reduces the complexity.

We briefly introduce GCL and its applications to CF. The success of contrastive learning and self-supervised learning in computer vision [6, 13] has influenced various research fields including graph learning [32, 51]. The primary goal of GCL is to maximize the agreement between different views of the same node against the agreement between views of the different nodes, where the views are generated by perturbing the graph or node features. Most existing works for recommendation generate views by randomly dropping out the edges or nodes [46, 49], where we showed the noise added on the spectral features might help denoise graph information but is uncontrollable (i.e., the important features might be irrecoverably polluted which reduces the recommendation performance). In addition, GCL brings additional complexity to GCNs which further sacrifices the model efficiency, and also ignores the latent relations between the indirectly connected nodes showing close relations on the graph as GCL only maximizes the views from the same node. To this end, we proposed a scalable contrastive framework, bringing significant improvement to SGDE without heavily increasing the model complexity.

7 CONCLUSION

In this work, we unveiled the redundancy of existing GCN-based recommendation methods from three aspects: feature, structure, and distribution redundancy. Particularly, we reviewed GCNs from a spectral perspective and showed that only a very few spectral features are contributive to the recommendation accuracy, GCNs can help suppress most features that are redundant and noisy while cannot completely remove them. By providing a deep insight of how user/item representation are generated, we showed that what makes the representations distinctive lies in the spectral features, while the neighborhood aggregation is not the key making GCN effective. We further observed that the number of required spectral feature is related to the spectral distribution, where a dataset with a flatter distribution tends to requires more spectral features when reaching the best performance, resulting in more computational cost. We reduced the required spectral features by increasing the node similarity to sharpen the spectral distribution, causing the important information on the graph to be concentrated in fewer features. Finally, we proposed an SGDE which only exploits the K-largest singular vectors for recommendation. By further analyzing how GCL works for recommendation, we further proposed a scalable contrastive framework. Particularly, we performed feature augmentation by adding noise on the spectral feature where the intensity is according to their importance, and augmented sparse supervisory signals with higher-order neighbor. Experimental results on three datasets demonstrated the effectiveness and efficiency of our proposed SGDE variants over our previously proposed GDE as well as competitive baselines including both GCN-based and traditional CF methods. In the future, we will continue efforts to boost training efficiency and model scalability of GCN-based recommendation methods.

Footnotes

REFERENCES

[1] Bao Yang, Fang Hui, and Zhang Jie. 2014. Topicmf: Simultaneously exploiting ratings and reviews for recommendation. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI ’14). 2–8.Google ScholarCross Ref
Reference 1Reference 2
[2] Berg Rianne van den, Kipf Thomas N., and Welling Max. 2017. Graph convolutional matrix completion. arXiv:1706.02263. Retrieved from https://arxiv.org/abs/1706.02263Google Scholar
Reference
[3] Bo Deyu, Wang Xiao, Shi Chuan, and Shen Huawei. Beyond low-frequency information in graph convolutional networks. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI ’21). 3950–3957.Google Scholar
Reference
[4] Chen Jingyuan, Zhang Hanwang, He Xiangnan, Nie Liqiang, Liu Wei, and Chua Tat-Seng. 2017. Attentive collaborative filtering: Multimedia recommendation with item- and component-level attention. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’17). 335–344.Google ScholarDigital Library
Reference
[5] Chen Lei, Wu Le, Hong Richang, Zhang Kun, and Wang Meng. 2020. Revisiting graph based collaborative filtering: A linear residual graph convolutional network approach. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI ’20). 27–34.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[6] Chen Ting, Kornblith Simon, Norouzi Mohammad, and Hinton Geoffrey. 2020. A simple Framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning (ICML ’20). 1597–1607.Google Scholar
Reference
[7] Covington Paul, Adams Jay, and Sargin Emre. 2016. Deep neural networks for Youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys’16). 191–198.Google ScholarDigital Library
Reference
[8] Ebesu Travis, Shen Bin, and Fang Yi. 2018. Collaborative memory network for recommendation systems. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’18). 515–524.Google ScholarDigital Library
Reference
[9] Fan Wenqi, Ma Yao, Li Qing, He Yuan, Zhao Eric, Tang Jiliang, and Yin Dawei. 2019. Graph neural networks for social recommendation. In Proceedings of the 28th International Conference on World Wide Web (WWW’19). 417–426.Google ScholarDigital Library
Reference
[10] Glorot Xavier and Bengio Yoshua. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS’10). 249–256.Google Scholar
Reference
[11] Hamilton Will, Ying Zhitao, and Leskovec Jure. 2017. Inductive representation learning on large graphs. In Proceedings of the Advances in Neural Information Processing Systems (Neurips’17). 1024–1034.Google Scholar
Reference
[12] Hao Bowen, Zhang Jing, Yin Hongzhi, Li Cuiping, and Chen Hong. 2021. Pre-training graph neural networks for cold-start users and items representation. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM’21). 265–273.Google ScholarDigital Library
Reference
[13] He Kaiming, Fan Haoqi, Wu Yuxin, Xie Saining, and Girshick Ross. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20). 9729–9738.Google ScholarCross Ref
Reference
[14] He Xiangnan, Deng Kuan, Wang Xiang, Li Yan, Zhang Yongdong, and Wang Meng. 2020. LightGCN: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’20). 639–648.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
[15] He Xiangnan, Liao Lizi, Zhang Hanwang, Nie Liqiang, Hu Xia, and Chua Tat-Seng. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web (WWW’17). 173–182.Google ScholarDigital Library
Reference 1Reference 2
[16] He Xiangnan, Zhang Hanwang, Kan Min-Yen, and Chua Tat-Seng. 2016. Fast matrix factorization for online recommendation with implicit feedback. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’16). 549–558.Google ScholarDigital Library
Reference
[17] Hidasi Balázs, Karatzoglou Alexandros, Baltrunas Linas, and Tikk Domonkos. 2016. Session-based recommendations with recurrent neural networks. In Proceedings of the 4th International Conference on Learning Representations (ICLR’16).Google Scholar
Reference
[18] Huang Tinglin, Dong Yuxiao, Ding Ming, Yang Zhen, Feng Wenzheng, Wang Xinyu, and Tang Jie. 2021. MixGCF: An improved training method for graph neural network-based recommender systems. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’21). 665–674.Google ScholarDigital Library
Reference
[19] Järvelin Kalervo and Kekäläinen Jaana. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446.Google ScholarDigital Library
Reference
[20] Ji Shuyi, Feng Yifan, Ji Rongrong, Zhao Xibin, Tang Wanwan, and Gao Yue. 2020. Dual channel hypergraph collaborative filtering. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’20). 2020–2029.Google ScholarDigital Library
Reference
[21] Kang Wang-Cheng and McAuley Julian. 2018. Self-attentive sequential recommendation. In Proceedings of the 2018 IEEE International Conference on Data Mining (ICDM’18). 197–206.Google ScholarCross Ref
Reference
[22] Kipf Thomas N. and Welling Max. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR’17).Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[23] Koren Yehuda, Bell Robert, and Volinsky Chris. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 30–37.Google ScholarDigital Library
Reference 1Reference 2Reference 3
[24] Li Qimai, Han Zhichao, and Wu Xiao-Ming. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI ’18). 3538–3545.Google ScholarCross Ref
Reference
[25] Lian Defu, Zhao Cong, Xie Xing, Sun Guangzhong, Chen Enhong, and Rui Yong. 2014. GeoMF: Joint geographical modeling and matrix factorization for point-of-interest recommendation. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’14). 831–840.Google ScholarDigital Library
Reference
[26] Ma Hao, Yang Haixuan, Lyu Michael R., and King Irwin. 2008. Sorec: Social recommendation using probabilistic matrix factorization. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (ICDM’08). 931–940.Google ScholarDigital Library
Reference 1Reference 2
[27] Mao Kelong, Zhu Jieming, Xiao Xi, Lu Biao, Wang Zhaowei, and He Xiuqiang. 2021. UltraGCN: Ultra simplification of graph convolutional networks for recommendation. In Proceedings of the 30th ACM International Conference on Information and Knowledge Management (CIKM’21). 1253–1262.Google ScholarDigital Library
Reference 1Reference 2
[28] NT Hoang and Maehara Takanori. 2019. Revisiting graph neural networks: All we have is low-pass filters. arXiv:1905.09550. Retrieved from https://arxiv.org/abs/1905.09550Google Scholar
Reference
[29] Peng Shaowen and Mine Tsunenori. 2020. A robust hierarchical graph convolutional network model for collaborative filtering. arXiv:2004.14734. Retrieved from https://arxiv.org/abs/2004.14734Google Scholar
Reference
[30] Peng Shaowen, Sugiyama Kazunari, and Mine Tsunenori. 2022. Less is more: Reweighting important spectral graph features for recommendation. In Proceedings of the 45th International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR’22). 1273–1282.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
[31] Peng Shaowen, Sugiyama Kazunari, and Mine Tsunenori. 2022. SVD-GCN: A simplified graph convolution paradigm for recommendation. In Proceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM’22). 1625–1634.Google ScholarDigital Library
Reference
[32] Qiu Jiezhong, Chen Qibin, Dong Yuxiao, Zhang Jing, Yang Hongxia, Ding Ming, Wang Kuansan, and Tang Jie. 2020. GCC: Graph contrastive coding for graph neural network pre-training. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’20). 1150–1160.Google ScholarDigital Library
Reference
[33] Rendle Steffen, Freudenthaler Christoph, Gantner Zeno, and Schmidt-Thieme Lars. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI’09). 452–461.Google ScholarDigital Library
Reference 1Reference 2
[34] Sandryhaila Aliaksei and Moura Jose M. F.. 2014. Discrete signal processing on graphs: Frequency analysis. IEEE Transactions on Signal Processing 62, 12 (2014), 3042–3054.Google ScholarDigital Library
Reference
[35] Sarwar Badrul, Karypis George, Konstan Joseph, and Riedl John. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th International Conference on World Wide Web (WWW’01). 285–295.Google ScholarDigital Library
Reference
[36] Sedhain Suvash, Menon Aditya Krishna, Sanner Scott, and Xie Lexing. 2015. Autorec: Autoencoders meet collaborative filtering. In Proceedings of the 24th International Conference on World Wide Web (WWW’15). 111–112.Google ScholarDigital Library
Reference
[37] Steck Harald. 2019. Embarrassingly shallow autoencoders for sparse data. In Proceedings of the 28th International Conference on World Wide Web (WWW’19). 3251–3257.Google ScholarDigital Library
Reference
[38] Sun Fei, Liu Jun, Wu Jian, Pei Changhua, Lin Xiao, Ou Wenwu, and Jiang Peng. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM’19). 1441–1450.Google ScholarDigital Library
Reference 1Reference 2
[39] Sun Jianing, Cheng Zhaoyue, Zuberi Saba, Pérez Felipe, and Volkovs Maksims. 2021. HGCF: Hyperbolic graph convolution networks for collaborative filtering. In Proceedings of the 30th International Conference on World Wide Web (WWW ’21). 593–601.Google ScholarDigital Library
Reference
[40] Veličković Petar, Cucurull Guillem, Casanova Arantxa, Romero Adriana, Lio Pietro, and Bengio Yoshua. 2018. Graph attention networks. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18).Google Scholar
Reference
[41] Wang Jun, Yu Lantao, Zhang Weinan, Gong Yu, Xu Yinghui, Wang Benyou, Zhang Peng, and Zhang Dell. 2017. Irgan: A minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’17). 515–524.Google ScholarDigital Library
Reference
[42] Wang Xiang, He Xiangnan, Wang Meng, Feng Fuli, and Chua Tat-Seng. 2019. Neural graph collaborative filtering. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’19). 165–174.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[43] Wang Yifan, Tang Suyao, Lei Yuntong, Song Weiping, Wang Sheng, and Zhang Ming. 2020. Disenhan: Disentangled heterogeneous graph attention network for recommendation. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM’20). 1605–1614.Google ScholarDigital Library
Reference
[44] Wu Chuhan, Wu Fangzhao, Cao Yang, Huang Yongfeng, and Xie Xing. 2021. Fedgnn: Federated graph neural network for privacy-preserving recommendation. arXiv:2102.04925. Retrieved from https://arxiv.org/abs/2102.04925Google Scholar
Reference
[45] Wu Chao-Yuan, Ahmed Amr, Beutel Alex, Smola Alexander J., and Jing How. 2017. Recurrent recommender networks. In Proceedings of the 10th ACM International Conference on Web Search and Data Mining (WSDM’17). 495–503.Google ScholarDigital Library
Reference
[46] Wu Jiancan, Wang Xiang, Feng Fuli, He Xiangnan, Chen Liang, Lian Jianxun, and Xie Xing. 2021. Self-supervised graph learning for recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’21). 726–735.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
[47] Xia Lianghao, Huang Chao, Xu Yong, Zhao Jiashu, Yin Dawei, and Huang Jimmy. 2022. Hypergraph contrastive collaborative filtering. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’22). 70–79.Google ScholarDigital Library
Reference
[48] Xue Hong-Jian, Dai Xinyu, Zhang Jianbing, Huang Shujian, and Chen Jiajun. 2017. Deep matrix factorization models for recommender systems. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI-17). 3203–3209.Google ScholarCross Ref
Reference
[49] Yang Yuhao, Huang Chao, Xia Lianghao, and Li Chenliang. 2022. Knowledge graph contrastive learning for recommendation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’22). 1434–1443.Google ScholarDigital Library
Reference 1Reference 2
[50] Ying Rex, He Ruining, Chen Kaifeng, Eksombatchai Pong, Hamilton William L., and Leskovec Jure. 2018. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’18). 974–983.Google ScholarDigital Library
Reference
[51] You Yuning, Chen Tianlong, Sui Yongduo, Chen Ting, Wang Zhangyang, and Shen Yang. 2020. Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems (NeurIPS’20) 33 (2020), 5812–5823.Google Scholar
Reference
[52] Yu Wenhui and Qin Zheng. 2020. Graph convolutional network for recommendation with low-pass collaborative filters. In Proceedings of the 37th International Conference on Machine Learning (ICML’20). 10936–10945.Google ScholarDigital Library
Reference
[53] Zhang Sixiao, Chen Hongxu, Ming Xiao, Cui Lizhen, Yin Hongzhi, and Xu Guandong. 2021. Where are we in embedding spaces? In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD’21), 2223–2231.Google Scholar
Reference
[54] Zhao Minghao, Wu Le, Liang Yile, Chen Lei, Zhang Jian, Deng Qilin, Wang Kai, Shen Xudong, Lv Tangjie, and Wu Runze. 2022. Investigating accuracy-novelty performance for graph-based collaborative filtering. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’22). 50–59.Google ScholarDigital Library
Reference
[55] Zheng Guanjie, Zhang Fuzheng, Zheng Zihan, Xiang Yang, Yuan Nicholas Jing, Xie Xing, and Li Zhenhui. 2018. DRN: A deep reinforcement learning framework for news recommendation. In Proceedings of the 27th International Conference on World Wide Web (WWW’18). 167–176.Google ScholarDigital Library
Reference
[56] Zheng Lei, Lu Chun-Ta, Jiang Fei, Zhang Jiawei, and Yu Philip S.. 2018. Spectral collaborative filtering. In Proceedings of the 12th ACM Conference on Recommender Systems (RecSys’18). 311–319.Google ScholarDigital Library
Reference 1Reference 2Reference 3

Index Terms

Less is More: Removing Redundancy of Graph Convolutional Networks for Recommendation
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Recommender systems

Recommendations

Less is More: Reweighting Important Spectral Graph Features for Recommendation
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

As much as Graph Convolutional Networks (GCNs) have shown tremendous success in recommender systems and collaborative filtering (CF), the mechanism of how they, especially the core components (\textiti.e., neighborhood aggregation) contribute to ...
Read More
LGCCF: A Linear Graph Convolutional Collaborative Filtering with Social Influence
Database Systems for Advanced Applications
Abstract
Collaborative filtering (CF) is the dominant technique in personalized recommendation. It models user-item interactions to select the relevant items for a user, and it is widely applied in real recommender systems. Recently, graph convolutional ...
Read More
Graph Collaborative Signals Denoising and Augmentation for Recommendation
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Graph collaborative filtering (GCF) is a popular technique for capturing high-order collaborative signals in recommendation systems. However, GCF's bipartite adjacency matrix, which defines the neighbors being aggregated based on user-item interactions, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 42, Issue 3
May 2024
721 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/3618081
Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 January 2024
- Online AM: 20 November 2023
- Accepted: 8 November 2023
- Revised: 20 September 2023
- Received: 19 January 2023
Published in tois Volume 42, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Collaborative filtering
graph convolutional network
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 841
  Total Downloads
- Downloads (Last 12 months)841
- Downloads (Last 6 weeks)215
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Less is More: Removing Redundancy of Graph Convolutional Networks for Recommendation

ACM Transactions on Information Systems

Abstract

1 INTRODUCTION

2 PRELIMINARIES

2.1 GCNs for CF

2.2 Graph Signal Processing

3 REDUNDANCIES IN GCNS

3.1 Feature Redundancy and GDE Brief

3.2 Structure Redundancy

3.3 Distribution Redundancy

(Variation on Second-order Graphs).

4 METHODOLOGY

4.1 Simplified Graph Denoising Encoder (SGDE)

4.2 Discussion

4.2.1 Weighting Function.

4.2.2 Comparison with Truncated SVD.

4.2.3 Comparison with GCN-based Methods.

4.2.4 Complexity.

4.3 Contrastive Simplified Graph Denoising Encoder (CSGDE)

5 EXPERIMENTS

5.1 Experimental Settings

5.1.1 Datasets and Evaluation Metrics.

5.1.2 Baselines.

5.1.3 Implementation Details.

5.2 Comparison

5.2.1 Performance.

5.2.2 Efficiency.

5.3 Model Analysis

5.3.1 Distribution Redundancy.

5.3.2 Structure Redundancy.

5.3.3 Weighting Function.

5.3.4 Contrastive Loss.

6 RELATED WORK

7 CONCLUSION

Footnotes

REFERENCES

Cited By

Index Terms

Recommendations

Less is More: Reweighting Important Spectral Graph Features for Recommendation

LGCCF: A Linear Graph Convolutional Collaborative Filtering with Social Influence

Graph Collaborative Signals Denoising and Augmentation for Recommendation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media