Abstract
Random walk with restart (RWR) is a widely-used measure of node similarity in graphs, and it has proved useful for ranking, community detection, link prediction, anomaly detection, etc. Since RWR is typically required to be computed separately for a larger number of query nodes or even for all nodes, fast computation of it is indispensable. However, for hypergraphs, the fast computation of RWR has been unexplored, despite its great potential. In this paper, we propose ARCHER, a fast computation framework for RWR on hypergraphs. Specifically, we first formally define RWR on hypergraphs, and then we propose two computation methods that compose ARCHER. Since the two methods are complementary (i.e., offering relative advantages on different hypergraphs), we also develop a method for automatic selection between them, which takes a very short time compared to the total running time. Through our extensive experiments on 18 real-world hypergraphs, we demonstrate (a) the speed and space efficiency of ARCHER, (b) the complementary nature of the two computation methods composing ARCHER, (c) the accuracy of its automatic selection method, and (d) its successful application to anomaly detection on hypergraphs.
Similar content being viewed by others
Notes
Let k and n denote the hub selection ratio and the number of nodes, respectively. SlashBurn removes \(\lceil kn \rceil\) high-degree nodes (called hubs) from a graph so that it is split into the giant connected component (GCC) and remaining disconnected components (called spokes), and it recursively repeats this process on the GCC. The hubs and spokes are then utilized to construct its reordering permutation (refer to its paper for details). It is used in both BEAR and BePI.
As an iterative solver, BePI employs GMRES (Trefethen and Bau 2022), a Krylov subspace method, with a preconditioner such as incomplete LU decomposition where the iterative solver converges if its residual is less than error tolerance \(\epsilon\).
References
Amburg I, Veldt N, Benson A (2020) Clustering in graphs and hypergraphs with categorical edge labels. In: Proceedings of the web conference 2020 (WWW), pp 706–717. https://doi.org/10.1145/3366423.3380152
Benson AR, Abebe R, Schaub MT et al (2018) Simplicial closure and higher-order link prediction. Proceed Natl Academy Sci. https://doi.org/10.1073/pnas.1800683115
Boyd S, Boyd SP, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511804441
Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1–7):107–117. https://doi.org/10.1016/s0169-7552(98)00110-x
Chitra U, Raphael B (2019) Random walks on hypergraphs with edge-dependent vertex weights. In: Proceedings of the 36th international conference on machine learning (ICML), pp 1172–1181, arXiv:1905.08287
Chodrow PS, Veldt N, Benson AR (2021) Generative hypergraph clustering: from blockmodels to modularity. Sci Adv 7(28):eabh1303. https://doi.org/10.1126/sciadv.abh1303
Cohen MB, Kelner J, Peebles J, et al (2016) Faster algorithms for computing the stationary distribution, simulating random walks, and more. In: 2016 IEEE 57th annual symposium on foundations of computer science (FOCS), pp 583–592. https://doi.org/10.1109/FOCS.2016.69
Cohen MB, Kelner J, Kyng R, et al (2018) Solving directed laplacian systems in nearly-linear time through sparse lu factorizations. In: 2018 IEEE 59th annual symposium on foundations of computer science (FOCS), pp 898–909. https://doi.org/10.1109/FOCS.2018.00089
Comrie C, Kleinberg J (2021) Hypergraph ego-networks and their temporal evolution. In: 2021 IEEE international conference on data mining (ICDM), pp 91–100. https://doi.org/10.1109/icdm51629.2021.00019
Do MT, Yoon Se, Hooi B, et al (2020) Structural patterns and generative models of real-world hypergraphs. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining (KDD). ACM, pp 176–186. https://doi.org/10.1145/3394486.3403060
Fowler JH (2006) Connecting the congress: a study of cosponsorship networks. Polit Anal 14(4):456–487. https://doi.org/10.1093/pan/mpl002
Fowler JH (2006) Legislative cosponsorship networks in the US house and senate. Soc Netw 28(4):454–465. https://doi.org/10.1016/j.socnet.2005.11.003
Fujiwara Y, Nakatsuji M, Onizuka M, et al (2012) Fast and exact top-k search for random walk with restart. Proceed VLDB Endowment 5(5), 442–453.https://doi.org/10.14778/2140436.2140441
Gasteiger J, Bojchevski A, Günnemann S (2019a) Predict then propagate: Graph neural networks meet personalized pagerank. In: International conference on learning representations (ICLR). arXiv:1810.05997
Gasteiger J, Weißenberger S, Günnemann S (2019b) Diffusion improves graph learning. In: Advances in neural information processing systems (NeurIPS). arXiv:1911.05485
Harper FM, Konstan JA (2015) The MovieLens datasets. ACM Trans Interact Intell Syst 5(4):1–19. https://doi.org/10.1145/2827872
Hayashi K, Aksoy SG, Park CH, et al (2020) Hypergraph random walks, laplacians, and clustering. In: Proceedings of the 29th ACM international conference on information & knowledge management (CIKM), pp 495–504. https://doi.org/10.1145/3340531.3412034
Horn RA, Johnson CR (2012) Matrix analysis. Cambridge University Press. https://doi.org/10.1017/CBO9780511810817
Hou G, Chen X, Wang S, et al (2021) Massively parallel algorithms for personalized pagerank. Proceed VLDB Endow 14(9):1668–1680. https://doi.org/10.14778/3461535.3461554
Jung J, Jin W, Sael L, et al (2016) Personalized ranking in signed networks using signed random walk with restart. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 973–978. https://doi.org/10.1109/icdm.2016.0122
Jung J, Park N, Lee S, et al. (2017) BePI. In: Proceedings of the 2017 ACM international conference on management of data (SIGMOD), pp 789–804. https://doi.org/10.1145/3035918.3035950
Jung J, Jin W, Kang U (2019) Random walk-based ranking in signed social networks: model and algorithms. Knowl Inf Syst 62(2):571–610. https://doi.org/10.1007/s10115-019-01364-z
Kang U, Faloutsos C (2011) Beyond ’caveman communities’: Hubs and spokes for graph compression and mining. In: 2011 IEEE 11th international conference on data mining (ICDM), pp 300–309, https://doi.org/10.1109/ICDM.2011.26
Langville AN, Meyer CD (2006) Google’s PageRank and beyond: the science of search engine rankings. Princeton University Press, Princeton. https://doi.org/10.1515/9781400830329
Lee G, Choe M, Shin K (2021) How do hyperedges overlap in real-world hypergraphs?—patterns, measures, and generators. In: Proceedings of the web conference 2021 (WWW), pp 3396–3407. https://doi.org/10.1145/3442381.3450010
Lee G, Choe M, Shin K (2022) HashNWalk: Hash and random walk based anomaly detection in hyperedge streams. In: Proceedings of the thirty-first international joint conference on artificial intelligence (IJCAI), pp 2129–2137. https://doi.org/10.24963/ijcai.2022/296
Lee G, Yoo J, Shin K (2023) Mining of real-world hypergraphs: Patterns, tools, and generators. In: Proceedings of the 29th ACM SIGKDD international conference on knowledge discovery & data mining (KDD). ACM, pp 5811–5812. https://doi.org/10.1145/3580305.3599567,
Lee J, Jung J (2023) Time-aware random walk diffusion to improve dynamic graph learning. In: Proceedings of the AAAI conference on artificial intelligence (AAAI). https://doi.org/10.1609/aaai.v37i7.26021
Leskovec J, Kleinberg J, Faloutsos C (2007) Graph evolution. ACM Trans Knowl Discovery Data 1(1):2. https://doi.org/10.1145/1217299.1217301
Li J, He J, Zhu Y (2018) E-tail product return prediction via hypergraph-based local graph cut. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (KDD), pp 519–527. https://doi.org/10.1145/3219819.3219829
Lin D, Wong RCW, Xie M, et al (2020) Index-free approach with theoretical guarantee for efficient random walk with restart query. In: IEEE 36th international conference on data engineering (ICDE), pp 913–924. https://doi.org/10.1109/icde48307.2020.00084
McAuley J, Leskovec J (2013) Discovering social circles in ego networks. arXiv:1210.8182
Nassar H, Kloster K, Gleich DF (2015) Strong localization in personalized PageRank vectors. In: Algorithms and models for the web graph (WAW), pp 190–202. https://doi.org/10.1007/978-3-319-26784-5_15
Ni J, Li J, McAuley J (2019) Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 188–197. https://doi.org/10.18653/v1/d19-1018
Page L, Brin S, Motwani R, et al (1999) The pagerank citation ranking: Bringing order to the web. Tech. rep., Stanford InfoLab
Rajaraman A, Ullman JD (2011) Mining of massive datasets. Cambridge University Press. https://doi.org/10.1017/CBO9781139924801
Ranshous S, Chaudhary M, Samatova NF (2017) Efficient outlier detection in hyperedge streams using MinHash and locality-sensitive hashing. In: Complex networks & their applications VI, pp 105–116. https://doi.org/10.1007/978-3-319-72150-7_9
Shin K, Jung J, Lee S, et al (2015) Bear: Block elimination approach for random walk with restart on large graphs. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data (SIGMOD), pp 1571–1585. https://doi.org/10.1145/2723372.2723716
Sinha A, Shen Z, Song Y, et al (2015) An overview of microsoft academic service (MAS) and applications. In: Proceedings of the 24th international conference on world wide web (WWW), pp 519–527. https://doi.org/10.1145/2740908.2742839
Sun J, Qu H, Chakrabarti D, et al (2005) Neighborhood formation and anomaly detection in bipartite graphs. In: Proceedings of the fifth IEEE international conference on data mining (ICDM), pp 418–425. https://doi.org/10.1109/ICDM.2005.103
Sun L, Ji S, Ye J (2008) Hypergraph spectral learning for multi-label classification. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp 668–676. https://doi.org/10.1145/1401890.1401971
Tong H, Faloutsos C, Gallagher B, et al (2007a) Fast best-effort pattern matching in large attributed graphs. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp 737–746. https://doi.org/10.1145/1281192.1281271
Tong H, Faloutsos C, Pan JY (2007) Random walk with restart: fast solutions and applications. Knowl Inf Syst 14(3):327–346. https://doi.org/10.1007/s10115-007-0094-2
Trefethen LN, Bau D (2022) Numerical linear algebra, vol 181. Siam, https://doi.org/10.1137/1.9780898719574
Wang R, Wang S, Zhou X (2019) Parallelizing approximate single-source personalized PageRank queries on shared memory. VLDB J 28(6):923–940. https://doi.org/10.1007/s00778-019-00576-7
Wang S, Yang R, Xiao X, et al (2017) Fora: simple and effective approximate single-source personalized pagerank. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 505–514. https://doi.org/10.1145/3097983.3098072
Wang S, Yang R, Wang R et al (2019) Efficient algorithms for approximate single-source personalized PageRank queries. ACM Trans Database Syst 44(4):1–37. https://doi.org/10.1145/3360902
Wei Z, He X, Xiao X, et al (2018) Topppr: Top-k personalized pagerank queries with precision guarantees on large graphs. In: Proceedings of the 2018 international conference on management of data (SIGMOD), pp 441–456. https://doi.org/10.1145/3183713.3196920
Wu H, Gan J, Wei Z, et al (2021) Unifying the global and local approaches: An efficient power iteration with forward push. In: Proceedings of the 2021 international conference on management of data (SIGMOD), pp 1996–2008. https://doi.org/10.1145/3448016.3457298
Yin H, Benson AR, Leskovec J, et al (2017) Local higher-order graph clustering. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 555–564. https://doi.org/10.1145/3097983.3098069
Zhang Y, Zhao Z, Feng Z (2018a) A unified approach to scalable spectral sparsification of directed graphs. arXiv:1812.04165
Zhang Z, Lin H, Gao Y (2018b) Dynamic hypergraph structure learning. In: Proceedings of the twenty-seventh international joint conference on artificial intelligence (IJCAI), pp 3162–3169. https://doi.org/10.24963/ijcai.2018/439
Zhou D, Huang J, Schölkopf B (2006) Learning with hypergraphs: clustering, classification, and embedding. In: Proceedings of the 19th international conference on neural information processing systems (NIPS), pp 1601–1608. https://doi.org/10.7551/mitpress/7503.003.0205
Zhu S, Zou L, Fang B (2013) Content based image retrieval via a transductive model. J Intell Inf Syst 42(1):95–109. https://doi.org/10.1007/s10844-013-0257-4
Zien J, Schlag M, Chan P (1999) Multi-level spectral hypergraph partitioning with arbitrary vertex sizes. ITCSDI 18(9):1389–1399. https://doi.org/10.1109/iccad.1996.569592
Funding
This work was supported by National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2020R1C1C1008296) (No. NRF-2021R1C1C1008526) and Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)) (No. 2021-0-02,068, Artificial Intelligence Innovation Hub).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Responsible editor: Charalampos Tsourakakis.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Experimental datasets
We provide a brief description of the datasets used in this paper.
-
EEN and EEU.Footnote 6 These hypergraphs represent sets of email addresses on emails in Enron (EEN) and a European research institution (EEU) where users are nodes and the group of the sender and all receivers of each email is a hyperedge.
-
HB and SB.\(^{6}\) These represent co-sponsorships of bills in the House of Representatives (HB) and the Senate (SB) where US Congresspersons are nodes, and groups of sponsors and co-sponsors of bills are hyperedges.
-
WAL.\(^{6}\) This is a hypergraph where products are nodes and hyperedges are sets of co-purchased products at Walmart.
-
TRI.\(^{6}\) This is a hypergraph where nodes are accommodations (mostly hotels) and hyperedges are sets of accommodations that a user performed “click-out” during the same browsing session at Trivago.
-
COD, COG, and COH.\(^{6}\) These are co-authorship hypergraphs where authors are nodes and each hyperedge represents the authors of a publication recorded on DBLP (COD), Geology (COG), and History (COH).
-
THS, THM, and THU).\(^{6}\) These are hypergraphs where users are nodes and each hyperedge represents the group of users associated with a thread at StackOverflow (THS), MathStackOverflow (THM), and AskUbuntu (THU).
-
AM.Footnote 7 This is a hypergraph of Amazon (AM) product reviews (spec., those categorized as Movies & TV) where users are nodes and a group of products reviewed by the same user is a hyperedge. Each user has at least 5 reviews.
-
YP.Footnote 8 This is a hypergraph of user ratings on locations (e.g., hotels and restaurants) at Yelp (YP) where users are nodes and a group of locations a user rated is a hyperedge. Ratings higher than 3 are considered.
-
TW.Footnote 9 This is a hypergraph of social relationships on Twitter (TW) where users are nodes and each hyperedge represents a group of users that compose a ‘circle’ (or ‘list’) together on Twitter.
-
ML1, ML10, and ML20.Footnote 10 These hypergraphs represent interactions of movies at MovieLens with different sizes of 1 M (ML1), 10 M (ML10), and 20 M (ML20) movie ratings where nodes are movies and a group of movies a user rated is a hyperedge. Ratings higher than 3 are considered.
Appendix B: Application to node retrieval
In this section, we introduce an application of RWR scores on hypergraphs for the task of node retrieval. Additionally, we evaluate the empirical effectiveness of this approach.
Similar node retrieval Given a hypergraph and a query node s, this task is to search for nodes structurally similar to the query node. Specifically, we measure node-to-node proximities for s and use them as ranking scores to sort all nodes except s in the order of the scores. If nodes with the same class of the query node are ranked high, structurally similar nodes are successfully retrieved, which can be evaluated by ranking metrics such as AUROC and MAP. For this task, we compute the hypergraph RWR scores \(\varvec{r}\) w.r.t. query node s through ARCHER, and utilize the scores for ranking.
Settings We conduct this experiment on the SB (senate-bills) and HB (house-bills) datasets, which contain binary node labels. The RWR models used in Sect. 5.6 are also used for this task. To introduce edge-dependent node weights (EDNW), we set \(\gamma _e(v) = \bar{d}(v)^{-\beta }\), where \(\bar{d}(v)\) represents the unweighted degree of node v. We then set \(\beta =1.0\) and \(\omega (e) = 1\) (refer to Appendix C for the selection of \(\beta =1.0\)). For EINW, we set \(\gamma _e(v) = 1\) and \(\omega (e) = 1\). We vary the restart probability c from 0.1 to 0.9 by 0.1.
Results Fig. 8 shows the experimental results on the node retrieval task in terms of AUROC and MAP. As shown in the figure, the RWR using EDNW shows the best performance, especially with high values of restart probability c, among all tested methods. Note that the RWR using EDNW outperforms that using EINW and naive RWR, indicating the edge-dependent node weights are useful also for this task.
Appendix C: Experiments on edge-dependent node weights for applications
In this section, we provide the experimental results regarding the effectiveness of the edge-dependent node weights for applications.
Anomaly detection For anomaly detection in Sect. 5.6, we set edge-dependent node weights \(\gamma _e(v) = \bar{d}(v)^{-\beta }\) for hypergraph RWR. For the experiment, we assess the performance of RWR by varying two parameters: \(\beta\) and the restart probability c. Specifically, we explore different values of \(\beta\) within the range of \(\{0.5, 1.0, 2.0\}\), and we also vary c between 0.1 and 0.9 in increments of 0.1. Figure 9 shows the results, \(\beta = 0.5\) generally yields the best performance across the tested datasets.
Similar node retrieval For node retrieval in Appendix B, we also set \(\gamma _e(v) = \bar{d}(v)^{-\beta }\) for hypergraph RWR. We test the node-retrieval performance of RWR by varying the values of \(\beta\) and c. Specifically, the list of values tested for \(\beta\) is \(\{0.5, 1.0, 2.0\}\), while the range for c spans from 0.1 to 0.9. Figure 10 shows the results, and \(\beta = 1.0\) leads to the best performance in most cases.
Appendix D: Counting of the number of non-zero entries
In this section, we discuss how to compute \(\texttt {nnz}({\textbf {H}}_{\mathcal {C}})\) and \(\texttt {nnz}({\textbf {H}}_{\star })\) rapidly and space-efficiently. They are used in Eq. (11) by ARCHER to select one between clique- and star-expansion-based methods.
Calculation of \({\texttt {nnz}({\textbf {H}}_{\mathcal {C}})}\) While it is possible to naively count the number of non-zeros in \({\textbf {H}}_{\mathcal {C}}={\textbf {I}}_{n} - (1-c)\tilde{\textbf{P}}^{\top }\), materializing \({\textbf {H}}_{\mathcal {C}}\) typically requires more space than the input data due to its relatively high density. Hence, we suggest a more efficient way based on the following property regarding \({\textbf {H}}_{\mathcal {C}}\):
where \(\tilde{\textbf{P}} = \tilde{\textbf{W}}\tilde{\textbf{R}}\). The equality is from the fact that the diagonal entries of \(\tilde{\textbf{P}}\) are non-zeros because \(\tilde{\textbf{P}}\) involves the transition probability that moves from each node v to one of its hyperedges, and goes back to v. Note the sparsity pattern of \(\tilde{\textbf{P}}\) is the same as that of the adjacency matrix of the clique-expanded graph \(G_{\mathcal {C}}\) (with additional self-loops on every node) of the hypergraph \(G_{H}\). Thus, we can calculate \(\texttt {nnz}(\tilde{\textbf{P}})\) without materializing \(\tilde{\textbf{P}}\), by directly counting the edges that are clique-expanded from each hyperedge.
Algorithm 4 summarizes the procedure for computing \(\texttt {nnz}({\textbf {H}}_{\mathcal {C}})\). For each node v (line 2), we find every node u that appears together with v in at least one hyperedge (line 5). Whenever we find such u, it is equivalent to finding an edge (v, u), and thus we increment the count accordingly (line 7). Note that we maintain a set C of such nodes to prevent duplicated counting (lines 3 and 8). Regardless of the input, Algorithm 4 requires \(O(|C |)=O(n)\) extra space to maintain the set C. The time complexity is \(O(\sum _{v \in \mathcal {V}}\sum _{e \in E(v)}|e |) = O(\sum _{e \in \mathcal {E}}|e |^{2})\) because it requires \(|e |\) operations for each node in e.
Calculation of \({\texttt {nnz}({\textbf {H}}_{\star })}\) Similarly, \(\texttt {nnz}({\textbf {H}}_{\star })\) can also be efficiently calculated based on the following equalities:
where \({\textbf {H}}_{\star }={\textbf {I}}_{N}-(1-c)\tilde{\textbf{S}}^{\top }\). Note that \({\textbf {I}}_{N}\) is the identity matrix of size \(N = n + m\), occupying \(n + m\) non-zeros in \({\textbf {H}}_{\star }\). The matrix \(\tilde{\textbf{S}}\) consists of \(\tilde{\textbf{W}}\) and \(\tilde{\textbf{R}}\) as shown in Eq. (5), and their sparsity patterns are the same as \({\textbf {W}}\) and \({\textbf {R}}\). The time complexity of this approach is dominated by that of counting the numbers of non-zero entries in \({\textbf {W}}\) and \({\textbf {R}}\). If \({\textbf {W}}\) and \({\textbf {R}}\) are in a sparse matrix format, the number of their non-zero entries can be computed in \(O(\texttt {nnz}({\textbf {W}})+\texttt {nnz}({\textbf {R}}))=O(\sum _{v \in \mathcal {V}}\bar{d}(v))=O(\sum _{e \in \mathcal {E}}|e |)\) time and even in O(1) time in some formats (e.g., compressed sparse row). With the exception of the inputs (i.e., \({\textbf {W}}\) and \({\textbf {R}}\)), this approach requires a constant amount of additional space.
Appendix E: Correlation between data statistics and costs of BePI
In this section, we empirically investigate the correlations between basic data statistics and the costs of BePI, which ARCHER employs for RWR computation. As the data statistics, we use \(\texttt {nnz}({\textbf {H}})\) (i.e., \(\texttt {nnz}({\textbf {H}}_{\mathcal {C}})\) and \(\texttt {nnz}({\textbf {H}}_{\star })\) in clique- and star-expansion-based computations, respectively), density, overlapness, and average hyperedge size. As the costs of the clique- and star-expansion-based computation of BePI, we consider preprocessing time, space cost, and query time. The results obtained across all the datasets (refer to Appendix A) for both clique- and star-expansion-based computations are presented in Fig. 11. As shown in Fig. 11a, there exists a strong positive correlation between \(\texttt {nnz}({\textbf {H}})\) and the costs for the calculation of RWR. For other statistics (see Figs. 11b, 11c, and 11d), there is no noticeable correlation between the statistics and the costs.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chun, J., Lee, G., Shin, K. et al. Random walk with restart on hypergraphs: fast computation and an application to anomaly detection. Data Min Knowl Disc 38, 1222–1257 (2024). https://doi.org/10.1007/s10618-023-00995-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-023-00995-9