Skip to main content
Log in

Random walk with restart on hypergraphs: fast computation and an application to anomaly detection

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Random walk with restart (RWR) is a widely-used measure of node similarity in graphs, and it has proved useful for ranking, community detection, link prediction, anomaly detection, etc. Since RWR is typically required to be computed separately for a larger number of query nodes or even for all nodes, fast computation of it is indispensable. However, for hypergraphs, the fast computation of RWR has been unexplored, despite its great potential. In this paper, we propose ARCHER, a fast computation framework for RWR on hypergraphs. Specifically, we first formally define RWR on hypergraphs, and then we propose two computation methods that compose ARCHER. Since the two methods are complementary (i.e., offering relative advantages on different hypergraphs), we also develop a method for automatic selection between them, which takes a very short time compared to the total running time. Through our extensive experiments on 18 real-world hypergraphs, we demonstrate (a) the speed and space efficiency of ARCHER, (b) the complementary nature of the two computation methods composing ARCHER, (c) the accuracy of its automatic selection method, and (d) its successful application to anomaly detection on hypergraphs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Algorithm 2
Algorithm 3
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Let k and n denote the hub selection ratio and the number of nodes, respectively. SlashBurn removes \(\lceil kn \rceil\) high-degree nodes (called hubs) from a graph so that it is split into the giant connected component (GCC) and remaining disconnected components (called spokes), and it recursively repeats this process on the GCC. The hubs and spokes are then utilized to construct its reordering permutation (refer to its paper for details). It is used in both BEAR and BePI.

  2. As an iterative solver, BePI employs GMRES (Trefethen and Bau 2022), a Krylov subspace method, with a preconditioner such as incomplete LU decomposition where the iterative solver converges if its residual is less than error tolerance \(\epsilon\).

  3. https://datalab.snu.ac.kr/bear

  4. https://datalab.snu.ac.kr/bepi

  5. https://github.com/geon0325/HashNWalk

  6. https://www.cs.cornell.edu/~arb/data/

  7. https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2

  8. https://www.yelp.com/dataset

  9. https://snap.stanford.edu/data/ego-Twitter.html

  10. https://grouplens.org/datasets/movielens/

References

  • Amburg I, Veldt N, Benson A (2020) Clustering in graphs and hypergraphs with categorical edge labels. In: Proceedings of the web conference 2020 (WWW), pp 706–717. https://doi.org/10.1145/3366423.3380152

  • Benson AR, Abebe R, Schaub MT et al (2018) Simplicial closure and higher-order link prediction. Proceed Natl Academy Sci. https://doi.org/10.1073/pnas.1800683115

    Article  Google Scholar 

  • Boyd S, Boyd SP, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511804441

    Book  Google Scholar 

  • Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst 30(1–7):107–117. https://doi.org/10.1016/s0169-7552(98)00110-x

    Article  Google Scholar 

  • Chitra U, Raphael B (2019) Random walks on hypergraphs with edge-dependent vertex weights. In: Proceedings of the 36th international conference on machine learning (ICML), pp 1172–1181, arXiv:1905.08287

  • Chodrow PS, Veldt N, Benson AR (2021) Generative hypergraph clustering: from blockmodels to modularity. Sci Adv 7(28):eabh1303. https://doi.org/10.1126/sciadv.abh1303

    Article  Google Scholar 

  • Cohen MB, Kelner J, Peebles J, et al (2016) Faster algorithms for computing the stationary distribution, simulating random walks, and more. In: 2016 IEEE 57th annual symposium on foundations of computer science (FOCS), pp 583–592. https://doi.org/10.1109/FOCS.2016.69

  • Cohen MB, Kelner J, Kyng R, et al (2018) Solving directed laplacian systems in nearly-linear time through sparse lu factorizations. In: 2018 IEEE 59th annual symposium on foundations of computer science (FOCS), pp 898–909. https://doi.org/10.1109/FOCS.2018.00089

  • Comrie C, Kleinberg J (2021) Hypergraph ego-networks and their temporal evolution. In: 2021 IEEE international conference on data mining (ICDM), pp 91–100. https://doi.org/10.1109/icdm51629.2021.00019

  • Do MT, Yoon Se, Hooi B, et al (2020) Structural patterns and generative models of real-world hypergraphs. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining (KDD). ACM, pp 176–186. https://doi.org/10.1145/3394486.3403060

  • Fowler JH (2006) Connecting the congress: a study of cosponsorship networks. Polit Anal 14(4):456–487. https://doi.org/10.1093/pan/mpl002

    Article  Google Scholar 

  • Fowler JH (2006) Legislative cosponsorship networks in the US house and senate. Soc Netw 28(4):454–465. https://doi.org/10.1016/j.socnet.2005.11.003

    Article  Google Scholar 

  • Fujiwara Y, Nakatsuji M, Onizuka M, et al (2012) Fast and exact top-k search for random walk with restart. Proceed VLDB Endowment 5(5), 442–453.https://doi.org/10.14778/2140436.2140441

  • Gasteiger J, Bojchevski A, Günnemann S (2019a) Predict then propagate: Graph neural networks meet personalized pagerank. In: International conference on learning representations (ICLR). arXiv:1810.05997

  • Gasteiger J, Weißenberger S, Günnemann S (2019b) Diffusion improves graph learning. In: Advances in neural information processing systems (NeurIPS). arXiv:1911.05485

  • Harper FM, Konstan JA (2015) The MovieLens datasets. ACM Trans Interact Intell Syst 5(4):1–19. https://doi.org/10.1145/2827872

    Article  Google Scholar 

  • Hayashi K, Aksoy SG, Park CH, et al (2020) Hypergraph random walks, laplacians, and clustering. In: Proceedings of the 29th ACM international conference on information & knowledge management (CIKM), pp 495–504. https://doi.org/10.1145/3340531.3412034

  • Horn RA, Johnson CR (2012) Matrix analysis. Cambridge University Press. https://doi.org/10.1017/CBO9780511810817

  • Hou G, Chen X, Wang S, et al (2021) Massively parallel algorithms for personalized pagerank. Proceed VLDB Endow 14(9):1668–1680. https://doi.org/10.14778/3461535.3461554

  • Jung J, Jin W, Sael L, et al (2016) Personalized ranking in signed networks using signed random walk with restart. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 973–978. https://doi.org/10.1109/icdm.2016.0122

  • Jung J, Park N, Lee S, et al. (2017) BePI. In: Proceedings of the 2017 ACM international conference on management of data (SIGMOD), pp 789–804. https://doi.org/10.1145/3035918.3035950

  • Jung J, Jin W, Kang U (2019) Random walk-based ranking in signed social networks: model and algorithms. Knowl Inf Syst 62(2):571–610. https://doi.org/10.1007/s10115-019-01364-z

    Article  Google Scholar 

  • Kang U, Faloutsos C (2011) Beyond ’caveman communities’: Hubs and spokes for graph compression and mining. In: 2011 IEEE 11th international conference on data mining (ICDM), pp 300–309, https://doi.org/10.1109/ICDM.2011.26

  • Langville AN, Meyer CD (2006) Google’s PageRank and beyond: the science of search engine rankings. Princeton University Press, Princeton. https://doi.org/10.1515/9781400830329

  • Lee G, Choe M, Shin K (2021) How do hyperedges overlap in real-world hypergraphs?—patterns, measures, and generators. In: Proceedings of the web conference 2021 (WWW), pp 3396–3407. https://doi.org/10.1145/3442381.3450010

  • Lee G, Choe M, Shin K (2022) HashNWalk: Hash and random walk based anomaly detection in hyperedge streams. In: Proceedings of the thirty-first international joint conference on artificial intelligence (IJCAI), pp 2129–2137. https://doi.org/10.24963/ijcai.2022/296

  • Lee G, Yoo J, Shin K (2023) Mining of real-world hypergraphs: Patterns, tools, and generators. In: Proceedings of the 29th ACM SIGKDD international conference on knowledge discovery & data mining (KDD). ACM, pp 5811–5812. https://doi.org/10.1145/3580305.3599567,

  • Lee J, Jung J (2023) Time-aware random walk diffusion to improve dynamic graph learning. In: Proceedings of the AAAI conference on artificial intelligence (AAAI). https://doi.org/10.1609/aaai.v37i7.26021

  • Leskovec J, Kleinberg J, Faloutsos C (2007) Graph evolution. ACM Trans Knowl Discovery Data 1(1):2. https://doi.org/10.1145/1217299.1217301

    Article  Google Scholar 

  • Li J, He J, Zhu Y (2018) E-tail product return prediction via hypergraph-based local graph cut. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (KDD), pp 519–527. https://doi.org/10.1145/3219819.3219829

  • Lin D, Wong RCW, Xie M, et al (2020) Index-free approach with theoretical guarantee for efficient random walk with restart query. In: IEEE 36th international conference on data engineering (ICDE), pp 913–924. https://doi.org/10.1109/icde48307.2020.00084

  • McAuley J, Leskovec J (2013) Discovering social circles in ego networks. arXiv:1210.8182

  • Nassar H, Kloster K, Gleich DF (2015) Strong localization in personalized PageRank vectors. In: Algorithms and models for the web graph (WAW), pp 190–202. https://doi.org/10.1007/978-3-319-26784-5_15

  • Ni J, Li J, McAuley J (2019) Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 188–197. https://doi.org/10.18653/v1/d19-1018

  • Page L, Brin S, Motwani R, et al (1999) The pagerank citation ranking: Bringing order to the web. Tech. rep., Stanford InfoLab

  • Rajaraman A, Ullman JD (2011) Mining of massive datasets. Cambridge University Press. https://doi.org/10.1017/CBO9781139924801

  • Ranshous S, Chaudhary M, Samatova NF (2017) Efficient outlier detection in hyperedge streams using MinHash and locality-sensitive hashing. In: Complex networks & their applications VI, pp 105–116. https://doi.org/10.1007/978-3-319-72150-7_9

  • Shin K, Jung J, Lee S, et al (2015) Bear: Block elimination approach for random walk with restart on large graphs. In: Proceedings of the 2015 ACM SIGMOD international conference on management of data (SIGMOD), pp 1571–1585. https://doi.org/10.1145/2723372.2723716

  • Sinha A, Shen Z, Song Y, et al (2015) An overview of microsoft academic service (MAS) and applications. In: Proceedings of the 24th international conference on world wide web (WWW), pp 519–527. https://doi.org/10.1145/2740908.2742839

  • Sun J, Qu H, Chakrabarti D, et al (2005) Neighborhood formation and anomaly detection in bipartite graphs. In: Proceedings of the fifth IEEE international conference on data mining (ICDM), pp 418–425. https://doi.org/10.1109/ICDM.2005.103

  • Sun L, Ji S, Ye J (2008) Hypergraph spectral learning for multi-label classification. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp 668–676. https://doi.org/10.1145/1401890.1401971

  • Tong H, Faloutsos C, Gallagher B, et al (2007a) Fast best-effort pattern matching in large attributed graphs. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp 737–746. https://doi.org/10.1145/1281192.1281271

  • Tong H, Faloutsos C, Pan JY (2007) Random walk with restart: fast solutions and applications. Knowl Inf Syst 14(3):327–346. https://doi.org/10.1007/s10115-007-0094-2

    Article  Google Scholar 

  • Trefethen LN, Bau D (2022) Numerical linear algebra, vol 181. Siam, https://doi.org/10.1137/1.9780898719574

  • Wang R, Wang S, Zhou X (2019) Parallelizing approximate single-source personalized PageRank queries on shared memory. VLDB J 28(6):923–940. https://doi.org/10.1007/s00778-019-00576-7

    Article  Google Scholar 

  • Wang S, Yang R, Xiao X, et al (2017) Fora: simple and effective approximate single-source personalized pagerank. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 505–514. https://doi.org/10.1145/3097983.3098072

  • Wang S, Yang R, Wang R et al (2019) Efficient algorithms for approximate single-source personalized PageRank queries. ACM Trans Database Syst 44(4):1–37. https://doi.org/10.1145/3360902

  • Wei Z, He X, Xiao X, et al (2018) Topppr: Top-k personalized pagerank queries with precision guarantees on large graphs. In: Proceedings of the 2018 international conference on management of data (SIGMOD), pp 441–456. https://doi.org/10.1145/3183713.3196920

  • Wu H, Gan J, Wei Z, et al (2021) Unifying the global and local approaches: An efficient power iteration with forward push. In: Proceedings of the 2021 international conference on management of data (SIGMOD), pp 1996–2008. https://doi.org/10.1145/3448016.3457298

  • Yin H, Benson AR, Leskovec J, et al (2017) Local higher-order graph clustering. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 555–564. https://doi.org/10.1145/3097983.3098069

  • Zhang Y, Zhao Z, Feng Z (2018a) A unified approach to scalable spectral sparsification of directed graphs. arXiv:1812.04165

  • Zhang Z, Lin H, Gao Y (2018b) Dynamic hypergraph structure learning. In: Proceedings of the twenty-seventh international joint conference on artificial intelligence (IJCAI), pp 3162–3169. https://doi.org/10.24963/ijcai.2018/439

  • Zhou D, Huang J, Schölkopf B (2006) Learning with hypergraphs: clustering, classification, and embedding. In: Proceedings of the 19th international conference on neural information processing systems (NIPS), pp 1601–1608. https://doi.org/10.7551/mitpress/7503.003.0205

  • Zhu S, Zou L, Fang B (2013) Content based image retrieval via a transductive model. J Intell Inf Syst 42(1):95–109. https://doi.org/10.1007/s10844-013-0257-4

    Article  Google Scholar 

  • Zien J, Schlag M, Chan P (1999) Multi-level spectral hypergraph partitioning with arbitrary vertex sizes. ITCSDI 18(9):1389–1399. https://doi.org/10.1109/iccad.1996.569592

    Article  Google Scholar 

Download references

Funding

This work was supported by National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2020R1C1C1008296) (No. NRF-2021R1C1C1008526) and Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)) (No. 2021-0-02,068, Artificial Intelligence Innovation Hub).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Kijung Shin or Jinhong Jung.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Responsible editor: Charalampos Tsourakakis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Experimental datasets

We provide a brief description of the datasets used in this paper.

  • EEN and EEU.Footnote 6 These hypergraphs represent sets of email addresses on emails in Enron (EEN) and a European research institution (EEU) where users are nodes and the group of the sender and all receivers of each email is a hyperedge.

  • HB and SB.\(^{6}\) These represent co-sponsorships of bills in the House of Representatives (HB) and the Senate (SB) where US Congresspersons are nodes, and groups of sponsors and co-sponsors of bills are hyperedges.

  • WAL.\(^{6}\) This is a hypergraph where products are nodes and hyperedges are sets of co-purchased products at Walmart.

  • TRI.\(^{6}\) This is a hypergraph where nodes are accommodations (mostly hotels) and hyperedges are sets of accommodations that a user performed “click-out” during the same browsing session at Trivago.

  • COD, COG, and COH.\(^{6}\) These are co-authorship hypergraphs where authors are nodes and each hyperedge represents the authors of a publication recorded on DBLP (COD), Geology (COG), and History (COH).

  • THS, THM, and THU).\(^{6}\) These are hypergraphs where users are nodes and each hyperedge represents the group of users associated with a thread at StackOverflow (THS), MathStackOverflow (THM), and AskUbuntu (THU).

  • AM.Footnote 7 This is a hypergraph of Amazon (AM) product reviews (spec., those categorized as Movies & TV) where users are nodes and a group of products reviewed by the same user is a hyperedge. Each user has at least 5 reviews.

  • YP.Footnote 8 This is a hypergraph of user ratings on locations (e.g., hotels and restaurants) at Yelp (YP) where users are nodes and a group of locations a user rated is a hyperedge. Ratings higher than 3 are considered.

  • TW.Footnote 9 This is a hypergraph of social relationships on Twitter (TW) where users are nodes and each hyperedge represents a group of users that compose a ‘circle’ (or ‘list’) together on Twitter.

  • ML1, ML10, and ML20.Footnote 10 These hypergraphs represent interactions of movies at MovieLens with different sizes of 1 M (ML1), 10 M (ML10), and 20 M (ML20) movie ratings where nodes are movies and a group of movies a user rated is a hyperedge. Ratings higher than 3 are considered.

Appendix B: Application to node retrieval

In this section, we introduce an application of RWR scores on hypergraphs for the task of node retrieval. Additionally, we evaluate the empirical effectiveness of this approach.

Fig. 8
figure 8

Similar-node-retrieval performance on hypergraphs in terms of AUROC and MAP. The hypergraph RWR using EDNW provides the best accuracy among all the baselines in the SB and HB datasets

Similar node retrieval Given a hypergraph and a query node s, this task is to search for nodes structurally similar to the query node. Specifically, we measure node-to-node proximities for s and use them as ranking scores to sort all nodes except s in the order of the scores. If nodes with the same class of the query node are ranked high, structurally similar nodes are successfully retrieved, which can be evaluated by ranking metrics such as AUROC and MAP. For this task, we compute the hypergraph RWR scores \(\varvec{r}\) w.r.t. query node s through ARCHER, and utilize the scores for ranking.

Settings We conduct this experiment on the SB (senate-bills) and HB (house-bills) datasets, which contain binary node labels. The RWR models used in Sect. 5.6 are also used for this task. To introduce edge-dependent node weights (EDNW), we set \(\gamma _e(v) = \bar{d}(v)^{-\beta }\), where \(\bar{d}(v)\) represents the unweighted degree of node v. We then set \(\beta =1.0\) and \(\omega (e) = 1\) (refer to Appendix C for the selection of \(\beta =1.0\)). For EINW, we set \(\gamma _e(v) = 1\) and \(\omega (e) = 1\). We vary the restart probability c from 0.1 to 0.9 by 0.1.

Results Fig. 8 shows the experimental results on the node retrieval task in terms of AUROC and MAP. As shown in the figure, the RWR using EDNW shows the best performance, especially with high values of restart probability c, among all tested methods. Note that the RWR using EDNW outperforms that using EINW and naive RWR, indicating the edge-dependent node weights are useful also for this task.

Appendix C: Experiments on edge-dependent node weights for applications

In this section, we provide the experimental results regarding the effectiveness of the edge-dependent node weights for applications.

Anomaly detection For anomaly detection in Sect. 5.6, we set edge-dependent node weights \(\gamma _e(v) = \bar{d}(v)^{-\beta }\) for hypergraph RWR. For the experiment, we assess the performance of RWR by varying two parameters: \(\beta\) and the restart probability c. Specifically, we explore different values of \(\beta\) within the range of \(\{0.5, 1.0, 2.0\}\), and we also vary c between 0.1 and 0.9 in increments of 0.1. Figure 9 shows the results, \(\beta = 0.5\) generally yields the best performance across the tested datasets.

Fig. 9
figure 9

Anomaly detection performance on hypergraphs in terms of AUROC and MAP with different edge-dependent node weights, i.e., \(\gamma _e(v) = \bar{d}(v)^{-\beta }\). When \(\beta =0.5\), it provides the best accuracy

Similar node retrieval For node retrieval in Appendix B, we also set \(\gamma _e(v) = \bar{d}(v)^{-\beta }\) for hypergraph RWR. We test the node-retrieval performance of RWR by varying the values of \(\beta\) and c. Specifically, the list of values tested for \(\beta\) is \(\{0.5, 1.0, 2.0\}\), while the range for c spans from 0.1 to 0.9. Figure 10 shows the results, and \(\beta = 1.0\) leads to the best performance in most cases.

Fig. 10
figure 10

Similar-node-retrieval performance in terms of AUROC and MAP with different edge-dependent node weights, i.e., \(\gamma _e(v) = \bar{d}(v)^{-\beta }\). When \(\beta = 1.0\), it provides better performance in most of the cases

Appendix D: Counting of the number of non-zero entries

In this section, we discuss how to compute \(\texttt {nnz}({\textbf {H}}_{\mathcal {C}})\) and \(\texttt {nnz}({\textbf {H}}_{\star })\) rapidly and space-efficiently. They are used in Eq. (11) by ARCHER to select one between clique- and star-expansion-based methods.

Calculation of \({\texttt {nnz}({\textbf {H}}_{\mathcal {C}})}\) While it is possible to naively count the number of non-zeros in \({\textbf {H}}_{\mathcal {C}}={\textbf {I}}_{n} - (1-c)\tilde{\textbf{P}}^{\top }\), materializing \({\textbf {H}}_{\mathcal {C}}\) typically requires more space than the input data due to its relatively high density. Hence, we suggest a more efficient way based on the following property regarding \({\textbf {H}}_{\mathcal {C}}\):

$$\begin{aligned} \texttt {nnz}({\textbf {H}}_{\mathcal {C}}) = \texttt {nnz}(\tilde{\textbf{P}}) \end{aligned}$$
(D1)

where \(\tilde{\textbf{P}} = \tilde{\textbf{W}}\tilde{\textbf{R}}\). The equality is from the fact that the diagonal entries of \(\tilde{\textbf{P}}\) are non-zeros because \(\tilde{\textbf{P}}\) involves the transition probability that moves from each node v to one of its hyperedges, and goes back to v. Note the sparsity pattern of \(\tilde{\textbf{P}}\) is the same as that of the adjacency matrix of the clique-expanded graph \(G_{\mathcal {C}}\) (with additional self-loops on every node) of the hypergraph \(G_{H}\). Thus, we can calculate \(\texttt {nnz}(\tilde{\textbf{P}})\) without materializing \(\tilde{\textbf{P}}\), by directly counting the edges that are clique-expanded from each hyperedge.

Algorithm 4 summarizes the procedure for computing \(\texttt {nnz}({\textbf {H}}_{\mathcal {C}})\). For each node v (line 2), we find every node u that appears together with v in at least one hyperedge (line 5). Whenever we find such u, it is equivalent to finding an edge (vu), and thus we increment the count accordingly (line 7). Note that we maintain a set C of such nodes to prevent duplicated counting (lines 3 and 8). Regardless of the input, Algorithm 4 requires \(O(|C |)=O(n)\) extra space to maintain the set C. The time complexity is \(O(\sum _{v \in \mathcal {V}}\sum _{e \in E(v)}|e |) = O(\sum _{e \in \mathcal {E}}|e |^{2})\) because it requires \(|e |\) operations for each node in e.

Algorithm 4
figure d

Counting the number of non-zeros of \({\textbf {H}}_{\mathcal {C}}\)

Calculation of \({\texttt {nnz}({\textbf {H}}_{\star })}\) Similarly, \(\texttt {nnz}({\textbf {H}}_{\star })\) can also be efficiently calculated based on the following equalities:

$$\begin{aligned} \texttt {nnz}({\textbf {H}}_{\star })&= n + m + \texttt {nnz}(\tilde{\textbf{S}}) \nonumber \\&= n + m + \texttt {nnz}(\tilde{\textbf{W}}) + \texttt {nnz}(\tilde{\textbf{R}}) \nonumber \\&= n + m + \texttt {nnz}({\textbf {W}}) + \texttt {nnz}({\textbf {R}}), \end{aligned}$$
(D2)

where \({\textbf {H}}_{\star }={\textbf {I}}_{N}-(1-c)\tilde{\textbf{S}}^{\top }\). Note that \({\textbf {I}}_{N}\) is the identity matrix of size \(N = n + m\), occupying \(n + m\) non-zeros in \({\textbf {H}}_{\star }\). The matrix \(\tilde{\textbf{S}}\) consists of \(\tilde{\textbf{W}}\) and \(\tilde{\textbf{R}}\) as shown in Eq. (5), and their sparsity patterns are the same as \({\textbf {W}}\) and \({\textbf {R}}\). The time complexity of this approach is dominated by that of counting the numbers of non-zero entries in \({\textbf {W}}\) and \({\textbf {R}}\). If \({\textbf {W}}\) and \({\textbf {R}}\) are in a sparse matrix format, the number of their non-zero entries can be computed in \(O(\texttt {nnz}({\textbf {W}})+\texttt {nnz}({\textbf {R}}))=O(\sum _{v \in \mathcal {V}}\bar{d}(v))=O(\sum _{e \in \mathcal {E}}|e |)\) time and even in O(1) time in some formats (e.g., compressed sparse row). With the exception of the inputs (i.e., \({\textbf {W}}\) and \({\textbf {R}}\)), this approach requires a constant amount of additional space.

Appendix E: Correlation between data statistics and costs of BePI

In this section, we empirically investigate the correlations between basic data statistics and the costs of BePI, which ARCHER employs for RWR computation. As the data statistics, we use \(\texttt {nnz}({\textbf {H}})\) (i.e., \(\texttt {nnz}({\textbf {H}}_{\mathcal {C}})\) and \(\texttt {nnz}({\textbf {H}}_{\star })\) in clique- and star-expansion-based computations, respectively), density, overlapness, and average hyperedge size. As the costs of the clique- and star-expansion-based computation of BePI, we consider preprocessing time, space cost, and query time. The results obtained across all the datasets (refer to Appendix A) for both clique- and star-expansion-based computations are presented in Fig. 11. As shown in Fig. 11a, there exists a strong positive correlation between \(\texttt {nnz}({\textbf {H}})\) and the costs for the calculation of RWR. For other statistics (see Figs. 11b, 11c, and 11d), there is no noticeable correlation between the statistics and the costs.

Fig. 11
figure 11

Correlations between a basic data statistics and b the costs of RWR computation on hypergraphs in terms of preprocessing time, space cost, and query time. BePI is used for RWR computation. We report the Pearson correlation coefficient for each scatter plot. Note that there is a strong positive correlation between \(\texttt {nnz}({\textbf {H}})\) and the costs, whereas other statistics do not exhibit such a correlation

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chun, J., Lee, G., Shin, K. et al. Random walk with restart on hypergraphs: fast computation and an application to anomaly detection. Data Min Knowl Disc 38, 1222–1257 (2024). https://doi.org/10.1007/s10618-023-00995-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-023-00995-9

Keywords

Navigation