research-article

Open Access

Parallel Simulation of Quantum Networks with Distributed Quantum State Management

Authors:
Xiaoliang Wu

Illinois Institute of Technology, Chicago, United States

Illinois Institute of Technology, Chicago, United States

0000-0001-8282-9645
View Profile

,
Alexander Kolar

University of Chicago, Chicago, United States

University of Chicago, Chicago, United States

0000-0001-5638-9826
View Profile

,
Joaquin Chung

Argonne National Laboratory, Lemont, United States

Argonne National Laboratory, Lemont, United States

0000-0001-7383-3810
View Profile

,
Dong Jin

Illinois Institute of Technology, Chicago, United States and University of Arkansas, Fayetteville, United States

Illinois Institute of Technology, Chicago, United States and University of Arkansas, Fayetteville, United States

0000-0001-9955-1432
View Profile

,
Martin Suchara

Argonne National Laboratory, Lemont, United States

Argonne National Laboratory, Lemont, United States

0000-0001-8808-1367
View Profile

,
Rajkumar Kettimuthu

Argonne National Laboratory, Lemont, United States

Argonne National Laboratory, Lemont, United States

0000-0002-0046-9883
View Profile

ACM Transactions on Modeling and Computer Simulation Volume 34 Issue 2Article No.: 11pp 1–28https://doi.org/10.1145/3634701

Published:08 April 2024Publication History

ACM Transactions on Modeling and Computer Simulation

Abstract

Quantum network simulators offer the opportunity to cost-efficiently investigate potential avenues for building networks that scale with the number of users, communication distance, and application demands by simulating alternative hardware designs and control protocols. Several quantum network simulators have been recently developed with these goals in mind. As the size of the simulated networks increases, however, sequential execution becomes time-consuming. Parallel execution presents a suitable method for scalable simulations of large-scale quantum networks, but the unique attributes of quantum information create unexpected challenges. In this work, we identify requirements for parallel simulation of quantum networks and develop the first parallel discrete-event quantum network simulator by modifying the existing serial simulator SeQUeNCe. Our contributions include the design and development of a quantum state manager (QSM) that maintains shared quantum information distributed across multiple processes. We also optimize our parallel code by minimizing the overhead of the QSM and decreasing the amount of synchronization needed among processes. Using these techniques, we observe a speedup of 2 to 25 times when simulating a 1,024-node linear network topology using 2 to 128 processes. We also observe an efficiency greater than 0.5 for up to 32 processes in a linear network topology of the same size and with the same workload. We repeat this evaluation with a randomized workload on a caveman network. We also introduce several methods for partitioning networks by mapping them to different parallel simulation processes. We have released the parallel SeQUeNCe simulator as an open source tool alongside the existing sequential version.

1 INTRODUCTION

Since the early 2010s, the development of quantum networking testbeds has generated considerable interest in the scientific community [9, 12, 14, 20, 29, 45]. However, the scale of these demonstrators is limited to a handful of nodes and communication distances that typically do not exceed the metropolitan scale. To advance the state of the art, simulations of quantum networks can help researchers develop and test new control protocols, plan and validate experiments, and compare alternative network architecture designs. They provide a cost-effective way to complement experimental research that develops real-world testbeds and prototypes.

A quantum network simulator must offer realism (accurate timing and representation of quantum states), flexibility (support of new types of hardware and protocols), and scalability (large number of nodes and events) for experimenters to evaluate the design of network devices, protocols, and applications. Existing quantum network simulators such as Simulator of QUantum Network Communication (SeQUeNCe) [54],¹ Network Simulator for Quantum Information using Discrete events (NetSquid) [8], and Quantum Internet Simulation Package (QuISP) [31] achieve realism and flexibility by modeling quantum networks using sequential discrete-event simulation (DES). However, the sequential execution limits the scalability of a single simulation since the simulation speed is limited by the performance of a single core. Simulating large-scale quantum networks through sequential execution takes a long time because quantum networks are more complex than their classical counterparts. The complexity of quantum network simulations comes from the need to model the behavior of several new types of hardware and consider the processing and maintenance of quantum states, thus typically generating a massive number of events. For example, the SeQUeNCe quantum network simulator tracks millions of events per simulated second, and one example of a simple metropolitan network simulation generated approximately 2 billion events [54]. Nevertheless, the high performance of a quantum network simulator is necessary to allow researchers to design quantum networks and uncover potential risks before deployment in hardware testbeds.

Short-range networks or simulations involving a limited timescale require performance evaluation over multiple trial runs, which may be trivially parallelized. However, hardware-specific network evaluation or simulations involving a long timescale may require fewer trials for evaluation, and the execution time for a single trial remains the performance bottleneck. Parallel discrete-event simulation (PDES) [17] improves the scalability of DES and significantly reduces the execution time by utilizing multiple processors to simultaneously execute events that ultimately advance the simulation of experiments faster than sequential simulation does. A quantum network (that is being simulated) can be partitioned into multiple subnetworks, each handled by a separate processor. The PDES synchronizes the states of the subnetworks across different processes and advances the global simulation timeline. This technique has been extensively applied in PDES for classical networks [36, 37]. The key to successfully parallelizing simulations of quantum networks is the efficient synchronization of events and states among processes.

In this work, we parallelize SeQUeNCe using multiprocessing with an effective synchronization algorithm by considering specific quantum network characteristics. First, the Message Passing Interface (MPI) [48] is used to synchronize cross-process quantum channels and classical channels. Although multithreading can help speed up the execution on a single node, too many threads can lead to diminishing returns (e.g., when the number of threads exceeds the number of cores). MPI-based multiprocessing supports the execution of the simulation on a cluster of MPI nodes in both shared- and distributed-memory systems for scalability enhancement. Second, we design a hierarchical quantum state manager (QSM) to store and synchronize the quantum state of simulated qubits. The QSM consists of one local QSM per process and a global QSM for handling cross-process entangled qubits. Processes use TCP-based socket communication to access the quantum states on the global QSM. Third, we develop a conservative synchronization algorithm to orchestrate MPI processes and QSM for parallel simulation. The algorithm enables MPI processes to exchange cross-process events at a dynamically computed synchronization point.

The design of our parallel quantum network simulator carefully considers requirements for scalability, accuracy, and usability. The scalability of the parallel simulator is determined by the performance of state synchronization among processes. We improve the simulation scalability in two ways. First, we save communication time between the local QSMs and the global QSM by grouping multiple requests into one packet. The global QSM offloads a portion of the computational work to the local QSMs. The local QSMs also batch outgoing global QSM requests to reduce global QSM processing overhead. Second, we reduce the overhead caused by an unbalanced workload by reducing the number of synchronization events. To address this issue, we exploit lookahead to obtain good performance. We define the simulation lookahead as the duration of future time within which a logical process will not send events that affect the states of other logical processes. In this work, we maximize lookahead values (within the constraints of conservative synchronization) based on the quantum network characteristics.

The accuracy of a parallel simulator requires that the parallel execution of events does not affect the simulation results; in other words, the parallel execution results should be identical to those of the sequential execution. We design the synchronization algorithm by considering the communication between the local QSMs and the global QSM. The optimized lookahead value guarantees that the parallel event execution does not affect the causality of events. Additionally, we provide a random number generation framework at the node level. Without this addition, events in PDES may trigger different actions because of different random numbers. This framework ensures that the sequence of random numbers used in PDES is identical to that in the sequential simulation. As a result, the parallelized simulator can correctly reproduce sequential results.

To improve usability, the kernel of the parallel simulator enables systematic model parallelization and hides functions such as QSM and random number generation from the simulation design. Notably, users can develop and use the same simulation setup for both parallel and sequential simulations.

We perform an extensive evaluation of the parallelized SeQUeNCe with three network topologies: linear topology, caveman graph [49], and Internet autonomous system [15]. We simulate quantum networks consisting of 1,024 quantum routers with 1 to 128 processes using up to four nodes on the Bebop supercomputer [27], a high-performance computing cluster located at Argonne National Laboratory. We use speedup and execution time as metrics to evaluate the scalability of our simulator, where speedup describes the ratio of sequential execution time to execution time using parallel processes. The results show that PDES can speed up the simulation of networks with linear topology up to 25 times. We also analyze the sources of parallelization overhead. The overhead from the unbalanced workloads increases with the growing number of processes. Moreover, we study the performance improvement by reducing the QSM communication time (i.e., up to 11.5 times speedup over communication without optimization) and by maximizing the lookahead value (i.e., more than 50% synchronization time reduction).

We summarize the contributions of this work as follows.

(1)	We analyze the requirements for parallel simulation, and we design a PDES-based simulation framework of quantum networks based on SeQUeNCe that allows users to keep using the models and setup they already defined for the sequential version.
(2)	We develop a quantum state manager that stores and synchronizes the state of simulated qubits across multiple processes.
(3)	We conduct an extensive performance evaluation of our parallel quantum network simulation with three network topologies on a high-performance computing (HPC) cluster, showing that our approach provides up to 25\(\times\) speedup.

The remainder of the article is organized as follows. In Section 2, we provide background on quantum information and quantum networks as well as the SeQUeNCe simulator and parallel simulation of classical networks. In Section 3, we analyze the requirements for parallel simulation of quantum networks and present the design architecture of our simulator. In Section 4, we describe the design of QSM that maintains quantum states in parallel simulations. In Section 5, we discuss lookahead optimization for reducing the simulation overhead. In Section 6, we evaluate the performance of the parallel simulator. In Section 7, we describe related work. In Section 8, we summarize our work and briefly outline future work.

2 BACKGROUND

In this section, we introduce a few elementary concepts in quantum information science. We also introduce the SeQUeNCe quantum network simulator and discuss the role of parallelization in classical network simulations.

2.1 Quantum Information and Quantum Network Protocols

Superposition allows a quantum system to be simultaneously in two different states. These states can be mathematically represented as a probability distribution of the possible measurement outcomes on the quantum system. For example, the measurement of a particle in the superposition state \(| \psi \rangle := \alpha | 0 \rangle + \beta | 1 \rangle\) yields the measurement outcome \(|0 \rangle\) with probability \(|\alpha |^2\) and \(|1 \rangle\) with probability \(|\beta |^2\), where \(\alpha\) and \(\beta\) are two complex numbers. Upon measurement, the quantum state \(| \psi \rangle\) collapses to either \(|0\rangle\) or \(|1\rangle\). Optical quantum networks use photons to carry quantum states.

Multiparticle quantum systems can exhibit entanglement, where the quantum state of each particle in the group cannot be described independently of the other particles. The simplest example is the EPR pair named after Einstein, Podolsky, and Rosen [13]: \(| \phi ^+ \rangle := \frac{1}{\sqrt {2}} (| 00 \rangle + | 11 \rangle)\). Here the measurement outcomes of the two particles are perfectly correlated with a 50% chance of being \(|0\rangle\), \(|0\rangle\) and 50% chance of being \(|1\rangle\), \(|1\rangle\). The measurement of the first particle immediately collapses the state of the other particle to the same state, even if the particles are spatially separated. We note that with the increasing number of particles, the space required to represent quantum states grows exponentially, presenting a significant challenge for computer simulations.

Fig. 1. (a) Quantum circuit for quantum teleportation. (b) Two short-distance entangled pairs used to create a long-distance entanglement with the swapping protocol.

One can use entanglement to teleport an arbitrary quantum state over a long distance without worrying about state loss due to channel attenuation [4]. The teleportation protocol is depicted in Figure 1(a). Solid lines represent qubits (a two-state quantum system), double lines classical bits, circles quantum states, and boxes quantum operations (quantum gates). Initially the sender and the receiver share an entangled qubit pair (green circles) that is distributed between them using the quantum network as described below. The arbitrary quantum state \(|\psi \rangle\) we want to transmit is prepared in another qubit (blue circle). The sender performs the indicated elementary operations on its two qubits (CNOT and Hadamard gates [35]) and measures them. One of four possible measurement outcomes can be realized: 00, 01, 10, or 11. The measurement outcome is then sent to the receiver by using a classical channel, and the receiver uses it to perform controlled quantum operations on its qubit in a process called Pauli frame correction [26]. After the correction, the receiver’s qubit (red circle) will be in state \(|\psi \rangle\). Teleportation is one example of a quantum network protocol, and it allows quantum state transmission by distributing only EPR pairs and classical information.

Reliable, long-distance entanglement distribution is an important service provided by quantum networks; and many other protocols depend on it, including teleportation. The success probability of entanglement distribution decreases exponentially with distance. Fortunately, the entanglement swapping protocol [56] can be used to “stitch together” multiple shorter-distance entanglements for a single longer-distance entanglement. The swapping protocol is illustrated in Figure 1(b). In the first step, entanglement is generated between the adjacent nodes A-B and B-C. Entangled photons may be stored in quantum memories to ensure that they are available precisely when needed [23]. To create entanglement between nodes A-C, node B performs a Bell state measurement (BSM) on its two qubits. The measurement outcome is sent from node B to either node A or node C to perform Pauli frame correction. After the correction, the qubits on nodes A and C become entangled in the EPR state.

In this work, we simulate entanglement distribution and entanglement swapping in networks with up to 1,024 nodes. Since entanglement link generation and swapping involve the measurement of qubits, the final state of the remaining two qubits remains bounded with dimension four. The space required for state representation thus grows only linearly with the size of our network. Despite this state simplification, such large-scale simulation still involves unique challenges as discussed in Section 3 and remains a computationally challenging problem.

2.2 SeQUeNCe: A Customizable Quantum Network Simulator

SeQUeNCe is an open source quantum network simulator [54] that allows simulations with photon-level details. The SeQUeNCe project website [41] provides the complete source code, documentation, and tutorials that allow easy usability and extendability.² The many use cases of the simulator include understanding the tradeoffs of alternative quantum network architectures, optimizing quantum hardware, and developing a robust control plane. SeQUeNCe uses a modularized design, as shown in Figure 2. This enables researchers to customize the simulated network and reuse existing models in a flexible way. The Application module represents quantum network applications and their service requests. The resource management and network management modules use classical control messages to manage the allocation of network resources. The entanglement management and hardware modules encompass protocols and hardware operations that act on qubits that are located in quantum memories or in photons at telecommunication wavelengths in optical network links. The SeQUeNCe simulator stores these states in the quantum state manager. For sequential simulation, a local QSM is sufficient to trace the quantum states. However, parallel simulation requires the simulator to synchronize multiple QSMs, a requirement that is addressed in this work. This work also significantly redesigns the simulation kernel to allow parallel simulations. Further details about the design and use cases of the sequential version of the SeQUeNCe simulator can be found in our earlier work [52—54].

2.3 Parallel Simulation of Classical Networks

Researchers have built time-parallel and space-parallel network simulators to improve scalability in terms of both execution speed and network size. Time-parallel simulation methods [1, 24, 51] parallelize a fixed-size problem by partitioning the simulation time axis into intervals and assigning a processor to simulate the system over its assigned time interval. Parallel (and distributed) discrete-event methods use a space-parallel approach that divides a single network simulation topology into distinct pieces and executes them in parallel on multiple interconnected processors. The efforts to parallelize simulations of classical networks started more than 20 years ago with Parallel/Distributed ns [42] and the Georgia Tech Network Simulator [43]. Distributed simulation support was added to ns-3 [38] in 2011 [40].

Fig. 3. Causality constraint—an essential part of parallel and distributed discrete-event simulations to ensure that events are processed in time-stamped order. Panel (a) shows the sequential simulation of a classical network with a global virtual time, while (b) shows the same network topology simulated in parallel. The topology and events are partitioned and assigned to three processes, each with its own corresponding local virtual time that has to be synchronized with the global virtual time.

Ensuring that events are processed in time-stamped order is a fundamental part of distributed simulation—the so-called causality constraint. Figure 3 illustrates the virtual time for both sequential and parallel simulation of a classical network. For the sequential case (insert on the left), the simulation uses a global virtual clock, while the parallel case (insert on the right) has three processes, each with its own corresponding local virtual time that has to be synchronized with the global virtual time. Almost all parallel network simulators use a conservative synchronization algorithm that takes precautions to avoid violating the causality constraint. The synchronization overhead significantly limits the speedup that can be achieved with parallel and distributed simulations. In fact, synchronization overheads in conservative approaches can cause parallel execution to be slower than sequential execution [25]. In contrast, optimistic synchronization algorithms allow violations to occur but are able to detect and recover from them. Optimistic synchronization has been applied successfully in classical network contexts such as HPC, storage, and HPC interconnects [6, 7].

3 PARALLEL QUANTUM NETWORK SIMULATOR DESIGN

The parallel simulation splits the simulated quantum network into multiple non-overlapping subnetworks. Each subnetwork consists of a subset of the network nodes, including components of the node and associated qubits, that are executed by one process. The design of the parallel quantum network simulator should satisfy three requirements.

(1)	Accuracy: The results of the parallel simulation should faithfully reproduce results from the sequential simulation. Parallelization should not sacrifice realism.
(2)	Scalability: The parallelized simulator should have good scalability. The design must carefully address the parallelization overhead; otherwise, the parallel simulator performance could be suboptimal and in the extreme case even worse than that of its sequential counterpart.
(3)	Usability: The parallel simulator should be compatible with models used for sequential simulation. This requirement aims to reduce errors as well as allow switching between parallel and sequential simulation, thereby increasing the usability and flexibility of the simulator.

In this work, we extend SeQUeNCe [54] to operate as a parallel simulator, and we demonstrate how the design achieves all three requirements. We also note that these requirements are influenced by the implementation of a given parallel simulation, including the lookahead interval for synchronization and the partitioning of the simulated network among processes. We explore an optimization for the lookahead time in Section 5 and possible methods for network partitioning in Section 6.3.

3.1 Requirement Analysis

The accuracy of the simulation depends on the correct order of event execution and the accurate modeling of the network state. During a parallel simulation, a large network is partitioned into multiple non-overlapping subnetworks. Events in a subnetwork are executed by one assigned process. Event execution on one process may affect the order of events and the state of the network on the other processes. The accuracy of the simulation requires the simulator to synchronize these effects among processes. We summarize three methods that can affect the state of nodes on the other subnetworks.

(1)	One node can send qubits to another node over a quantum channel. This action generates an event on the receiver node to receive qubits. This is typically used for entanglement generation between adjacent quantum routers.
(2)	One node can send classical messages to another node on a classical channel. This action creates an event to receive messages. Control protocols on nodes process these classical messages.
(3)	Entangled qubits that belong to different processes can change the shared entanglement state. For example, in quantum teleportation, the measurement on the sender node changes the quantum state of the qubit on the receiver node without any explicit quantum or classical message transmission.

The first two methods affect states on other processes by generating events in the future. Processes can synchronize their states by exchanging cross-process events, and the synchronization algorithm can successfully avoid out-of-order event execution. The third method affects states on other processes by changing a shared state. To ensure consistency of shared states, we have entangled qubits share a single entanglement state instance instead of synchronizing copies of the state across processes. A read/write lock is applied to avoid conflict when accessing those shared states.

Random number generation is another factor affecting reproducibility. Given the randomness in measuring qubits, the parallel simulator should generate random numbers identical to those of the sequential simulation. Therefore, we assign a random number generator (RNG) to every network node in both the sequential and parallel simulation. Events executed on a node or its components use the RNG on the node to generate a random number. Since components connected to a single network node are never split between processes, identical results will be produced as long as events on the node are executed in the correct sequential order.

The scalability of the simulator depends on the reduction of parallelization overhead. In a parallel quantum network simulator, the overhead results from synchronizing processes, exchanging data and accessing the shared quantum states. Our design aims to reduce the execution time and complexity of these functions to improve the simulation scalability.

The usability of the parallel simulator depends on its ease of use. Models developed for parallel simulation ought to be compatible with the sequential simulation (one can configure the parallel simulator to run experiments with one processor). Therefore, our design should allow efficient and error-free simulation design; in other words, the parallelization should be systematically handled by the simulation kernel, with the process being completely transparent to users.

3.2 Parallel Simulator Architecture

Figure 4 depicts the architecture of our parallel simulator for quantum networks. We parallelize SeQUeNCe by leveraging multiprocessing, a technique in which separate processes handle separate subnetworks. We use MPI [48] for interprocess communication including cross-process events, produced by cross-process quantum and classical channels, for both shared- and distributed-memory systems. These two techniques allow users to run a parallel simulation on an MPI cluster. Although multiprocessing may incur a higher communication cost than multithreading does, HPC cluster optimizations have reduced the MPI communication overhead [10]. Our experiments in Section 6 also demonstrate that MPI communication has a limited impact on the parallelization overhead.

Fig. 4. Architecture of the parallel simulator. Processes communicate and synchronize using MPI. Inside each process, models use the simulation kernel to schedule events and access quantum states. Local QSMs in each kernel forward cross-process distributed entanglement states to the global QSM.

To track simulated quantum states, we design a QSM that provides an interface for manipulating and reading quantum states. For shared entanglement, we also design a global QSM as a separate server process. During the simulation, all cross-process entanglement states are stored in the global QSM, while non-entangled qubits or those entangled with only local qubits are stored in a local QSM. Entangled qubits hold the key for accessing their shared state only within a local or global QSM. An individual process may then communicate with the global QSM to retrieve or update the state of an entangled qubit, either directly or through a quantum circuit. Quantum circuit operations, represented as unitary matrices, are not generally commutative. These operations therefore cannot be reordered, implying a requirement for proper ordering with respect to the simulation time for global QSM requests. Since quantum circuits may operate only on qubits stored in local hardware, however, circuits on separate processes will not overlap. Thus, the execution order of requests within one synchronization window does not affect the final results. The global QSM may be implemented as a single-process program, a multithreaded program, or a cluster according to the scale of the parallel simulation. Thus, the global QSM can handle requests from different processes more efficiently and reduce potential resource competition. While interprocess events are exchanged by using MPI, the simpler point-to-point communication between the local QSMs and the global QSM utilizes TCP/IP communication. This is to avoid the additional overhead from using MPI for QSM communication. Further, QSM implementation details are described in Section 4.

Fig. 5. Conservative synchronization algorithm.

To ensure proper execution of events, we adjust the conservative synchronization algorithm [33], not only to exchange cross-process events but also to synchronize states on the global QSM. The existing conservative algorithm has been demonstrated to work for networks without entangled states. Figure 5 depicts the operation of the conservative algorithm. Events on separate processes execute in parallel during the Computing phase. If an event execution schedules a future event to be run on a different process, then the future event is stored in the output queue OutQ. When a process executes all local events up to \(sync\_time\), it enters the Waiting phase until all processes are finished. This \(sync\_time\) is chosen to guarantee that all events in the OutQ will have future timestamps not yet reached by the simulation engine (as discussed below). The processes then exchange events in the OutQ during the Communicating phase. Additionally, a future \(sync\_time\) is calculated by using a lookahead value (lookahead).

We describe our implementation of the conservative algorithm with the addition of a global QSM in Algorithm 1. In this algorithm, we introduce the following data structures:

—	Queue: a queue storing events to be executed locally
—	OutQ: a queue of events that are scheduled locally but should be executed on a different process
—	InQ: a queue of events that are scheduled by a different process but should be executed locally

We also include the key configuration parameters lookahead (as defined above) and \(end\_time\) (defined below).

At each iteration of the algorithm loop, we check to see whether \(local\_time\) (the simulation time of event execution on the process) is less than \(end\_time\) (the simulation time at which event execution should halt). If the condition is met, then the algorithm moves into the Communicating phase; otherwise, the loop exits because the simulation has finished.

In the Communicating phase, MPI processes first calculate the minimum timestamp of events stored locally (\(local\_min\_time\)) by comparing the minimum timestamp of the events in the queue (Queue) and output queue (OutQ). Next, MPI processes move events in the local output queue to the corresponding process and broadcast their \(local\_min\_time\). The function “MPI_EXCHANGE_DATA” returns events from other processes to the input queue (InQ) and a list of \(local\_min\_time\) from other processes in the \(remote\_min\_times\) list. By calculating the minimum value of the \(remote\_min\_times\) list and \(local\_min\_time\), the algorithm determines \(global\_min\_time\), which is the minimum timestamp of all events in the simulation. If all processes have an empty event queue, then \(global\_min\_time\) will be infinity, and execution is halted. Otherwise, we can use \(global\_min\_time\) and the lookahead value to get the time window [34] bounded by a future \(sync\_time\) (in simulation time). If the time window extends beyond \(end\_time\), then we instead set \(sync\_time\) to be \(end\_time\).

As shown in Figure 5(a), the lookahead value is set so that all cross-process events may be scheduled for simulation times only after \(sync\_time\). We subsequently move into the Computing phase, where events in the local Queue are executed and \(local\_time\) is updated accordingly. After all the events before \(sync\_time\) are executed, we update \(local\_time\) to be \(sync\_time\). Additionally, the process sends a message to the global QSM to serve all pending requests. The process is unblocked upon receiving a response from the global QSM to ensure that all earlier requests on the global QSM are completely handled.

The parallel simulation kernel also supports sequential models for compatibility when needed. As shown in Figure 4, models can schedule events and manipulate quantum states using interfaces offered by the kernel. When the simulation is executed sequentially, the event scheduler does not need to communicate with other processes, and the local QSM does not need to connect with the global QSM. Therefore, the same model can be executed by using sequential or parallel simulation with only a simple setup configuration change.

3.3 Parallelization Overhead Sources

The proposed architecture introduces parallelization overhead. This overhead may be measured as an increase in execution time beyond that required for the execution of simulated events. Figure 5(b) shows the state of processes using the conservative synchronization algorithm. Processes have three states: Computing, Communicating, and Waiting. The Computing state denotes that the process is executing events whose timestamps are within the time window. The Communicating state denotes the MPI communication that creates a synchronization barrier among processes. The Waiting state denotes that a process has completed the execution of local events in the time window and is now waiting for all other processes to reach the communication barrier. We identify four sources of overhead. First, the barrier caused by the MPI communication makes one process (e.g., with a light workload) wait for other processes (e.g., with heavy workloads) in the Waiting state. The second source is due to the data exchange among processes. The number of exchanged events affects the communication overhead. The third source is the cost for barrier synchronization (e.g., the calculation of the local minimum timestamp for the next computing phase and the time window until the next synchronization). Large lookahead is essential for minimizing parallel synchronization overhead; this will be discussed in Section 5. Apart from the overhead for the synchronization algorithm, the global QSM introduces the fourth overhead source. In the parallel simulation, the global QSM handles operations on cross-process entangled states. As a result, the time spent on communication with the global QSM increases the overall execution time for manipulating entangled states. A parallel simulation with a larger amount of cross-process entanglement may thus introduce larger parallelization overhead due to higher global QSM communication. In Section 4 we introduce multiple techniques for reducing the overhead from these sources.

4 QUANTUM STATE MANAGER

This section describes the quantum state manager for our PDES-based simulator of quantum networks. Section 4.1 describes the essential functions provided by the QSM in the simulation kernel and the memoization technique used for accelerating matrix multiplication. Section 4.2 describes the global QSM, in which we explain the mechanism used for distinguishing cross-process entanglement and single-process entanglement. Additionally, we illustrate the local computing, traffic compression, and multithreading techniques that are used to reduce the overhead of communication between the local and global QSMs.

4.1 Local Quantum State Manager

The QSM within the simulation kernel is designed to provide methods for manipulating quantum states. Models of hardware and protocols may manipulate the quantum state of qubits with a qubit key that is produced by a universal unique identifier [28]. The QSM maintains the relation between keys and quantum states as a hash table. For entangled states, multiple keys may point to the same quantum state object. Here we introduce three important functions provided by the QSM:

—	\(set(keys, state)\): The QSM sets the quantum state to a given state for qubit(s) with given key(s).
—	\(get(key)\): The QSM returns the state of the qubit with a given key.
—	\(run(circuit, keys, prob\_sample)\): The QSM runs a quantum circuit on the qubit(s) with a given key (or keys). For the circuit with measurement operation, \(prob\_sample\), a random number in the range from 0 to 1, decides the measurement outcome. The function returns the outcome of the measurement.

Quantum states (including entanglement) are stored as amplitude vectors (Bra-ket notation [44]) or matrices (density matrix [16]) together with a list that indicates the order of entangled qubits as a list of keys. The get and set functions allow models to read and write the quantum state of one or many qubits, including any entanglement. Compared with the set function, hardware and protocol models mostly use the run function to change the quantum state of qubits. For example, entanglement purification [11] and swapping [56] protocols usually describe their protocols as a quantum circuit. A quantum circuit consists of quantum logic gates and/or measurements. The gates in the circuit may be represented by a unitary matrix. Unlike quantum logic gates, measurement is a stochastic nonreversible operation. The simulator thus needs to use a random number to determine the measured state. To reproduce identical outcomes, the run function takes the random number \(prob\_sample\) from the caller function if the circuit contains a measurement operation. As described in Section 3.1, this random \(prob\_sample\) is produced by the random number generator on the calling node to ensure the consistency of simulations.

In SeQUeNCe, we use the QuTiP [22] Python library to generate the unitary matrix of a given circuit. The output state is produced by matrix multiplication of the unitary matrix of the circuit gates with the input quantum state of the qubit(s), which is a CPU-intensive task. Furthermore, the simulation of a quantum network involves repeated computation of this matrix multiplication. Therefore, we utilize the memoization technique [32, 46] to speed up redundant functions, such as the run function. We implement least recently used caches with fixed sizes to cache the result of these multiplications.

4.2 Global Quantum State Manager

Although unentangled states or entangled states confined to one process may be managed by a single local QSM, cross-process entangled states may need to be accessed by multiple local QSMs. We thus designed a global QSM as a server process to ensure the consistency and accuracy of quantum states within the parallel simulation. While the quantum state of qubits is stored in the centralized global QSM, the local QSMs on separate MPI processes provide the transparent functions to manipulate the quantum state wherever the storage of state is.

Since the global QSM is executed on an independent process, it must receive requests and send responses over the network, in contrast to a local QSM. Multiple MPI processes may communicate with the global QSM simultaneously, a process that can result in resource competition. Such competition may increase the processing time on the global QSM when performing large-scale parallel simulations. The communication time and resource competition constitute the parallelization overhead from the global QSM, as described in Section 3.3. The global QSM may thus become a bottleneck for parallel simulation. To address this, we implement multiple techniques to efficiently reduce the global QSM overhead.

4.2.1 Mitigating Communication Overhead.

Fig. 6. Entanglement generation with the meet-in-the-middle protocol under two scenarios—quantum memories \(q_A\) and \(q_B\) emit photons \(q_a\) and \(q_b\) , respectively—to generate entanglement. \(S_A\) ( \(S_B\) ) denotes the quantum state of qubits \(q_A\) ( \(q_B\) ) and \(q_a\) ( \(q_b\) ). The optical fiber transmits photons to the BSM node located in the middle. Once detectors are triggered by photons, the BSM node sends messages to nodes containing memories. Protocols may determine the current quantum state of memories with these messages.

The total communication time depends on the number of transmissions and the latency of each transmission. In this work we reduced the number of transmissions between the local and global QSMs. These transmissions include three types of communication categorized by their goals:

—	Forward function call. The local QSM forwards function calls to the global QSM and waits for a response.
—	Synchronize quantum state. Operations on the local QSM may cause a state inconsistency, which requires the local QSM to update the state in the global QSM.
—	Acknowledge state of global QSM. This is for a local QSM to confirm that the global QSM has completed executing all the forwarded functions.

To reduce the number of transmissions, we first offload most of the computational work to the local QSM. We use the entanglement generation using the meet-in-the-middle protocol [23] as an example to show interactions between the local and global QSM in Figure 6. Qubits \(q_A\) and \(q_B\) stored in the quantum memories are entangled with photons \(q_a\) and \(q_b\) with the shared quantum state \(S_A\) (\(S_B\)). Photons are sent to the intermediate BSM node to generate entanglement between \(q_A\) and \(q_B\). In the parallel simulation, qubits either belong to one process, shown in Figure 6(a), or belong to two processes, shown in Figure 6(b). If the MPI process contains all required qubits, then the local QSM can produce the entangled state between \(q_A\) and \(q_B\) locally without forwarding the function to the global QSM. Otherwise, the local QSM must forward the function to the global QSM since the “local QSM 2” executing events on the BSM node do not hold the state \(S_A\). As a result, a simulation with fewer cross-process entanglement states has fewer communication requests with the global QSM and thus reduces the communication overhead.

The challenge of offloading work to the local QSMs is in ensuring the consistency of quantum states between the local and global QSMs. For example, in Figure 6(b) the quantum state \(S_A\) in the global QSM should be identical to the state \(S_A\) stored in “local QSM 1” before the global QSM executes the request from “local QSM 2.” The timing of synchronizing local states with the global QSM thus becomes an issue, since the parallel simulator must keep this consistency while avoiding redundant synchronization.

Investigation of existing entanglement protocols shows that the synchronization communication is triggered by two conditions: (1) transmitted qubits sharing an entangled state and (2) forwarded function calls requiring a local quantum state. The first condition may separate two entangled qubits from one process to two processes, so we synchronize the entanglement state to the global QSM. The management of this state is moved from the local QSM to the global QSM. To the best of our knowledge, all entanglement generation protocols rely on quantum channels to entangle qubits in two locations. The synchronization triggered by the first condition can avoid inconsistent states between the local and global QSMs. The second condition requires the global QSM to execute operations. The local QSM should push the local quantum state to the global QSM before forwarding the function call. The correctness of entanglement protocols, such as entanglement swapping, can be protected by these triggering conditions.

The synchronization requirement demands the management of qubits at the global QSM. The local QSM forwards the function calls related to either unknown qubits (i.e., qubits produced by other processes) or qubits managed by the global QSM. Once the qubits lose their entangled state, the management of qubits will be transferred back to the local QSM. In the parallel SeQUeNCe simulator, either the measurement on qubits or the set function can return the qubits back to the local QSM by destroying an entangled state.

In addition to offloading work from the global QSM as much as possible, we reduce communication time by grouping multiple requests into one packet. We observe that some requests to the global QSM do not expect responses, such as the run function without measurement and the set function. Therefore, the simulator caches requests with no responses. These requests are transmitted to the global QSM only if a new request needs results from the global QSM or the simulator forcefully transmits all requests in the cache using the notify_global_QSM_about_synchronization() function from Algorithm 1. This mechanism leads to efficient state synchronization since the synchronization does not require a response from the global QSM.

We implement the communication between the local and global QSM using the TCP/IP protocol. Function calls are serialized in JSON format. We execute the global QSM process on the same cluster where MPI processes execute the events in parallel run, to reduce the communication latency.

4.2.2 Mitigating Resource Competition.

The resource competition from MPI processes reduces the performance of the global QSM. Since the global QSM needs to handle requests from all MPI processes, the performance of the global QSM will degrade with a growing number of processes and communication, and thus the global QSM may become the parallelization bottleneck.

To address this problem, we first design the global QSM as a multithreaded program. The global QSM uses independent threads to concurrently serve MPI processes. Compared with a single-threaded program, the multithreading technique reduces the waiting time of the requests. As opposed to a multiprocess program, a multithreaded program provides a lighter way to share resources among threads.

Using the multithreading technique requires thread safety for the shared quantum states in the global QSM. In the parallel simulation of a quantum network, multiple MPI processes may simultaneously operate on the same quantum state. For example, step 2 of Figure 1(b) shows that nodes A and C need to run a circuit to correct the quantum state of entanglement between A and C. If nodes A and C belong to two MPI processes, then the two threads in global QSM serving these MPI processes may simultaneously process requests that modify the entanglement state shared between A and C, a process that may cause a race condition. We use a set of locks to guarantee thread safety. The global QSM assigns each qubit a lock. When the global QSM receives a function call, the global QSM collects all qubit keys related to the function. These qubits include not only qubits listed in the argument of the function but also qubits entangled with listed qubits, since all these qubits share a single entangled state. Then, the thread acquires locks for all these qubits to avoid race conditions. After the execution of the function, the thread releases all locks.

In contrast to the rest of the SeQUeNCe source code, we implement the global QSM in C++. Since simulation models interact only with the local QSM, the global QSM is hidden from users and their models. Implementing the global QSM in C++ rather than Python results in higher performance while still maintaining the ease of use for the simulator. The C++ library Quantum++ [18] is used to manipulate quantum states on the global QSM. As an additional benefit, Quantum++ is supposed to provide better performance than QuTiP [22]. The memoization mentioned in Section 4.1 is also implemented to reduce redundant computational work.

5 LOOKAHEAD OPTIMIZATION

The lookahead value determines the frequency of synchronization among processes. Parallel simulations with larger lookahead values synchronize the processes less often and thus exhibit better performance. Additionally, the overhead from the unbalanced workload can be mitigated by a larger lookahead value. To support the PDES of a quantum network, the minimum latency of cross-process channels can be selected as a baseline lookahead value. Although both quantum and classical channels introduce a propagation delay, the latency on the classical channels also includes processing delay, transmission delay, and queuing delay, which make the latency on a classical channel \(T_{cc}\) generally larger than the latency on a quantum channel \(T_{qc}\). Therefore, the lookahead value should be based on the latency of the shortest quantum channel.

Instead of using a lookahead value equal to the latency of the shortest quantum channel, we propose that the lookahead value \(\Delta t\) could be increased to \(\frac{T_{cc}}{2}\), where \(T_{cc}\) is the classical delay between routers and intermediate BSM nodes if the simulated quantum network uses the meet-in-the-middle protocol to generate entanglement. Users could choose \(\frac{T_{cc}}{2}\) as the lookahead value if \(\frac{T_{cc}}{2} \gt T_{qc}\). This relies on the observation that in a general entanglement distribution network, only the entanglement generation protocol relies on the quantum channel; the remaining protocols manipulate entanglement using quantum circuit operations on a single node. This approach provides an effective delay of \(T_{cc}\) between entanglement generation operations on the BSM nodes and further processing on the quantum routers. We may thus use a shorter lookahead value \(\frac{T_{cc}}{2}\) to ensure proper ordering of events after entanglement generation and have BSM nodes execute events asynchronously to ensure proper ordering of events operating on qubits received from routers. We define asynchronous execution as nodes executing events with a \(sync\_time\) (defined in Algorithm 1) that lags behind the \(sync\_time\) of other nodes by the lookahead value.

In greater detail, Figure 6 depicts the process of the meet-in-the-middle entanglement generation protocol. One BSM node is placed in the middle of two quantum routers. Routers first send qubit-encoded photons to the BSM node and then receive messages from the BSM node. In a parallel simulation, if two quantum routers belong to different processes, then the BSM node is the entity that connects with the two routers over the quantum channel. Routers using the meet-in-the-middle protocol send photons to the BSM node with \(T_{qc}\) delay. The message events generated by the BSM node occur after a further delay of \(T_{cc}\). We can use this transmission pattern to increase the lookahead value.

Fig. 7. Parallel simulation using half of the classical channel latency as the lookahead value \(\Delta t\) .

Figure 7(a) and (b) illustrate parallel simulation of the meet-in-the-middle protocol with \(\Delta t = \frac{T_{cc}}{2}\) and asynchronous timelines. The simulation starts from \(t_0\), and three nodes maintain their time and events separately. Quantum routers synchronize their time using Algorithm 1, and the BSM node exchanges cross-process events only at the time window. We label the order of execution on the figure as follows:

①	Quantum routers execute events from \(t_0\) to \(t_0 +\Delta t\) and exchange events. Cross-process events are moved to the BSM node.
②	The BSM node executes events from \(t_0\) to \(t_0 +\Delta t\), while routers execute events from \(t_0+\Delta t\) to \(t_0 + 2 * \Delta t\). The BSM node sends produced cross-process events to the routers and receives cross-process events from \(t_0 +\Delta t\) to \(t_0 + 2*\Delta t\).
③	The BSM node executes events from \(t_0 + \Delta t\) to \(t_0 + 2\Delta t\), and the routers execute events from \(t_0 + 2\Delta t\) to \(t_0 + 3 * \Delta t\). Then, the nodes exchange data as described in ②.

The BSM node receives events at the barrier \(t_0 + \Delta t\) and executes events from simulation time \(t_0\) to \(t_0 + \Delta t\). Since the time of the BSM node is slower than the time on the routers by at most \(2 \Delta t\), cross-process events produced by the BSM node must have a timestamp larger than \(t_0 + 2*\Delta t\). The second synchronization barrier at \(t_0 + 2*\Delta t\) can move cross-process events produced by the BSM node to the routers. In this way, all cross-process events are exchanged and executed in the correct order, and PDES can use the \(\Delta T = \frac{T_{cc}}{2}\) lookahead time to reduce the parallelization overhead if \(\frac{T_{cc}}{2} \gt T_{qc}\).

To apply this mechanism, the MPI process picks out BSM nodes that connect with routers on other processes before running the simulation. The MPI process maintains two timelines for the selected BSM nodes and other nodes, respectively, with the BSM timeline being asynchronous. MPI processes then execute events as shown in Algorithm 1 with the mechanism described previously to run the parallel simulation. We evaluate the improvement from this optimization in Section 6.

In this work, we select and optimize the lookahead value for networks with fixed distances between nodes, like those based on optical fiber links. If the network links have variable distance (as in the case of wireless networks), then the simulator has to dynamically adjust the lookahead value or even dynamically migrate simulated nodes between processes. Based on experience from parallelization of wireless networks [30], researchers could optimize the lookahead value or algorithm by studying the characteristics of those networks. Then, researchers could run parallel simulations simply by replacing our lookahead algorithm with their own. The rest of the system does not depend on any assumption related to the lookahead.

6 PERFORMANCE EVALUATION

We conduct extensive evaluation experiments for the parallelized SeQUeNCe on three quantum network topologies. We report four components of the wall-clock execution time of the simulations: MPI communication, socket communication, event execution, and barrier synchronization. The MPI communication includes the overhead from the data exchange and the unbalanced workload. The socket communication denotes the aggregated pending time of a process caused by the communication between the local and global QSMs. The execution time speedup (strong scaling) and efficiency (weak scaling) [50], \(T_s / T_p\), are used as the metrics to quantify the scalability of the simulator, where \(T_s\) and \(T_p\) denote the execution time for sequential simulation and parallel simulation consisting of p processes, respectively. The barrier synchronization algorithm overhead is very low (less than 1 s), and thus we do not plot it in the following figures.

The simulations of the three quantum networks utilize identical models of software and hardware. Simulations execute up to 30 million events by using 1–128 processors. Considering the scale of the simulated network, we limit the maximum number of processors to 128, but our parallel simulator supports parallel simulations using more processors. In Table 2 of the supplementary material, we list values of parameters shared among simulations; the detailed model design can be found in our prior work [54]. The simulated networks all contain equal-length quantum channels, so that the minimum delay used for the lookahead value is equal to the delay of each channel. We discuss other possible channel configurations in Section 8. For classical communication, the nodes are connected with the channels in a fully connected mesh topology.

The three simulated quantum networks have different network topologies and flows. We define a flow as a long-distance entanglement request between two endpoints in the network. In Section 6.1 the first quantum network uses a linear topology to distribute entanglement between the end nodes. Such a chain of quantum repeaters is useful for the long-distance distribution of entanglement. The network uses only one traffic flow to distribute entanglement between the two routers on the endpoints.

In Section 6.2 the second quantum network uses the topology of a connected caveman graph [49] that consists of 128 connected cliques, each having 8 quantum routers. This topology is more complicated than the linear topology. We simulate multiple random flows in the network. The random flows are produced by the network nodes, and each node selects a random destination.

The third quantum network uses the topology of an Internet autonomous system network [15] generated by Networkx [19]. Similarly to the second network, every network node produces a flow with a random destination. The challenge of parallel simulation for this kind of network results from its irregular topology. We introduce three heuristic methods to optimize the network partitioning in Section 6.3.

All the simulations were executed on the Broadwell partition of Bebop [27], the high-performance computing cluster operated by the Laboratory Computing Resource Center at Argonne National Laboratory. Each of the Broadwell nodes has 36 cores of Intel Xeon E5-2695v4 and 128 GB of DDR4 running CentOs 7. All programs are interpreted using Python 3.8.5 or compiled using g++ 4.8.5. We utilize 1 compute node for simulations up to 32 processes, while we use 2 and 4 compute nodes for 64 and 128 process simulations, respectively. The number of processes per compute node is thus equal to the total number of processes up to 32 and then remains constant at 32 for larger simulations. This process number refers to the number of processes for network nodes; we additionally add 1 process for the global QSM in each simulation, increasing the total process number by 1. Each process maintains 2 timelines for lookahead-optimized code (the usual use case) and 1 timeline otherwise.³

Bebop nodes use an Intel Omni-Path interconnect for communication, with a local MPI implementation. The other libraries used for simulation were NumPy (version 1.18.5), pandas (version 0.24.2), QuTiP (version 4.6.0.dev) [22], tqdm (version 4.47.0), NetworkX (version 2.4), mpi4py (version 3.0.3), and Quantum++ (version 2.6) [18]. Python packages were managed using Anaconda version 4.8.3, and the C++ QSM server was built using CMake version 3.18.4.

Memory frequency \(f_m = 20\) KHz

6.1 Chain of Quantum Routers

Fig. 8. Simulations of a linear network.

The first kind of network has a quantum router chain that serves only one flow that distributes entanglement between the two end routers. The quantum network uses a chain of routers to mitigate the exponentially increased loss rate of photons caused by the increased distance [5]. Figure 8(a) illustrates the topology of a quantum network with five routers. In this type of network, adjacent nodes first use the meet-in-the-middle generation scheme [23] to distribute entanglement over a short distance. Then, the entanglement swapping [56] and purification protocols [11] are used to distribute entanglement over a longer distance with high fidelity. For the parallel simulation experiments, we evenly divide the number of routers by the number of processes (\(n = 1024/p\)) and assign them sequentially to each process (i.e., routers 1 to n will be assigned process 1, routers n+1 to 2n to process 2, etc.). For simulations with the increased lookahead value introduced in Section 5, the BSM nodes that connect two different processes are placed in a separate process where the simulation time is allowed to lag.

All our linear topology simulations use a 1 km distance between all adjacent routers. The BSM node is in the middle of each link; in other words, the length of the optical fiber is 0.5 km. All routers contain 100 quantum memories.

We first evaluate the strong scaling of the simulator in a quantum network with 1,024 routers and a total of 102,400 quantum memories. Strong scaling requires simulations with a fixed total workload and an increasing number of processes. We simulate the network with 1 to 128 processes. For this and future simulations using 1 process, the sequential version of SeQUeNCe was used (rather than the parallel version running with 1 MPI process). Since one computing node contains only 36 cores, parallel simulations with 64 and 128 processes are executed on 2 and 4 computing nodes, respectively. Figure 8(b) illustrates the average execution time of each simulation process for 10 separate simulations and the speedup ratio achieved by the parallel simulation. Since the standard deviation of the measured execution time ranges from 7.4 to 189.6 s for strong scaling (with 32 and 1 processes, respectively) and from 0.8 to 13.48 s for weak scaling (with 16 and 128 processes, respectively), we do not include error bars in the plot. We observe that the execution time decreases with the increasing number of processes. The parallel simulation can speed up the simulation by a factor of up to 25. When we double the number of processes, the computation time is halved (indicated by the blue part). The synchronization time (indicated by the red part) is roughly constant, which leads to a growing portion of the total execution time with the growing number of processes (from 0.07% to 0.67% of all execution time) as the computation time drops. We also notice that the time for the socket communication between the local and global QSMs is trivial in this case. The time of socket communication is increased from 2.0 s to 18.6 s as the number of processes increases from 2 to 128.

We also evaluate the performance of the simulator with weak scaling using a scaled workload. We fix the number of routers on one process to 8 and increase the size of the network with the number of processes. Figure 8(c) illustrates the execution time and efficiency of the simulations. We observe that the execution time increases as the model gets larger and runs on more processes. The majority of overhead results from the synchronization algorithm; little overhead is generated from the socket communication as local QSMs process more than 90% of requests without socket communication. We explore the cause of this increased synchronization overhead below. Note that we scale the workload by increasing the size of the network; therefore, the workload does not exactly increase linearly and thus causes the increased time (from 40.9 to 60.7 s) of event execution. In addition, the parallelization efficiency decreases with the number of processes resulting from the increased overhead. The efficiency of the 2-process simulation is 0.96, whereas the efficiency of the 128-process simulation is 0.19.

Both strong-scaling and weak-scaling evaluations show an increasing overhead. We observe that the performance in both cases drops significantly when the number of processes increases from 32 to 64. There are two reasons for this drop. First, the performance of the multithreaded global QSM drops when the number of processes is larger than 32. Since we execute the global QSM on one computer that serves all MPI processes, the capability of the global QSM is limited by the number of processors and is affected by the MPI processes on the same machine. When the number of processes is less than or equal to 36, the number of processors on one Broadwell node, the global QSM threads do not compete with each other. However, when the number of processes is greater than 36, threads of global QSM compete for computation resources and thus show worse performance. We observe that the global QSM achieves a similar processing speed – defined as the time to complete a communication request with a response from the global QSM – for parallel simulation with up to 32 processes. The processing time of the global QSM is doubled for simulations with 64 processes and tripled for 128 processes. The degraded performance is also demonstrated with the growing socket communication time as the number of processes increases from 32 to 128 (see Figure 8(c)).

The second overhead is due to the unbalanced workload. The red bars in Figure 8 indicate the synchronization overhead including the unbalanced workload and the transmission of data. Therefore, we design further experiments to distinguish and quantify which one actually results in the increased synchronization time. The experiments duplicate the cross-process data before exchanging events. When the processes receive data from other processes, the duplicated data is discarded. These experiments increase the amount of cross-process transmission during synchronization without affecting the distribution of the workload. Figure 8(d) shows the execution time of simulations that duplicate data from 0 (i.e., no duplication) to 8 times. The four simulations show a similar time in terms of not only event execution and socket communication but also synchronization. Therefore, we conclude that the unbalanced workload during the time window causes a large overhead even if the aggregated workload is balanced. Since the workload balance depends on the simulation setup, addressing this kind of overhead is out of the scope of the design of a parallel simulator.

We also evaluate the lookahead value improvement technique with both strong scaling and weak scaling. Figure 8(e) and (f) illustrate the performance of the simulator with and without this technique, respectively. For parallel simulation, the left bar represents the larger lookahead value (\(T_{cc}/2\)), and the right bar represents the smaller lookahead value (\(T_{qc}\)). As a result, the lookahead value is increased from 2.5 \({\mu }\)s to 150 \({\mu }\)s. This proposed technique reduces the simulation synchronization overhead in all cases and thus improves the scalability of the simulator. We also verify that the event traces, including the number of executed simulation events, timestamp, object and function names, and arguments of the function, between the parallel and sequential simulations are identical in all cases.⁴

6.2 Quantum Network with Caveman Graph Topology

The second network uses the caveman graph topology [49] to form the quantum network. The caveman graph \(Caveman(n,k)\) is formed from a set of n isolated k-node caves, as a clique, by removing one edge from each cave and using it to connect a neighboring cave along a cycle. The community cave structure of the caveman graph allows us to easily group nodes into sets of nodes for parallelization. Nodes in the same cave are assigned to the same process, and neighboring caves are placed together. Compared with networks in Section 6.1, this network serves multiple flows with a smaller number of hops. Figure 9(a) illustrates the network using the caveman graph \(Caveman(4,4)\) as the topology, which denotes that both the size of one cave and the number of caves are 4.

All simulations use the \(Caveman(128,8)\) network topology that serves 1,024 random flows. Every node in the caveman graph represents one quantum router, and every edge in the graph represents optical fibers and the BSM node used for the meet-in-the-middle protocol. The distance between adjacent routers is 10 km. A sufficient number of quantum memories is allocated on each node to support all flows that pass through them. In our setup, 81,500 quantum memories are used in total. At the beginning of the simulation, every router produces a flow that occupies 25 memories in the source and destination routers of the flow and 50 memories in all intermediate routers. The destination routers are selected randomly based on the number of hops in the flow. The number of hops follows an exponential distribution with \(\lambda = 1\). We assume that the number of short-distance flows is greater than the number of long-distance flows in a realistic quantum network.

Fig. 9. Simulations of a caveman network.

Figure 9(b) illustrates the performance of simulations executed on 1 to 128 processes, with execution time shown as an average over five runs. Again, we note that the standard deviation in execution time ranges from only 6.0 to 164.0 s and thus error bars are not plotted. We evaluate the scalability of the parallel simulation by inspecting the execution time and the speedup with strong scaling. We observe that the execution time decreases as the number of processes grows. The parallel simulation reduces the execution time by a factor of up to 20.5. Although the speedup rate of simulating this network is similar to the speedup rate shown in Section 6.1 in scenarios with fewer than 64 processors, the speedup rate decreases around 23% for parallel simulations with 64 and 128 processors. The scalability degradation results from an unbalanced workload. Here we use the number of quantum memories utilized on each process to quantify load imbalance on MPI processes. We calculate the coefficient of variation (CV), which divides the standard deviation by the mean. We observe that the value of CV increases from 0.01 (2 processes) to 0.17 (128 processes). The unbalanced distribution of quantum memories causes an unbalanced workload, which increases the overhead and therefore reduces the scalability of parallelization. However, placing routers in the same cluster within one process significantly reduces the amount of cross-process entanglement. Less than 10% of work is forwarded to the global QSM. Even if the 128-process simulation has the most cross-process entanglement states, the simulation incurs only a 30-s wait time (of a 240-s total runtime) due to responses from the socket communication. The partition scheme used for caveman networks balances the number of nodes in processes and reduces the number of cross-process entanglement states, but the unbalanced distribution of quantum memories still reduces the scalability. Therefore, we use the next simulation to investigate how partition schemes affect the scalability of parallel simulation on a more complicated topology.

6.3 Quantum Network with Autonomous System Network Topology

Fig. 10. Simulations of an autonomous system network graph.

We choose the graph resembling the Internet autonomous system (AS) network as the topology of the third simulated network. As with previous simulations, nodes of the graph represent quantum routers, and edges represent optical fibers with BSM nodes used for the meet-in-the-middle protocol. These routers follow the same method introduced in Section 6.2 to produce random flows. We simulate the quantum network with 1,024 routers in the AS network topology. There are 81,300 quantum memories used (again, based on flow usage) to serve 1,024 random flows. We investigate the scalability of the parallel simulator under this scenario. We use Networkx [19] to generate the graph \(AS(1024,0)\) that denotes a random 1,024-node graph using seed 0.

Figure 10(a) illustrates a smaller network \(AS(32,1)\). Unlike the previous two network topologies, this network topology includes many edge routers that are connected by the core routers. These core routers have a higher degree than do the edge routers, which results in an unbalanced distribution of workload among the routers. The unbalanced workload on the routers is a challenging scenario for the network partition algorithm for parallel simulation.

We propose three heuristic solutions for network partitioning: \(Partition 1\), \(Partition 2\), and \(Partition 3\). The proposed solutions use simulated annealing [47] to optimize their energy functions. We define three required components for the simulated annealing: state, neighboring state, and energy function. State defines the mapping relation between nodes and processes. The initial state evenly assigns routers to processes. Neighboring state is the state that could be generated by swapping two routers in different processes. The initial state and neighboring state make the partition scheme always distribute nodes to processes evenly. The simulated annealing method starts from the initial state and randomly jumps between neighboring states. After a few iterations, the state with the lowest energy is the final partition scheme. Given different energy functions, simulated annealing may produce different partition schemes. We design three energy functions using different assumptions.

(1)	\(Partition 1\) assumes that users know all flows in the simulation. The energy function is to calculate the number of cross-process flows.
(2)	\(Partition 2\) assumes that users know only the topology of the network. The energy function is to calculate the number of cross-process quantum channels.
(3)	\(Partition 3\) assumes that users know how many memories are used in each router. The energy function is designed to calculate the imbalance of memory distribution.

The energy functions of \(Partition 1\) and \(Partition 2\) are designed to reduce the overhead of the socket communication, while the energy function of \(Partition 3\) is designed to reduce the overhead of unbalanced workloads. We conduct parallel simulations with these three partition solutions on 2 to 128 processes. These simulations help us explore how various network partitions affect the parallelization performance.

Figure 10(b) illustrates the execution time of the simulations. We note again that the maximum standard deviation in simulation time is 220.8 s, so error bars are not plotted. For parallel simulations, the left, middle, and right bars represent partitioning using \(Partition 1\), \(Partition 2\), and \(Partition 3\), respectively. We observe that parallel simulations using \(Partition 3\) consume more time for socket communication than the other two partition methods because of the different objective functions. Figure 10(c) demonstrates that the increased socket communication time results from the additional requests processed by the global QSM. Compared with \(Partition 1\), \(Partition 3\) requires the local QSM to forward 30% more requests to the global QSM. Among the three methods, \(Partition 1\) performs the best when the number of processes is 2, 4, 64, and 128; \(Partition 2\) performs the worst in most cases; \(Partition 3\) performs the best when the number of processes is 8, 16, and 32. When the parallel simulation uses fewer processes, \(Partition 1\) finds a partition with low overhead from the socket communication, and the imbalance of workload is suppressed by the limited number of processes as well as the equal number of nodes in the processes. As the number of processes increases, the overhead from the unbalanced workload dominates the execution time, and the socket communication time is mitigated by the multithreaded global QSM. Thus, \(Partition 3\) shows the best parallel performance, exemplified by its measured memory CV for 32 processes of 0.10 compared to 0.37 and 0.43 for \(Partition 1\) and \(Partition 2\).⁵ When the number of processes is too large, however, equally distributing nodes to processes becomes a bottleneck in \(Partition 3\). For example, with a parallel simulation of 128 processes, we expect every process to control \(81300 / 128 \approx 89\) memories on average, but the heaviest router controls 2,675 memories. Seven routers need to be placed in the same process as the heaviest router, which deteriorates the workload balance. In addition to the decreased overhead on socket communication, the three partition solutions show similar performance when the number of processes exceeds 32.

\(Partition 3\) demonstrates the largest socket communication time. Therefore, those scenarios are good candidates for evaluating the proposed techniques to reduce communication time, as described in Section 4.2. We compare the performance of the parallel simulation with work offloading and requests grouping, respectively. Figure 10(d) illustrates the normalized execution time of the simulation. We normalize the execution time of parallel simulations, with normalized time 1 corresponding to no communication time optimization and lower normalized time being desirable. We observe that both optimization techniques always speed up simulations. Offloading work from the global QSM to the local QSMs has a bigger impact on simulations with fewer processes. Since the simulation using fewer processes leads to less cross-process entanglement, more manipulations of quantum states can be handled locally instead of using the global QSM. Compared with work offloading, grouping multiple requests to one message achieves a more significant performance improvement, especially for simulations with more processes. By grouping requests, parallel simulations can save more than 75% of the communication time. Moreover, the benefit from this optimization is less affected by the increased number of processes. After aggregating both optimization methods, we speed up the simulation using 128 processes by 4.5 times and the simulation using 2 processes by 11.5 times, which significantly improves the scalability of our simulator.

7 RELATED WORK

Recently, several discrete-event quantum network simulators have been developed, including QuISP [31], NetSquid [8], SQUANCH [3], and SeQUeNCe [52–54]. These simulators offer a flexible and reproducible testing environment to study the performance of quantum networks, but they differ in the supported features, hardware, and protocol models, as well as simulation use cases. With the exception of SQUANCH, the performance of these simulators has been limited due to their sequential execution of simulation events. We believe that the techniques that allowed us to parallelize quantum state management in the SeQUeNCe simulator can be transferred to the other simulators.

Thus far, large-scale quantum network simulations used sequential processing and achieved improved scalability by relying on simplified qubit representations. For example, QuISP [31] aims to simulate a full Quantum Internet consisting of up to 100 networks of up to 100 nodes each, and scalability is achieved by replacing full quantum-state tracking by maintaining an error model of qubits in large networks. NetSquid allows the user to choose between several quantum state representations that vary in their scalability, namely, ket vector, density matrix, stabiliser, and graph state representation. The NetSquid simulator was successfully used in a recent simulation of a chain of up to 1,000 nodes [8]. SQUANCH [3] is notable for allowing parallelized simulations that use multiprocessing where each process manages one simulated network node. However, the simulator does not attempt to model interactions of protocols or details at the physical layer, and it does not guarantee the order of event executions.

Several classical network simulators support parallelization, such as ns-3 [38] and OMNeT++ [39]. Parallelization of classical network simulators is simpler, however, because of the absence of exponentially large quantum state representations and nonlocal effects of quantum state measurements. Kazer et al. [25] used the built-in MPI-based PDES framework of OMNeT++ to simulate a highly interconnected data center network and found that synchronization can actually cause parallel DES to perform worse than a single-threaded implementation. Nikolaev et al. [37] evaluated the performance of the distributed ns-3 network simulator and concluded that it scales only to about 100–200 cores/processes. Beyond that limit, performance suffers because of increased interprocess synchronization and communication costs. Barnes et al. [2, 36, 37] simulated networks with 1 billion nodes (routers and hosts) using ns-3 on a supercomputer. Their scalability metric and their implementation of a NULL message synchronization algorithm [17] may be useful in future work.

Our previous work [55] demonstrated that parallelization can greatly speed up the simulation of quantum key distribution networks [21]. In that work, a network model was effectively divided into multiple components, each of which was executed on a different logic process with a synchronization mechanism to ensure the correct execution order of cross-process events. A QKD network composed of QKD terminals and channels for transmitting photons as well as packets was simulated. As in our current work, we applied the conservative synchronization algorithm [34]. However, the earlier work was limited to tracing quantum superposition states and did not consider distribution of entanglement, which is necessary in general-purpose quantum communication networks. Here we address this shortcoming by designing and implementing simulation techniques that use distributed quantum state management that allows efficient parallel simulation of entanglement distribution.

8 CONCLUSION

In this article we presented a parallelized version of the SeQUeNCe simulator for quantum communication networks. The simulator uses a conservative synchronization algorithm for maintaining event execution order between parallel processes. A QSM was developed to track shared quantum states among processes, with a centralized global QSM servicing requests from local QSMs on individual processes. We additionally investigated multiple methods for optimizing the performance of a quantum network simulator.

Furthermore, by working within the modularized architecture of the SeQUeNCe simulator, we have provided a parallel simulator that requires minimal adaptations to sequentially run simulations. The kernel module, which processes events and stores quantum states, was fully parallelized while leaving the higher-level hardware and protocol layers with few changes. Our design ensures identical results for sequential and parallel simulations in conjunction with the well-ordered event execution guaranteed by the conservative synchronization algorithm.

This initial work on parallel quantum network simulation sets the foundation for further exploration in the field. In future work, we plan to investigate the dynamic assignment of lookahead values for non-homogeneous optical links that change over time, improve the efficiency of global QSM communication through the study of multilayer QSM hierarchies, and compare our partitioning scheme with traditional methods such as METIS. Additionally, we aim to explore the use of optimistic synchronization either in place of or in combination with the current conservative synchronization approach as a means to enhance the performance of parallel quantum network simulations. We will also extend our scalability study in terms of (a) the number of processors used for the parallel simulation, (b) the size of the networks simulated, and (c) the amount of workload. These future research efforts will deepen our understanding of this field and provide valuable insights for practitioners.

We are releasing the parallelized SeQUeNCe simulator as open source software with the goal of providing an easy avenue for constructing parallel simulations whose performance will enable the development of future large-scale quantum networks. Our plans for future development of the SeQUeNCe simulator include the introduction of new optical hardware models, including new types of quantum memories, and the release of an interactive GUI for the intuitive construction of new simulations.

ACKNOWLEDGMENTS

We thank Siu Man Chan for contributions to the SeQUeNCe software package and Gail Pieper for help with editing. We also thank Tekin Bicer for providing the illustration for Figure 3. We thank Rex Brown for discussions that helped improve our design, and we thank the anonymous reviewers whose comments helped improve this article.

Footnotes

¹ GitHub Repository: https://github.com/sequence-toolbox/SeQUeNCe/
Footnote
² The “parallel” branch specifically includes the code used in this work.
Footnote
³ More configuration details can be found in the supplementary material.
Footnote
⁴ Validation details could be found in the supplementary material.
Footnote
⁵ See respective CV in the supplementary material.
Footnote

Supplemental Material

Available for Download

pdf

tomacs-2021-0079-file003.pdf (448 KB)

Supplementary material

zip

tomacs-2021-0079-file005.zip (24.2 MB)

Supplementary material

REFERENCES

[1] Andradóttir Sigrún and Ott Teunis J.. 1995. Time-segmentation parallel simulation of networks of queues with loss or communication blocking. ACM Trans. Model. Comput. Simul. 5, 4 (1995), 269–305. DOI:Google ScholarDigital Library
Reference
[2] Barnes P. D.. 2011. A Benchmark Model for Parallel ns3. Technical Report. Lawrence Livermore National Laboratory, Livermore, CA.Google Scholar
Reference
[3] Bartlett Ben. 2018. A distributed simulation framework for quantum networks and channels. arXiv:1808.07047. Retrieved from https://arxiv.org/abs/1808.07047Google Scholar
Reference 1Reference 2
[4] Bennett Charles H., Brassard Gilles, Crépeau Claude, Jozsa Richard, Peres Asher, and Wootters William K.. 1993. Teleporting an unknown quantum state via dual classical and Einstein–Podolsky–Rosen channels. Phys. Rev. Lett. 70, 13 (Mar. 1993), 1895–1899. DOI:Google ScholarCross Ref
Reference
[5] Briegel H.-J., Dür Wolfgang, Cirac Juan I., and Zoller Peter. 1998. Quantum repeaters: The role of imperfect local operations in quantum communication. Phys. Rev. Lett. 81, 26 (1998), 5932.Google ScholarCross Ref
Reference
[6] Carothers Christopher D., Bauer David, and Pearce Shawn. 2002. ROSS: A high-performance, low-memory, modular Time Warp system. J. Parallel Distrib. Comput. 62, 11 (2002), 1648–1669.Google ScholarCross Ref
Reference
[7] CODES-org. 2021. CODES Discrete-event Simulation Framework. Retrieved from https://github.com/codes-org/codesGoogle Scholar
Reference
[8] Coopmans Tim, Knegjens Robert, Dahlberg Axel, Maier David, Nijsten Loek, Filho Julio de Oliveira, Papendrecht Martijn, Rabbie Julian, Rozpędek Filip, Skrzypczyk Matthew, Wubben Leon, Jong Walter de, Podareanu Damian, Torres-Knoop Ariana, Elkouss David, and Wehner Stephanie. 2021. NetSquid, a NETwork simulator for QUantum information using discrete events. Commun. Phys. 4, 1 (2021), 164. DOI:Google ScholarCross Ref
Reference 1Reference 2Reference 3
[9] Courtland Rachel. 2016. China’s 2,000-km quantum link is almost complete [News]. IEEE Spectr. 53, 11 (2016), 11–12. DOI:Google ScholarDigital Library
Reference
[10] Sensi Daniele De, Girolamo Salvatore Di, McMahon Kim H., Roweth Duncan, and Hoefler Torsten. 2020. An in-depth analysis of the Slingshot interconnect. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’20). 1–14. DOI:Google ScholarCross Ref
Reference
[11] Dür Wolfgang and Briegel Hans J.. 2007. Entanglement purification and quantum error correction. Rep. Progr. Phys. 70, 8 (2007), 1381.Google ScholarCross Ref
Reference 1Reference 2
[12] Dynes J. F., Wonfor Adrian, Tam WW-S., Sharpe A. W., Takahashi R., Lucamarini M., Plews A., Yuan Z. L., Dixon A. R., Cho J., et al. 2019. Cambridge quantum network. npj Quant. Inf. 5, 1 (2019), 1–8.Google ScholarCross Ref
Reference
[13] Einstein A., Podolsky B., and Rosen N.. 1935. Can quantum-mechanical description of physical reality be considered complete? Phys. Rev. 47, 10 (May 1935), 777–780. DOI:Google ScholarCross Ref
Reference
[14] Elliott Chip, Colvin Alexander, Pearson David, Pikalo Oleksiy, Schlafer John, and Yeh Henry. 2005. Current status of the DARPA quantum network. In Quantum Information and Computation III, Vol. 5815. International Society for Optics and Photonics, 138–149.Google ScholarCross Ref
Reference
[15] Elmokashfi Ahmed, Kvalbein Amund, and Dovrolis Constantine. 2010. On the scalability of BGP: The role of topology growth. IEEE J. Select. Areas Commun. 28, 8 (2010), 1250–1261.Google ScholarDigital Library
Reference 1Reference 2
[16] Fano Ugo. 1957. Description of states in quantum mechanics by density matrix and operator techniques. Rev. Mod. Phys. 29, 1 (1957), 74.Google ScholarCross Ref
Reference
[17] Fujimoto Richard M.. 1990. Parallel discrete event simulation. Commun. ACM 33, 10 (Oct. 1990), 30–53. DOI:Google ScholarDigital Library
Reference 1Reference 2
[18] Gheorghiu Vlad. 2018. Quantum++: A modern C++ quantum computing library. PLoS One 13, 12 (2018).Google ScholarCross Ref
Reference 1Reference 2
[19] Hagberg Aric, Swart Pieter, and Chult Daniel S. 2008. Exploring Network Structure, Dynamics, and Function Using NetworkX. Technical Report. Los Alamos National Laboratory, Los Alamos, NM.Google Scholar
Reference 1Reference 2
[20] INQNET. 2021. Fermilab Quantum Network (FQNET). Retrieved from http://inqnet.caltech.edu/fqnet/Google Scholar
Reference
[21] Jin Jeongwan, Bourgoin Jean-Philippe, Tannous Ramy, Agne Sascha, Pugh Christopher J., Kuntz Katanya B., Higgins Brendon L., and Jennewein Thomas. 2019. Genuine time-bin-encoded quantum key distribution over a turbulent depolarizing free-space channel. Opt. Express 27, 26 (2019), 37214–37223.Google ScholarCross Ref
Reference
[22] Johansson J. Robert, Nation Paul D., and Nori Franco. 2012. QuTiP: An open-source Python framework for the dynamics of open quantum systems. Comput. Phys. Commun. 183, 8 (2012), 1760–1772.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[23] Jones Cody, Kim Danny, Rakher Matthew T., Kwiat Paul G., and Ladd Thaddeus D.. 2016. Design and analysis of communication protocols for quantum repeater networks. New J. Phys. 18, 8 (2016), 083015.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[24] Jones K. G. and Das S. R.. 2001. Time-parallel algorithms for simulation of multiple access protocols. In MASCOTS 2001, Proceedings 9th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems. 49–58. DOI:Google ScholarCross Ref
Reference
[25] Kazer Charles W., Sedoc João, Ng Kelvin K. W., Liu Vincent, and Ungar Lyle H.. 2018. Fast network simulation through approximation or: How blind men can describe elephants. In Proceedings of the 17th ACM Workshop on Hot Topics in Networks. 141–147.Google ScholarDigital Library
Reference 1Reference 2
[26] Knill Emanuel. 2005. Quantum computing with realistically noisy devices. Nature 434, 7029 (2005), 39–44.Google ScholarCross Ref
Reference
[27] Laboratory Argonne National. 2021. Bebop. Retrieved from https://www.lcrc.anl.gov/systems/resources/bebop/Google Scholar
Reference 1Reference 2
[28] Leach Paul, Mealling Michael, and Salz Rich. 2005. A universally unique identifier (UUID) URN namespace. RFC 4122 (Proposed Standard).Google Scholar
Reference
[29] Lerner Louise. 2018. Quantum Network to Test Unhackable Communications. Retrieved from https://www.anl.gov/article/quantum-network-to-test-unhackable-communicationsGoogle Scholar
Reference
[30] Liu Jason and Nicol David M.. 2002. Lookahead revisited in wireless network simulations. In Proceedings 16th Workshop on Parallel and Distributed Simulation. IEEE Computer Society, 79–79.Google ScholarCross Ref
Reference
[31] Matsuo Takaaki, Durand Clément, and Meter Rodney Van. 2019. Quantum link bootstrapping using a RuleSet-based communication protocol. Phys. Rev. A 100, 5 (2019), 052320.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[32] Michie Donald. 1968. “Memo” functions and machine learning. Nature 218, 5136 (1968), 19–22.Google ScholarCross Ref
Reference
[33] Nicol David M.. 1993. The cost of conservative synchronization in parallel discrete event simulations. J. ACM 40, 2 (1993), 304–333.Google ScholarDigital Library
Reference
[34] Nicol David M.. 1996. Principles of conservative parallel simulation. In Proceedings of the 28th Conference on Winter Simulation. 128–135.Google ScholarDigital Library
Reference 1Reference 2
[35] Nielsen Michael A. and Chuang. Isaac L. 2000. Quantum Computation and Quantum Information. Cambridge University Press.Google Scholar
Reference
[36] Nikolaev Sergei, Banks Eddy, Barnes Peter D., Jefferson David R., and Smith Steven. 2015. Pushing the envelope in distributed ns-3 simulations: One billion nodes(WNS3’15). Association for Computing Machinery, New York, NY, 67–74. DOI:Google ScholarDigital Library
Reference 1Reference 2
[37] Nikolaev Sergei, Barnes Peter D., Brase James M., Canales Thomas W., Jefferson David R., Smith Steve, Soltz Ron A., and Scheibel Peter J.. 2013. Performance of distributed ns-3 network simulator. In Proceedings of the 6th International ICST Conference on Simulation Tools and Techniques (SimuTools’13). Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, Brussels, BEL, 17–23. Google ScholarDigital Library
Reference 1Reference 2Reference 3
[38] ns3 2021. ns-3: Network Simulator. Retrieved from https://www.nsnam.org/Google Scholar
Reference 1Reference 2
[39] OMNET 2021. OMNeT++: Discrete Event Simulator. Retrieved from https://omnetpp.org/Google Scholar
Reference
[40] Pelkey Joshua and Riley George. 2011. Distributed simulation with MPI in ns-3. In Proceedings of the 4th International ICST Conference on Simulation Tools and Techniques (SIMUTools’11). Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, Brussels, BEL, 410–414. Google ScholarDigital Library
Reference
[41] repository GitHub. 2020. SeQUeNCe: Simulator of QUantum Network Communication. Retrieved from https://github.com/sequence-toolbox/SeQUeNCe/tree/parallelGoogle Scholar
Reference
[42] Riley George. 1999. PDNS—Parallel/Distributed NS. Retrieved from https://www.cc.gatech.edu/computing/pads/pdns.htmlGoogle Scholar
Reference
[43] Riley George F.. 2003. The georgia tech network simulator. In Proceedings of the ACM SIGCOMM Workshop on Models, Methods and Tools for Reproducible Network Research (MoMeTools’03). Association for Computing Machinery, New York, NY, 5–12. DOI:Google ScholarDigital Library
Reference
[44] Shankar Ramamurti. 2012. Principles of Quantum Mechanics. Springer Science & Business Media.Google Scholar
Reference
[45] Simon Christoph. 2017. Towards a global quantum network. Nat. Photon. 11, 11 (2017), 678–680. DOI:Google ScholarCross Ref
Reference
[46] Stoffers Mirko, Schemmel Daniel, Dustmann Oscar Soria, and Wehrle Klaus. 2018. On automated memoization in the field of simulation parameter studies. ACM Trans. Model. Comput. Simul. 28, 4 (2018), 1–25.Google ScholarDigital Library
Reference
[47] Laarhoven Peter J. M. Van and Aarts Emile H. L.. 1987. Simulated annealing. In Simulated Annealing: Theory and Applications. Springer, 7–15.Google ScholarCross Ref
Reference
[48] J. J. Dongarra and D. W. Walker. 1996. MPI: A Standard Message Passing Interface. Supercomputer 12, 1 (1996), 56--68.Google Scholar
Reference 1Reference 2
[49] Watts Duncan J.. 1999. Networks, dynamics, and the small-world phenomenon. Am. J. Sociol. 105, 2 (1999), 493–527.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[50] Wiki HPC. 2020. Scaling Tutorial. Retrieved from https://hpc-wiki.info/hpc/index.php?title=Scalin%g_tutorial&oldid=3609Google Scholar
Reference
[51] Wu H., Fujimoto R. M., and Ammar M.. 2003. Time-parallel trace-driven simulation of CSMA/CD. In Proceedings of the 17th Workshop on Parallel and Distributed Simulation (PADS’03).105–112. DOI:Google ScholarCross Ref
Reference
[52] Wu Xiaoliang, Chung Joaquin, Kolar Alexander, Wang Eugene, Zhong Tian, Kettimuthu Rajkumar, and Suchara Martin. 2019. Photon-level simulation of quantum key distribution with picosecond accuracy. In Single Photon Workshop.Google Scholar
Reference 1Reference 2
[53] Wu Xiaoliang, Chung Joaquin, Kolar Alexander, Wang Eugene, Zhong Tian, Kettimuthu Rajkumar, and Suchara Martin. 2019. Simulations of photonic quantum networks for performance analysis and experiment design. In Proceedings of the IEEE/ACM Workshop on Photonics-Optics Technology Oriented Networking, Information and Computing Systems (PHOTONICS’19). IEEE, 28–35.Google ScholarCross Ref
[54] Wu Xiaoliang, Kolar Alexander, Chung Joaquin, Jin Dong, Zhong Tian, Kettimuthu Rajkumar, and Suchara Martin. 2021. SeQUeNCe: A customizable discrete-event simulator of quantum networks. Quant. Sci. Technol. 6, 4 (2021), 045027. Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
[55] Wu Xiaoliang, Zhang Bo, and Jin Dong. 2020. Parallel simulation of quantum key distribution networks. In Proceedings of the ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 187–196.Google ScholarDigital Library
Reference
[56] Żukowski Marek, Zeilinger Anton, Horne Michael A., and Ekert Artur K.. 1993. Event-ready-detectors Bell experiment via entanglement swapping. Phys. Rev. Lett. 71, 26 (1993), 4287.Google ScholarCross Ref
Reference 1Reference 2Reference 3

Index Terms

Parallel Simulation of Quantum Networks with Distributed Quantum State Management
1. Computing methodologies
  1. Modeling and simulation
    1. Simulation types and techniques
      1. Discrete-event simulation
      2. Massively parallel and high-performance simulations
  2. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms
2. Networks
  1. Network performance evaluation
    1. Network simulations

Recommendations

Parallel Simulation of Quantum Key Distribution Networks
SIGSIM-PADS '20: Proceedings of the 2020 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation

With the significantly growing investment in quantum communi-cations, quantum key distribution (QKD), as a key application toshare a security key between two remote parties, has been deployedin urban areas and even at a continental scale. To meet the ...
Read More
A Scalable Quantum Key Distribution Network Testbed Using Parallel Discrete-Event Simulation
Quantum key distribution (QKD) has been promoted as a means for secure communications. Although QKD has been widely implemented in many urban fiber networks, the large-scale deployment of QKD remains challenging. Today, researchers extensively conduct ...
Read More
Playing distributed two-party quantum games on quantum networks

This paper investigates quantum games between two remote players on quantum networks. We propose two schemes for distributed remote quantum games: the client---server scheme based on states transmission between nodes of the network and the peer-to-peer ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Modeling and Computer Simulation Volume 34, Issue 2
April 2024
178 pages
ISSN:1049-3301
EISSN:1558-1195
DOI:10.1145/3613554
Editor:
Wentong Cai
Nanyang Technological University, Singapore
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 April 2024
- Online AM: 31 January 2024
- Accepted: 22 August 2023
- Revised: 10 July 2023
- Received: 2 November 2021
Published in tomacs Volume 34, Issue 2

Check for updates
Author Tags
Parallel discrete-event simulation
quantum network
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 156
  Total Downloads
- Downloads (Last 12 months)156
- Downloads (Last 6 weeks)102
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Parallel Simulation of Quantum Networks with Distributed Quantum State Management

ACM Transactions on Modeling and Computer Simulation

Abstract

1 INTRODUCTION

2 BACKGROUND

2.1 Quantum Information and Quantum Network Protocols

2.2 SeQUeNCe: A Customizable Quantum Network Simulator

2.3 Parallel Simulation of Classical Networks

3 PARALLEL QUANTUM NETWORK SIMULATOR DESIGN

3.1 Requirement Analysis

3.2 Parallel Simulator Architecture

3.3 Parallelization Overhead Sources

4 QUANTUM STATE MANAGER

4.1 Local Quantum State Manager

4.2 Global Quantum State Manager

4.2.1 Mitigating Communication Overhead.

4.2.2 Mitigating Resource Competition.

5 LOOKAHEAD OPTIMIZATION

6 PERFORMANCE EVALUATION

6.1 Chain of Quantum Routers

6.2 Quantum Network with Caveman Graph Topology

6.3 Quantum Network with Autonomous System Network Topology

7 RELATED WORK

8 CONCLUSION

ACKNOWLEDGMENTS

Footnotes

Supplemental Material

Available for Download

REFERENCES

Cited By

Index Terms

Recommendations

Parallel Simulation of Quantum Key Distribution Networks

A Scalable Quantum Key Distribution Network Testbed Using Parallel Discrete-Event Simulation

Playing distributed two-party quantum games on quantum networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media