Abstract
Memory disaggregation is a promising architecture for modern datacenters that separates compute and memory resources into independent pools connected by ultra-fast networks, which can improve memory utilization, reduce cost, and enable elastic scaling of compute and memory resources. However, existing memory disaggregation solutions based on remote direct memory access (RDMA) suffer from high latency and additional overheads including page faults and code refactoring. Emerging cache-coherent interconnects such as CXL offer opportunities to reconstruct high-performance memory disaggregation. However, existing CXL-based approaches have physical distance limitation and cannot be deployed across racks.
In this article, we propose Rcmp, a novel low-latency and highly scalable memory disaggregation system based on RDMA and CXL. The significant feature is that Rcmp improves the performance of RDMA-based systems via CXL, and leverages RDMA to overcome CXL’s distance limitation. To address the challenges of the mismatch between RDMA and CXL in terms of granularity, communication, and performance, Rcmp (1) provides a global page-based memory space management and enables fine-grained data access, (2) designs an efficient communication mechanism to avoid communication blocking issues, (3) proposes a hot-page identification and swapping strategy to reduce RDMA communications, and (4) designs an RDMA-optimized RPC framework to accelerate RDMA transfers. We implement a prototype of Rcmp and evaluate its performance by using micro-benchmarks and running a key-value store with YCSB benchmarks. The results show that Rcmp can achieve 5.2× lower latency and 3.8× higher throughput than RDMA-based systems. We also demonstrate that Rcmp can scale well with the increasing number of nodes without compromising performance.
1 INTRODUCTION
Memory disaggregation is increasingly favored in datacenters (e.g., RSA [48], WSC [5], and dReDBox [27]), cloud servers (e.g., Pond [30] and Amazon Aurora [53]), in-memory databases (e.g., PolarDB [11] and LegoBase [65]), and High-Performance Computing (HPC) systems [37, 55], among others, for higher resource utilization, flexible hardware scalability, and lower costs. This architecture (Figure 1) decouples compute and memory resources from traditional monolithic servers to form independent resource pools. The compute pool contains rich CPU resources but minimal memory resources, whereas the memory pool contains large amounts of memory but near-zero computation power. Memory disaggregation can provide a global shared memory pool and allow different resources to scale independently, which offers opportunities to build cost-effective and elastic datacenters.
Remote Direct Memory Access (RDMA) networks are generally adopted in memory disaggregation systems [3, 17, 29, 41, 43, 51, 63, 66] to connect the compute and memory pools (Figure 2(a)). However, existing RDMA-based memory disaggregation solutions have significant shortcomings. One is high latency. Current RDMA can support a single-digit-microsecond-level latency (1.5\(\sim\)3 \(\mu\)s) [17, 64] but is still several orders of magnitude away from DRAM memory latency (80\(\sim\)140 ns). RDMA communication becomes the performance bottleneck for accessing the memory pool. Another is additional overhead. Since memory semantics are not natively supported, RDMA incurs intrusive code modifications and interruption overheads on the original system. Specifically, current RDMA-based memory disaggregation includes page-based and object-based approaches, differentiated by data exchange granularity. However, page-based approaches involve additional overhead of page-fault handling and read/write amplifications [10, 41], whereas object-based approaches require custom interface changes and source-level modifications that sacrifice transparency [17, 56].
CXL (Compute Express Link) is a PCIe-based cache-coherent interconnect protocol, which enables direct and coherent access to remote memory devices without CPU intervention [45, 52]. CXL natively supports memory semantics and has similar multi-socket NUMA access latency (about 90\(\sim\)150 ns [21, 45]), which exhibits great potential to overcome the drawbacks of RDMA and realize low-cost, high-performance memory disaggregation. Recently, CXL-based memory disaggregation technology has received significant attention in both academia and industry [10, 21, 30, 56].
Reconstructing a CXL-based memory disaggregation architecture (see Figure 2(b)) to replace RDMA is promising research, but the immaturity of CXL technology and the lack of industrial-grade products make it difficult in practice. First, there are physical limitations. Existing CXL-based memory disaggregation faces restrictions on long-distance deployment, typically limited to within rack level in a datacenter, even for the latest CXL 3.0 specification [14, 45, 56]. Physical distance limitations result in the inability to deploy memory pools across racks, losing high scalability. Second, the cost is high. The cost of replacing all RDMA hardware in a datacenter with CXL hardware is prohibitively high, especially for large-scale clusters. Furthermore, due to the lack of commercially available mass-produced CXL hardware and supporting infrastructure, current research on CXL memory relies on custom FPGA prototypes [21, 49] or emulation using a CPU-less NUMA node [30, 32].
In this article, we probe a hybrid memory disaggregation architecture combining CXL and RDMA, which aims to retain and leverage RDMA to enable CXL to break the distance constraint. In such an architecture (see Figure 2(c)), a small CXL-based memory pool is built in a rack, and RDMA is used to connect the racks, forming a larger memory pool. This approach uses CXL to improve the performance of RDMA-based memory disaggregation and ignores the physical distance limitation of CXL. However, it faces huge challenges in implementation including granularity mismatch, communication mismatch, and performance mismatch of RDMA and CXL (Section 3.3). In particular, due to the latency gap between RDMA and CXL, RDMA communications between racks become the major performance bottleneck. Some research proposes an RDMA-driven acceleration framework [61] using a cache-coherent accelerator to connect to CXL-like cache-coherence memory, but this approach requires customized hardware.
To address these issues, we propose Rcmp, a novel memory disaggregation system based on
We implement Rcmp as a user-level architecture with 6,483 lines of C++ code. Rcmp provides simple APIs for memory pool services, which is easy to use for applications. In addition, Rcmp provides simple high-capacity in-memory file system interfaces by integrating with FUSE [1]. We evaluate Rcmp with micro-benchmarks and run a key-value store (hashtable) under YCSB workloads. The evaluation results indicate that Rcmp achieves high and stable performance in all workloads. Specifically, Rcmp reduces latency by 3 to 8× under micro-benchmarks and improves throughput by 2 to 4× under YCSB workloads compared to RDMA-based memory disaggregation systems. In addition, Rcmp has good scalability with the increasing number of nodes or racks. The open source code of Rcmp and experimental datasets in this article are available at https://github.com/PDS-Lab/Rcmp
In summary, we make the following contributions:
– | We analyze the shortcomings of current memory disaggregation systems and show that RDMA-based systems suffer from high latency, additional overhead, and sub-optimal communication, whereas CXL-based systems suffer from physical distance limitations and lack of available products. | ||||
– | We design and implement Rcmp, a novel memory pool system, which achieves high performance and scalability by combining the advantages of RDMA and CXL. To our best knowledge, this is the first work to use both RDMA and CXL techniques to construct a memory disaggregation architecture. | ||||
– | We propose many optimization designs to overcome the performance challenges encountered when combining RDMA and CXL, including global memory management, an efficient communication mechanism, a hot-page swapping strategy, and a high-performance RPC framework. | ||||
– | We conduct a comprehensive evaluation of Rcmp’s performance and compare it with state-of-the-art memory disaggregation systems. The results demonstrate that Rcmp significantly outperforms these systems in terms of performance and scalability. |
The rest of the article is organized as follows. Sections 2 and 3 explain the background and motivations. Sections 4 and 5 present the design ideas and system architecture details of Rcmp. Section 6 presents comprehensive evaluations. Section 7 summarizes the related work. Section 8 concludes the article.
2 BACKGROUND
2.1 Memory Disaggregation
The emerging applications such as big data [31, 39], deep learning [4, 28], HPC [37, 55], and large language models (e.g., ChatGpt [7] and GPT-3 [19]) are increasingly prevalent in modern datacenters, which leads to a huge demand for memory [2, 3, 44, 56]. However, datacenters today mostly use monolithic server architectures where CPU and memory are tightly coupled, which suffer from significant challenges in the face of growing memory requirements:
– | Low memory utilization: In monolithic servers, since the memory resource occupied by a single instance cannot be allocated across server boundaries, it is difficult to fully utilize memory resources. Table 1 shows that the memory utilization in typical datacenters, cloud platforms, and HPC systems is generally below 50%. In addition, real-world applications often request a large amount of memory, but the memory is not fully used in practice. For example, in Microsoft Azure’s and Google’s clusters [30, 33, 56], about 30% to 61% of allocated memory remains idle for extended periods of time. | ||||
– | Lack of elasticity: It is difficult to scale down/up the memory or CPU resources after they have been installed in a monolithic server. As a result, the server configurations must be planned in advance, and dynamic adjustments often come with waste in existing server hardware [44, 65]. In addition, it is difficult to flexibly scale the memory capacity of a single server to the required size due to the fixed CPU to memory ratio [44, 56]. | ||||
– | High costs: Lots of unused memory resources lead to high operating costs and wasted energy [11, 65]. In addition, device failures are frequent in modern datacenters, occurring almost every day [13, 40, 58]. With the monolithic architecture, when any one hardware component within a server fails, the whole server is often unusable. Such coarse-grained fault management leads to high costs [44]. |
Examples | Memory Utilization | |
---|---|---|
Datacenters | Google’s production cluster [40, 44] | 20%\(\sim\)40% |
Alibaba’s co-located datacenter [13] | 5%\(\sim\)60% | |
Cloud Platforms | Snowflake [33, 54] | \(\sim\)19% |
Microsoft Azure [30, 56] | <50% | |
HPC Systems | The clusters at Lawrence Livermore National Laboratory [37, 55] | <15% |
Cori at National Energy Research Scientific Computing Center [35] | 9%\(\sim\)15% |
In response, memory disaggregation is proposed to solve these problems and has received significant attention in both academia and industry [3, 17, 21, 43, 51, 56, 63, 66]. Memory disaggregation separates the memory resources from the compute resources in a datacenter, forming independent resource pools connected with fast networks. This allows different resources to be managed and expanded independently, enabling higher memory utilization, elastic scaling, and lower costs.
As shown in Figure 1, in such an architecture, Compute Nodes (CNs) in the compute pool contain a large number of CPU cores and small local DRAM, and Memory Nodes (MNs) in the memory pool host high-volume memory with near-zero computation power. The microsecond-latency networks (e.g., RDMA) or cache-coherence interconnection protocols (e.g., CXL) are generally the physical transmission approach from CNs to MNs.
2.2 RDMA Technologies
RDMA is a series of protocols that allow one machine to directly access data in remote machines across the network. RDMA protocols are typically solidified directly on RDMA NICs (RNIC) and have high bandwidth (>10 GB/s) and low latency at the microsecond level (\(\sim\)2 \(\mu\)s), which are widely supported by InfiniBand, RoCE, and OmniPath, among others. [20, 47, 62]. RDMA provides data transfer services based on two types of operational primitives: one-sided verbs including RDMA
2.3 CXL Protocols
CXL is an open industry standard based on PCIe for high-speed communication between processors, accelerators, and memory in a cache-coherent way with
3 EXISTING MEMORY DISAGGREGATION ARCHITECTURES AND LIMITATIONS
3.1 RDMA-Based Approaches
According to the way data is managed, RDMA-based memory disaggregation can be roughly divided into two approaches: page based and object based. The page-based approach (e.g., Infiniswap [22], LegoOS [44], Fastswap [3]) uses virtual memory mechanisms to cache remote pages in the memory pool into a local DRAM cache. It achieves remote memory pool access by triggering page faults and swapping local memory pages and remote pages. Its advantages are simplicity, ease of use, and transparency to applications. The object-based approach (e.g., FaRM [17] and FaRMV2 [43], AIFM [41], and Gengar [18]) achieves fine-grain memory management with custom object-based semantics, such as key-value interfaces. The one-sided verbs enable the CNs to directly access the MNs without involving remote CPUs, which is more suitable for memory disaggregation due to near-zero computation power in MNs. However, if only one-sided RDMA primitives are used for communication in memory disaggregation systems, a single data query may involve multiple read and write operations, resulting in high latency [25, 26]. Therefore, many studies propose high-performance RPC frameworks based on RDMA (e.g., FaSST [26] and FaRM [17]) or adopt general RPC libraries without RDMA primitives [24].
In general, the shortcomings of the RDMA-based approaches can be summarized as follows.
Problem 1: High Latency. There is a large latency gap between RDMA communication and memory access, more than 20× (Table 2). This makes RDMA networks a major performance bottleneck for RDMA-based memory disaggregation systems.
Problem 2: High Overhead. The page-based approach suffers from performance degradation due to page-fault overheads [21, 41, 56]. The example of Fastswap [3] is shown in Table 2, and it has high remote access latency (the experiment details are presented in Section 6.2). In addition, for fine-grained accesses, read/write amplification occurs because data is always transferred at page granularity. The object-based approach can avoid page-fault overheads, but it has intrusive code modifications and varies depending on the semantics of the application, leading to higher complexity.
Problem 3: Sub-Optimal Communication. The existing RDMA communication methods are not optimal and do not take full advantage of RDMA bandwidth. We test the throughput varying different data sizes using mainstream communication frameworks including (1) only-RPC (using eRPC [24]), and (2) one-sided RDMA and RPC hybrid mode [17, 26], which uses RPC to obtain remote data address first and accesses data via one-sided RDMA verbs. As shown in Figure 3, the result presents that RPC communication is suitable for small data transmission, whereas the hybrid mode has higher throughput for large data scenarios. The 512 bytes is a cut-off point, which inspires us to design a dynamic strategy. In summary, the RDMA-based solutions are summarized in Table 3.
RDMA Based | CXL Based | Hybrid | |||
---|---|---|---|---|---|
Systems | Page based (e.g., Fastswap [3]) | Object based e.g.,FaRM [17]) | DirectCXL [21] | CXL-over-Ethernet [56] | Rcmp |
Physical Link | RDMA | RDMA | CXL | CXL+Ethernet | CXL+RDMA |
Latency | High: \(\sim\)13 \(\mu\)s | Medium: \(\sim\)8 \(\mu\)s | Low: 700 ns\(\sim\)1 \(\mu\)s | Medium: \(\sim\)6 \(\mu\)s | Low: \(\sim\)3 \(\mu\)s |
Software Overhead | High | Medium | Low | Low | Low |
Network Efficiency | Low | Medium | High | Medium | High |
Scalability | High | Medium | Medium: within rack level | Medium | High |
3.2 CXL-Based Approaches
Lots of studies have proposed memory disaggregation architectures using CXL [10, 21, 30, 56] to overcome the shortcomings of RDMA-based approaches and achieve lower access latency. CXL-based memory disaggregation can provide a shared cache-coherent memory pool and support cache-line granularity access without invasive changes. In summary, based on the characteristics of CXL, the CXL-based approaches have the following advantages over RDMA-based approaches:
– | Less software overhead: CXL maintains a unified, coherent memory space between the CPU (host processor) and any memory on the attached CXL device. CXL-based approaches reduce software stack complexity without page-fault overheads [21, 30]. | ||||
– | Fine-grained access: CXL supports CPUs and GPUs, and other processors access the memory pool by native | ||||
– | Lower latency: CXL provides near-memory latency and CXL-based approaches alleviate network bottlenecks and memory over-provisioning issues [21, 46]. | ||||
– | Elasticity: CXL-based approaches promise excellent scalability as more PCIe devices can be attached across switches unlike DIMM (Dual Inline Memory Module) used for DRAM. |
However, the CXL-based approaches also suffer from the following shortcomings.
Problem 1: Physical Distance Limitation. Due to the limited length of the PCIe bus, the CXL-based approach is limited within rack level [45, 56] (existing CXL products is up to 2-m maximum distance [14]), which cannot be used directly in large-scale datacenters. Of course, the PCIe flexible extension cable can be used, but there is still a maximum length limitation (\(\le\)15 inches) [42]. An ongoing research effort is to convert a PCIe 5.0 electrical signal into an optical signal [16], which is still in the testing phase and requires specialized hardware. This approach also has potential overheads including signal loss, power consumption, deployment costs, and so on. In addition, at 3- to 4-m distance, the photon travel time alone exceeds the first-word access latency of modern memory. Therefore, if CXL-based memory disaggregation is beyond rack boundaries, it will become noticeable for latency-sensitive applications [34].
Problem 2: High Cost. What is worse, the CXL products are immature and most research is still in the emulation phase, which includes FPGA-based prototypes and simulation using NUMA nodes. Since the early CXL products using FPGAs are yet not optimized for latency [38] and report higher latency (more than 250 ns) [49], NUMA-based simulation is still the more popular approach for CXL proofs of concept [30, 32, 55, 60]. In addition, the expensive price of current CXL products makes it impractical to replace all RDMA hardware in a datacenter with CXL hardware.
3.3 Hybrid Approaches and Challenges
A possible solution is to use the network to overcome the rack distance limitation of CXL. The state-of-the-art case is CXL-over-Ethernet [56]. It deploys the compute and memory pools in separate racks, and uses CXL in the compute pool to provide global coherent memory abstraction, so the CPU can access the disaggregated memory directly via
As many researchers believed, CXL and RDMA are complementary technologies and combining the two is promising research [14, 34]. In this article, we explore a new hybrid architecture by combining CXL-based and RDMA-based approaches (i.e., build small memory pools via CXL within the rack and connect these small memory pools via RDMA). This symmetrical architecture allows the full advantage of CXL in each small memory pool and improves scalability with RDMA. However, this hybrid architecture faces the following challenges.
Challenge 1: Granularity Mismatch CXL-based approaches support cache coherence with cache-line as the access granularity. The access granularity of RDMA-based approaches is page or object, much larger than the cache-line granularity. It needs to redesign the memory management and access mechanism for the hybrid architecture.
Challenge 2: Communication Mismatch RDMA communication relies on the RNIC and message queues, whereas CXL is based on high-speed links and cache coherence protocols. It needs to achieve unified and efficient abstraction for the inter- and intra-rack communications.
Challenge 3: Performance Mismatch The latency of RDMA is much greater than CXL (over 10×). Performance mismatch will result in non-uniform access patterns (similar to NUMA architecture)—that is, accessing memory in the local rack (local-rack access for short) is much faster than accessing the remote rack (remote-rack access).
4 DESIGN IDEAS
To address these challenges, we present Rcmp, a novel hybrid memory pool system with RDMA and CXL. Rcmp achieves better performance and scalability, as shown in Table 3. The main design tradeoffs and ideas are described as follows.
4.1 Global Memory Management
Rcmp achieves global memory management via a page-based approach for two reasons. First, the page management method is easy to adopt and transparent to all user applications. Second, the page-based approach better fits the byte access feature of CXL than the object-based approach, which incurs additional indexing overhead. Each page is divided into many slabs for fine-grained management. In addition, Rcmp provides global address management for the memory pool and initially uses a centralized Metadata Server (MS) to manage the assignment and mapping of memory addresses (Section 5.1).
Rcmp accesses and moves data at cache-line granularity, decoupling from memory page size. Since CXL supports memory semantics, Rcmp can naturally enable access at cache-line granularity within the rack. For remote-rack access, Rcmp avoids performance degradation by using direct access mode (Direct-I/O) instead of page swapping triggered by page faults (Section 5.1).
4.2 Efficient Communication Mechanism
As shown in Figure 4, the hybrid architecture has three optional methods of remote-rack communications. In method (a), each CN accesses the memory pool in the remote rack through its own RNIC. This approach has obvious drawbacks. The first is the high cost due to excessive RNIC devices; second, each CN has both a CXL link and an RDMA interface, resulting in high consistency-maintenance overheads; and third, the high contention with the limited RNIC memory causes frequent cache invalidation and higher communication latency [17, 63]. In method (b), one Daemon server (equipped with RNIC) is used on each rack to manage access requests to remote racks. The Daemon server can reduce cost and consistency overhead, but the single Daemon (with an RNIC) will result in limited RDMA bandwidth. In method (c), CNs are grouped using hashing with each group corresponding to a Daemon, to avoid a single Daemon becoming a performance bottleneck. All Daemons are built on the same CXL memory, and consistency is easily guaranteed. Rcmp supports the latter two methods, and method (b) is adopted under small-scale nodes by default.
As with the latest memory disaggregation solutions [17, 43, 61], Rcmp uses the lock-free ring buffer to achieve efficient intra- and inter-rack communications.
Intra-Rack Communication. After Daemon is introduced, CNs need to first communicate with Daemon to determine where the data is stored. The simple solution is to maintain a ring buffer in CXL memory to manage the communication between a CN and Daemon, which may cause message blocking in the hybrid architecture. As shown in Figure 5, CNs add the access request to the ring buffer and wait for Daemon to poll. In this example, CN1 first sends Msg1, then CN2 sends Msg2. When data is filled, the current message (Msg1) completes and the next message (Msg2) will be processed. If Msg1 is a remote-rack access request and Msg2 is a local-rack access request, then due to the performance gap between RDMA and CXL, Msg2 may be filled first. Since each message is of variable length, Daemon cannot obtain the Msg2’s head pointer to skip Msg1 and process Msg2 first. Msg2 must wait for Msg1 to complete, causing the message blocking. To avoid communication blocking, Rcmp decouples local and remote rack accesses and uses different ring buffer structures, where a double-layer ring buffer is adopted for remote-rack access (Section 5.2).
Inter-Rack Communication. Daemon servers in different racks communicate with each other through ring buffers with one-sided RDMA writes/reads.
4.3 Remote-Rack Access Optimization
Due to the non-uniform access characteristics, remote-rack access will be the main performance bottleneck of the hybrid architecture. In addition, because of the direct I/O model, one RDMA communication is required for remote-rack data accessed with any granularity, incurring high latency, especially for frequent small data accesses. Rcmp optimizes this problem in two ways: reducing and accelerating remote-rack accesses.
Reducing Remote-Rack Accesses. Skewed accesses and hot spots exist widely in real-world datacenters [12, 59]. Accordingly, Rcmp proposes a page-based hotness identification and user-level hot-page swapping scheme to migrate frequently accessed pages (hot pages) to the local rack for less remote-rack accesses (Section 5.3).
To further leverage temporal and spatial locality, Rcmp caches fine-grained accesses of the remote rack in CXL memory and batches write requests to the remote rack (Section 5.4).
Accelerating RDMA Communications. Rcmp proposes a high-performance RDMA RPC (RRPC) framework with a hybrid transmission mode and other optimizations (e.g., doorbell batching) to take full advantage of the high bandwidth of RDMA networks (Section 5.5).
5 RCMP SYSTEM
In this section, we describe the Rcmp system and optimization strategies in detail.
5.1 System Overview
The Rcmp system overview is shown in Figure 6. Rcmp manages clusters in units of the rack. All CNs and MNs in a rack are interconnected with the CXL links, which is equivalent to a small CXL memory pool. Different racks are connected via RDMA to form a larger memory pool. Rcmp can achieve better performance compared to RDMA-based systems and higher scalability compared to CXL-based systems. The MS is used to global address assignment and metadata maintenance. In a rack, all CNs share a unified CXL memory. The CN Lib provides the APIs of the memory pool. The Daemon server is the central control node of the rack. It is responsible for handling access requests including CXL requests (CXL Proxy) and RDMA requests (Message Manager), swapping hot pages (Swap Manager), managing the slab allocator, and maintaining CXL memory space (Resource Manager). Daemon is running on a server within each rack, the same as the CN. In addition, Rcmp is a user-level architecture, avoiding context-switching overhead between kernel and user space.
Global Memory Management Rcmp provides global memory address management, as shown in Figure 7(a). MS handles memory allocation at a coarse granularity, page. The global address GAddr (page_id, page_offset) consists of the page id assigned by MS and page offset in CXL memory. Rcmp uses two hash tables to store address mappings. Specifically, the page directory (in MS) records the mapping of page id to rack, and the page table (in Daemon) records the mapping of page id to CXL memory. In addition, to support fine-grained data access, Rcmp uses the slab allocator (an object-caching kernel memory allocator) [8] to handle fine-grained memory allocations. A page is a collection of slabs to the power of 2.
The memory space includes CXL memory and local DRAM of CNs and Daemon, as shown in Figure 7(b). In a rack, each CN has small local DRAM for caching the metadata of local-rack pages including the page table and the hotness information. The local DRAM of Daemon (1) stores the local-rack page table and page hotness metadata of remote accesses, and (2) caches the page directory and the remote-rack page table. CXL memory consists of two parts: a large shared coherent memory space and an owner memory space registered by each CN. The owner memory is used as a CXL cache of remote racks for write buffering and page caching.
Interface. As shown in Table 4, Rcmp provides the usual memory pool interfaces including
API | Description |
---|---|
PoolContext *Open (ClientOptions options) | Open the Rcmp memory pool |
Void Close (PoolContext *pool_ctx) | Close Rcmp |
GAddr Alloc (size_t size) | Alloc memory from the memory pool |
Status Free (GAddr gaddr, size_t size) | Free the memory |
Status Read (GAddr gaddr, size_t size, void *buf) | Read data from gaddr and write to buf |
Status Write (GAddr gaddr, size_t size, void *buf) | Write data from buf to gaddr |
Status Lock (GAddr gaddr) | Add a write/read lock the address gaddr |
Status UnLock (GAddr gaddr) | Unlock the address gaddr |
Workflow. The access workflow of Rcmp is shown in Figure 9. When an application in a CN accesses the memory pool using
5.2 Intra-Rack Communication
CN needs to communicate with Daemon to determine whether they are local- or remote-rack accesses, but there is a significant difference in the access latency between the two cases. To prevent communication blocking, Rcmp uses two ring buffer structures for different access scenarios, as shown in Figure 10.
For local-rack accesses, a normal ring buffer is used for communication. The green buffer in the figure is an example. In this case, since all access is ultra-low latency (via CXL), blocking does not occur even in high-conflict situations. In addition, the ring buffers (and the RDMA QPs) are shared across threads (one CN) based on Flock’s method [36] for high concurrency.
For remote-rack accesses, a double-layer ring buffer is used for efficient and concurrent communications, as shown in Figure 10. The first ring buffer (polling buffer) stores the message metadata (e.g., type, size) and a pointer ptr that points to the second buffer (data buffer), which stores message data. The data in the polling buffer is of fixed length, whereas the message in the data buffer is of variable length. When the message in the data buffer is completed, add the request to the polling buffer. The Daemon polls the polling buffer to process the message that the current ptr points to. For example, in Figure 10, the latter Msg2 in the data buffer is filled first, and the request is added first to the polling buffer. Therefore, Msg2 will be processed first without blocking. Additionally, different messages can be processed concurrently. In the implementation, we use a lock-free KFIFO queue [50] as the polling buffer, and the data buffer is the normal ring buffer.
5.3 Hot-Page Identification and Swapping
To reduce remote-rack accesses, Rcmp designs a hot-page identification and swapping policy. It aims to identify frequently accessed hot pages in remote racks and migrate them to the local rack.
Hot-Page Identification. An expiring policy is proposed to identify hot pages. Specifically, the hotness of a page is measured by its access frequency and the time duration since its last access. We maintain three variables named \(Cur_r\), \(Cur_w,\) and \(lastTime\) to denote the number of read accesses, number of write accesses, and the time of the most recent access of a page. When accessing the page and counting the hotness, we first get the \(\Delta t\), which is equal to the present time minus \(lastTime\). If \(\Delta t\) is greater than valid lifetime threshold \(T_l\), the page is defined as “expired,” and the \(Cur_r\), \(Cur_w\) will be cleared to zero. The page hotness is equal to \(\alpha \times (Cur_r + Cur_w) + 1\), where \(\alpha\) is the exponential decay factor, \(\alpha = e^{- \lambda \Delta t}\), where \(\lambda\) is a “decay” constant. Then, the \(Cur_r\) or \(Cur_w\) adds 1 according to the access type. If the hotness is greater than threshold \(H_p\), the page is “hot.” In addition, if (\(Cur_r/Cur_w\)) of a hot page is greater than the threshold \(R_{rw}\), the page is “read hot.” All thresholds are configurable and have default values. In a rack, all CNs (local DRAM) maintain the hotness values (or hotness metadata) of local-rack pages, and the hotness metadata of remote-rack pages is stored in Daemon. The memory overhead is small because each page maintains three variables, about 32 bytes. The time complexity to update the hotness metadata of a page is also low, only O(1).
Hot-Page Swapping and Caching. Rcmp proposes a user-level swap mechanism, unlike the swap mechanism of page-based systems (e.g., LegoOS, Infiniswap), which relies on the host’s kernel swap daemon (
5.4 CXL Cache and Synchronization Mechanism
Rcmp proposes a simple and efficient caching and synchronization mechanism based on
CXL Write Buffer. A
CXL Page Cache. Similarly, when using a
5.5 RRPC Framework
Compared with the traditional RDMA and RPC frameworks, RRPC adopts a hybrid approach, which can adaptively choose RPC and one-sided RDMA communication for different data patterns. RRPC is inspired by the test results in Figure 3 and uses 512B as the threshold to dynamically select the communication modes. The main idea is to efficiently leverage the high bandwidth characteristics of RDMA to amortize communication latency. As shown in Figure 12, RRPC includes three communication modes.
Pure RPC mode is for communications with less than 512B of transmitted data, including scenarios such as locking during transactions, data index queries, and memory allocation.
RPC and one-sided mode is suitable for unstructured big data (more than 512B) and unknown data size such as object storage scenarios. In this case, it is difficult for the client to know the size of the object to be accessed before requesting the server. Therefore, it is necessary to obtain the remote address via RPC first, allocate a specified size of space locally, and finally remote fetch via an RDMA one-sided
RPC zero-copy mode is for structured big data (more than 512B) with fixed size such as SQL scenarios. Because data has a fixed size, the communication mode can carry the address of local space when sending an RPC request, and the data is written directly via an RDMA one-sided
For the latter two modes, once the page address is acquired via RPC, Rcmp will cache it and only use one-sided RDMA reads/writes for subsequent accesses. In addition, RRPC adopts QP sharing and doorbell batching, among others, to optimize RDMA communications, drawing on the strengths of other works [17, 26, 63].
6 EVALUATION
In this section, we evaluate Rcmp’s performance using different benchmarks. The implementation of Rcmp and experiment setup are introduced first (Section 6.1 and Section 6.2). Next, we compare Rcmp with three other remote memory systems using a micro-benchmark (Section 6.3). Then, we run a key-value store with YCSB benchmarks to show the performance benefits of Rcmp (Section 6.4). Finally, we evaluate the impact of key technologies in Rcmp (Section 6.5).
6.1 Implementation
Rcmp is a user-level system without kernel-space modifications, implemented in 6,483 lines of C++ code. In Rcmp, a page is 2 MB by default since it achieves a good balance between metadata size and latency; each write buffer is 64 MB and the page cache is LRU-cache and houses 50 pages; the threshold \(T_l\) is 100 s, \(H_p\) is 4, \(\lambda\) is 0.04, and \(R_{rw}\) is 0.9 by default. The thresholds are tuned based on application scenarios. The RRPC framework is implemented on eRPC [24].
CXL-enabled FPGA prototypes are now available for purchase, but we still choose NUMA-based emulation to implement CXL memory for two reasons. First, FPGA-based prototypes have higher latency in Intel measurements [49], more than 250 ns. As presented by CXL president Siamak Tavallaei [38], “These early CXL proof of concepts and products are yet not optimized for latency. With time, the access latency of CXL memory will be significantly improved.” Second, in addition to similar access latency, the NUMA architecture is cache coherent and uses
6.2 Experiment Setup
All experiments are conducted on five servers, each equipped with two-socket Intel Xeon Gold 5218R CPUs @ 2.10 Ghz, 128 GB of DRAM, and one 100-Gbps Mellanox ConnectX-5 RNIC. The operating system is Ubuntu 20.04 with Linux 5.4.0-144-generic. The interconnection latency of NUMA node 0 and node 1 is 138.5 ns and 141.1 ns, and the intra-node access latency is 93 ns and 89.7 ns.
Rcmp is compared with other four state-of-the-art remote memory systems: (1) Fastswap [3], a page-based system; (2) FaRM [17], an object-based system; (3) GAM [9], a distributed memory system that provides a cache coherence protocol over RDMA; and (4) CXL-over-Ethernet, a CXL-based memory disaggregation system with Ethernet (see Section 7 for details). We run Fastswap and GAM using open source codes. Since FaRM is not publicly available, we use the code in the work of Cai et al. [9]. Note that FaRM and GAM are not really “disaggregation” architectures; their CNs have local memory of the same size as the remote memory. We modify some configurations (reducing local memory) to port them to a disaggregated architecture. Due to the lack of FPGA devices and the unpublished source code of CXL-over-Ethernet, we implement the CXL-over-Ethernet prototype based on the Rcmp’s code. To be fair, the RDMA network is also used in CXL-over-Ethernet.
System Deployment and Simulated Environment. Figure 13(a) shows the envisioned architecture of Rcmp. In a rack, low-latency CXL is used to connect CNs and MNs to form a small memory pool; RDMA is used to connect the racks (interconnection with RDMA-enabled ToR switches). The CXL link speed is 90 to 150 ns; the RDMA network latency is 1.5 to 3 \(\mu\)s. Our test environment is shown in Figure 13(b). Due to the limited availability of devices, we use a server to simulate a rack, including a small compute pool and memory pool (or CXL memory). For Rcmp and CXL-over-Ethernet, the compute pool is running on one CPU socket and one CPU-less MN as CXL memory. In the compute pool of Rcmp, different processes run different CN clients and a process runs Daemon. For other systems, the memory pool is connected to the compute pool via RDMA. In addition, the memory pool or CXL memory in a rack has about 100 GB of DRAM, and the local DRAM of the compute pool is 1 GB. We use micro-benchmarks to evaluate the basic read/write performance of different systems and use the YCSB benchmarks [15] to evaluate their performance under different workloads, as shown in Table 5.
6.3 Micro-Benchmark Results
We first evaluate the overall performance and scalability of these systems by running the micro-benchmark with random
Overall Performance. As shown in Figure 14, we run a micro-benchmark 10 times in a two-rack environment under different data sizes and compare the average latency. The same number of memory pages are pre-allocated for each rack.
The results show that Rcmp has lower and more stable write/read latency (<3.5 \(\mu\)s and <3\(\mu\)s). Specifically, the write latency is reduced by 2.3 to 8.9× and the read latency is reduced by 2.7 to 8.1× compared to other systems. This is achieved through Rcmp’s efficient utilization of CXL, which incorporates designs such as effective communication and hot-page swapping to minimize system latency. Fastswap has over 12-\(\mu\)s access latency, which is \(\sim 5.2\times\) higher than Rcmp. When the accessed data is not in the local DRAM cache, Fastswap fetches the page from the remote memory pool based on expensive page faults, resulting in higher overhead. FaRM has lower read/write latency, around 8 \(\mu\)s, due to object-based data management and efficient messaging primitives to improve RDMA communications. GAM is also an object-based system and performs well (\(\sim 5 \mu\)s) when the data size is less than 512B, but latency increases dramatically when the data is larger. This is because GAM uses 512B as the default cache line size, and when data span multiple cache lines, GAM needs to maintain the consistency state across all cache lines synchronously, resulting in performance degradation. Furthermore, the write operations are made asynchronous and pipelined in GAM, which has a lower write latency (see Figure 14(a)). CXL-over-Ethernet also achieves low read and write latency (6–8 \(\mu\)s) through CXL. However, CXL-over-Ethernet deploys CXL in the compute pools and employs a cache strategy for the memory pools, which does not fully utilize the low-latency benefits of CXL. In addition, CXL-over-Ethernet is not optimized for a network, which is the main performance bottleneck of the hybrid architecture.
Scalability. We test the scalability of different systems by varying different clients and racks. Each client runs a micro-benchmark. We have five servers and can build up to five racks.
First, we first compare the read/write throughput with multiple clients concurrently in a two-rack environment. As shown in Figure 15, the throughput of Rcmp is roughly linear to the number of clients when there are fewer than 16 clients. However, the scalability is limited when there are more clients due to a single Daemon. Therefore, Rcmp will adopt multiple Daemon servers for larger-scale nodes (Section 4.2). Fastswap scales almost linearly with the client because of the efficient page-fault-driven remote memory accesses. FaRM also has good scalability, especially for read operations due to efficient communication primitives. In contrast, GAM only exhibits linear scalability within four threads. When more clients are involved, the performance improvement of GAM is marginal or even negative due to the software overhead of its user-level library [29, 66]. To ensure consistency, GAM has to acquire locks to check the access permission for each memory access, which has a high overhead in dense access scenarios. CXL-over-Ethernet’s performance no longer improves with threads after eight threads. In CXL-over-Ethernet, before accessing the memory pool, all threads need communicate with the CXL Agent, which becomes the performance bottleneck.
Second, we increase the number of racks and run eight clients with each rack. The accessed data of each rack is uniformly distributed among the entire memory pool. As shown in Figure 16, the throughput of Fastswap is not affected by the number of racks and has excellent scalability. A slight performance loss occurs in Rcmp and FaRM due to competition from different accesses between racks. In Rcmp, there is also contention for hot-page swapping, which is mitigated by the hot-page identification mechanism. The cache coherence overhead of GAM becomes more pronounced in the multi-rack environment, resulting in significant performance degradation. For CXL-over-Ethernet, the agent in the compute pool limits the scalability.
In summary, Rcmp effectively leverages CXL through several innovative designs to reduce access latency and improve scalability, whereas other systems suffer from high latency or poor scalability.
6.4 Key-Value Store and YCSB Workloads
We run a general key-value store interface that is implemented as a hashtable on these systems. Next, we run widely used YCSB benchmarks [15] (six workloads as shown in Table 5) to evaluate the performance. Since the hashtable does not support range queries, the YCSB E workload is not performed. All experiments are run in a two-rack environment. We pre-load 100M key-value pairs with 64B size and then perform different workloads under Uniform and Zipfian (skewness is 0.99 by default) distributions. Figure 17 shows the throughput of different systems, all normalized to Fastswap. Based on this result, the following conclusions can be drawn.
First, Rcmp outperforms RDMA-based systems by 2 to 4× on all the workloads by utilizing CXL efficiently. Specifically, for read-intensive workloads (YCSB B, C, D), Rcmp performance improves \(\sim 3\times\) over Fastswap by avoiding page-fault overheads and reducing data movement between racks with hot-page swapping. In addition, Rcmp designs efficient communication mechanisms and an RRPC framework to achieve optimal performance. FaRM, GAM, and CXL-over-Ethernet also have better performance, \(\sim 1.5\times\) improvement over Fastswap. This is because FaRM needs only a single one-sided lock-free read operation to remote access. GAM or CXL-over-Ethernet provides a uniform caching policy in local memory or CXL memory. With the memory disaggregation architecture, the benefits of caching are constrained by the limited local DRAM. For write-intensive workloads (YCSB A and F), Rcmp has a 1.5× higher throughput.
Second, Rcmp’s performance improvement is more pronounced in Zipfian workloads, which achieves up to 3.8× higher throughput. Since hot pages are accessed frequently under Zipfian workloads, Rcmp greatly reduces slow remote-rack accesses by migrating hot pages to the local rack. GAM and CXL-over-Ethernet also have significant performance improvements due to high cache hit rates under Zipfian workloads.
In summary, Rcmp achieves superior performance over other systems by effectively leveraging CXL and other optimizations. Other systems have obvious limitations in the memory disaggregation architecture, where most of the data is obtained by accessing the remote memory pool. For instance, kernel-based, page-granular Fastswap has expensive interruption overheads. GAM’s caching strategies have limited performance improvement in the scarce local memory. In addition, some operations of FaRM rely on bilateral collaboration, which is incompatible with this disaggregated architecture due to the near-zero computation power of the memory pool.
6.5 Impact of Key Technologies
In this section, we focus on the impact of four strategies on Rcmp performance, including the communication mechanism, swapping and caching strategies, and RRPC. These strategies aim to mitigate the performance mismatch problems (between RDMA and CXL) and maximize the performance benefits of CXL.
We first apply Rcmp’s key technologies one by one, and Figure 18 shows the results under a micro-benchmark in a two-rack environment. Base represents the basic version of Rcmp, including single-layer ring buffers, eRPC, and so on. +RB represents adopting double-layer ring buffers; +Swap and +WB indicate that the hot-page swapping and CXL write buffer are further applied, respectively; and +RRPC represents adopting the RRPC framework and shows the final performance of Rcmp. Rcmp-only-CXL indicates that all read/write operations are performed within the rack and do not involve RDMA networks. Theoretically, the pure CXL solution is the upper limit of Rcmp’s performance, but it can only be deployed within the rack. The results show that these techniques decompose the performance gap between Rcmp and Rcmp-only-CXL, but Rcmp still has room for improvement in tail latency and read throughput. Among them, double-layer ring buffers reduce the latency, especially the tail latency. Swapping strategy greatly reduces the latency and improves the throughput, which is more obvious in read operations. The write buffer and RRPC improve the throughput significantly, and the write buffer mainly affects write operations.
Then, we analyze the benefits of each technology in detail. All experiments are run in a two-rack environment by default.
Intra-Rack Communication. As shown in Figure 19, we compare the latency (p50, p99, p999) under a micro-benchmark of two strategies: (1) using a single ring buffer (Baseline) and (2) using two ring buffers for different access modes (Rcmp). The results show that Rcmp reduces the 50th, 99th, and 999th percentile latency by up to 21.7%, 30.9%, and 51.5%. Due to the latency gap between local- and remote-rack accesses, a single communication buffer may lead to blocking problems, triggering longer tail latency. Rcmp solves it with an efficient communication mechanism.
Hot-Page Swapping. We run a micro-benchmark under different distributions (Uniform, Zipfian) to evaluate the effect of hot-page swapping. The results are shown in Figure 20, and the following conclusions can be drawn. First, the hot-page swapping policy can significantly improve performance compared to the no-swapping policy, especially for skewed workloads. For example, the swapping policy (\(H_p\)=3) can improve throughput by 5% on Uniform workloads and 35% on Zipfian workloads. Second, frequent page swapping results in performance degradation. When the hotness threshold is set very low (e.g., \(H_p\)=1) the throughput plummets, because when the threshold is low, each remote access may trigger a page swap (similar to page-based systems), resulting in high overhead.
Write Buffer. Assuming a scenario using
RRPC Framework. We compare the RRPC with FaRM’s RPC [17], the eRPC [24] framework, and a hybrid mode (eRPC + one-sided RDMA verbs) under different transfer data sizes. As shown in Figure 22, RRPC achieves 1.33 to 1.89× higher throughput than eRPC + one-sided RDMA verbs, and 1.5 to 2× higher than eRPC when transferred data is larger. eRPC performs well when the data is less than 968B, but when the data is larger, eRPC suffers from performance degradation. This is because eRPC is based on UD (Unreliable Datagram) mode and each message has an MTU (maximum transmission unit) size, 1KB by default, and when the data is larger than MTU size, it will be divided into multiple packets, resulting in worse performance. RRPC will only select the eRPC method when the data is less than 512B, and select the hybrid mode when the data is larger. In addition, RRPC adopts several strategies to improve RDMA communication performance.
6.6 Discussion
Supporting Decentralization. The centralized design makes the MS prone to be a performance bottleneck, which is mitigated by leveraging CN local DRAM in Rcmp. Rcmp tries to implement a decentralized architecture with consistent hashing, and the cluster membership is to be maintained reliably using Zookeeper [23], similar to FaRM.
Supporting Cache I/O. Rcmp adopts a cache-less access mode, which avoids the consistency maintenance overhead across racks. With the decentralized architecture, Rcmp can design cache structures for remote racks in CXL memory and maintain cache consistency between racks using Zookeeper.
Transparency. Although Rcmp provides very simple APIs and inbuilt implementations of standard data structures (e.g., hashtable), there are still many scenarios that desire to migrate legacy applications to the Rcmp without modifying their source codes. Referring to Gengar [18], we integrate Rcmp with FUSE [1] to implement a simple distributed file system, which can be used for most applications without modifying source codes. Instructions for use can be found in the Rcmp’s source code at https://github.com/PDS-Lab/Rcmp.
7 RELATED WORK
7.1 RDMA-Based Remote Memory
Page-Based Systems. Infiniswap [22] is a page-based remote memory system using RDMA networks, which performs decentralized slab placements and evictions based on one-sided RDMA operations. Additionally, Infiniswap adopts a block device in kernel space as the swap space and a daemon in user space to manage accessible remote memory. Similar page-based systems include LegoOS [44], a new resource-disaggregated OS that provides a global virtual memory space; Clover [51], an RDMA-based disaggregated persistent memory (pDPM) system, which separates the metadata/control plane and the data plane; and Fastswap [3], a fast swap system for disaggregated memory over RDMA via a far memory-aware cluster scheduler and so on. However, these page-based systems suffer from I/O amplification due to coarse-grained access and additional overhead due to page-fault handling and context switching.
Object-Based Systems. Object-based memory disaggregation designs its own object interfaces (e.g., key-value store) to directly intervene in RDMA data transfers. FaRM [17] is an object-based remote memory system based on RDMA, which exposes the memory of all servers in a cluster as a shared address space and provides efficient APIs to simplify the use of remote memory. AIFM [41] is an application-integrated remote memory system which provides convenient APIs for the development of applications and a high-performance runtime design for minimal overhead on object accesses. Xstore [57] adopts learned indexes to build remote memory cache in RDMA-based key-value stores. However, these systems are not fully “disaggregated,” because each CN contains the same size of local memory as the remote memory. FUSEE is a fully memory-disaggregated key-value store that brings disaggregation to metadata management based on RACE hashing index [67], which is one-sided RDMA-conscious extendible hashing. Gengar [18] is an object-based hybrid memory pool system that provides a global memory space (including remote NVM and DRAM) over RDMA.
Communication Optimization. Most RDMA-based systems propose optimized strategies to improve the efficiency of RDMA communication. FaRM proposes messaging primitives based on lock-free ring buffers to minimize the communication overhead for remote memory. In addition, FaRM reduces RNIC cache misses by sharing QPs. Clover improves the scalability of RDMA by registering memory regions using huge memory pages (HugePage) with RNICs. Xstore uses doorbell batching to reduce network latency for multiple RDMA
7.2 Supporting Cache Coherence
GAM [9] is a distributed memory system that provides cache-coherence memory over RDMA. GAM maintains coherence between local and remote memory via a directory-based cache coherence protocol. However, this approach has high maintenance overhead. New interconnect protocols, such as CXL [45] or CCIX [6], natively support cache coherence and lower latency compared to RDMA. Some researchers have tried to redesign RDMA-based memory disaggregation using these protocols. Kona [10] uses cache coherence instead of virtual memory for tracking applications’ memory accesses transparently, reducing read/write amplification in page-based systems. Rambda [61] is an RDMA-driven acceleration framework using a cache-coherent accelerator to connect to CXL-like cache-coherence memory and adopts a
7.3 CXL-Based Memory Disaggregation
DirectCXL [21] is a CXL-based memory disaggregation, which achieves directly accessible remote memory over CXL protocols. DirectCXL exhibits 6.2× lower latency than RDMA-based memory disaggregation. Pond [30] is the memory pooling system for the cloud platform, which significantly reduces DRAM costs based on CXL. However, these systems do not consider the distance limitation of CXL. CXL-over-Ethernet [56] is a novel FPGA-based memory disaggregation that ignores CXL’s limitation via Ethernet, but it does not fully exploit the performance benefits of CXL.
8 CONCLUSION AND FUTURE WORK
In this study, we developed Rcmp, a low-latency and high-scalable memory pooling system, which first combines RDMA and CXL to achieve memory disaggregation. Rcmp builds a CXL-based memory pool within the rack and uses RDMA to connect the racks, forming a global memory pool. Rcmp adopts several technologies to address the challenge of mismatch between RDMA and CXL. Rcmp proposes a global memory and address management to support access at cache-line granularity. In addition, Rcmp uses different buffer structures to handle intra- and inter-rack communications to avoid blocking problems. For reducing remote-rack accesses, Rcmp proposes a hot-page identification and migration strategy and fine-grained accesses buffering with a lock-based synchronization mechanism. For improving remote-rack accesses, Rcmp designs an optimized RRPC framework. Evaluations indicated that Rcmp outperforms significantly other RDMA-based solutions in all workloads without additional overheads.
In the future, we will experiment with real CXL devices for Rcmp and improve the design of decentralization and CXL caching strategies (presented in Section 6.6). In addition, we will support other storage devices (e.g., PM, SSD, and HDD) in Rcmp.
ACKNOWLEDGMENTS
We appreciate all reviewers and editors for their insightful comments and feedback. We thank Yixing Guo for his efforts in Rcmp’s codes.
- [1] . 2023. FUSE (Filesystem in Userspace). Retrieved December 8, 2023 from http://libfuse.github.io/Google Scholar
- [2] . 2023. Memory disaggregation: Why now and what are the challenges. ACM SIGOPS Operating Systems Review 57, 1 (2023), 38–46.Google ScholarDigital Library
- [3] . 2020. Can far memory improve job throughput? In Proceedings of the 15th European Conference on Computer Systems. 1–16.Google ScholarDigital Library
- [4] . 2021. FlashNeuron: SSD-enabled large-batch training of very deep neural networks. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST’21). 387–401.Google Scholar
- [5] . 2013. TheDatacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (2nd ed.). Synthesis Lectures on Computer Architecture. Morgan & Claypool.Google ScholarCross Ref
- [6] . 2017. CCIX, GEN-Z, OpenCAPI: Overview & comparison. In Proceedings of the OpenFabrics Workshop.Google Scholar
- [7] . 2023. Role of Chat GPT in public health. Annals of Biomedical Engineering 51, 5 (2023), 868–869.Google ScholarCross Ref
- [8] . 1994. The slab allocator: An object-caching kernel memory allocator. In Proceedings of the USENIX Summer 1994 Technical Conference, Vol. 16. 1–12.Google Scholar
- [9] . 2018. Efficient distributed memory management with RDMA and caching. Proceedings of the VLDB Endowment 11, 11 (2018), 1604–1617.Google ScholarDigital Library
- [10] . 2021. Rethinking software runtimes for disaggregated memory. In Proceedings of the 18th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. 2–16.Google Scholar
- [11] . 2021. PolarDB Serverless: A cloud native database for disaggregated data centers. In Proceedings of the 2021 International Conference on Management of Data. 2477–2489.Google ScholarDigital Library
- [12] . 2020. Characterizing, modeling, and benchmarking RocksDB key-value workloads at Facebook. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20).Google ScholarDigital Library
- [13] . 2018. Analyzing Alibaba’s co-located datacenter workloads. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data’18). IEEE, Los Alamitos, CA, 292–297.Google ScholarCross Ref
- [14] . 2023. Supercomputing Predictions: Custom CPUs, CXL3.0, and Petalith Architectures. Retrieved December 8, 2023 from https://adrianco.medium.com/supercomputing-predictions-custom-cpus-cxl3-0-and-petalith-architectures-b67cc324588f/Google Scholar
- [15] . 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing. 143–154.Google ScholarDigital Library
- [16] . 2023. PCI Express®5.0 Optical Signal Transmission Test. Retrieved December 8, 2023 from https://global.kyocera.com/newsroom/news/2023/000694.htmlGoogle Scholar
- [17] . 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 401–414.Google Scholar
- [18] . 2021. Gengar: An RDMA-based distributed hybrid memory pool. In Proceedings of the 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS’21). IEEE, Los Alamitos, CA, 92–103.Google ScholarCross Ref
- [19] . 2020. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines 30 (2020), 681–694.Google ScholarDigital Library
- [20] . 2016. Network requirements for resource disaggregation. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 249–264. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/gaoGoogle Scholar
- [21] . 2022. Direct access, high-performance memory disaggregation with DirectCXL. In Proceedings of the 2022 USENIX Annual Technical Conference (USENIX ATC’22). 287–294.Google Scholar
- [22] . 2017. Efficient memory disaggregation with INFINISWAP. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation (NSDI’17). 649–667.Google ScholarDigital Library
- [23] . 2010. ZooKeeper: Wait-free coordination for internet-scale systems. In Proceedings of the USENIX Annual Technical Conference, Vol. 8.Google Scholar
- [24] . 2019. Datacenter RPCs can be general and fast. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19). 1–16.Google Scholar
- [25] . 2014. Using RDMA efficiently for key-value services. In Proceedings of the 2014 ACM Conference on SIGCOMM. 295–306.Google ScholarDigital Library
- [26] . 2016. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). 185–201.Google Scholar
- [27] . 2016. Rack-scale disaggregated cloud data centers: The dReDBox project vision. In Proceedings of the 2016 Design, Automation, and Test in Europe Conference and Exhibition (DATE’16). IEEE, Los Alamitos, CA, 690–695.Google ScholarCross Ref
- [28] . 2018. Beyond the memory wall: A case for memory-centric HPC system for deep learning. In Proceedings of the 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’18). IEEE, Los Alamitos, CA, 148–161.Google ScholarDigital Library
- [29] . 2021. Mind: In-network memory management for disaggregated data centers. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 488–504.Google ScholarDigital Library
- [30] . 2023. Pond: CXL-based memory pooling systems for cloud platforms. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vol. 2. 574–587.Google ScholarDigital Library
- [31] . 2018. Main-memory requirements of big data applications on commodity server platform. In Proceedings of the 2018 18th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGRID’18). IEEE, Los Alamitos, CA, 653–660.Google Scholar
- [32] . 2023. TPP: Transparent page placement for CXL-enabled tiered-memory. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vol. 3. 742–755.Google ScholarDigital Library
- [33] . 2021. Memtrade: A disaggregated-memory marketplace for public clouds. arXiv preprint arXiv:2108.06893 (2021).Google Scholar
- [34] . 2023. Myths and legends in high-performance computing. International Journal of High Performance Computing Applications 37, 3-4 (2023), 245–259.Google ScholarDigital Library
- [35] . 2022. A case for intra-rack resource disaggregation in HPC. ACM Transactions on Architecture and Code Optimization 19, 2 (2022), 1–26.Google ScholarDigital Library
- [36] . 2021. Birds of a feather flock together: Scaling RDMA RPCs with Flock. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 212–227.Google ScholarDigital Library
- [37] . 2020. On the memory underutilization: Exploring disaggregated memory on HPC systems. In Proceedings of the 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’20). IEEE, Los Alamitos, CA, 183–190.Google ScholarCross Ref
- [38] . 2022. Just How Bad Is CXL Memory Latency? Retrieved December 8, 2023 from https://www.nextplatform.com/2022/12/05/just-how-bad-is-cxl-memory-latency/Google Scholar
- [39] . 2021. HeMem: Scalable tiered memory management for big data applications and real NVM. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 392–407.Google ScholarDigital Library
- [40] . 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the 3rd ACM Symposium on Cloud Computing. 1–13.Google ScholarDigital Library
- [41] . 2020. AIFM: High-performance, application-integrated far memory. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 315–332.Google Scholar
- [42] . 2019. PCIe Riser Extension Assembly. Technical Disclosure Commons (January 11, 2019). https://www.tdcommons.org/dpubs_series/1878Google Scholar
- [43] . 2019. Fast general distributed transactions with opacity. In Proceedings of the 2019 International Conference on Management of Data. 433–448.Google ScholarDigital Library
- [44] . 2018. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 69–87.Google Scholar
- [45] . 2022. Compute Express Link. Retrieved December 8, 2023 from https://www.computeexpresslink.org/_files/ugd/0c1418_a8713008916044ae9604405d10a7773b.pdf/Google Scholar
- [46] . 2023. A Milestone in Moving Data. Retrieved December 8, 2023 from https://www.intel.com/content/www/us/en/newsroom/home.htmlGoogle Scholar
- [47] . 2019. Shoal: A network architecture for disaggregated racks. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19). 255–270. https://www.usenix.org/conference/nsdi19/presentation/shrivastavGoogle Scholar
- [48] . 2019. Intel® Rack Scale Design (Intel® RSD) Storage Services. API Specification. Intel.Google Scholar
- [49] . 2023. Demystifying CXL memory with genuine CXL-ready systems and devices. arXiv preprint arXiv:2303.15375 (2023).Google Scholar
- [50] . 2023. Linux Kernel Source Tree. Retrieved December 8, 2023 from https://github.com/torvalds/linux/blob/master/lib/kfifo.cGoogle Scholar
- [51] . 2020. Disaggregating persistent memory and controlling them remotely: An exploration of passive disaggregated key-value stores. In Proceedings of the 2020 USENIX Annual Technical Conference. 33–48.Google Scholar
- [52] . 2019. HOTI 2019: Compute express link. In Proceedings of the 2019 IEEE Symposium on High-Performance Interconnects (HOTI’19). IEEE, Los Alamitos, CA, 18–18.Google ScholarCross Ref
- [53] . 2017. Amazon Aurora: Design considerations for high throughput cloud-native relational databases. In Proceedings of the 2017 ACM International Conference on Management of Data. 1041–1052.Google ScholarDigital Library
- [54] . 2020. Building an elastic query engine on disaggregated storage. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI’20). 449–462.Google Scholar
- [55] . 2022. Evaluating emerging CXL-enabled memory pooling for HPC systems. arXiv preprint arXiv:2211.02682 (2022).Google Scholar
- [56] . 2023. CXL over Ethernet: A novel FPGA-based memory disaggregation design in data centers. In Proceedings of the 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’23). IEEE, Los Alamitos, CA, 75–82.Google ScholarCross Ref
- [57] . 2020. Fast RDMA-based ordered key-value store using remote learned cache. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 117–135.Google ScholarDigital Library
- [58] . 2003. Reliability mechanisms for very large storage systems. In Proceedings of the 2003 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’03). IEEE, Los Alamitos, CA, 146–156.Google ScholarCross Ref
- [59] . 2020. A large scale analysis of hundreds of in-memory cache clusters at Twitter. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 191–208.Google ScholarDigital Library
- [60] . 2022. Performance evaluation on CXL-enabled hybrid memory pool. In Proceedings of the 2022 IEEE International Conference on Networking, Architecture, and Storage (NAS’22). IEEE, Los Alamitos, CA, 1–5.Google ScholarCross Ref
- [61] . 2023. RAMBDA: RDMA-driven acceleration framework for memory-intensive \(\mu\)s-scale datacenter applications. In Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA’23). IEEE, Los Alamitos, CA, 499–515.Google ScholarCross Ref
- [62] . 2016. The end of a myth: Distributed transactions can scale. CoRR abs/1607.00655 (2016). http://arxiv.org/abs/1607.00655Google Scholar
- [63] . 2022. FORD: Fast one-sided RDMA-based distributed transactions for disaggregated persistent memory. In Proceedings of the 20th USENIX Conference on File and Storage Technologies (FAST’22). 51–68.Google Scholar
- [64] . 2021. Sherman: A write-optimized distributed B+Tree index on disaggregated memory. arXiv preprint arXiv:2112.07320 (2021).Google Scholar
- [65] . 2021. Towards cost-effective and elastic cloud database deployment via memory disaggregation. Proceedings of the VLDB Endowment 14, 10 (2021), 1900–1912.Google ScholarDigital Library
- [66] . 2022. ScaleStore: A fast and cost-efficient storage engine using DRAM, NVMe, and RDMA. In Proceedings of the 2022 International Conference on Management of Data. 685–699.Google ScholarDigital Library
- [67] . 2021. One-sided RDMA-conscious extendible hashing for disaggregated memory. In Proceedings of the USENIX Annual Technical Conference. 15–29.Google Scholar
Index Terms
- Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL
Recommendations
Pond: CXL-Based Memory Pooling Systems for Cloud Platforms
ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2Public cloud providers seek to meet stringent performance requirements and low hardware cost. A key driver of performance and cost is main memory. Memory pooling promises to improve DRAM utilization and thereby reduce costs. However, pooling is ...
Near to Far: An Evaluation of Disaggregated Memory for In-Memory Data Processing
DIMES '23: Proceedings of the 1st Workshop on Disruptive Memory SystemsEfficient in-memory data processing relies on the availability of sufficient resources, be it CPU time or available main memory. Traditional approaches are coping with resource limitations by either adding more processors or RAM sticks to a single ...
An efficient design for fast memory registration in RDMA
Remote Direct Memory Access (RDMA) improves network bandwidth and reduces latency by eliminating unnecessary copies from network interface card to application buffers, but the communication buffer management to reduce memory registration and ...
Comments