skip to main content
research-article
Open Access

Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL

Published:19 January 2024Publication History

Skip Abstract Section

Abstract

Memory disaggregation is a promising architecture for modern datacenters that separates compute and memory resources into independent pools connected by ultra-fast networks, which can improve memory utilization, reduce cost, and enable elastic scaling of compute and memory resources. However, existing memory disaggregation solutions based on remote direct memory access (RDMA) suffer from high latency and additional overheads including page faults and code refactoring. Emerging cache-coherent interconnects such as CXL offer opportunities to reconstruct high-performance memory disaggregation. However, existing CXL-based approaches have physical distance limitation and cannot be deployed across racks.

In this article, we propose Rcmp, a novel low-latency and highly scalable memory disaggregation system based on RDMA and CXL. The significant feature is that Rcmp improves the performance of RDMA-based systems via CXL, and leverages RDMA to overcome CXL’s distance limitation. To address the challenges of the mismatch between RDMA and CXL in terms of granularity, communication, and performance, Rcmp (1) provides a global page-based memory space management and enables fine-grained data access, (2) designs an efficient communication mechanism to avoid communication blocking issues, (3) proposes a hot-page identification and swapping strategy to reduce RDMA communications, and (4) designs an RDMA-optimized RPC framework to accelerate RDMA transfers. We implement a prototype of Rcmp and evaluate its performance by using micro-benchmarks and running a key-value store with YCSB benchmarks. The results show that Rcmp can achieve 5.2× lower latency and 3.8× higher throughput than RDMA-based systems. We also demonstrate that Rcmp can scale well with the increasing number of nodes without compromising performance.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Memory disaggregation is increasingly favored in datacenters (e.g., RSA [48], WSC [5], and dReDBox [27]), cloud servers (e.g., Pond [30] and Amazon Aurora [53]), in-memory databases (e.g., PolarDB [11] and LegoBase [65]), and High-Performance Computing (HPC) systems [37, 55], among others, for higher resource utilization, flexible hardware scalability, and lower costs. This architecture (Figure 1) decouples compute and memory resources from traditional monolithic servers to form independent resource pools. The compute pool contains rich CPU resources but minimal memory resources, whereas the memory pool contains large amounts of memory but near-zero computation power. Memory disaggregation can provide a global shared memory pool and allow different resources to scale independently, which offers opportunities to build cost-effective and elastic datacenters.

Fig. 1.

Fig. 1. Memory disaggregation.

Remote Direct Memory Access (RDMA) networks are generally adopted in memory disaggregation systems [3, 17, 29, 41, 43, 51, 63, 66] to connect the compute and memory pools (Figure 2(a)). However, existing RDMA-based memory disaggregation solutions have significant shortcomings. One is high latency. Current RDMA can support a single-digit-microsecond-level latency (1.5\(\sim\)3 \(\mu\)s) [17, 64] but is still several orders of magnitude away from DRAM memory latency (80\(\sim\)140 ns). RDMA communication becomes the performance bottleneck for accessing the memory pool. Another is additional overhead. Since memory semantics are not natively supported, RDMA incurs intrusive code modifications and interruption overheads on the original system. Specifically, current RDMA-based memory disaggregation includes page-based and object-based approaches, differentiated by data exchange granularity. However, page-based approaches involve additional overhead of page-fault handling and read/write amplifications [10, 41], whereas object-based approaches require custom interface changes and source-level modifications that sacrifice transparency [17, 56].

Fig. 2.

Fig. 2. Different memory disaggregation architectures.

CXL (Compute Express Link) is a PCIe-based cache-coherent interconnect protocol, which enables direct and coherent access to remote memory devices without CPU intervention [45, 52]. CXL natively supports memory semantics and has similar multi-socket NUMA access latency (about 90\(\sim\)150 ns [21, 45]), which exhibits great potential to overcome the drawbacks of RDMA and realize low-cost, high-performance memory disaggregation. Recently, CXL-based memory disaggregation technology has received significant attention in both academia and industry [10, 21, 30, 56].

Reconstructing a CXL-based memory disaggregation architecture (see Figure 2(b)) to replace RDMA is promising research, but the immaturity of CXL technology and the lack of industrial-grade products make it difficult in practice. First, there are physical limitations. Existing CXL-based memory disaggregation faces restrictions on long-distance deployment, typically limited to within rack level in a datacenter, even for the latest CXL 3.0 specification [14, 45, 56]. Physical distance limitations result in the inability to deploy memory pools across racks, losing high scalability. Second, the cost is high. The cost of replacing all RDMA hardware in a datacenter with CXL hardware is prohibitively high, especially for large-scale clusters. Furthermore, due to the lack of commercially available mass-produced CXL hardware and supporting infrastructure, current research on CXL memory relies on custom FPGA prototypes [21, 49] or emulation using a CPU-less NUMA node [30, 32].

In this article, we probe a hybrid memory disaggregation architecture combining CXL and RDMA, which aims to retain and leverage RDMA to enable CXL to break the distance constraint. In such an architecture (see Figure 2(c)), a small CXL-based memory pool is built in a rack, and RDMA is used to connect the racks, forming a larger memory pool. This approach uses CXL to improve the performance of RDMA-based memory disaggregation and ignores the physical distance limitation of CXL. However, it faces huge challenges in implementation including granularity mismatch, communication mismatch, and performance mismatch of RDMA and CXL (Section 3.3). In particular, due to the latency gap between RDMA and CXL, RDMA communications between racks become the major performance bottleneck. Some research proposes an RDMA-driven acceleration framework [61] using a cache-coherent accelerator to connect to CXL-like cache-coherence memory, but this approach requires customized hardware.

To address these issues, we propose Rcmp, a novel memory disaggregation system based on RDMA and CXL to provide low-latency and scalable memory pool services. As shown in Figure 2(c), the significant feature of Rcmp is to combine the RDMA-based (see Figure 2(a)) and CXL-based (see Figure 2(b)) approaches to overcome the drawbacks of both and maximize the performance benefits of CXL. Rcmp advocates several optimized designs to address the aforementioned challenges. Specifically, Rcmp has four key features. First, Rcmp provides global memory allocation and address management, which decouples data movement size (cache-line granularity) from the memory allocation size (page granularity). Fine-grained data access can avoid IO amplification [10, 56]. Second, Rcmp designs an efficient intra- and inter-rack communication mechanism to avoid communication blocking problems. Third, Rcmp proposes a hot-page identification and swapping strategy, and a CXL memory caching policy with a synchronization mechanism to reduce cross-rack access. Fourth, Rcmp designs a high-performance RDMA-aware RPC framework to accelerate cross-rack RDMA transfers.

We implement Rcmp as a user-level architecture with 6,483 lines of C++ code. Rcmp provides simple APIs for memory pool services, which is easy to use for applications. In addition, Rcmp provides simple high-capacity in-memory file system interfaces by integrating with FUSE [1]. We evaluate Rcmp with micro-benchmarks and run a key-value store (hashtable) under YCSB workloads. The evaluation results indicate that Rcmp achieves high and stable performance in all workloads. Specifically, Rcmp reduces latency by 3 to 8× under micro-benchmarks and improves throughput by 2 to 4× under YCSB workloads compared to RDMA-based memory disaggregation systems. In addition, Rcmp has good scalability with the increasing number of nodes or racks. The open source code of Rcmp and experimental datasets in this article are available at https://github.com/PDS-Lab/Rcmp

In summary, we make the following contributions:

We analyze the shortcomings of current memory disaggregation systems and show that RDMA-based systems suffer from high latency, additional overhead, and sub-optimal communication, whereas CXL-based systems suffer from physical distance limitations and lack of available products.

We design and implement Rcmp, a novel memory pool system, which achieves high performance and scalability by combining the advantages of RDMA and CXL. To our best knowledge, this is the first work to use both RDMA and CXL techniques to construct a memory disaggregation architecture.

We propose many optimization designs to overcome the performance challenges encountered when combining RDMA and CXL, including global memory management, an efficient communication mechanism, a hot-page swapping strategy, and a high-performance RPC framework.

We conduct a comprehensive evaluation of Rcmp’s performance and compare it with state-of-the-art memory disaggregation systems. The results demonstrate that Rcmp significantly outperforms these systems in terms of performance and scalability.

The rest of the article is organized as follows. Sections 2 and 3 explain the background and motivations. Sections 4 and 5 present the design ideas and system architecture details of Rcmp. Section 6 presents comprehensive evaluations. Section 7 summarizes the related work. Section 8 concludes the article.

Skip 2BACKGROUND Section

2 BACKGROUND

2.1 Memory Disaggregation

The emerging applications such as big data [31, 39], deep learning [4, 28], HPC [37, 55], and large language models (e.g., ChatGpt [7] and GPT-3 [19]) are increasingly prevalent in modern datacenters, which leads to a huge demand for memory [2, 3, 44, 56]. However, datacenters today mostly use monolithic server architectures where CPU and memory are tightly coupled, which suffer from significant challenges in the face of growing memory requirements:

Low memory utilization: In monolithic servers, since the memory resource occupied by a single instance cannot be allocated across server boundaries, it is difficult to fully utilize memory resources. Table 1 shows that the memory utilization in typical datacenters, cloud platforms, and HPC systems is generally below 50%. In addition, real-world applications often request a large amount of memory, but the memory is not fully used in practice. For example, in Microsoft Azure’s and Google’s clusters [30, 33, 56], about 30% to 61% of allocated memory remains idle for extended periods of time.

Lack of elasticity: It is difficult to scale down/up the memory or CPU resources after they have been installed in a monolithic server. As a result, the server configurations must be planned in advance, and dynamic adjustments often come with waste in existing server hardware [44, 65]. In addition, it is difficult to flexibly scale the memory capacity of a single server to the required size due to the fixed CPU to memory ratio [44, 56].

High costs: Lots of unused memory resources lead to high operating costs and wasted energy [11, 65]. In addition, device failures are frequent in modern datacenters, occurring almost every day [13, 40, 58]. With the monolithic architecture, when any one hardware component within a server fails, the whole server is often unusable. Such coarse-grained fault management leads to high costs [44].

Table 1.
ExamplesMemory Utilization
DatacentersGoogle’s production cluster [40, 44]20%\(\sim\)40%
Alibaba’s co-located datacenter [13]5%\(\sim\)60%
Cloud PlatformsSnowflake [33, 54]\(\sim\)19%
Microsoft Azure [30, 56]<50%
HPC SystemsThe clusters at Lawrence Livermore National Laboratory [37, 55]<15%
Cori at National Energy Research Scientific Computing Center [35]9%\(\sim\)15%

Table 1. Memory Utilization in Typical Systems

In response, memory disaggregation is proposed to solve these problems and has received significant attention in both academia and industry [3, 17, 21, 43, 51, 56, 63, 66]. Memory disaggregation separates the memory resources from the compute resources in a datacenter, forming independent resource pools connected with fast networks. This allows different resources to be managed and expanded independently, enabling higher memory utilization, elastic scaling, and lower costs.

As shown in Figure 1, in such an architecture, Compute Nodes (CNs) in the compute pool contain a large number of CPU cores and small local DRAM, and Memory Nodes (MNs) in the memory pool host high-volume memory with near-zero computation power. The microsecond-latency networks (e.g., RDMA) or cache-coherence interconnection protocols (e.g., CXL) are generally the physical transmission approach from CNs to MNs.

2.2 RDMA Technologies

RDMA is a series of protocols that allow one machine to directly access data in remote machines across the network. RDMA protocols are typically solidified directly on RDMA NICs (RNIC) and have high bandwidth (>10 GB/s) and low latency at the microsecond level (\(\sim\)2 \(\mu\)s), which are widely supported by InfiniBand, RoCE, and OmniPath, among others. [20, 47, 62]. RDMA provides data transfer services based on two types of operational primitives: one-sided verbs including RDMA READ, WRITE, ATOMIC (e.g., FAA, CAS) and two-sided verbs including RDMA SEND, RECV. RDMA communication is implemented through a message queue model, called the Queue Pair (QP) and the Completion Queue (CQ). QP consists of Send Queue (SQ) and Receive Queue (RQ). A sender posts the request to SQ (one-sided or two-sided verbs), and RQ is used for queuing RDMA RECV requests in two-sided verbs. CQ is associated with the specified QP. Requests in the same SQ are executed sequentially. By using doorbell batching [47, 64], multiple RDMA operations can be merged into a single request. These requests are then read by the RNIC, which asynchronously writes or reads data from a remote memory. When the sender’s request completes, the RNIC writes the completion entry to CQ so that the sender can know it by polling CQ.

2.3 CXL Protocols

CXL is an open industry standard based on PCIe for high-speed communication between processors, accelerators, and memory in a cache-coherent way with Load/Store semantics. CXL contains three separate protocols including CXL.io, CXL.cache, and CXL.mem. Among them, CXL.mem allows the CPU to access the underlying memory directly via the PCIe bus (FlexBus) without involving page faults or DMAs. Therefore, CXL can provide byte-addressable memory (CXL memory) in the same physical address space and allows transparent memory allocation. With PCIe 5.0, CPU-to-CXL interconnect bandwidth will be similar to the cross-NUMA interconnects in the NUMA architecture. CXL memory can be considered as a CPU-less NUMA node, from a software perspective, and the access latency is also similar to the NUMA access latency (about 90\(\sim\)150 ns [21, 45]). Even CXL 3.0 Specification [45] reports that the CXL.mem has access latency close to normal DRAM (about 40-ns read latency and 80-ns write latency). However, the current CXL prototypes used in most papers have significantly higher access latency, about 170 to 250 ns [30, 32, 49].

Skip 3EXISTING MEMORY DISAGGREGATION ARCHITECTURES AND LIMITATIONS Section

3 EXISTING MEMORY DISAGGREGATION ARCHITECTURES AND LIMITATIONS

3.1 RDMA-Based Approaches

According to the way data is managed, RDMA-based memory disaggregation can be roughly divided into two approaches: page based and object based. The page-based approach (e.g., Infiniswap [22], LegoOS [44], Fastswap [3]) uses virtual memory mechanisms to cache remote pages in the memory pool into a local DRAM cache. It achieves remote memory pool access by triggering page faults and swapping local memory pages and remote pages. Its advantages are simplicity, ease of use, and transparency to applications. The object-based approach (e.g., FaRM [17] and FaRMV2 [43], AIFM [41], and Gengar [18]) achieves fine-grain memory management with custom object-based semantics, such as key-value interfaces. The one-sided verbs enable the CNs to directly access the MNs without involving remote CPUs, which is more suitable for memory disaggregation due to near-zero computation power in MNs. However, if only one-sided RDMA primitives are used for communication in memory disaggregation systems, a single data query may involve multiple read and write operations, resulting in high latency [25, 26]. Therefore, many studies propose high-performance RPC frameworks based on RDMA (e.g., FaSST [26] and FaRM [17]) or adopt general RPC libraries without RDMA primitives [24].

In general, the shortcomings of the RDMA-based approaches can be summarized as follows.

Problem 1: High Latency. There is a large latency gap between RDMA communication and memory access, more than 20× (Table 2). This makes RDMA networks a major performance bottleneck for RDMA-based memory disaggregation systems.

Table 2.
Latency
DRAM\(\sim\)80 ns
CXL [21, 45]90\(\sim\)150 ns
RDMA [20, 47, 62]\(\sim\)2 \(\mu\)s
Page Based (e.g., Fastswap [3])\(\sim\)13 \(\mu\)s (remote)
Object Based (e.g., FaRM [17])\(\sim\)8 \(\mu\)s (remote)

Table 2. Latency Comparison

Problem 2: High Overhead. The page-based approach suffers from performance degradation due to page-fault overheads [21, 41, 56]. The example of Fastswap [3] is shown in Table 2, and it has high remote access latency (the experiment details are presented in Section 6.2). In addition, for fine-grained accesses, read/write amplification occurs because data is always transferred at page granularity. The object-based approach can avoid page-fault overheads, but it has intrusive code modifications and varies depending on the semantics of the application, leading to higher complexity.

Problem 3: Sub-Optimal Communication. The existing RDMA communication methods are not optimal and do not take full advantage of RDMA bandwidth. We test the throughput varying different data sizes using mainstream communication frameworks including (1) only-RPC (using eRPC [24]), and (2) one-sided RDMA and RPC hybrid mode [17, 26], which uses RPC to obtain remote data address first and accesses data via one-sided RDMA verbs. As shown in Figure 3, the result presents that RPC communication is suitable for small data transmission, whereas the hybrid mode has higher throughput for large data scenarios. The 512 bytes is a cut-off point, which inspires us to design a dynamic strategy. In summary, the RDMA-based solutions are summarized in Table 3.

Table 3.
RDMA BasedCXL BasedHybrid
SystemsPage based (e.g., Fastswap [3])Object based e.g.,FaRM [17])DirectCXL [21]CXL-over-Ethernet [56]Rcmp
Physical LinkRDMARDMACXLCXL+EthernetCXL+RDMA
LatencyHigh: \(\sim\)13 \(\mu\)sMedium: \(\sim\)8 \(\mu\)sLow: 700 ns\(\sim\)1 \(\mu\)sMedium: \(\sim\)6 \(\mu\)sLow: \(\sim\)3 \(\mu\)s
Software OverheadHighMediumLowLowLow
Network EfficiencyLowMediumHighMediumHigh
ScalabilityHighMediumMedium: within rack levelMediumHigh

Table 3. Comparison of Memory Disaggregation Approaches

Fig. 3.

Fig. 3. Communication test.

3.2 CXL-Based Approaches

Lots of studies have proposed memory disaggregation architectures using CXL [10, 21, 30, 56] to overcome the shortcomings of RDMA-based approaches and achieve lower access latency. CXL-based memory disaggregation can provide a shared cache-coherent memory pool and support cache-line granularity access without invasive changes. In summary, based on the characteristics of CXL, the CXL-based approaches have the following advantages over RDMA-based approaches:

Less software overhead: CXL maintains a unified, coherent memory space between the CPU (host processor) and any memory on the attached CXL device. CXL-based approaches reduce software stack complexity without page-fault overheads [21, 30].

Fine-grained access: CXL supports CPUs and GPUs, and other processors access the memory pool by native Load/Store instructions. CXL-based approaches allow cache-line granularity, which avoids the read/write amplification problem of the RDMA-based approaches.

Lower latency: CXL provides near-memory latency and CXL-based approaches alleviate network bottlenecks and memory over-provisioning issues [21, 46].

Elasticity: CXL-based approaches promise excellent scalability as more PCIe devices can be attached across switches unlike DIMM (Dual Inline Memory Module) used for DRAM.

However, the CXL-based approaches also suffer from the following shortcomings.

Problem 1: Physical Distance Limitation. Due to the limited length of the PCIe bus, the CXL-based approach is limited within rack level [45, 56] (existing CXL products is up to 2-m maximum distance [14]), which cannot be used directly in large-scale datacenters. Of course, the PCIe flexible extension cable can be used, but there is still a maximum length limitation (\(\le\)15 inches) [42]. An ongoing research effort is to convert a PCIe 5.0 electrical signal into an optical signal [16], which is still in the testing phase and requires specialized hardware. This approach also has potential overheads including signal loss, power consumption, deployment costs, and so on. In addition, at 3- to 4-m distance, the photon travel time alone exceeds the first-word access latency of modern memory. Therefore, if CXL-based memory disaggregation is beyond rack boundaries, it will become noticeable for latency-sensitive applications [34].

Problem 2: High Cost. What is worse, the CXL products are immature and most research is still in the emulation phase, which includes FPGA-based prototypes and simulation using NUMA nodes. Since the early CXL products using FPGAs are yet not optimized for latency [38] and report higher latency (more than 250 ns) [49], NUMA-based simulation is still the more popular approach for CXL proofs of concept [30, 32, 55, 60]. In addition, the expensive price of current CXL products makes it impractical to replace all RDMA hardware in a datacenter with CXL hardware.

3.3 Hybrid Approaches and Challenges

A possible solution is to use the network to overcome the rack distance limitation of CXL. The state-of-the-art case is CXL-over-Ethernet [56]. It deploys the compute and memory pools in separate racks, and uses CXL in the compute pool to provide global coherent memory abstraction, so the CPU can access the disaggregated memory directly via Load/Store semantics. Then, Ethernet is adopted to transmit CXL memory access requests to the memory pool. This approach can support cache-line access granularity, but each remote access still requires networks and cannot take advantage of CXL’s low latency. The optimization under way is to carefully design the caching strategy in CXL memory [32, 56]. Table 3 shows a comparison of existing memory disaggregation methods, which all have advantages and limitations.

As many researchers believed, CXL and RDMA are complementary technologies and combining the two is promising research [14, 34]. In this article, we explore a new hybrid architecture by combining CXL-based and RDMA-based approaches (i.e., build small memory pools via CXL within the rack and connect these small memory pools via RDMA). This symmetrical architecture allows the full advantage of CXL in each small memory pool and improves scalability with RDMA. However, this hybrid architecture faces the following challenges.

Challenge 1: Granularity Mismatch CXL-based approaches support cache coherence with cache-line as the access granularity. The access granularity of RDMA-based approaches is page or object, much larger than the cache-line granularity. It needs to redesign the memory management and access mechanism for the hybrid architecture.

Challenge 2: Communication Mismatch RDMA communication relies on the RNIC and message queues, whereas CXL is based on high-speed links and cache coherence protocols. It needs to achieve unified and efficient abstraction for the inter- and intra-rack communications.

Challenge 3: Performance Mismatch The latency of RDMA is much greater than CXL (over 10×). Performance mismatch will result in non-uniform access patterns (similar to NUMA architecture)—that is, accessing memory in the local rack (local-rack access for short) is much faster than accessing the remote rack (remote-rack access).

Skip 4DESIGN IDEAS Section

4 DESIGN IDEAS

To address these challenges, we present Rcmp, a novel hybrid memory pool system with RDMA and CXL. Rcmp achieves better performance and scalability, as shown in Table 3. The main design tradeoffs and ideas are described as follows.

4.1 Global Memory Management

Rcmp achieves global memory management via a page-based approach for two reasons. First, the page management method is easy to adopt and transparent to all user applications. Second, the page-based approach better fits the byte access feature of CXL than the object-based approach, which incurs additional indexing overhead. Each page is divided into many slabs for fine-grained management. In addition, Rcmp provides global address management for the memory pool and initially uses a centralized Metadata Server (MS) to manage the assignment and mapping of memory addresses (Section 5.1).

Rcmp accesses and moves data at cache-line granularity, decoupling from memory page size. Since CXL supports memory semantics, Rcmp can naturally enable access at cache-line granularity within the rack. For remote-rack access, Rcmp avoids performance degradation by using direct access mode (Direct-I/O) instead of page swapping triggered by page faults (Section 5.1).

4.2 Efficient Communication Mechanism

As shown in Figure 4, the hybrid architecture has three optional methods of remote-rack communications. In method (a), each CN accesses the memory pool in the remote rack through its own RNIC. This approach has obvious drawbacks. The first is the high cost due to excessive RNIC devices; second, each CN has both a CXL link and an RDMA interface, resulting in high consistency-maintenance overheads; and third, the high contention with the limited RNIC memory causes frequent cache invalidation and higher communication latency [17, 63]. In method (b), one Daemon server (equipped with RNIC) is used on each rack to manage access requests to remote racks. The Daemon server can reduce cost and consistency overhead, but the single Daemon (with an RNIC) will result in limited RDMA bandwidth. In method (c), CNs are grouped using hashing with each group corresponding to a Daemon, to avoid a single Daemon becoming a performance bottleneck. All Daemons are built on the same CXL memory, and consistency is easily guaranteed. Rcmp supports the latter two methods, and method (b) is adopted under small-scale nodes by default.

Fig. 4.

Fig. 4. Inter-rack communication methods.

As with the latest memory disaggregation solutions [17, 43, 61], Rcmp uses the lock-free ring buffer to achieve efficient intra- and inter-rack communications.

Intra-Rack Communication. After Daemon is introduced, CNs need to first communicate with Daemon to determine where the data is stored. The simple solution is to maintain a ring buffer in CXL memory to manage the communication between a CN and Daemon, which may cause message blocking in the hybrid architecture. As shown in Figure 5, CNs add the access request to the ring buffer and wait for Daemon to poll. In this example, CN1 first sends Msg1, then CN2 sends Msg2. When data is filled, the current message (Msg1) completes and the next message (Msg2) will be processed. If Msg1 is a remote-rack access request and Msg2 is a local-rack access request, then due to the performance gap between RDMA and CXL, Msg2 may be filled first. Since each message is of variable length, Daemon cannot obtain the Msg2’s head pointer to skip Msg1 and process Msg2 first. Msg2 must wait for Msg1 to complete, causing the message blocking. To avoid communication blocking, Rcmp decouples local and remote rack accesses and uses different ring buffer structures, where a double-layer ring buffer is adopted for remote-rack access (Section 5.2).

Fig. 5.

Fig. 5. Communication blocking.

Inter-Rack Communication. Daemon servers in different racks communicate with each other through ring buffers with one-sided RDMA writes/reads.

4.3 Remote-Rack Access Optimization

Due to the non-uniform access characteristics, remote-rack access will be the main performance bottleneck of the hybrid architecture. In addition, because of the direct I/O model, one RDMA communication is required for remote-rack data accessed with any granularity, incurring high latency, especially for frequent small data accesses. Rcmp optimizes this problem in two ways: reducing and accelerating remote-rack accesses.

Reducing Remote-Rack Accesses. Skewed accesses and hot spots exist widely in real-world datacenters [12, 59]. Accordingly, Rcmp proposes a page-based hotness identification and user-level hot-page swapping scheme to migrate frequently accessed pages (hot pages) to the local rack for less remote-rack accesses (Section 5.3).

To further leverage temporal and spatial locality, Rcmp caches fine-grained accesses of the remote rack in CXL memory and batches write requests to the remote rack (Section 5.4).

Accelerating RDMA Communications. Rcmp proposes a high-performance RDMA RPC (RRPC) framework with a hybrid transmission mode and other optimizations (e.g., doorbell batching) to take full advantage of the high bandwidth of RDMA networks (Section 5.5).

Skip 5RCMP SYSTEM Section

5 RCMP SYSTEM

In this section, we describe the Rcmp system and optimization strategies in detail.

5.1 System Overview

The Rcmp system overview is shown in Figure 6. Rcmp manages clusters in units of the rack. All CNs and MNs in a rack are interconnected with the CXL links, which is equivalent to a small CXL memory pool. Different racks are connected via RDMA to form a larger memory pool. Rcmp can achieve better performance compared to RDMA-based systems and higher scalability compared to CXL-based systems. The MS is used to global address assignment and metadata maintenance. In a rack, all CNs share a unified CXL memory. The CN Lib provides the APIs of the memory pool. The Daemon server is the central control node of the rack. It is responsible for handling access requests including CXL requests (CXL Proxy) and RDMA requests (Message Manager), swapping hot pages (Swap Manager), managing the slab allocator, and maintaining CXL memory space (Resource Manager). Daemon is running on a server within each rack, the same as the CN. In addition, Rcmp is a user-level architecture, avoiding context-switching overhead between kernel and user space.

Fig. 6.

Fig. 6. Rcmp system overview.

Global Memory Management Rcmp provides global memory address management, as shown in Figure 7(a). MS handles memory allocation at a coarse granularity, page. The global address GAddr (page_id, page_offset) consists of the page id assigned by MS and page offset in CXL memory. Rcmp uses two hash tables to store address mappings. Specifically, the page directory (in MS) records the mapping of page id to rack, and the page table (in Daemon) records the mapping of page id to CXL memory. In addition, to support fine-grained data access, Rcmp uses the slab allocator (an object-caching kernel memory allocator) [8] to handle fine-grained memory allocations. A page is a collection of slabs to the power of 2.

Fig. 7.

Fig. 7. Global memory and address management.

The memory space includes CXL memory and local DRAM of CNs and Daemon, as shown in Figure 7(b). In a rack, each CN has small local DRAM for caching the metadata of local-rack pages including the page table and the hotness information. The local DRAM of Daemon (1) stores the local-rack page table and page hotness metadata of remote accesses, and (2) caches the page directory and the remote-rack page table. CXL memory consists of two parts: a large shared coherent memory space and an owner memory space registered by each CN. The owner memory is used as a CXL cache of remote racks for write buffering and page caching.

Interface. As shown in Table 4, Rcmp provides the usual memory pool interfaces including Open/Close, memory Alloc/Free, data Read/Write, and Lock/UnLock. The Open operation opens the memory pool according to the user configuration (ClientOptions), returns the memory pool context (PoolContext) pointer upon success, and otherwise returns nullptr. When using the Alloc operation, Daemon finds a free page in CXL memory for the application and updates the page table. If there are no free pages, MS allocates the page in a local rack based on the proximity principle. If there is no free space in the rack, the page is allocated in a remote rack randomly (e.g., a hash function). The Read/Write operations read/write data of any size via CXL in the local rack or RDMA in the remote rack. The lock operation including RLock and WLock is for concurrency control. The lock address has to be initialized first. An example of using these APIs to program the Rcmp is shown in Figure 8.

Table 4.
APIDescription
PoolContext *Open (ClientOptions options)Open the Rcmp memory pool
Void Close (PoolContext *pool_ctx)Close Rcmp
GAddr Alloc (size_t size)Alloc memory from the memory pool
Status Free (GAddr gaddr, size_t size)Free the memory
Status Read (GAddr gaddr, size_t size, void *buf)Read data from gaddr and write to buf
Status Write (GAddr gaddr, size_t size, void *buf)Write data from buf to gaddr
Status Lock (GAddr gaddr)Add a write/read lock the address gaddr
Status UnLock (GAddr gaddr)Unlock the address gaddr

Table 4. Rcmp APIs

Fig. 8.

Fig. 8. Sample code using Rcmp APIs.

Workflow. The access workflow of Rcmp is shown in Figure 9. When an application in a CN accesses the memory pool using Read or Write operation, note the following. First, if the page is found in the page table cached in local DRAM, access directly in the CXL memory via Load/Store operations. Second, otherwise, initially find the rack where the page is located from MS, and the page directory is cached in the local DRAM of Daemon. After that, there is no need to contact the MS, and the CN communicates with Daemon directly. Third, for local-rack accesses, CN gets the location (page offset) of the data by searching the page table in Daemon. Then, access directly in the CXL memory. Fourth, for remote-rack accesses, Daemon obtains the page offset and caches the page table by communicating with the Daemon in the remote rack. The data in the remote rack is then accessed directly via one-sided RDMA READ/WRITE operations. In this process, the Daemon in the remote rack receives access requests through the CXL Proxy and accesses the CXL memory. Fifth, if there is a hot page when remote-rack accessing, the page swap mechanism will be triggered. Sixth, if a WLock or RLock exists, Rcmp enables write buffering or page caching for reducing remote-rack accesses.

Fig. 9.

Fig. 9. The workflows of Rcmp.

5.2 Intra-Rack Communication

CN needs to communicate with Daemon to determine whether they are local- or remote-rack accesses, but there is a significant difference in the access latency between the two cases. To prevent communication blocking, Rcmp uses two ring buffer structures for different access scenarios, as shown in Figure 10.

Fig. 10.

Fig. 10. Intra-rack communication mechanism.

For local-rack accesses, a normal ring buffer is used for communication. The green buffer in the figure is an example. In this case, since all access is ultra-low latency (via CXL), blocking does not occur even in high-conflict situations. In addition, the ring buffers (and the RDMA QPs) are shared across threads (one CN) based on Flock’s method [36] for high concurrency.

For remote-rack accesses, a double-layer ring buffer is used for efficient and concurrent communications, as shown in Figure 10. The first ring buffer (polling buffer) stores the message metadata (e.g., type, size) and a pointer ptr that points to the second buffer (data buffer), which stores message data. The data in the polling buffer is of fixed length, whereas the message in the data buffer is of variable length. When the message in the data buffer is completed, add the request to the polling buffer. The Daemon polls the polling buffer to process the message that the current ptr points to. For example, in Figure 10, the latter Msg2 in the data buffer is filled first, and the request is added first to the polling buffer. Therefore, Msg2 will be processed first without blocking. Additionally, different messages can be processed concurrently. In the implementation, we use a lock-free KFIFO queue [50] as the polling buffer, and the data buffer is the normal ring buffer.

5.3 Hot-Page Identification and Swapping

To reduce remote-rack accesses, Rcmp designs a hot-page identification and swapping policy. It aims to identify frequently accessed hot pages in remote racks and migrate them to the local rack.

Hot-Page Identification. An expiring policy is proposed to identify hot pages. Specifically, the hotness of a page is measured by its access frequency and the time duration since its last access. We maintain three variables named \(Cur_r\), \(Cur_w,\) and \(lastTime\) to denote the number of read accesses, number of write accesses, and the time of the most recent access of a page. When accessing the page and counting the hotness, we first get the \(\Delta t\), which is equal to the present time minus \(lastTime\). If \(\Delta t\) is greater than valid lifetime threshold \(T_l\), the page is defined as “expired,” and the \(Cur_r\), \(Cur_w\) will be cleared to zero. The page hotness is equal to \(\alpha \times (Cur_r + Cur_w) + 1\), where \(\alpha\) is the exponential decay factor, \(\alpha = e^{- \lambda \Delta t}\), where \(\lambda\) is a “decay” constant. Then, the \(Cur_r\) or \(Cur_w\) adds 1 according to the access type. If the hotness is greater than threshold \(H_p\), the page is “hot.” In addition, if (\(Cur_r/Cur_w\)) of a hot page is greater than the threshold \(R_{rw}\), the page is “read hot.” All thresholds are configurable and have default values. In a rack, all CNs (local DRAM) maintain the hotness values (or hotness metadata) of local-rack pages, and the hotness metadata of remote-rack pages is stored in Daemon. The memory overhead is small because each page maintains three variables, about 32 bytes. The time complexity to update the hotness metadata of a page is also low, only O(1).

Hot-Page Swapping and Caching. Rcmp proposes a user-level swap mechanism, unlike the swap mechanism of page-based systems (e.g., LegoOS, Infiniswap), which relies on the host’s kernel swap daemon (kswapd) [3, 21, 22, 44]. As shown in Figure 11, taking a CN in the R1 rack that wants to swap the hot page in the R2 rack as an example, the swapping process is as follows. ① A swap request of R1 (Daemon) is sent to the MS and is added to the FIFO queue, which is used to avoid flooding the system with repeated requests for the same page. ② R1 selects the free page as the page to be swapped. If there is no free space, R1 collects the hotness metadata of all CNs and selects the page with the lowest hotness as the page to be swapped. If the page to be swapped is still “hot” (e.g., scan workloads), then stop the swap process and turn to ⑥. ③ R2 sums the hotness of the page (to be swapped with R1) in all CNs and compares the result with the hotness of R1. If R2 has higher hotness, reject the swap process and turn to ⑥, but if the page is “read hot” for R1, the page will be cached in R1’s CXL memory. The page cache is read-only and will be deleted when the page is to be written (see Section 5.4). This comparison-based approach avoids frequent migration of pages (page ping-pong). In addition, performance under read-intensive workloads can be improved by caching “read hot” data. ④ R2 disables hotness metadata about the page and updates the page table in all CNs. ⑤ Swap the hot page based on two one-sided RDMA operations. ⑥ Update the page table and R1’s request is dequeued.

Fig. 11.

Fig. 11. Hot-page swapping.

5.4 CXL Cache and Synchronization Mechanism

Rcmp proposes a simple and efficient caching and synchronization mechanism based on Lock/UnLock operations to reduce frequent remote-rack accesses under massive fine-grained workloads. The main idea is that lock and data are coupled together with the cache-line granularity, which means the data in the same cache line shares the same lock [9]. Rcmp designs the writer buffer and page cache for each CN in CXL memory, and achieves inter-rack consistency with the synchronization mechanism. The intra-rack consistency can be guaranteed by CXL without additional policies.

CXL Write Buffer. A WLock operation makes the requesting node (CN) become an owner node (no other nodes can modify it). In this case, the CN can cache the fine-grained write requests (less than 256B by default) for remote racks in the CXL write buffer. When the buffer reaches a certain size or the WLock is unlocked, Rcmp batches writes to the corresponding remote racks asynchronously with background threads. Our current implementation uses two buffer structures, and when one buffer is full, all write requests go to the new buffer. The buffer structure is a high-concurrency SkipList by default, similar to the memtable structure in the LSM-KV stores [12].

CXL Page Cache. Similarly, when using a RLock operation, the CN becomes a shared node. The page can be cached in the CXL page cache at ③ of the page swapping process. When the page is to be written or RLock is unlocked, CN invalidates the page cache.

5.5 RRPC Framework

Compared with the traditional RDMA and RPC frameworks, RRPC adopts a hybrid approach, which can adaptively choose RPC and one-sided RDMA communication for different data patterns. RRPC is inspired by the test results in Figure 3 and uses 512B as the threshold to dynamically select the communication modes. The main idea is to efficiently leverage the high bandwidth characteristics of RDMA to amortize communication latency. As shown in Figure 12, RRPC includes three communication modes.

Fig. 12.

Fig. 12. Different communication modes in RRPC.

Pure RPC mode is for communications with less than 512B of transmitted data, including scenarios such as locking during transactions, data index queries, and memory allocation.

RPC and one-sided mode is suitable for unstructured big data (more than 512B) and unknown data size such as object storage scenarios. In this case, it is difficult for the client to know the size of the object to be accessed before requesting the server. Therefore, it is necessary to obtain the remote address via RPC first, allocate a specified size of space locally, and finally remote fetch via an RDMA one-sided READ operation.

RPC zero-copy mode is for structured big data (more than 512B) with fixed size such as SQL scenarios. Because data has a fixed size, the communication mode can carry the address of local space when sending an RPC request, and the data is written directly via an RDMA one-sided WRITE operation.

For the latter two modes, once the page address is acquired via RPC, Rcmp will cache it and only use one-sided RDMA reads/writes for subsequent accesses. In addition, RRPC adopts QP sharing and doorbell batching, among others, to optimize RDMA communications, drawing on the strengths of other works [17, 26, 63].

Skip 6EVALUATION Section

6 EVALUATION

In this section, we evaluate Rcmp’s performance using different benchmarks. The implementation of Rcmp and experiment setup are introduced first (Section 6.1 and Section 6.2). Next, we compare Rcmp with three other remote memory systems using a micro-benchmark (Section 6.3). Then, we run a key-value store with YCSB benchmarks to show the performance benefits of Rcmp (Section 6.4). Finally, we evaluate the impact of key technologies in Rcmp (Section 6.5).

6.1 Implementation

Rcmp is a user-level system without kernel-space modifications, implemented in 6,483 lines of C++ code. In Rcmp, a page is 2 MB by default since it achieves a good balance between metadata size and latency; each write buffer is 64 MB and the page cache is LRU-cache and houses 50 pages; the threshold \(T_l\) is 100 s, \(H_p\) is 4, \(\lambda\) is 0.04, and \(R_{rw}\) is 0.9 by default. The thresholds are tuned based on application scenarios. The RRPC framework is implemented on eRPC [24].

CXL-enabled FPGA prototypes are now available for purchase, but we still choose NUMA-based emulation to implement CXL memory for two reasons. First, FPGA-based prototypes have higher latency in Intel measurements [49], more than 250 ns. As presented by CXL president Siamak Tavallaei [38], “These early CXL proof of concepts and products are yet not optimized for latency. With time, the access latency of CXL memory will be significantly improved.” Second, in addition to similar access latency, the NUMA architecture is cache coherent and uses Load/Store semantics as CXL.

6.2 Experiment Setup

All experiments are conducted on five servers, each equipped with two-socket Intel Xeon Gold 5218R CPUs @ 2.10 Ghz, 128 GB of DRAM, and one 100-Gbps Mellanox ConnectX-5 RNIC. The operating system is Ubuntu 20.04 with Linux 5.4.0-144-generic. The interconnection latency of NUMA node 0 and node 1 is 138.5 ns and 141.1 ns, and the intra-node access latency is 93 ns and 89.7 ns.

Rcmp is compared with other four state-of-the-art remote memory systems: (1) Fastswap [3], a page-based system; (2) FaRM [17], an object-based system; (3) GAM [9], a distributed memory system that provides a cache coherence protocol over RDMA; and (4) CXL-over-Ethernet, a CXL-based memory disaggregation system with Ethernet (see Section 7 for details). We run Fastswap and GAM using open source codes. Since FaRM is not publicly available, we use the code in the work of Cai et al. [9]. Note that FaRM and GAM are not really “disaggregation” architectures; their CNs have local memory of the same size as the remote memory. We modify some configurations (reducing local memory) to port them to a disaggregated architecture. Due to the lack of FPGA devices and the unpublished source code of CXL-over-Ethernet, we implement the CXL-over-Ethernet prototype based on the Rcmp’s code. To be fair, the RDMA network is also used in CXL-over-Ethernet.

System Deployment and Simulated Environment. Figure 13(a) shows the envisioned architecture of Rcmp. In a rack, low-latency CXL is used to connect CNs and MNs to form a small memory pool; RDMA is used to connect the racks (interconnection with RDMA-enabled ToR switches). The CXL link speed is 90 to 150 ns; the RDMA network latency is 1.5 to 3 \(\mu\)s. Our test environment is shown in Figure 13(b). Due to the limited availability of devices, we use a server to simulate a rack, including a small compute pool and memory pool (or CXL memory). For Rcmp and CXL-over-Ethernet, the compute pool is running on one CPU socket and one CPU-less MN as CXL memory. In the compute pool of Rcmp, different processes run different CN clients and a process runs Daemon. For other systems, the memory pool is connected to the compute pool via RDMA. In addition, the memory pool or CXL memory in a rack has about 100 GB of DRAM, and the local DRAM of the compute pool is 1 GB. We use micro-benchmarks to evaluate the basic read/write performance of different systems and use the YCSB benchmarks [15] to evaluate their performance under different workloads, as shown in Table 5.

Table 5.
WorkloadsABCDEF
OperationsR: 50%U: 50%R: 95% U: 5%R: 100%R: 95% I: 5%S: 95% I: 5%R: 50% M: 50%
R, Read; U, Update; I, Insert; S, Scan; M, Read-Modify-Write.

Table 5. YCSB Workloads

Fig. 13.

Fig. 13. Envisioned prototype and simulated environment.

6.3 Micro-Benchmark Results

We first evaluate the overall performance and scalability of these systems by running the micro-benchmark with random read/write operations. The data size is 64B and 100M data items are used in per read/write operation by default.

Overall Performance. As shown in Figure 14, we run a micro-benchmark 10 times in a two-rack environment under different data sizes and compare the average latency. The same number of memory pages are pre-allocated for each rack.

Fig. 14.

Fig. 14. Access latency.

The results show that Rcmp has lower and more stable write/read latency (<3.5 \(\mu\)s and <3\(\mu\)s). Specifically, the write latency is reduced by 2.3 to 8.9× and the read latency is reduced by 2.7 to 8.1× compared to other systems. This is achieved through Rcmp’s efficient utilization of CXL, which incorporates designs such as effective communication and hot-page swapping to minimize system latency. Fastswap has over 12-\(\mu\)s access latency, which is \(\sim 5.2\times\) higher than Rcmp. When the accessed data is not in the local DRAM cache, Fastswap fetches the page from the remote memory pool based on expensive page faults, resulting in higher overhead. FaRM has lower read/write latency, around 8 \(\mu\)s, due to object-based data management and efficient messaging primitives to improve RDMA communications. GAM is also an object-based system and performs well (\(\sim 5 \mu\)s) when the data size is less than 512B, but latency increases dramatically when the data is larger. This is because GAM uses 512B as the default cache line size, and when data span multiple cache lines, GAM needs to maintain the consistency state across all cache lines synchronously, resulting in performance degradation. Furthermore, the write operations are made asynchronous and pipelined in GAM, which has a lower write latency (see Figure 14(a)). CXL-over-Ethernet also achieves low read and write latency (6–8 \(\mu\)s) through CXL. However, CXL-over-Ethernet deploys CXL in the compute pools and employs a cache strategy for the memory pools, which does not fully utilize the low-latency benefits of CXL. In addition, CXL-over-Ethernet is not optimized for a network, which is the main performance bottleneck of the hybrid architecture.

Scalability. We test the scalability of different systems by varying different clients and racks. Each client runs a micro-benchmark. We have five servers and can build up to five racks.

First, we first compare the read/write throughput with multiple clients concurrently in a two-rack environment. As shown in Figure 15, the throughput of Rcmp is roughly linear to the number of clients when there are fewer than 16 clients. However, the scalability is limited when there are more clients due to a single Daemon. Therefore, Rcmp will adopt multiple Daemon servers for larger-scale nodes (Section 4.2). Fastswap scales almost linearly with the client because of the efficient page-fault-driven remote memory accesses. FaRM also has good scalability, especially for read operations due to efficient communication primitives. In contrast, GAM only exhibits linear scalability within four threads. When more clients are involved, the performance improvement of GAM is marginal or even negative due to the software overhead of its user-level library [29, 66]. To ensure consistency, GAM has to acquire locks to check the access permission for each memory access, which has a high overhead in dense access scenarios. CXL-over-Ethernet’s performance no longer improves with threads after eight threads. In CXL-over-Ethernet, before accessing the memory pool, all threads need communicate with the CXL Agent, which becomes the performance bottleneck.

Fig. 15.

Fig. 15. Total throughput under different clients.

Second, we increase the number of racks and run eight clients with each rack. The accessed data of each rack is uniformly distributed among the entire memory pool. As shown in Figure 16, the throughput of Fastswap is not affected by the number of racks and has excellent scalability. A slight performance loss occurs in Rcmp and FaRM due to competition from different accesses between racks. In Rcmp, there is also contention for hot-page swapping, which is mitigated by the hot-page identification mechanism. The cache coherence overhead of GAM becomes more pronounced in the multi-rack environment, resulting in significant performance degradation. For CXL-over-Ethernet, the agent in the compute pool limits the scalability.

Fig. 16.

Fig. 16. Per-rack throughput under different racks.

In summary, Rcmp effectively leverages CXL through several innovative designs to reduce access latency and improve scalability, whereas other systems suffer from high latency or poor scalability.

6.4 Key-Value Store and YCSB Workloads

We run a general key-value store interface that is implemented as a hashtable on these systems. Next, we run widely used YCSB benchmarks [15] (six workloads as shown in Table 5) to evaluate the performance. Since the hashtable does not support range queries, the YCSB E workload is not performed. All experiments are run in a two-rack environment. We pre-load 100M key-value pairs with 64B size and then perform different workloads under Uniform and Zipfian (skewness is 0.99 by default) distributions. Figure 17 shows the throughput of different systems, all normalized to Fastswap. Based on this result, the following conclusions can be drawn.

Fig. 17.

Fig. 17. Normalized throughput under YCSB workloads.

First, Rcmp outperforms RDMA-based systems by 2 to 4× on all the workloads by utilizing CXL efficiently. Specifically, for read-intensive workloads (YCSB B, C, D), Rcmp performance improves \(\sim 3\times\) over Fastswap by avoiding page-fault overheads and reducing data movement between racks with hot-page swapping. In addition, Rcmp designs efficient communication mechanisms and an RRPC framework to achieve optimal performance. FaRM, GAM, and CXL-over-Ethernet also have better performance, \(\sim 1.5\times\) improvement over Fastswap. This is because FaRM needs only a single one-sided lock-free read operation to remote access. GAM or CXL-over-Ethernet provides a uniform caching policy in local memory or CXL memory. With the memory disaggregation architecture, the benefits of caching are constrained by the limited local DRAM. For write-intensive workloads (YCSB A and F), Rcmp has a 1.5× higher throughput.

Second, Rcmp’s performance improvement is more pronounced in Zipfian workloads, which achieves up to 3.8× higher throughput. Since hot pages are accessed frequently under Zipfian workloads, Rcmp greatly reduces slow remote-rack accesses by migrating hot pages to the local rack. GAM and CXL-over-Ethernet also have significant performance improvements due to high cache hit rates under Zipfian workloads.

In summary, Rcmp achieves superior performance over other systems by effectively leveraging CXL and other optimizations. Other systems have obvious limitations in the memory disaggregation architecture, where most of the data is obtained by accessing the remote memory pool. For instance, kernel-based, page-granular Fastswap has expensive interruption overheads. GAM’s caching strategies have limited performance improvement in the scarce local memory. In addition, some operations of FaRM rely on bilateral collaboration, which is incompatible with this disaggregated architecture due to the near-zero computation power of the memory pool.

6.5 Impact of Key Technologies

In this section, we focus on the impact of four strategies on Rcmp performance, including the communication mechanism, swapping and caching strategies, and RRPC. These strategies aim to mitigate the performance mismatch problems (between RDMA and CXL) and maximize the performance benefits of CXL.

We first apply Rcmp’s key technologies one by one, and Figure 18 shows the results under a micro-benchmark in a two-rack environment. Base represents the basic version of Rcmp, including single-layer ring buffers, eRPC, and so on. +RB represents adopting double-layer ring buffers; +Swap and +WB indicate that the hot-page swapping and CXL write buffer are further applied, respectively; and +RRPC represents adopting the RRPC framework and shows the final performance of Rcmp. Rcmp-only-CXL indicates that all read/write operations are performed within the rack and do not involve RDMA networks. Theoretically, the pure CXL solution is the upper limit of Rcmp’s performance, but it can only be deployed within the rack. The results show that these techniques decompose the performance gap between Rcmp and Rcmp-only-CXL, but Rcmp still has room for improvement in tail latency and read throughput. Among them, double-layer ring buffers reduce the latency, especially the tail latency. Swapping strategy greatly reduces the latency and improves the throughput, which is more obvious in read operations. The write buffer and RRPC improve the throughput significantly, and the write buffer mainly affects write operations.

Fig. 18.

Fig. 18. Contributions of techniques to performance.

Then, we analyze the benefits of each technology in detail. All experiments are run in a two-rack environment by default.

Intra-Rack Communication. As shown in Figure 19, we compare the latency (p50, p99, p999) under a micro-benchmark of two strategies: (1) using a single ring buffer (Baseline) and (2) using two ring buffers for different access modes (Rcmp). The results show that Rcmp reduces the 50th, 99th, and 999th percentile latency by up to 21.7%, 30.9%, and 51.5%. Due to the latency gap between local- and remote-rack accesses, a single communication buffer may lead to blocking problems, triggering longer tail latency. Rcmp solves it with an efficient communication mechanism.

Fig. 19.

Fig. 19. Ring buffers.

Hot-Page Swapping. We run a micro-benchmark under different distributions (Uniform, Zipfian) to evaluate the effect of hot-page swapping. The results are shown in Figure 20, and the following conclusions can be drawn. First, the hot-page swapping policy can significantly improve performance compared to the no-swapping policy, especially for skewed workloads. For example, the swapping policy (\(H_p\)=3) can improve throughput by 5% on Uniform workloads and 35% on Zipfian workloads. Second, frequent page swapping results in performance degradation. When the hotness threshold is set very low (e.g., \(H_p\)=1) the throughput plummets, because when the threshold is low, each remote access may trigger a page swap (similar to page-based systems), resulting in high overhead.

Fig. 20.

Fig. 20. Hot-page swap.

Write Buffer. Assuming a scenario using WLock operations, we evaluate the throughput of using a write buffer (Rcmp) and no buffer (Baseline) by running a micro-benchmark under different data sizes. As shown in Figure 21, Rcmp achieves up to 1.6× higher throughput than Baseline at all data sizes. The data is cached in the writer buffer and batched asynchronously to the remote racks via background threads. Therefore, Rcmp removes writes from critical execution paths and reduces the remote-rack access, enhancing the write performance. However, when the data is larger than 256B, the performance improvement is not significant and background threads cause more CPU overhead. Therefore, Rcmp does not use the write buffer when the data is larger than 256B.

Fig. 21.

Fig. 21. Write buffer.

RRPC Framework. We compare the RRPC with FaRM’s RPC [17], the eRPC [24] framework, and a hybrid mode (eRPC + one-sided RDMA verbs) under different transfer data sizes. As shown in Figure 22, RRPC achieves 1.33 to 1.89× higher throughput than eRPC + one-sided RDMA verbs, and 1.5 to 2× higher than eRPC when transferred data is larger. eRPC performs well when the data is less than 968B, but when the data is larger, eRPC suffers from performance degradation. This is because eRPC is based on UD (Unreliable Datagram) mode and each message has an MTU (maximum transmission unit) size, 1KB by default, and when the data is larger than MTU size, it will be divided into multiple packets, resulting in worse performance. RRPC will only select the eRPC method when the data is less than 512B, and select the hybrid mode when the data is larger. In addition, RRPC adopts several strategies to improve RDMA communication performance.

Fig. 22.

Fig. 22. RRPC framework.

6.6 Discussion

Supporting Decentralization. The centralized design makes the MS prone to be a performance bottleneck, which is mitigated by leveraging CN local DRAM in Rcmp. Rcmp tries to implement a decentralized architecture with consistent hashing, and the cluster membership is to be maintained reliably using Zookeeper [23], similar to FaRM.

Supporting Cache I/O. Rcmp adopts a cache-less access mode, which avoids the consistency maintenance overhead across racks. With the decentralized architecture, Rcmp can design cache structures for remote racks in CXL memory and maintain cache consistency between racks using Zookeeper.

Transparency. Although Rcmp provides very simple APIs and inbuilt implementations of standard data structures (e.g., hashtable), there are still many scenarios that desire to migrate legacy applications to the Rcmp without modifying their source codes. Referring to Gengar [18], we integrate Rcmp with FUSE [1] to implement a simple distributed file system, which can be used for most applications without modifying source codes. Instructions for use can be found in the Rcmp’s source code at https://github.com/PDS-Lab/Rcmp.

Skip 7RELATED WORK Section

7 RELATED WORK

7.1 RDMA-Based Remote Memory

Page-Based Systems. Infiniswap [22] is a page-based remote memory system using RDMA networks, which performs decentralized slab placements and evictions based on one-sided RDMA operations. Additionally, Infiniswap adopts a block device in kernel space as the swap space and a daemon in user space to manage accessible remote memory. Similar page-based systems include LegoOS [44], a new resource-disaggregated OS that provides a global virtual memory space; Clover [51], an RDMA-based disaggregated persistent memory (pDPM) system, which separates the metadata/control plane and the data plane; and Fastswap [3], a fast swap system for disaggregated memory over RDMA via a far memory-aware cluster scheduler and so on. However, these page-based systems suffer from I/O amplification due to coarse-grained access and additional overhead due to page-fault handling and context switching.

Object-Based Systems. Object-based memory disaggregation designs its own object interfaces (e.g., key-value store) to directly intervene in RDMA data transfers. FaRM [17] is an object-based remote memory system based on RDMA, which exposes the memory of all servers in a cluster as a shared address space and provides efficient APIs to simplify the use of remote memory. AIFM [41] is an application-integrated remote memory system which provides convenient APIs for the development of applications and a high-performance runtime design for minimal overhead on object accesses. Xstore [57] adopts learned indexes to build remote memory cache in RDMA-based key-value stores. However, these systems are not fully “disaggregated,” because each CN contains the same size of local memory as the remote memory. FUSEE is a fully memory-disaggregated key-value store that brings disaggregation to metadata management based on RACE hashing index [67], which is one-sided RDMA-conscious extendible hashing. Gengar [18] is an object-based hybrid memory pool system that provides a global memory space (including remote NVM and DRAM) over RDMA.

Communication Optimization. Most RDMA-based systems propose optimized strategies to improve the efficiency of RDMA communication. FaRM proposes messaging primitives based on lock-free ring buffers to minimize the communication overhead for remote memory. In addition, FaRM reduces RNIC cache misses by sharing QPs. Clover improves the scalability of RDMA by registering memory regions using huge memory pages (HugePage) with RNICs. Xstore uses doorbell batching to reduce network latency for multiple RDMA READ/WRITE operations. FaSST [26] proposes a fast RPC framework using two-sided unreliable RDMA instead of one-sided verbs and is especially efficient for small messages. However, two-sided verbs are not suitable for disaggregated architectures and FaSST is not efficient for big messages. Many remote systems adopt eRPC [24], a general-purpose RPC library that offers comparable performance, without RDMA primitives. But we observe that only RPC is not optimal for memory disaggregation systems.

7.2 Supporting Cache Coherence

GAM [9] is a distributed memory system that provides cache-coherence memory over RDMA. GAM maintains coherence between local and remote memory via a directory-based cache coherence protocol. However, this approach has high maintenance overhead. New interconnect protocols, such as CXL [45] or CCIX [6], natively support cache coherence and lower latency compared to RDMA. Some researchers have tried to redesign RDMA-based memory disaggregation using these protocols. Kona [10] uses cache coherence instead of virtual memory for tracking applications’ memory accesses transparently, reducing read/write amplification in page-based systems. Rambda [61] is an RDMA-driven acceleration framework using a cache-coherent accelerator to connect to CXL-like cache-coherence memory and adopts a cpoll mechanism to reduce polling overhead.

7.3 CXL-Based Memory Disaggregation

DirectCXL [21] is a CXL-based memory disaggregation, which achieves directly accessible remote memory over CXL protocols. DirectCXL exhibits 6.2× lower latency than RDMA-based memory disaggregation. Pond [30] is the memory pooling system for the cloud platform, which significantly reduces DRAM costs based on CXL. However, these systems do not consider the distance limitation of CXL. CXL-over-Ethernet [56] is a novel FPGA-based memory disaggregation that ignores CXL’s limitation via Ethernet, but it does not fully exploit the performance benefits of CXL.

Skip 8CONCLUSION AND FUTURE WORK Section

8 CONCLUSION AND FUTURE WORK

In this study, we developed Rcmp, a low-latency and high-scalable memory pooling system, which first combines RDMA and CXL to achieve memory disaggregation. Rcmp builds a CXL-based memory pool within the rack and uses RDMA to connect the racks, forming a global memory pool. Rcmp adopts several technologies to address the challenge of mismatch between RDMA and CXL. Rcmp proposes a global memory and address management to support access at cache-line granularity. In addition, Rcmp uses different buffer structures to handle intra- and inter-rack communications to avoid blocking problems. For reducing remote-rack accesses, Rcmp proposes a hot-page identification and migration strategy and fine-grained accesses buffering with a lock-based synchronization mechanism. For improving remote-rack accesses, Rcmp designs an optimized RRPC framework. Evaluations indicated that Rcmp outperforms significantly other RDMA-based solutions in all workloads without additional overheads.

In the future, we will experiment with real CXL devices for Rcmp and improve the design of decentralization and CXL caching strategies (presented in Section 6.6). In addition, we will support other storage devices (e.g., PM, SSD, and HDD) in Rcmp.

Skip ACKNOWLEDGMENTS Section

ACKNOWLEDGMENTS

We appreciate all reviewers and editors for their insightful comments and feedback. We thank Yixing Guo for his efforts in Rcmp’s codes.

REFERENCES

  1. [1] GitHub. 2023. FUSE (Filesystem in Userspace). Retrieved December 8, 2023 from http://libfuse.github.io/Google ScholarGoogle Scholar
  2. [2] Aguilera Marcos K., Amaro Emmanuel, Amit Nadav, Hunhoff Erika, Yelam Anil, and Zellweger Gerd. 2023. Memory disaggregation: Why now and what are the challenges. ACM SIGOPS Operating Systems Review 57, 1 (2023), 3846.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Amaro Emmanuel, Branner-Augmon Christopher, Luo Zhihong, Ousterhout Amy, Aguilera Marcos K., Panda Aurojit, Ratnasamy Sylvia, and Shenker Scott. 2020. Can far memory improve job throughput? In Proceedings of the 15th European Conference on Computer Systems. 116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Bae Jonghyun, Lee Jongsung, Jin Yunho, Son Sam, Kim Shine, Jang Hakbeom, Ham Tae Jun, and Lee Jae W.. 2021. FlashNeuron: SSD-enabled large-batch training of very deep neural networks. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST’21). 387401.Google ScholarGoogle Scholar
  5. [5] Barroso Luiz André, Clidaras Jimmy, and Hölzle Urs. 2013. TheDatacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines (2nd ed.). Synthesis Lectures on Computer Architecture. Morgan & Claypool.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Benton Brad. 2017. CCIX, GEN-Z, OpenCAPI: Overview & comparison. In Proceedings of the OpenFabrics Workshop.Google ScholarGoogle Scholar
  7. [7] Biswas Som S.. 2023. Role of Chat GPT in public health. Annals of Biomedical Engineering 51, 5 (2023), 868869.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Jeff Bonwick. 1994. The slab allocator: An object-caching kernel memory allocator. In Proceedings of the USENIX Summer 1994 Technical Conference, Vol. 16. 1–12.Google ScholarGoogle Scholar
  9. [9] Cai Qingchao, Guo Wentian, Zhang Hao, Agrawal Divyakant, Chen Gang, Ooi Beng Chin, Tan Kian-Lee, Teo Yong Meng, and Wang Sheng. 2018. Efficient distributed memory management with RDMA and caching. Proceedings of the VLDB Endowment 11, 11 (2018), 16041617.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Calciu Irina, Imran M. Talha, Puddu Ivan, Kashyap Sanidhya, and Metreveli Zviad. 2021. Rethinking software runtimes for disaggregated memory. In Proceedings of the 18th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. 216.Google ScholarGoogle Scholar
  11. [11] Cao Wei, Zhang Yingqiang, Yang Xinjun, Li Feifei, Wang Sheng, Hu Qingda, Cheng Xuntao, Chen Zongzhi, Liu Zhenjun, Fang Jing, et al. 2021. PolarDB Serverless: A cloud native database for disaggregated data centers. In Proceedings of the 2021 International Conference on Management of Data. 24772489.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Cao Zhichao and Dong Siying. 2020. Characterizing, modeling, and benchmarking RocksDB key-value workloads at Facebook. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20).Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Cheng Yue, Anwar Ali, and Duan Xuejing. 2018. Analyzing Alibaba’s co-located datacenter workloads. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data’18). IEEE, Los Alamitos, CA, 292297.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Cockcroft Adrian. 2023. Supercomputing Predictions: Custom CPUs, CXL3.0, and Petalith Architectures. Retrieved December 8, 2023 from https://adrianco.medium.com/supercomputing-predictions-custom-cpus-cxl3-0-and-petalith-architectures-b67cc324588f/Google ScholarGoogle Scholar
  15. [15] Cooper Brian F., Silberstein Adam, Tam Erwin, Ramakrishnan Raghu, and Sears Russell. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing. 143154.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Corporation Anritsu and Corporation KYOCERA. 2023. PCI Express®5.0 Optical Signal Transmission Test. Retrieved December 8, 2023 from https://global.kyocera.com/newsroom/news/2023/000694.htmlGoogle ScholarGoogle Scholar
  17. [17] Dragojević Aleksandar, Narayanan Dushyanth, Castro Miguel, and Hodson Orion. 2014. FaRM: Fast remote memory. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 401414.Google ScholarGoogle Scholar
  18. [18] Duan Zhuohui, Liu Haikun, Lu Haodi, Liao Xiaofei, Jin Hai, Zhang Yu, and He Bingsheng. 2021. Gengar: An RDMA-based distributed hybrid memory pool. In Proceedings of the 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS’21). IEEE, Los Alamitos, CA, 92103.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Floridi Luciano and Chiriatti Massimo. 2020. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines 30 (2020), 681694.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Gao Peter Xiang, Narayan Akshay, Karandikar Sagar, Carreira João, Han Sangjin, Agarwal Rachit, Ratnasamy Sylvia, and Shenker Scott. 2016. Network requirements for resource disaggregation. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 249–264. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/gaoGoogle ScholarGoogle Scholar
  21. [21] Gouk Donghyun, Lee Sangwon, Kwon Miryeong, and Jung Myoungsoo. 2022. Direct access, high-performance memory disaggregation with DirectCXL. In Proceedings of the 2022 USENIX Annual Technical Conference (USENIX ATC’22). 287294.Google ScholarGoogle Scholar
  22. [22] Gu Juncheng, Lee Youngmoon, Zhang Yiwen, Chowdhury Mosharaf, and Shin Kang G.. 2017. Efficient memory disaggregation with INFINISWAP. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation (NSDI’17). 649667.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Hunt Patrick, Konar Mahadev, Junqueira Flavio Paiva, and Reed Benjamin. 2010. ZooKeeper: Wait-free coordination for internet-scale systems. In Proceedings of the USENIX Annual Technical Conference, Vol. 8.Google ScholarGoogle Scholar
  24. [24] Kalia Anuj, Kaminsky Michael, and Andersen David. 2019. Datacenter RPCs can be general and fast. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19). 116.Google ScholarGoogle Scholar
  25. [25] Kalia Anuj, Kaminsky Michael, and Andersen David G.. 2014. Using RDMA efficiently for key-value services. In Proceedings of the 2014 ACM Conference on SIGCOMM. 295306.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Kalia Anuj, Kaminsky Michael, and Andersen David G.. 2016. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram RPCs. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI’16). 185–201.Google ScholarGoogle Scholar
  27. [27] Katrinis Kostas, Syrivelis Dimitris, Pnevmatikatos Dionisios, Zervas Georgios, Theodoropoulos Dimitris, Koutsopoulos Iordanis, Hasharoni Kobi, Raho Daniel, Pinto Christian, Espina F., et al. 2016. Rack-scale disaggregated cloud data centers: The dReDBox project vision. In Proceedings of the 2016 Design, Automation, and Test in Europe Conference and Exhibition (DATE’16). IEEE, Los Alamitos, CA, 690695.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Kwon Youngeun and Rhu Minsoo. 2018. Beyond the memory wall: A case for memory-centric HPC system for deep learning. In Proceedings of the 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’18). IEEE, Los Alamitos, CA, 148161.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Lee Seung-Seob, Yu Yanpeng, Tang Yupeng, Khandelwal Anurag, Zhong Lin, and Bhattacharjee Abhishek. 2021. Mind: In-network memory management for disaggregated data centers. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 488504.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Li Huaicheng, Berger Daniel S, Hsu Lisa, Ernst Daniel, Zardoshti Pantea, Novakovic Stanko, Shah Monish, Rajadnya Samir, Lee Scott, Agarwal Ishwar, et al. 2023. Pond: CXL-based memory pooling systems for cloud platforms. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vol. 2. 574587.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Makrani Hosein Mohammadi, Rafatirad Setareh, Houmansadr Amir, and Homayoun Houman. 2018. Main-memory requirements of big data applications on commodity server platform. In Proceedings of the 2018 18th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGRID’18). IEEE, Los Alamitos, CA, 653660.Google ScholarGoogle Scholar
  32. [32] Maruf Hasan Al, Wang Hao, Dhanotia Abhishek, Weiner Johannes, Agarwal Niket, Bhattacharya Pallab, Petersen Chris, Chowdhury Mosharaf, Kanaujia Shobhit, and Chauhan Prakash. 2023. TPP: Transparent page placement for CXL-enabled tiered-memory. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vol. 3. 742–755.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Maruf Hasan Al, Zhong Yuhong, Wang Hongyi, Chowdhury Mosharaf, Cidon Asaf, and Waldspurger Carl. 2021. Memtrade: A disaggregated-memory marketplace for public clouds. arXiv preprint arXiv:2108.06893 (2021).Google ScholarGoogle Scholar
  34. [34] Matsuoka Satoshi, Domke Jens, Wahib Mohamed, Drozd Aleksandr, and Hoefler Torsten. 2023. Myths and legends in high-performance computing. International Journal of High Performance Computing Applications 37, 3-4 (2023), 245259.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Michelogiannakis George, Klenk Benjamin, Cook Brandon, Teh Min Yee, Glick Madeleine, Dennison Larry, Bergman Keren, and Shalf John. 2022. A case for intra-rack resource disaggregation in HPC. ACM Transactions on Architecture and Code Optimization 19, 2 (2022), 126.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Monga Sumit Kumar, Kashyap Sanidhya, and Min Changwoo. 2021. Birds of a feather flock together: Scaling RDMA RPCs with Flock. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 212227.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Peng Ivy, Pearce Roger, and Gokhale Maya. 2020. On the memory underutilization: Exploring disaggregated memory on HPC systems. In Proceedings of the 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’20). IEEE, Los Alamitos, CA, 183190.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Platform The Next. 2022. Just How Bad Is CXL Memory Latency? Retrieved December 8, 2023 from https://www.nextplatform.com/2022/12/05/just-how-bad-is-cxl-memory-latency/Google ScholarGoogle Scholar
  39. [39] Raybuck Amanda, Stamler Tim, Zhang Wei, Erez Mattan, and Peter Simon. 2021. HeMem: Scalable tiered memory management for big data applications and real NVM. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 392407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Reiss Charles, Tumanov Alexey, Ganger Gregory R., Katz Randy H., and Kozuch Michael A.. 2012. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the 3rd ACM Symposium on Cloud Computing. 113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Ruan Zhenyuan, Schwarzkopf Malte, Aguilera Marcos K., and Belay Adam. 2020. AIFM: High-performance, application-integrated far memory. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 315332.Google ScholarGoogle Scholar
  42. [42] Salmonson Rick, Oxby Troy, Briski Larry, Normand Robert, Stacy Russell, and Glanzman Jeffrey. 2019. PCIe Riser Extension Assembly. Technical Disclosure Commons (January 11, 2019). https://www.tdcommons.org/dpubs_series/1878Google ScholarGoogle Scholar
  43. [43] Shamis Alex, Renzelmann Matthew, Novakovic Stanko, Chatzopoulos Georgios, Dragojević Aleksandar, Narayanan Dushyanth, and Castro Miguel. 2019. Fast general distributed transactions with opacity. In Proceedings of the 2019 International Conference on Management of Data. 433448.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Shan Yizhou, Huang Yutong, Chen Yilun, and Zhang Yiying. 2018. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 6987.Google ScholarGoogle Scholar
  45. [45] Sharma Debendra Das and Agarwal Ishwar. 2022. Compute Express Link. Retrieved December 8, 2023 from https://www.computeexpresslink.org/_files/ugd/0c1418_a8713008916044ae9604405d10a7773b.pdf/Google ScholarGoogle Scholar
  46. [46] Shenoy Navin. 2023. A Milestone in Moving Data. Retrieved December 8, 2023 from https://www.intel.com/content/www/us/en/newsroom/home.htmlGoogle ScholarGoogle Scholar
  47. [47] Shrivastav Vishal, Valadarsky Asaf, Ballani Hitesh, Costa Paolo, Lee Ki Suh, Wang Han, Agarwal Rachit, and Weatherspoon Hakim. 2019. Shoal: A network architecture for disaggregated racks. In Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19). 255–270. https://www.usenix.org/conference/nsdi19/presentation/shrivastavGoogle ScholarGoogle Scholar
  48. [48] Intel. 2019. Intel® Rack Scale Design (Intel® RSD) Storage Services. API Specification. Intel.Google ScholarGoogle Scholar
  49. [49] Sun Yan, Yuan Yifan, Yu Zeduo, Kuper Reese, Jeong Ipoom, Wang Ren, and Kim Nam Sung. 2023. Demystifying CXL memory with genuine CXL-ready systems and devices. arXiv preprint arXiv:2303.15375 (2023).Google ScholarGoogle Scholar
  50. [50] Torvalds. 2023. Linux Kernel Source Tree. Retrieved December 8, 2023 from https://github.com/torvalds/linux/blob/master/lib/kfifo.cGoogle ScholarGoogle Scholar
  51. [51] Tsai Shin-Yeh, Shan Yizhou, and Zhang Yiying. 2020. Disaggregating persistent memory and controlling them remotely: An exploration of passive disaggregated key-value stores. In Proceedings of the 2020 USENIX Annual Technical Conference. 3348.Google ScholarGoogle Scholar
  52. [52] Doren Stephen Van. 2019. HOTI 2019: Compute express link. In Proceedings of the 2019 IEEE Symposium on High-Performance Interconnects (HOTI’19). IEEE, Los Alamitos, CA, 1818.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Verbitski Alexandre, Gupta Anurag, Saha Debanjan, Brahmadesam Murali, Gupta Kamal, Mittal Raman, Krishnamurthy Sailesh, Maurice Sandor, Kharatishvili Tengiz, and Bao Xiaofeng. 2017. Amazon Aurora: Design considerations for high throughput cloud-native relational databases. In Proceedings of the 2017 ACM International Conference on Management of Data. 10411052.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Vuppalapati Midhul, Miron Justin, Agarwal Rachit, Truong Dan, Motivala Ashish, and Cruanes Thierry. 2020. Building an elastic query engine on disaggregated storage. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI’20). 449462.Google ScholarGoogle Scholar
  55. [55] Wahlgren Jacob, Gokhale Maya, and Peng Ivy B.. 2022. Evaluating emerging CXL-enabled memory pooling for HPC systems. arXiv preprint arXiv:2211.02682 (2022).Google ScholarGoogle Scholar
  56. [56] Wang Chenjiu, He Ke, Fan Ruiqi, Wang Xiaonan, Wang Wei, and Hao Qinfen. 2023. CXL over Ethernet: A novel FPGA-based memory disaggregation design in data centers. In Proceedings of the 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’23). IEEE, Los Alamitos, CA, 7582.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Wei Xingda, Chen Rong, and Chen Haibo. 2020. Fast RDMA-based ordered key-value store using remote learned cache. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 117135.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Xin Qin, Miller Ethan L., Schwarz Thomas, Long Darrell D. E., Brandt Scott A., and Litwin Witold. 2003. Reliability mechanisms for very large storage systems. In Proceedings of the 2003 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’03). IEEE, Los Alamitos, CA, 146156.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Yang Juncheng, Yue Yao, and Rashmi K. V.. 2020. A large scale analysis of hundreds of in-memory cache clusters at Twitter. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 191208.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Yang Qirui, Jin Runyu, Davis Bridget, Inupakutika Devasena, and Zhao Ming. 2022. Performance evaluation on CXL-enabled hybrid memory pool. In Proceedings of the 2022 IEEE International Conference on Networking, Architecture, and Storage (NAS’22). IEEE, Los Alamitos, CA, 15.Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Yuan Yifan, Huang Jinghan, Sun Yan, Wang Tianchen, Nelson Jacob, Ports Dan R. K., Wang Yipeng, Wang Ren, Tai Charlie, and Kim Nam Sung. 2023. RAMBDA: RDMA-driven acceleration framework for memory-intensive \(\mu\)s-scale datacenter applications. In Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA’23). IEEE, Los Alamitos, CA, 499515.Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Zamanian Erfan, Binnig Carsten, Kraska Tim, and Harris Tim. 2016. The end of a myth: Distributed transactions can scale. CoRR abs/1607.00655 (2016). http://arxiv.org/abs/1607.00655Google ScholarGoogle Scholar
  63. [63] Zhang Ming, Hua Yu, Zuo Pengfei, and Liu Lurong. 2022. FORD: Fast one-sided RDMA-based distributed transactions for disaggregated persistent memory. In Proceedings of the 20th USENIX Conference on File and Storage Technologies (FAST’22). 5168.Google ScholarGoogle Scholar
  64. [64] Zhang Yifan, Liang Zhihao, Wang Jianguo, and Idreos Stratos. 2021. Sherman: A write-optimized distributed B+Tree index on disaggregated memory. arXiv preprint arXiv:2112.07320 (2021).Google ScholarGoogle Scholar
  65. [65] Zhang Yingqiang, Ruan Chaoyi, Li Cheng, Yang Xinjun, Cao Wei, Li Feifei, Wang Bo, Fang Jing, Wang Yuhui, Huo Jingze, et al. 2021. Towards cost-effective and elastic cloud database deployment via memory disaggregation. Proceedings of the VLDB Endowment 14, 10 (2021), 19001912.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Ziegler Tobias, Binnig Carsten, and Leis Viktor. 2022. ScaleStore: A fast and cost-efficient storage engine using DRAM, NVMe, and RDMA. In Proceedings of the 2022 International Conference on Management of Data. 685699.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. [67] Zuo Pengfei, Sun Jiazhao, Yang Liu, Zhang Shuangwu, and Hua Yu. 2021. One-sided RDMA-conscious extendible hashing for disaggregated memory. In Proceedings of the USENIX Annual Technical Conference. 1529.Google ScholarGoogle Scholar

Index Terms

  1. Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 21, Issue 1
      March 2024
      500 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/3613496
      • Editor:
      • David Kaeli
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 January 2024
      • Accepted: 26 November 2023
      • Revised: 12 October 2023
      • Received: 9 July 2023
      Published in taco Volume 21, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)2,855
      • Downloads (Last 6 weeks)1,020

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader