research-article

Open Access

ISP Agent: A Generalized In-storage-processing Workload Offloading Framework by Providing Multiple Optimization Opportunities

Authors:
Seokwon Kang

Hanyang University, Republic of Korea

Hanyang University, Republic of Korea

0000-0003-0017-592X
View Profile

,
Jongbin Kim

Hanyang University, Republic of Korea

Hanyang University, Republic of Korea

0000-0002-0182-842X
View Profile

,
Gyeongyong Lee

Hanyang University, Republic of Korea

Hanyang University, Republic of Korea

0000-0002-8186-573X
View Profile

,
Jeongmyung Lee

Hanyang University, Republic of Korea

Hanyang University, Republic of Korea

0009-0006-2005-9204
View Profile

,
Jiwon Seo

Hanyang University, Republic of Korea

Hanyang University, Republic of Korea

0000-0002-4855-5609
View Profile

,
Hyungsoo Jung

Hanyang University, Republic of Korea

Hanyang University, Republic of Korea

0000-0002-5376-7200
View Profile

,
Yong Ho Song

Samsung Electronics, Republic of Korea

Samsung Electronics, Republic of Korea

0000-0002-1759-4242
View Profile

,
Yongjun Park

Yonsei University, Republic of Korea

Yonsei University, Republic of Korea

0000-0003-3725-0380
View Profile

ACM Transactions on Architecture and Code Optimization Volume 21 Issue 1Article No.: 11pp 1–24https://doi.org/10.1145/3632951

Published:19 January 2024Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

As solid-state drives (SSDs) with sufficient computing power have recently become the dominant devices in modern computer systems, in-storage processing (ISP), which processes data within the storage without transferring it to the host memory, is being utilized in various emerging applications. The main challenge of ISP is to deliver storage data to the offloaded workload. This is difficult because of the information gap between the host and storage, the data consistency problem between the host and offloaded workloads, and SSD-specific hardware limitations. Moreover, because the offloaded workloads use internal SSD resources, host I/O performance might be degraded due to resource conflicts. Although several ISP frameworks have been proposed, existing ISP approaches that do not deeply consider the internal SSD behavior are often insufficient to support efficient ISP workload offloading with high programmability.

In this article, we propose an ISP agent, a lightweight ISP workload offloading framework for SSD devices. The ISP agent provides I/O and memory interfaces that allow users to run existing function codes on SSDs without major code modifications, and separates the resources for the offloaded workloads from the existing SSD firmware to minimize interference with host I/O processing. The ISP agent also provides further optimization opportunities for the offloaded workload by considering SSD architectures. We have implemented the ISP agent on the OpenSSD Cosmos+ board and evaluated its performance using synthetic benchmarks and a real-world ISP-assisted database checkpointing application. The experimental results demonstrate that the ISP agent enhances host application performance while increasing ISP programmability, and that the optimization opportunities provided by the ISP agent can significantly improve ISP-side performance without compromising host I/O processing.

1 INTRODUCTION

Recently, modern solid-state drives (SSDs) have increased their computational power by embedding powerful processors and large memories. This enables complex storage management operations for flash translation layer (FTL) and garbage collection (GC). As SSDs with sufficient computing power become popular, in-storage-processing (ISP) techniques have been proposed in various emerging applications [1, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 15, 16, 17, 18, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 32, 33, 34, 35, 36]. ISP is a technique in which storage devices utilize their own resources or additional computing capabilities to process workloads without transferring data to host processors, such as central processing units (CPUs) or graphic processing units (GPUs). This technique can compute workloads with short input/output (I/O) latency and effectively reduce the overall volume of I/O transfers between the host and storage.

However, delivering storage data to ISP-offloaded workloads, known as ISP kernels, presents several significant issues. First, SSDs require low-level addresses to access NAND flash data, as opposed to the abstract file information contained in the user-level I/Os. Second, SSDs typically cache multiple NAND flash data in its internal DRAM to expedite host I/O processing. The ISP kernel must consider the cached data to access recent data. Third, the NAND flash module within the SSD imposes restrictions on target memory addresses, file locations, and I/O sizes, necessitating the use of an internal memory buffer for NAND flash access rather than direct access via the ISP kernel memory. Furthermore, we have observed that storage data requests from the ISP kernel can negatively impact host I/O performance due to interference.

Existing ISP frameworks solve these problems in three ways: domain-specific ISP [11, 12, 19, 22, 35, 36], FIFO-ISP [3, 13, 27], and FTL-based-ISP [5, 7, 8, 9, 32]. Domain-specific ISP aims to enhance offloading target-specific operations to SSDs by leveraging an in-depth knowledge of both target applications and SSD architectures. However, this approach has limited applicability due to being designed to support only specific operations that are effective for a particular domain. FIFO-ISP defines the storage data utilized for input and output by the ISP kernel on the host. FIFO-ISP techniques incorporate ISP offloading in the regular accelerator workload offloading model: preloading the input, executing the offloaded work, and storing or transferring the output to the host. However, offloading workload that demands random file accesses, such as ISP-assisted database checkpointing [23], is unviable with FIFO-ISP. By contrast, FTL-based-ISP permits the ISP kernel to request file I/Os and process them using the pre-existing SSD firmware, a part of FTL. FTL-based-ISP facilitates the implementation of an ISP kernel as a host-side function with I/Os by transferring OS filesystem information to storage. However, if the ISP kernel I/O is managed in storage firmware without hardware support, then the NAND flash module constraints may result in data rearrangement overhead between ISP kernel memory and internal memory buffer. This can lead to a decrease in host I/O performance.

In this study, we propose a general-purpose, low-overhead ISP workload offloading framework named the ISP agent. The ISP agent isolates the ISP subsystem, which exclusively handles ISP kernels, from the baseline storage subsystem that processes typical SSD operations (for FTL and GC). The ISP agent processes ISP kernel I/Os via the storage subsystem identical to FTL-based-ISP. The ISP agent utilizes a distinct memory buffer to access NAND flash for ISP kernel I/O, transferring data rearrangement overhead from the storage subsystem to the ISP subsystem. Thus, the ISP agent facilitates the ISP kernel’s access to random storage data with byte-level location and size, while minimising host I/O interference.

In addition, the ISP agent offers two additional optimization opportunities based on the SSD structure: I/O overlapping and direct I/O. I/O overlapping allows ISP kernels to carry out other activities while I/O processes are ongoing in the background. Direct I/O utilizes a target ISP kernel memory to access NAND flash data directly, bypassing memory buffers when the memory meets NAND flash memory constraints.

We implemented the ISP agent on the Cosmos+ OpenSSD board [14]. To evaluate the ISP agent, we compared it with FIFO-ISP and FTL-based-ISP using synthetic benchmarks for blockNDP [3] and DB checkpointing [23]. Additionally, we conducted comparisons on both the host and ISP sides of a real-world application, Sysbench [31], on MariaDB [21] with ISP-assisted DB checkpointing. The experimental results demonstrate that the ISP agent achieves maximum efficiency in ISP offloading by minimizing interference with host I/Os. The application of proposed optimizations yields ISP performance equivalent to that of a FIFO-ISP, and even surpasses that of an FTL-based-ISP.

This article offers the following key contributions:

–	We show that ISP offloading might affect the host I/O performance and propose solutions to reduce the interference without hardware support while providing FTL-based-ISP level programmability.
–	We provide two optimization opportunities of I/O overlapping and direct I/O, which can improve the ISP kernel performance, based on a deep understanding of SSD systems.
–	We evaluate the ISP agent on both the host and ISP sides using a real-world application: ISP-assisted DB checkpointing.

2 BACKGROUND

2.1 Cosmos+ OpenSSD Architecture

Cosmos+ OpenSSD [14] is a customizable SSD platform with a flash programmable gate array (FPGA)-based flash device controller. OpenSSD connects to the host via PCIe, and the host requests I/Os via the NVMe protocol. Figure 1(a) shows an overview of the simple architecture of the Cosmos+ OpenSSD system.

Fig. 1. (a) Cosmos+ OpenSSD and (b) NVMe command (I/O request from the host) transformation in OpenSSD.

Figure 1(b) and Algorithm 1 shows the OpenSSD I/O handling process. When a host application requests I/Os with the file information, such as file position and I/O size, the operating system (OS) first calculates the low-level storage address of the requested data, known as a logical block address (LBA). The OS generates non-volatile message express (NVMe) commands, which include the LBA and the target host memory. These commands are then transmitted to the OpenSSD, which uses slice (SSD page) units to access the NAND flash. When OpenSSD receives I/O requests from the host, it partitions an NVMe command of any I/O size into fixed-size slice requests. For each slice request, the OpenSSD performs NAND flash access and direct memory access (DMA) with the host. After processing all the slice requests, OpenSSD returns the NVMe command.

Specifically, OpenSSD uses a NAND buffer, which is a cache for NAND flash data in the DRAM of the SSD. The NAND buffer comprises several entries, where each entry caches a single slice (SSD page) of NAND flash data. Algorithm 1 demonstrates how OpenSSD handles slice requests. OpenSSD first accesses the NAND buffer by using the logical address of the target SSD page (line 1). If there is no matching NAND buffer entry (miss), then the OpenSSD retrieves the oldest NAND buffer entry from the least recently used (LRU) list (line 3) and evicts the SSD page data to the NAND flash that the entry was holding (lines 2–8). After obtaining the NAND buffer entry corresponding to the target SSD page, OpenSSD validates it. If the NAND buffer entry is invalid, then OpenSSD reads the data from the NAND flash module (lines 9–11). This NAND READ operation can be omitted if the slice request is a write and covers the entire SSD page. OpenSSD then performs a DMA operation using the valid NAND buffer entry (line 16) and updates the LRU list (line 17). As shown in Figure 1(b), every operation that includes NAND flash access and DMA is queued to the job queue of each NAND buffer entry. The operations within each job queue are executed sequentially, while those in separate job queues are performed concurrently.

2.2 In-storage Processing and ISP-assisted Database Checkpointing

ISP is the method of processing data within a storage device instead of transferring it to off-storage computation units, such as the CPU or GPU. Because most SSDs have a higher internal data bandwidth (SSD to NAND flash module) than their external data bandwidth (host to SSD), ISP is highly effective by rather exploiting the internal bandwidth. Furthermore, the utilization of workload offloading through ISP has the potential to enhance overall system performance by reducing host-storage bandwidth, host memory space, and host computation unit usage.

ISP-assisted database (DB) checkpointing [23] is a representative example of ISP. DB checkpointing is an operation executed periodically that stores dirty DB pages on a physical device. The aim is to reduce the recovery execution time, which can result in heavy I/Os and degrade foreground transaction performance. To alleviate the DB checkpointing overhead, ISP-assisted DB checkpointing has been proposed. ISP-assisted DB checkpointing gathers only the changes (deltas) in the dirty pages, transforms them into a constant workload (delta update) list, and offloads the workloads to the ISP device. It can greatly minimize I/O overheads by solely transmitting small-sized deltas to the SSD instead of entire dirty DB pages. ISP-assisted DB checkpointing does not impact foreground transactions as it updates dirty pages solely in the ISP device, which operates in the host’s background. As a result, ISP-assisted DB checkpointing can notably enhance DB performance.

Figure 2 shows the storage-side process of ISP-assisted DB checkpointing. The DB collects the changes in one DB page, packs them in the format of page ID, size, position (location), and the actual data of each delta, and writes them to the delta file. We call the packed data for one DB page a single-page delta (SPD). The delta file contains multiple consecutive SPDs. When the tasks of the DB checkpointing are transferred to the SSD, the ISP kernel reads and parses the delta file into SPDs. Based on the parsed delta, the ISP kernel reads the target page of the SPD. After reading the DB page, the ISP kernel updates the deltas into the read DB page and writes the updated DB page. Following the repetition of these processes for all SPDs (read DB page, update delta, and write DB page), the ISP kernel reads and parses the delta files. This process continues until all offloaded deltas are updated.

Fig. 2. ISP-assisted database checkpointing.

3 MOTIVATION

3.1 ISP Requirements

Based on the ISP-assisted DB checkpointing described in Section 2.2, the following requirements are necessary for ISP offloading:

–	The ISP kernel should access storage data with high-level information, such as the db page number, rather than the low-level storage address.
–	The ISP kernel should access storage data at the byte level, since the DB checkpointing kernel reads deltas of arbitrary length and location (byte addressability).
–	The I/O from the host and the storage access from the ISP kernel must not conflict or interfere with each other.

To meet the requirements, there are several challenges in exchanging data between the ISP kernel and the storage for ISP offloading.

3.2 Challenges of ISP

3.2.1 Information Gap between Host and Storage.

The first challenge is obtaining the low-level storage address (LBA) for the ISP kernel. As shown in Figure 3(a), when the host application requests file I/Os, the OS transforms it into an NVMe command and transmits it to the SSD. The NVMe command encompass the LBA of the intended data, which the OS computes based on the file-to-LBA mapping. Generally, the SSD cannot access this OS file-to-LBA mapping information. As a result, it cannot retrieve the LBAs for the storage data used by ISP kernel, which includes file-level information like file descriptor, offset, and size.

Fig. 3. (a) Information gap and (b) coherence problems in ISP offloading.

3.2.2 Coherence Problem.

As described in Section 2.1, OpenSSD utilizes the NAND buffer to cache NAND flash data. Because the cached data is not updated to the actual NAND flash until evicted, the ISP kernel should consider both the NAND buffer and the NAND flash to obtain the most up-to-date storage data.

However, unintended conflicts may arise with the ISP kernel’s host I/Os and storage data access. This can happen when a single OpenSSD slice carries data from multiple files, due to the host file system’s block size being smaller. Figure 3(b) illustrates an instance where an ISP kernel and a host program write to different files in the same slice simultaneously, resulting in conflicts even though both perform read-modify-write operations to prevent altering data from different files This issue is prevalent on SSD systems when integrating an ISP feature due to the different block size between the filesystem and the SSD page.

3.2.3 NAND Flash Module Access Requirements.

Due to the limited data access capability of page size for NAND flash modules, OpenSSD divides I/O requests into slice units. Additionally, to access data, NAND flash modules necessitate address-aligned memory, an SSD DRAM area whose address is a multiple of the slice size. Unlike address-aligned memory (e.g., 0x4000), NAND flash access is accomplished abnormally with unaligned memory (e.g., 0x4123). Figure 4(a) shows that the ISP kernel necessitates arbitrary-sized data at random file locations and typically relies on unaligned memory. Therefore, buffering the NAND flash access with address-aligned memory is crucial, and reorganising input or result data is necessary when using the unaligned memory with the ISP kernel.

Fig. 4. (a) Memory alignment and size requirements for NAND flash data access. (b–e) Internal and external bandwidth utilization of the SSD in concurrent execution of fio [2] (randread) and the ISP kernel. The ISP kernel repeats the 8 MB read operation.

Accessing NAND flash modules is subject to certain limitations. The process has to be performed on one slice (SSD page) at a time, with address-aligned memory employed as the source or destination. Address-aligned memory refers to memory whose address is a multiple of the slice. To satisfy this requirement for host I/O, OpenSSD partitions NVMe commands of arbitrary sizes into slice-sized requests and employs the NAND buffer, which consists of address-aligned memory. However, as shown in Figure 4(a), the ISP kernel requires arbitrary-sized data stored in random file locations, and typically operates with unaligned ISP kernel memory. Therefore, the ISP kernel necessitates buffering NAND flash access with address-aligned, slice-sized memory and rearranging data between the address-aligned buffer and the ISP kernel memory.

3.2.4 Interference with Host I/O Performance.

Since the internal bandwidth of a SSD typically exceeds its external bandwidth, offloading ISP tasks can enhance overall system utilization by transferring the external bandwidth usage to the internal bandwidth. Nonetheless, accessing storage data through the ISP kernel may degrade host’s I/O performance as these functions use SSD resources beyond just the internal bandwidth. The rearrangement of data, the transfer of data between address-aligned buffers and unaligned ISP kernel memory, occupies SSD computing cores that handle host I/Os. Figures 4(b)–4(e) show the utilization of internal and external bandwidth when the host and the ISP kernel make concurrent requests to the storage. We have taken measurements in four different scenarios:

–	(b) The ISP kernel does not request any storage data access.
–	(c) All the ISP kernel I/O only accesses NAND flash directly.
–	(d) The SSD handles ISP kernel I/O as host I/O without DMA. The SSD first accesses the NAND buffer and fetches NAND flash data using the NAND buffer entry when it misses.
–	(e) The SSD handles ISP kernel I/O as host I/O and rearranges data between the NAND buffer and ISP kernel memory instead DMA.

As shown in the figure, concurrent access by the ISP kernel and host I/O elevates the utilization of the internal bandwidth of the SSD. However, sharing the NAND buffer and rearranging data can decrease internal and external bandwidth utilization, leading to degraded host I/O performance.

3.2.5 Optimization Opportunities in ISP Kernel.

When OpenSSD requests data access to the NAND flash module, the module prepares and transmits/receives the data from/to the SSD DRAM memory. While the NAND flash module processes, the SSD’s core can carry out other tasks such as DMA and NVMe message transformations. Likewise, the ISP kernel can request more storage data or perform computational jobs concurrently with the NAND flash module’s processing.

However, if the ISP kernel already has address-aligned memory and requests page-sized storage data in the appropriate file location with the memory, then NAND flash access can be carried out directly without memory buffering. This optimisation can enhance the performance of the ISP kernel by reducing rearrangement.

3.3 Conventional ISP Approaches

3.3.1 Domain-specific ISP.

Numerous studies [11, 12, 19, 22, 35, 36] have been focused on offloading core operations of a particular domain to SSDs. By leveraging a deep understanding of SSD architecture and core operation characteristics, domain-specific ISPs effectively handle specific workloads. While these domain-specific approaches are highly effective, their applicability is restricted to specific domains as they only support limited operations.

3.3.2 FIFO-ISP.

Many general-purpose ISP studies, such as Summarizer [13] and INSIDER [27], proposed FIFO-ISP approaches to solve this problem. FIFO-ISP delivers storage data to the ISP kernel based on the host I/O process. When the host program launches the ISP kernel, it specifies the input and output data of the kernel. The specified data is prepared in the storage memory in accordance with the current host I/O handling process, as shown in Figure 5(a), and the storage executes the ISP kernel with the prepared data. By managing the input and output data of the ISP kernel on the host side, FIFO-ISP can deliver storage data to the ISP kernel avoiding the problems discussed in Section 3.2. FIFO-ISP typically integrates ISP offloading with host I/O to enhance overall performance by decreasing the I/O size through ISP kernel. Furthermore, the FIFO-ISP typically pipelines data transfer (DMA) and ISP kernels, dynamically allocating workloads to both ISPs and CPUs, resulting in improved overall performance.

Fig. 5. Accessing the storage data for the ISP kernel using file-level information in (a) FIFO-ISP and (b) FTL-based-ISP.

However, the FIFO-ISP restricts the access of the ISP kernel to arbitrary storage data, thus limiting the types of workloads that can be offloaded. Workloads that demand I/O to an arbitrary file area or contain both input and output data in the SSD, such as ISP-assisted DB checkpointing, are not suitable for offloading by FIFO-ISP.

3.3.3 FTL-based-ISP.

FTL-based-ISP techniques, such as Biscuit [7] and CSD [5, 8, 9, 32], have also been suggested, in which the ISP kernel accesses storage data via pre-existing SSD firmware or a NAND flash driver that fulfils a comparable role. For the ISP kernel to perform I/O using file-level information, the FTL-based-ISP distributes the pertinent information between the SSD and host, as shown in Figure 5(b).

The backend storage firmware in the FTL-based-ISP manages ISP kernel I/O as host I/O, and caches NAND flash data through the NAND buffer. However, as shown in Figure 4(b)–4(e), the ISP kernel may interfere with the performance of the host I/Os due to its I/O resource occupancy.

3.4 Insights for Generalized ISP Offloading: ISP Agent

We propose an ISP agent, a framework for general-purpose ISP workload offloading. The ISP agent, similar to FTL-based-ISP, transfers file information from the host application to the SSD before offloading the workloads. It builds a File-to-LBA mapping table named the F2LBA table on the SSD. This table converts file-level ISP kernel I/Os into a low-level storage address-based I/Os. The ISP agent manages and launches the ISP kernel in a separate subsystem called the ISP subsystem. The existing firmware, called the storage subsystem, performs FTL, GC, and NAND scheduling. Each subsystem has its own memory and core. When the ISP kernel requests I/Os, these are transferred to the storage subsystem by the ISP subsystem instead of directly accessing the NAND flash module. The storage subsystem handles ISP kernel I/Os in the same manner as the host I/Os but uses a separated and pre-reserved buffer to access the NAND flash data instead of the NAND buffer. The ISP buffer, a pre-reserved address-aligned buffer, enables the ISP subsystem to perform the rearrangement process rather than relying on the storage subsystem, which otherwise results in significant degradation in host I/O performance. Hence, the ISP agent can process ISP kernel I/Os effectively, even if the I/Os are of arbitrary sizes and are directed to any file location with unaligned memory as the source or destination. Furthermore, the ISP agent provides additional opportunities to optimize the ISP kernel by providing a non-blocking I/O capability and an aligned memory allocation interface.

4 ISP AGENT: GENERAL IN-STORAGE-PROCESSING FRAMEWORK

In this section, we describe the implementation of the ISP agent based on the insights discussed in Section 3.4. Figure 6(a) shows an overview of the ISP agent framework, which comprises two sub-programs on the SSD side; namely, the storage subsystem and the ISP subsystem. The storage subsystem executes general SSD operations, including FTL, GC, and NAND access scheduling. The ISP subsystem launches and manages the ISP kernels. Both subsystems operate on independent cores and memories, but they share some memory as an inter-core queue for message and I/O transmission. The ISP agent provides memory and I/O interfaces to the ISP kernel, which is managed by the ISP subsystem. Moreover, the ISP agent offers application libraries for managing the F2LBA tables and the ISP kernels.

Fig. 6. ISP agent: (a) framework overview and (b) programming model.

4.1 Programming Model and Application Library

Figure 6(b) shows the programming model of the ISP agent. The primary objective of this model is to permit programmers to utilize pre-existing programs as ISP kernels without making significant code changes. The ISP agent accomplishes this by providing interfaces to the ISP kernel that match precisely with the host memory and I/O interfaces, such as malloc_ik, pread_ik, and pwrite_ik. The only difference between the host program and ISP kernel is that I/O requests require a predefined file number instead of a file descriptor.

The host program initiates file registration to be utilized by the ISP kernel by invoking the register_file() function within the library. The registration necessitates only parameters that are identical to those of the open() system call, excluding the user-defined file number. The library builds the file LBA information from the OS and constructs the F2LBA table based on it. Finally, the library transmits the table to the SSD through NVMe commands. After registering all files, the host application requests to launch the ISP kernel along with its arguments. Upon receipt of the request, the library initiates the ISP kernel through NVMe commands.

4.2 ISP Subsystem

The SSD parts of the ISP agent comprise two subsystems: the storage subsystem and the ISP subsystems. The storage subsystem manages standard I/O requests from the host by converting the NVMe command, accessing the data buffer, scheduling NAND flash access, issuing DMAs, and performing garbage collection. Meanwhile, the ISP subsystem manages the ISP kernel and its I/Os. When the storage system recognizes the ISP kernel’s NVMe command and corresponding arguments, it transmits them to the ISP subsystem. Subsequently, the ISP subsystem initiates the kernel.

Figure 7(a) illustrates the ISP subsystem and the ISP kernel. When an I/O request is triggered by the ISP kernel, the ISP subsystem follows a process similar to that of the host OS. It initially divides the I/O into logical blocks by byte-level offset and size. Subsequently, rearrangement information is produced for each partitioned I/O. This information specifies the target destination memory (for write) or source memory (for read), along with the start and end offsets of data within each logical block of the selected file. The ISP subsystem collects rearrangement information, file number, and block-unit file offset along with the I/O size to create a storage command. Afterward, this storage command is routed to the storage subsystem through the inter-core queue. In the storage subsystem, the command is reconstructed into StgCmd requests with LBA information using the F2LBA table. The StgCmd request is a SSD page-level request for accessing the NAND flash module, similar to the slice request described in Section 2.1. The storage system assigns the pre-reserved, address-aligned memory, known as the ISP buffer, to each StgCmd request to access the NAND flash module. The structure of the ISP buffer and the detailed process will be explained in Section 4.3. The storage subsystem updates the rearrangement information using the ISP buffer. Then, the ISP subsystem rearranges the target data between the assigned ISP buffer and the target memory before (for write) or after (for read) accessing the NAND flash module.

Fig. 7. (a) ISP subsystem and (b) storage subsystem of the ISP agent.

4.3 Storage Subsystem

Figure 7(b) demonstrates how the storage subsystem converts storage commands to StgCmd requests, differing from the transformation of NVMe commands into slice requests. NVMe commands are designed for contiguous logical blocks; therefore, they carry logical block-level information such as start LBA and number of logical blocks (NLB). Thus, the storage subsystem can generate slice requests by simply carving NVMe commands using this information. The storage command’s file-level information may seem continuous initially, but it has a risk of fragmentation when translated into the logical block-level address. As such, a gathering process is required to collect the logical blocks that belong to the same SSD page based on their LBAs. The storage subsystem accomplishes this process with a hash table using LBAs as keys.

As discussed in Section 3.4, the ISP agent introduces the ISP buffer, a pre-reserved address-aligned memory that buffers NAND Flash module accesses requested by the ISP kernel. The ISP buffer is similar to the NAND buffer in that it comprises multiple entries corresponding to SSD pages, each entry containing valid and dirty bits, and the buffer is managed by an LRU list. The primary difference between the ISP buffer and the NAND buffer is that each entry in the ISP buffer has an additional lock flag for each entry. This particular flag indicates that the ISP buffer entry is currently in the process of being rearranged. The storage subsystem verifies whether a SSD page data targeted is unlocked in the ISP buffer before handling StgCmd and slice requests. If the data is locked, then the processing of the request is postponed until the rearrangement process has been completed. This guarantees preservation of data integrity by preventing conflicts or inconsistencies that may arise during the rearrangement.

Algorithm 2 demonstrates how the storage subsystem handles StgCmd requests. To begin, the subsystem accesses the NAND buffer based on the logical address of the targeted SSD page (line 1). If a dirty NAND buffer entry exists, then the subsystem writes the associated data back to the NAND flash (lines 3 and 4). If there is a clean NAND buffer entry and the StgCmd request is a write, then the storage subsystem invalidates the NAND buffer entry (lines 5–8). After lines 1–9 are executed, the current data of the SSD page is stored either in the NAND flash module or the ISP buffer. The storage subsystem handles StgCmd requests similar to the slice requests explained in Algorithm 1, differing only in the usage of the ISP buffer rather than the NAND buffer. The storage system initially accesses the ISP buffer utilizing the target SSD page’s logical address (line 10). If necessary, then it will evict the oldest entry in the ISP buffer (lines 11–17) and access the NAND flash module (lines 18–20). Note that StgCmd requests require the verification of byte-level start and end offsets in the rearrangement information to ensure complete SSD page coverage, unlike slice requests. After completing all operations, the storage subsystem updates the rearrangement information by utilizing the ISP buffer entry (line 26) and locks the ISP buffer entry (line 27) until the rearrangement is finished.

4.4 DMA-enabled Memory

Several applications require retrieving result values to the host instead of writing them to NAND flash. To facilitate this, the ISP agent introduces DMA-enabled memory (DEM), which is a specific memory region of a fixed size within the ISP subsystem that the ISP kernel can access. The host application can request DMA using DEM through the application library. Moreover, DEM also uses a mask to block DMA at the logical block level. The storage subsystem conducts a verification of the mask when the host initiates DMA with the DEM. DMA processes will not execute if the mask is set. The DEM mask can be configured by both the host application and ISP kernel to enable efficient ISP kernel outcome filtering.

4.5 ISP Kernel Optimization: I/O Overlapping and Direct I/O

Figure 8(a) demonstrates that the ISP subsystem remains inactive while storage commands are transmitted and processed. Storage command operations handled in the storage subsystem, such as LBA calculations and NAND flash access, are considered background tasks within the ISP subsystem. Therefore, if the ISP core performs additional computations or I/O operations during this idle period, then it has the potential to enhance the overall performance of the ISP core. Based on this concept, the ISP agent offers a non-blocking I/O interface, enabling the ISP kernel to perform other tasks simultaneously at the same time as processing background I/O. This optimization is referred to as I/O overlapping.

Fig. 8. ISP kernel optimization opportunities provided by the ISP agent: (a) I/O overlapping and (b) direct I/O.

As explained in Section 3.2.3, the ISP agent buffers access to NAND flash due to its memory address alignment constraints. If the ISP kernel requests I/O while meeting several conditions for accessing the NAND flash, then it can bypass memory buffering for NAND flash access. Specifically, the ISP kernel should request I/O with three conditions to bypass memory buffering: (1) the I/O is of the slice-unit size, (2) the request locates an appropriate file location, and (3) address-aligned ISP kernel memory is utilized. Figure 8(b) demonstrates direct NAND flash access for the ISP kernel I/O. This optimization can accelerate I/O response time by sidestepping rearrangements via memory buffering. We refer to this optimization as Direct I/O. The ISP agent provides memory allocation interfaces that correspond with kernels for this purpose.

Algorithm 3 describes how the storage subsystem handles StgCmd requests with direct I/O. The storage subsystem first assesses whether the StgCmd request matches the direct I/O criteria (line 1). If it matches, then the subsystem verifies whether the latest version of the requested data is present in either the NAND or ISP buffer (lines 2 and 11). Once the existence of the data is confirmed, the subsystem writes it back (lines 4 and 13) and invalidates the corresponding buffer entry (lines 5–9, 14–18). As a result, the newest version of the data is stored in the NAND flash module. The storage subsystem accesses the NAND flash module with the target memory directly without any buffering (lines 20 and 21). After all operations are finished, the storage subsystem updates the rearrangement information to prevent rearrangement in the ISP subsystem.

5 EXPERIMENTAL SETUPS

5.1 Environments

We synthesized a MicroBlaze [37] core on FPGA of Cosmos+ OpenSSD [14] to implement the ISP agent. We ran the storage subsystem on the Arm Cortex-A9 and the ISP subsystem on the MicroBlaze. While the total memory of the Cosmos+ OpenSSD Board is 4 GB, the ISP subsystem only used up to 48 MB memory. Specifically, we allocated 16 MB of memory for each of the ISP buffer, DEM, and ISP kernel. We utilized the Intel i7-4790K [10] as the host CPU with 16 GB of RAM.

5.2 Baselines

We conducted a comparison between the ISP agent, CPU execution, FIFO-ISP, and FTL-based-ISP without hardware support.

FIFO-ISP prevents the ISP kernel from requesting I/Os; instead, it allows the host application to indicate the storage data to be processed. In FIFO-ISP, the SSD loads specific data into the NAND buffer and subsequently executes the ISP kernel with it. This technique offers the advantage of expedited ISP kernel processing due to the absence of SSD-side LBA computation and data rearrangement. However, the amount of data that the ISP kernel can handle is limited by FIFO-ISP, which is confined to the size of the NAND buffer entry. Furthermore, FIFO-ISP limits the offloading capability for specific workloads that may necessitate access to arbitrary positions of the storage data.

FTL-based-ISP allows the ISP kernel to request I/Os, which is then handled by the pre-existing firmware. The firmware caches the storage data requested by the ISP kernel using the NAND buffer, similar to how it handles I/Os and slice requests from the host application. Once the data has been prepared in the NAND buffer, the firmware rearranges the storage data in the buffer and transfers it to the ISP kernel memory. FTL-based-ISP enables the ISP kernel to fully utilize one core for execution, while the ISP agent uses the core for both ISP kernel execution and data rearrangement. Nonetheless, delegating the rearrangement process to the firmware-executing core could interfere with the host’s I/O performance.

We implemented these two approaches, FIFO-ISP and FTL-based-ISP, to accommodate delivering results of the ISP kernel to host memory (storage-to-host) or different addresses in the storage (storage-to-storage). All ISP kernel computations are executed on MicroBlaze, with variations in approach depending on access to storage data.

The storage-to-host FIFO-ISP can be implemented by executing the ISP kernel and DMA based on their results. To achieve this in Algorithm 1, the ISP kernel must be executed before line 15, and DmaWithNandBufEntry() should be called based on its outcome. When implementing storage-to-storage FIFO-ISP, it may be necessary to discard the results of the ISP kernel depending on its output, and recover the input data. Algorithm 4 demonstrates the implementation of storage-to-storage FIFO-ISP. The SSD first verifies if a NAND buffer entry exists for the destination SSD page, removing it if it does (lines 1–7). Next, it prepares the NAND buffer entry for the source SSD page (lines 8–16), which is similar to how slice requests are processed, and performs a writeback to backup the latest data for the source SSD page (line 15). Finally, the SSD launches the ISP kernel with the loaded source data loaded in the NAND buffer entry (line 17). Depending on the result, the corresponding NAND buffer entry is either assigned to the destination SSD page or invalidated (lines 18–21).

To implement FTL-based-ISP, We utilized the ISP agent mechanism including F2LBA tables, StgCmd requests, and DEM. The ISP kernel can request I/Os through these mechanisms in FTL-based-ISP. The difference between FTL-based-ISP and ISP agent lies in how the storage subsystem handles StgCmd requests. Algorithm 5 demonstrates how the storage subsystem handles StgCmd requests in FTL-based-ISP. The storage subsystem uses NAND buffers to buffer data requested by StgCmd. Just as when processing a slice request, the storage subsystem accesses the NAND buffer (line 1), evicts the oldest buffer entry (lines 2–8), and loads the most current data into the NAND buffer entry (lines 9–11). However, in contrast to line 16 of Algorithm 1, the storage subsystem rearranges the data in the NAND buffer entries and transfers it to the target ISP kernel memory (line 16), rather than performing DMA.

5.3 Benchmarks

The evaluation of the ISP agent used three ISP kernels: NDP Aggregation, NDP Filtering, and DB Checkpointing. Tables 1 and 2 present our evaluation, listing the parameters used for each benchmark.

Table 1.

View Table

Table 1. NDP Benchmarks and Their Parameters

Table 2.

View Table

Table 2. MariaDB [21] and Sysbench [31] Configurations

NDP Aggregation and NDP Filtering are both data processing workloads that entail dividing file data into multiple units and performing iterative computations on each unit. NDP Aggregation summarizes integer values in each data unit by utilizing the size of the unit and the percentage of integers within it as input parameters. However, NDP Filtering examines specific data patterns within each data unit and, if detected, transfers the results to the host memory (storage-to-host) or records them to storage result files (storage-to-storage). For NDP Filtering, we utilized the data unit size and the ratio of the pattern present within the data unit as parameters. We evaluated the NDP Aggregation and NDP Filtering benchmarks in both storage-to-host and storage-to-storage scenarios using four methods: CPU Execution, FIFO-ISP, FTL-based-ISP, and ISP agent.

As described in Section 2.2, DB Checkpointing updates dirty DB pages according to the information in the delta files. We used two parameters, the size of the DB page and the number of delta entries (NDE) per SPD as parameters for DB Checkpointing. To demonstrate the impact and effectiveness of the ISP agent on host performance in a practical application, we concurrently measured the performance of the ISP kernels and the host I/O workload with Sysbench [31] and MariaDB [21]. The Sysbench was executed on MariaDB with ISP-assisted DB checkpointing, and OLTP Sysbench performance was measured as host performance. The processing speed of the checkpointing offloaded to SSD by MariaDB was measured as ISP kernel performance. We compared the performance of the ISP agent on both the host and ISP sides to the FTL-based-ISP. In addition, we analyzed the host-side performance by comparing it to the OLTP Sysbench performance on a vanilla DB without ISP.

6 EVALUATION

6.1 ISP Kernel Performances

Figures 9 and 10 show the performance results for the storage-to-storage and storage-to-host benchmarks listed in Table 1. We measured the total time for all workloads that were offloaded to the SSD, which included setup procedures like the formation of the F2LBA table and DMA with DEM. Each bar denotes the efficiency of the ISP kernels in FIFO-ISP, FTL-based-ISP, and ISP agent, encompassing the improved performance of the ISP kernel via I/O overlapping and direct I/O in ISP agent. The labels I and D refer to I/O overlapping optimization and direct I/O optimization, respectively. All performance measurements are normalized to the CPU execution of each benchmark.

Fig. 9. ISP kernel performance comparison of CPU execution, FIFO-ISP, FTL-based-ISP, and ISP agent using storage-to-storage benchmarks: (a) NDP Aggregation, (b) NDP Filtering, and (c) DB Checkpointing.

Fig. 10. ISP kernel performance comparison of CPU execution, FIFO-ISP, FTL-based-ISP, and ISP agent using storage-to-host benchmarks: (a) NDP Aggregation and (b) NDP Filtering.

The figures demonstrate that CPU execution outperforms ISP offloading except for DB checkpointing. Additionally, as the application becomes more compute intensive, the performance of ISP offloading decreases. This is due to the fundamental performance difference of the i7-4790K for CPU execution and the MicroBlaze for ISP offloading. However, ISP offloading achieves greater performance as the intensity of I/O increases, especially in DB checkpointing applications with heavy I/O demands.

In most cases, FIFO-ISP demonstrates the most impressive performance among ISP techniques. As discussed earlier, FIFO-ISP directly computes data in the NAND buffer, avoiding the need to access the F2LBA table or rearrange data. Just as various entries in the NAND buffer can simultaneously access the NAND flash module or execute DMA for slice requests, separate FIFO-ISP kernels can run concurrently for different entries in the NAND buffer. This aspect of FIFO-ISP renders it well-suited for streaming data processing workloads such as NDP Aggregation and NDP Filtering. However, it cannot offload all types of workloads. In particular, FIFO-ISP cannot process data units larger than the 16 KB SSD page size, and it is unable to access any file position necessary for DB checkpointing. Additionally, Figure 9(b) demonstrates that FIFO-ISP was unable to offload storage-to-storage NDP filtering when using a 4 KB data unit size due to the NAND flash module’s inability to perform partial writes to the 16 KB SSD page.

The FTL-based-ISP and the ISP agent effectively offloaded all benchmarks. In general, the unoptimized ISP agent was slower than the FTL-based-ISP, because the FTL-based-ISP allows the cores to concentrate on executing the ISP kernel without data rearrangement. Nevertheless, the optimized ISP agent outperformed the FTL-based-ISP and was similar to the FIFO-ISP. This suggests that the ISP agent’s optimization techniques efficiently handle the structural trade-offs and enhance the effectiveness of ISP offloading.

6.2 I/O Overlapping and Direct I/O

As shown in Figures 9 and 10, the direct I/O optimization fairly improved performance, whereas the I/O overlapping optimization had minor effects. However, when these two optimizations were combined, a significant performance increase was observed. This is because the effectiveness of I/O overlapping relies on the proportion of computations, including data rearrangement, and I/Os. Therefore, if the ISP core is enhanced, then performance can be improved solely through I/O overlapping.

Figure 11(a) demonstrates the impact on I/O overlapping performance when the number of concurrently updated SPDs (NSU) changes. NSU indicates the maximum number of I/Os that can be overlapped simultaneously. For instance, when NSU is set to 16, the ISP kernel parses 16 SPDs and generates non-blocking reads simultaneously in parallel for each DB page indicated by the SPDs. The ISP kernel performs delta updates and writes to the DB pages separately after each non-blocking read for that page. Therefore, I/Os and delta updates from different SPDs can overlap. As shown in Figure 11(a), performance improves as more I/Os overlap, i.e., as NSU increases. This improvement is attributed to SSDs accessing multiple NAND flash modules concurrently through multiple flash channels, leading to an overall faster I/O processing speed as more I/Os are issued concurrently. Figure 11(b) demonstrates the effectiveness of DB checkpointing that uses direct I/O optimization with two different approaches. The first approach, D-Possible, applies direct I/O optimization selectively only when feasible. In detail, the D-possible approach does not perform direct I/O optimization on DB page read/write operations when using a 4 KB DB page size. However, the D-All approach optimizes all I/O, including DB page read/write, by performing them on the entire SSD page where the desired DB page is located. However, D-All requires more memory and leads to additional redundancy for DB page I/O operations. As shown in Figure 11(b), D-All exhibits superior performance compared to D-possible, even with I/O redundancy trade-offs.

Fig. 11. Performance of the DB checkpointing (a) with I/O overlapping optimization based on the number of simultaneously updated SPDs (NSU) and (b) with direct I/O optimization.

6.3 Effectiveness of ISP Offloading and Interference with Host Application

Figure 12 shows the performance of ISP offloading in real-world scenarios. As discussed earlier, we evaluated the effectiveness of ISP offloading and the impact of the ISP agent on host performance using the real-world application, Sysbench on MariaDB. In this experiment, MariaDB generates standard I/O requests for Sysbench transactions while offloading the DB checkpointing using ISP. Offloading the ISP kernel to the SSD could potentially affect the I/O performance of the host, subsequently impacting Sysbench performance.

Fig. 12. Host DB performance and the ISP kernel performance comparison of vanilla DB, FTL-based-ISP, and ISP agent using a real-world application Sysbench on MariaDB.

The Sysbench OLTP performance reflects the effectiveness of the host application by indicating the amount of Sysbench transactions processed per second. The processing speed of the offloaded deltas indicates the efficiency of the ISP kernel. Our experiments with MariaDB used three configurations: Vanilla DB (without ISP offloading), FTL-based-ISP offloading, and ISP agent offloading. We evaluated the performance of each configuration and normalized the delta processing speeds using the results obtained from FTL-based-ISP.

As shown by the black solid line in Figure 12, the Sysbench OLTP performance is greatly enhanced by ISP offloading, resulting in \(1.32\times\) and \(1.65\times\) improvements using FTL-based-ISP and ISP agent, respectively. This demonstrates the significant effectiveness of ISP-assisted DB checkpointing in improving overall DB performance. The ISP agent proves especially effective in harnessing ISP offloading, as it mitigates host I/O interference through data rearrangement separation, a functionality not achievable by FTL-based-ISP.

However, as shown by the red dashed line in Figure 12, the ISP agent was 0.894 times slower than FTL-based-ISP on the ISP side. This demonstrates that there is a trade-off between host I/O performance and ISP kernel performance for the ISP agent and FTL-based-ISP. Nevertheless, as depicted on the right side of Figure 12, the ISP agent’s optimization techniques significantly improved ISP kernel performance. The ISP agent increased the performance of the ISP kernel by 1.035\(\times\) through I/O overlapping, 5.044\(\times\) through direct I/O, and up to 15.998\(\times\) through the combination of both techniques, while maintaining a 1.65\(\times\) enhancement in host performance. These results demonstrate that the ISP agent’s optimizations have the potential to considerably enhance the performance of the ISP kernel.

In summary, optimizing the ISP agent allows for significant improvements in ISP kernel performance while maintaining the host performance advantage, despite the trade-off between host I/O performance and ISP kernel performance.

6.4 Insights from Experimental Results

The experimental results indicate the following three conclusions:

First, offloading background tasks to the ISP enhances overall application performance, particularly when managed by the ISP agent. Since the DB checkpointing was previously executed in the background, the ISP-assisted DB currently employs internal SSD resources to operate such background tasks. Figure 12 demonstrates that offloading DB checkpointing to the ISP improves overall DB performance. Among various ISP techniques, the ISP agent provides the best option for offloading background tasks to the ISP and maximizing overall application performance while minimizing interference with host I/O operations.

Second, it is important to evaluate the performance of ISP offloading and CPU execution prior to offloading general foreground tasks to the ISP. Generally, CPU execution was faster than ISP offloading with the exception of DB checkpointing. This is due to the difference in core performance between CPU execution and ISP offloading. The ISP kernel was executed on a 200MHz MicroBlaze for ISP offloading, while CPU execution was performed on a 4.40GHz Intel i7-4790K. Therefore, executing the ISP kernel on higher performance computing cores will likely result in a substantial performance increase.

Third, the ISP agent outperforms most ISP techniques in various situations and is the most efficient solution considering engineering costs. Figures 9 and 10 show that the FIFO-ISP outperforms other ISP technologies in terms of ISP-side performance. Meanwhile, the optimized ISP agent also achieves comparable or superior results. Optimizing ISP kernels using I/O overlapping or direct I/O techniques demands increased engineering costs. However, the ISP agent enables direct offloading of pre-existing host code to the ISP, eliminating the need to rewrite ISP kernels into the streaming programming model required for the FIFO-ISP. Therefore, when engineering costs are considered, the ISP agent proves to be the most cost effective ISP technology.

7 RELATED WORKS

In-storage computing techniques have been proposed for various data-intensive domains such as databases [6, 11, 23, 28, 29, 36], graph processing [15, 19, 22], deep learning [12, 15, 35], and string matching [1, 20, 24]. The studies have achieved substantial progress in performance and energy efficiency by migrating core functions of the host application to ISP devices. However, their use is limited to specific areas, while the ISP agent aims to fulfill general purposes.

The available general-purpose ISP frameworks can be classified as either FIFO-ISP or FTL-based-ISP. In FIFO-ISP, the host program controls the input and output data of the ISP kernels, which cannot request I/Os for random storage data. Therefore, the ISP kernel must be defined as a function for streaming data. Several FIFO-ISP frameworks integrate the ISP with host I/O [3, 13, 27], leading to the potential for improved overall performance by pipelining ISP computations, DMA, and NAND flash accesses. However, as discussed in Section 3.3, FIFO-ISP cannot handle workloads that involve complex storage data accesses, such as ISP-assisted DB checkpointing. To elaborate, Summarizer [13] executes user-defined ISP kernels with the data stored in the NAND buffer. INSIDER [27] features a POSIX file I/O-like interface that enables the host application to launch the ISP kernel. SmartSSD [16] provides an OpenCL API to deploy a kernel on FPGA inside the SSD. SmartSSD enables the ISP kernel to utilize NAND flash data as both input and output through peer-to-peer communication between the NAND flash and FPGA DRAM. Similar to Summarizer and INSIDER, blockNDP [3] also incorporates the host’s read/write operation with a per-block function. It also provides transform commands for storage-to-storage ISP execution.

FTL-based-ISP techniques, including Biscuit [7] and CSD [5, 8, 9, 32], have also been proposed. This enables the ISP kernel to perform I/Os for random storage data by transferring requests to the existing storage firmware. However, the absence of hardware support leads to ISP kernel I/O overheads that interfere with host I/O performance. The ISP agent mitigates this issue by separating memory and the pre-/post-computation of ISP kernel I/O from the current storage firmware. It also offers additional optimization techniques to enhance ISP kernel performance. In particular, Biscuit [7] provides a flow-based programming interface specifically designed for accessing file data in SSD through ISP kernels. The programming model allows the host program to designate the files utilized in the ISP kernel, similar to the ISP agent. Biscuit employs a hardware pattern matcher to expedite the speed of large read operations on its devices. In CSD [5, 8, 9, 32], the storage device utilized Linux OS and the file systems underwent synchronization between the host OS and storage OS via TCP/IP communication over PCIe. The ISP kernel I/Os were managed by the block device driver or backend firmware in conjunction with the host I/Os in CSD.

8 CONCLUSION

In this article, we present the ISP agent, an ISP workload offloading framework. The ISP agent enhances programmability by enabling the ISP kernel to request I/O with file-level information. Unlike the FTL-based-ISP, the ISP agent separates the internal resources needed to handle ISP kernel I/O from the current storage firmware to minimize interference with host I/O performance. Additionally, the ISP agent provides two optimization options for ISP kernel I/O: I/O overlapping and direct I/O. We evaluated the ISP agent with blockNDP and ISP-assisted DB checkpointing applications. As a result, the optimization options of the ISP agent considerably enhance the performance of the ISP kernel while simultaneously maximizing performance improvements on the host via ISP offloading.

REFERENCES

[1] Adams Ian F., Keys John, and Mesnier Michael P.. 2019. Respecting the block interface—Computational storage using virtual objects. In Proceedings of the 11th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’19). USENIX Association. Retrieved from https://www.usenix.org/conference/hotstorage19/presentation/adamsGoogle ScholarDigital Library
Reference 1Reference 2
[2] Axboe Jens. 2022. Flexible I/O Tester. Retrieved from https://github.com/axboe/fioGoogle Scholar
Reference
[3] Barbalace Antonio, Decky Martin, Picorel Javier, and Bhatotia Pramod. 2020. BlockNDP: Block-storage near data processing. In Proceedings of the 21st International Middleware Conference Industrial Track (Middleware’20). ACM, New York, NY, 8–15. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[4] Cho Sangyeun, Park Chanik, Oh Hyunok, Kim Sungchan, Yi Youngmin, and Ganger Gregory R.. 2013. Active disk meets flash: A case for intelligent SSDs. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS’13). ACM, New York, NY, 91–102. DOI:Google ScholarDigital Library
Reference
[5] Do Jaeyoung, Ferreira Victor C., Bobarshad Hossein, Torabzadehkashi Mahdi, Rezaei Siavash, Heydarigorji Ali, Souza Diego, Goldstein Brunno F., Santiago Leandro, Kim Min Soo, Lima Priscila M. V., França Felipe M. G., and Alves Vladimir. 2020. Cost-effective, energy-efficient, and scalable storage computing for large-scale AI applications. ACM Trans. Storage 16, 4, Article 21 (Oct.2020), 37 pages. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[6] Do Jaeyoung, Kee Yang-Suk, Patel Jignesh M., Park Chanik, Park Kwanghyun, and DeWitt David J.. 2013. Query processing on smart SSDs: Opportunities and challenges. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’13). ACM, New York, NY, 1221–1230. DOI:Google ScholarDigital Library
Reference 1Reference 2
[7] Gu Boncheol, Yoon Andre S., Bae Duck-Ho, Jo Insoon, Lee Jinyoung, Yoon Jonghyun, Kang Jeong-Uk, Kwon Moonsang, Yoon Chanho, Cho Sangyeun, Jeong Jaeheon, and Chang Duckhyun. 2016. Biscuit: A framework for near-data processing of big data workloads. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). IEEE Press, 153–165. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[8] HeydariGorji Ali, Torabzadehkashi Mahdi, Rezaei Siavash, Bobarshad Hossein, Alves Vladimir, and Chou Pai H.. 2020. Stannis: Low-power acceleration of DNN training using computational storage devices. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20). 1–6. DOI:Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[9] HeydariGorji Ali, Torabzadehkashi Mahdi, Rezaei Siavash, Bobarshad Hossein, Alves Vladimir, and Chou Pai H.. 2022. In-storage processing of I/O intensive applications on computational storage drives. In Proceedings of the 23rd International Symposium on Quality Electronic Design (ISQED’22). 1–6. DOI:Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[10] Intel. 2014. Intel Core i7-4790K Processor. Retrieved from https://www.intel.com/content/www/us/en/products/sku/80807/intel-core-i74790k-processor-8m-cache-up-to-4-40-ghz/specifications.html. Accessed: 2022-11-09.Google Scholar
Reference
[11] Jo Insoon, Bae Duck-Ho, Yoon Andre S., Kang Jeong-Uk, Cho Sangyeun, Lee Daniel D. G., and Jeong Jaeheon. 2016. YourSQL: A high-performance database system leveraging in-storage computing. Proc. VLDB Endow. 9, 12 (Aug.2016), 924–935. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[12] Kim Junkyum, Kang Myeonggu, Han Yunki, Kim Yang-Gon, and Kim Lee-Sup. 2023. OptimStore: In-storage optimization of large scale DNNs with on-die processing. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’23). 611–623. DOI:Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[13] Koo Gunjae, Matam Kiran Kumar, I Te, Narra H. V. Krishna Giri, Li Jing, Tseng Hung-Wei, Swanson Steven, and Annavaram Murali. 2017. Summarizer: Trading communication with computing near storage. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-50’17). ACM, New York, NY, 219–231. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[14] Kwak Jaewook, Lee Sangjin, Park Kibin, Jeong Jinwoo, and Song Yong Ho. 2020. Cosmos+ OpenSSD: Rapid prototype for flash storage systems. ACM Trans. Storage 16, 3, Article 15 (July2020), 35 pages. DOI:Google ScholarDigital Library
Reference 1Reference 2Reference 3
[15] Kwon Miryeong, Gouk Donghyun, Lee Sangwon, and Jung Myoungsoo. 2022. Hardware/software co-programmable framework for computational SSDs to accelerate deep learning service on large-scale graphs. In Proceedings of the 20th USENIX Conference on File and Storage Technologies (FAST’22). USENIX Association, 147–164. Retrieved from https://www.usenix.org/conference/fast22/presentation/kwonGoogle Scholar
Reference 1Reference 2Reference 3
[16] Lee Joo Hwan, Zhang Hui, Lagrange Veronica, Krishnamoorthy Praveen, Zhao Xiaodong, and Ki Yang Seok. 2020. SmartSSD: FPGA accelerated near-storage data analytics on SSD. IEEE Comput. Arch. Lett. 19, 2 (2020), 110–113. DOI:Google ScholarDigital Library
Reference 1Reference 2
[17] Lee Young-Sik, Quero Luis Cavazos, Kim Sang-Hoon, Kim Jin-Soo, and Maeng Seungryoul. 2016. ActiveSort: Efficient external sorting using active SSDs in the MapReduce framework. Future Gen. Comput. Syst. 65 (2016), 76–89. DOI:Google ScholarDigital Library
Reference
[18] Lee Young-Sik, Quero Luis Cavazos, Lee Youngjae, Kim Jin-Soo, and Maeng Seungryoul. 2014. Accelerating external sorting via on-the-fly data merge in active SSDs. In Proceedings of the 6th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’14). USENIX Association, Philadelphia, PA. Retrieved from https://www.usenix.org/conference/hotstorage14/workshop-program/presentation/leeGoogle Scholar
Reference
[19] Li Cangyuan, Wang Ying, Liu Cheng, Liang Shengwen, Li Huawei, and Li Xiaowei. 2021. GLIST: Towards in-storage graph learning. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’21). USENIX Association, 225–238. Retrieved from https://www.usenix.org/conference/atc21/presentation/li-cangyuanGoogle Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[20] Ghiasi Nika Mansouri, Park Jisung, Mustafa Harun, Kim Jeremie, Olgun Ataberk, Gollwitzer Arvid, Cali Damla Senol, Firtina Can, Mao Haiyu, Alserr Nour Almadhoun, Ausavarungnirun Rachata, Vijaykumar Nandita, Alser Mohammed, and Mutlu Onur. 2022. GenStore: A high-performance in-storage processing system for genome sequence analysis. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’22). ACM, New York, NY, 635–654. DOI:Google ScholarDigital Library
Reference 1Reference 2
[21] MariaDB. 2020. MariaDB Server: The open source relational database. Retrieved from https://mariadb.org/Google Scholar
Reference 1Reference 2Reference 3
[22] Matam Kiran Kumar, Koo Gunjae, Zha Haipeng, Tseng Hung-Wei, and Annavaram Murali. 2019. GraphSSD: Graph semantics aware SSD. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA’19). ACM, New York, NY, 116–128. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[23] Oh Hyeonseok, Jang Hyeongwon, Kim Jaeeun, Kim Jongbin, Han Hyuck, Kang Sooyong, and Jung Hyungsoo. 2020. DEMETER: Hardware-Assisted Database Checkpointing. ACM, New York, NY, 394–403. Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[24] Pei Shuyi, Yang Jing, and Yang Qing. 2019. REGISTOR: A platform for unstructured data processing inside SSD storage. ACM Trans. Storage 15, 1, Article 7 (Mar.2019), 24 pages. DOI:Google ScholarDigital Library
Reference 1Reference 2
[25] Qiao Weikang, Oh Jihun, Guo Licheng, Chang Mau-Chung Frank, and Cong Jason. 2021. FANS: FPGA-accelerated near-storage sorting. In Proceedings of the IEEE 29th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’21). 106–114. DOI:Google ScholarCross Ref
Reference
[26] Quero Luis Cavazos, Lee Young-Sik, and Kim Jin-Soo. 2015. Self-sorting SSD: Producing sorted data inside active SSDs. In Proceedings of the 31st Symposium on Mass Storage Systems and Technologies (MSST’15). 1–7. DOI:Google ScholarCross Ref
Reference
[27] Ruan Zhenyuan, He Tong, and Cong Jason. 2019. INSIDER: Designing in-storage computing system for emerging high-performance drive. In Proceedings of the USENIX Conference on Usenix Annual Technical Conference (USENIX ATC’19). USENIX Association, 379–394.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[28] Salamat Sahand, Aboutalebi Armin Haj, Khaleghi Behnam, Lee Joo Hwan, Ki Yang Seok, and Rosing Tajana. 2021. NASCENT: Near-storage acceleration of database sort on SmartSSD. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’21). ACM, New York, NY, 262–272. DOI:Google ScholarDigital Library
Reference 1Reference 2
[29] Schmid Robert, Plauth Max, Wenzel Lukas, Eberhardt Felix, and Polze Andreas. 2020. Accessible near-storage computing with FPGAs. In Proceedings of the 15th European Conference on Computer Systems (EuroSys’20). ACM, New York, NY, Article 28, 12 pages. DOI:Google ScholarDigital Library
Reference 1Reference 2
[30] Seshadri Sudharsan, Gahagan Mark, Bhaskaran Sundaram, Bunker Trevor, De Arup, Jin Yanqin, Liu Yang, and Swanson Steven. 2014. Willow: A user-programmable SSD. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). USENIX Association, Broomfield, CO, 67–80. Retrieved from https://www.usenix.org/conference/osdi14/technical-sessions/presentation/seshadri.Google Scholar
Reference
[31] Sysbench. 2020. Sysbench, Scriptable Multi-threaded Benchmark Tool. Retrieved from https://github.com/akopytov/sysbench.Google Scholar
Reference 1Reference 2Reference 3
[32] Torabzadehkashi Mahdi, Rezaei Siavash, Heydarigorji Ali, Bobarshad Hosein, Alves Vladimir, and Bagherzadeh Nader. 2019. Catalina: In-storage processing acceleration for scalable big data analytics. In Proceedings of the 27th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP’19). 430–437. DOI:Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[33] Wang Jianguo, Lo Eric, Yiu Man Lung, Tong Jiancong, Wang Gang, and Liu Xiaoguang. 2014. Cache design of SSD-based search engine architectures: An experimental study. ACM Trans. Inf. Syst. 32, 4, Article 21 (Oct.2014), 26 pages. DOI:Google ScholarDigital Library
Reference
[34] Wang Jianguo, Park Dongchul, Kee Yang-Suk, Papakonstantinou Yannis, and Swanson Steven. 2016. SSD in-storage computing for list intersection. In Proceedings of the 12th International Workshop on Data Management on New Hardware (DaMoN’16). ACM, New York, NY, Article 4, 7 pages. DOI:Google ScholarDigital Library
Reference
[35] Wilkening Mark, Gupta Udit, Hsia Samuel, Trippel Caroline, Wu Carole-Jean, Brooks David, and Wei Gu-Yeon. 2021. RecSSD: Near data processing for solid state drive based recommendation inference. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’21). ACM, New York, NY, 717–729. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[36] Woods Louis, István Zsolt, and Alonso Gustavo. 2014. Ibex: An intelligent storage engine with support for advanced SQL offloading. Proc. VLDB Endow. 7, 11 (July2014), 963–974. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[37] Xilinx AMD. 2019. Microblaze Soft Processor Core. Retrieved from https://www.xilinx.com/products/design-tools/microblaze.html. Accessed: 2022-03-14.Google Scholar
Reference

Index Terms

ISP Agent: A Generalized In-storage-processing Workload Offloading Framework by Providing Multiple Optimization Opportunities
1. Computer systems organization
  1. Embedded and cyber-physical systems
    1. Embedded systems
      1. Firmware
2. Software and its engineering
  1. Software notations and tools
    1. Context specific languages
      1. Domain specific languages

Recommendations

Optimizing virtual machine live storage migration in heterogeneous storage environment
VEE '13

Virtual machine (VM) live storage migration techniques significantly increase the mobility and manageability of virtual machines in the era of cloud computing. On the other hand, as solid state drives (SSDs) become increasingly popular in data centers, ...
Read More
Optimizing virtual machine live storage migration in heterogeneous storage environment
VEE '13: Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments

Virtual machine (VM) live storage migration techniques significantly increase the mobility and manageability of virtual machines in the era of cloud computing. On the other hand, as solid state drives (SSDs) become increasingly popular in data centers, ...
Read More
Flash-Based Storage Deduplication Techniques: A Survey

Exponential growth of the amount of data stored worldwide together with high level of data redundancy motivates the active development of data deduplication techniques. The overall increasing popularity of solid-state drives (SSDs) as primary storage ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 21, Issue 1
March 2024
500 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3613496
Editor:
David Kaeli
Northeastern University, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 January 2024
- Online AM: 14 November 2023
- Accepted: 23 October 2023
- Revised: 22 September 2023
- Received: 2 February 2023
Published in taco Volume 21, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
In-storage processing
solid state drive
firmware
programming model
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 1,404
  Total Downloads
- Downloads (Last 12 months)1,404
- Downloads (Last 6 weeks)1,081
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ISP Agent: A Generalized In-storage-processing Workload Offloading Framework by Providing Multiple Optimization Opportunities

ACM Transactions on Architecture and Code Optimization

Abstract

1 INTRODUCTION

2 BACKGROUND

2.1 Cosmos+ OpenSSD Architecture

2.2 In-storage Processing and ISP-assisted Database Checkpointing

3 MOTIVATION

3.1 ISP Requirements

3.2 Challenges of ISP

3.2.1 Information Gap between Host and Storage.

3.2.2 Coherence Problem.

3.2.3 NAND Flash Module Access Requirements.

3.2.4 Interference with Host I/O Performance.

3.2.5 Optimization Opportunities in ISP Kernel.

3.3 Conventional ISP Approaches

3.3.1 Domain-specific ISP.

3.3.2 FIFO-ISP.

3.3.3 FTL-based-ISP.

3.4 Insights for Generalized ISP Offloading: ISP Agent

4 ISP AGENT: GENERAL IN-STORAGE-PROCESSING FRAMEWORK

4.1 Programming Model and Application Library

4.2 ISP Subsystem

4.3 Storage Subsystem

4.4 DMA-enabled Memory

4.5 ISP Kernel Optimization: I/O Overlapping and Direct I/O

5 EXPERIMENTAL SETUPS

5.1 Environments

5.2 Baselines

5.3 Benchmarks

6 EVALUATION

6.1 ISP Kernel Performances

6.2 I/O Overlapping and Direct I/O

6.3 Effectiveness of ISP Offloading and Interference with Host Application

6.4 Insights from Experimental Results

7 RELATED WORKS

8 CONCLUSION

REFERENCES

Cited By

Index Terms

Recommendations

Optimizing virtual machine live storage migration in heterogeneous storage environment

Optimizing virtual machine live storage migration in heterogeneous storage environment

Flash-Based Storage Deduplication Techniques: A Survey

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media