research-article

Open Access

Perseid: A Secondary Indexing Mechanism for LSM-Based Storage Systems

Authors:
Jing Wang

Tsinghua University, China

Tsinghua University, China

0009-0004-5385-4066
View Profile

,
Youyou Lu

Tsinghua University, China

Tsinghua University, China

0000-0002-6214-5390
View Profile

,
Qing Wang

Tsinghua University, China

Tsinghua University, China

0000-0002-5526-7154
View Profile

,
Yuhao Zhang

Tsinghua University, China

Tsinghua University, China

0000-0002-0118-8135
View Profile

,
Jiwu Shu

Tsinghua University, China

Tsinghua University, China

0000-0002-7362-2789
View Profile

Authors Info & Claims

ACM Transactions on Storage Volume 20 Issue 2Article No.: 9pp 1–28https://doi.org/10.1145/3633285

Published:19 February 2024Publication History

ACM Transactions on Storage

Abstract

LSM-based storage systems are widely used for superior write performance on block devices. However, they currently fail to efficiently support secondary indexing, since a secondary index query operation usually needs to retrieve multiple small values, which scatter in multiple LSM components. In this work, we revisit secondary indexing in LSM-based storage systems with byte-addressable persistent memory (PM). Existing PM-based indexes are not directly competent for efficient secondary indexing. We propose Perseid, an efficient PM-based secondary indexing mechanism for LSM-based storage systems, which takes into account both characteristics of PM and secondary indexing. Perseid consists of (1) a specifically designed secondary index structure that achieves high-performance insertion and query, (2) a lightweight hybrid PM-DRAM and hash-based validation approach to filter out obsolete values with subtle overhead, and (3) two adapted optimizations on primary table searching issued from secondary indexes to accelerate non-index-only queries. Our evaluation shows that Perseid outperforms existing PM-based indexes by 3–7× and achieves about two orders of magnitude performance of state-of-the-art LSM-based secondary indexing techniques even if on PM instead of disks.

1 INTRODUCTION

Log-Structured Merge trees (LSM-trees) feature outstanding write performance and thus have been widely adopted in modern key-value KV) stores, such as RocksDB [28] and Cassandra [1].

Different from in-place update storage structures (e.g., B\(^+\)-Tree), LSM-trees buffer writes in memory and flush them to storage devices in batches periodically to avoid random writes, which enables high write performance and low device write amplification. Besides high write performance, many database applications also require high-performance queries on not only primary keys but also other specific values [13], thus necessitating secondary indexing techniques.

LSM-trees’ attributes make it challenging to design efficient secondary indexing. Modern LSM-based storage systems typically store a secondary index as another LSM-tree [54] (e.g., a column family in RocksDB [51]). However, designed for block devices and optimized for write performance, LSM-trees are not competent data structures for secondary indexes, which require high search performance. First, since secondary indexes usually only store primary keys instead of full records¹ as values, KV pairs in secondary indexes are small. LSM-trees’ heavy lookup operations are inefficient for these small KV pairs. Second, secondary keys are not unique and can have multiple associated primary keys. LSM-trees’ out-of-place write pattern will scatter these non-consecutive-arrived values (i.e., associated primary keys) to multiple pieces at different levels. Consequently, query operations need to search all levels in the LSM-based secondary index to fetch these value pieces. Besides the device I/O overhead, LSM-trees have non-negligible overheads of CPU and memory (i.e., indexing and Bloom filter) [21, 25, 40].

Moreover, the consistency of secondary indexes is another issue in LSM-based storage systems. As an LSM-based primary table adopts the blind-write pattern to insert, update, and delete records (appends new data without checking prior data, versus read-modify-write in B\(^+\)-Trees) for high write performance, it is unable to delete the obsolete entry in a secondary index without acquiring the old secondary key. Consequently, when querying a secondary index, the system should validate each entry by checking the primary table before returning the results to users, which introduces many unnecessary but expensive lookups on the primary table for obsolete entries. Some systems fetch old records when updating or deleting records to keep secondary indexes up-to-date synchronously [11, 51], whereas this method discards the blind-write attribute and thus degrades the write performance.

Though many efforts have been made to optimize these predicaments [42, 47, 54, 59], they are difficult to solve the problems discussed above well, sacrificing either write performance of the LSM-based storage systems or query performance of the secondary index.

As secondary indexing demands low-latency queries and the KV pairs of secondary indexes are small, we argue that leveraging persistent memory (PM) to provide a new solution for secondary indexing is promising. PM has many attractive advantages such as byte-addressability, DRAM-comparable access latency, and the ability of data persistency, which is well suited to secondary indexing. Though there are many state-of-the-art PM-based indexes [17, 31, 37, 39, 43, 44, 52, 53, 72, 73], none of them are designed for secondary indexing. Without considering the non-unique feature of secondary indexes and consistency in LSM-based KV stores, simply adopting existing general PM-based indexes as secondary indexes can overshadow their performance.

In this work, we propose Perseid [61], a new persistent memory-based secondary indexing mechanism for LSM-based KV stores. Perseid contains PS-Tree, a specifically designed data structure on PM for secondary indexes. PS-Tree can leverage state-of-the-art PM-based indexes and enhance them with a specific value layer, which considers the characteristics of both PM and secondary indexing. The value layer of PS-Tree works in a manner of blended log-structured approach and B\(^+\)-Tree leaf nodes, which is both PM-friendly and secondary-index-friendly. Specifically, new values are appended to value pages for efficient insertion on PM. During the value page split, multiple values (i.e., associated primary keys) that belong to the same secondary key are reorganized to store continuously for efficient querying.

Moreover, Perseid retains the blind-write attribute of LSM-based KV stores for high write performance without sacrificing secondary index query performance. This is achieved by a lightweight hybrid PM-DRAM and hash-based validation approach in Perseid. Perseid uses a hash table on PM to record the latest version of primary keys. However, multiple random accesses on PM still incur high latencies. Thus, Perseid adopts a small mirror of the validation hash table on DRAM, which only contains useful information for validation. During validation, the volatile hash table absorbs random accesses to PM, and thus reduces the validation overhead. The small volatile hash table not only saves DRAM memory space but also reduces cache pollution.

Perseid has a fairly low latency of index-only query.² However, the overhead of non-index-only queries is still dominated by the LSM-based primary table. Therefore, we further propose two optimizations for non-index-only queries in Perseid. First, as querying the primary table issued by the secondary index is an internal operation, we can locate KV pairs with additional auxiliary information much more efficiently, reducing cumbersome indexing operations. By matching the tiering compaction strategy [24, 48], we can further bypass Bloom filter checking. Second, as one secondary index query may need to search for multiple independent records in the primary table, we parallelize these searching operations with multiple threads. Since search latencies on the LSM-based primary table may vary largely, we apply a worker-active manner on parallel threads to avoid load imbalance among threads and improve utilization.

We implement Perseid and evaluate it against state-of-the-art PM-based indexes and LSM-based secondary indexing techniques on PM. The evaluation results show that Perseid outperforms exiting PM-based indexes by 3-7\(\times\) for queries, and achieves about two orders of magnitude higher performance of state-of-the-art LSM-based secondary indexing techniques even if on PM instead of disks, while maintaining the high write performance of LSM-based storage systems.

In summary, this article makes the following contributions:

—	Analysis of the inefficiencies of LSM-based secondary indexing techniques and existing PM-based indexes when adopted as secondary indexes for LSM-based KV stores.
—	Perseid, an efficient PM-based secondary indexing mechanism, which includes a secondary index-friendly structure, a lightweight validation approach, and two optimizations on primary table searching issued from secondary indexes.
—	Experiments that demonstrate the advantage of Perseid.

2 BACKGROUND

2.1 Log-Structured Merge Trees

The LSM-tree applies out-of-place updates and performs sequential writes, which achieves superior write performance compared with other in-place-update storage structures.

The LSM-tree has a multi-level structure on storage and each level comprises one or several sorted runs. The size of Level \(L_n\) is several times (e.g., 10) larger than Level \(L_{n-1}\). Each sorted run contains sorted KV pairs and is further partitioned to multiple small components called SSTables. In LSM-trees, new KV pairs are first buffered into a memory component called a MemTable. When the MemTable fills up, it turns into an immutable MemTable and gets flushed to storage as a sorted run. Since sorted runs have overlapping key ranges, a query operation needs to search multiple sorted runs. To limit the number of sorted runs and improve search efficiency, LSM-trees conduct compaction periodically to merge several components and remove obsolete KV pairs.

Two typical compaction strategy and their variants are commonly used in LSM-trees [24, 48]: The leveling strategy [28, 30] allows each level (besides \(L_0\)) to have only one sorted run. When a level (\(L_n\)) exceeds its size limit, one or more SSTables from level \(L_n\) and all overlapped SSTables from the higher level \(L_{n+1}\) are sort-merged to generate new SSTables into level \(L_{n+1}\); The tiering strategy [55, 64] allows each level (besides \(L_0\)) to have multiple sorted runs to reduce the write amplification. To compact SSTables at level \(L_n\), several SSTables in a range partition are merged to new SSTables writing to level \(L_{n+1}\) directly, without rewriting existing SSTables at level \(L_{n+1}\). Compared with the leveling strategy, the tiering strategy has a much smaller write amplification ratio and thus higher write performance. However, since query operations need to search multiple sorted runs in each level, LSM-trees with a tiering strategy have much lower read performance.

2.2 Secondary Index in LSM-Based Systems

Many applications require queries on specific values other than primary keys. Without an index based on specific values, database systems need to scan the whole table to find relevant data. Thus, secondary indexing is an indispensable technique in database systems. For example, in Facebook’s database service for social graphs, secondary keys are heavily used, such as finding IDs who liked a specific photo [13, 51]. In this work, we mainly discuss stand-alone secondary indexes, which are separate index structures apart from the primary table and are commonly used in database systems [54]. A stand-alone secondary index maintains mappings from each secondary key to its associated primary keys. As secondary keys are not unique, a single secondary key can have multiple associated primary keys.

Consistency Strategy. Since LSM-based KV stores update or delete records by out-of-place blind-writes, maintaining consistency of secondary indexes becomes a challenge in LSM-based storage systems. There are two strategies to handle this issue, Synchronous and Validation.

For Synchronous strategy, whenever a record is written in the primary table, the secondary index is maintained synchronously to reflect the latest and valid status (e.g., AsterixDB [11], MongoDB [5], MyRocks [51]). For example, as shown in Figure 1(a), when writing a new record {p2\(\rightarrow\)s1} (p denotes the primary key, s denotes the secondary key, and other fields are omitted for simplicity) into the primary table, the storage system also fetches the old record of p2 to get its old secondary key s2. Then, the storage system inserts not only a new entry {s1\(\rightarrow\)p2} but also a tombstone to delete the obsolete entry {s2\(\rightarrow\)p2} in the secondary index. Nevertheless, this strategy discards the blind-write attribute and thus degrades the write performance which is the main advantage of LSM-based KV stores.

Fig. 1. Stand-alone secondary indexing in LSM-based systems with Synchronous strategy and Validation strategy [54]. The shaded entries indicate that they are invisible in the index.

By contrast, as shown in Figure 1(b), Validation strategy only inserts the new entry {s1\(\rightarrow\)p2} but does not maintain the consistency of obsolete entries in secondary indexes (e.g., Cassandra [1, 2], DELI [59], and secondary indexing proposed by Luo et al. [47]). However, secondary index query operations need to validate all relevant entries by checking the primary table to filter out obsolete mappings. Though previous work proposed some approaches to reduce the validation overhead, their benefits are limited. For example, DELI [59] lazily repairs the secondary index along with compaction of the primary table. Luo et al. [47] propose to store an extra timestamp for each entry in the secondary index and use a primary key index that only stores primary keys and their latest timestamp for validation. The primary key index is validated instead of the primary table. However, since the primary key index is also an LSM-tree, though it filters out unnecessary point lookups on the primary table, it still requires point lookups on itself.

Index Type. As a secondary key can have multiple associated primary keys, LSM-based secondary indexes have two types surrounding this issue, including composite index and posting list [54]. The key in a composite index (i.e., composite key) is a concatenation of a secondary key and a primary key. The composite index is easy to implement and adopted by many systems [20, 51, 54]. However, it turns a secondary lookup operation into a prefix range search operation.

The posting list stores multiple associated primary keys in the value of a KV pair. Entries in each posting list can be sorted by primary keys or recency. When a new record is inserted, there are two update strategies. Eager update strategy conducts read-modify-write, fetching the old posting list and merging the new primary key to the posting list. Lazy update strategy blindly insert a new posting list which only includes the new primary key. It leaves posting lists merging to compaction. However, a secondary lookup needs to search all levels to fetch all relevant entries.

Limitations. Even though there are multiple strategies, types, and optimizations, LSM-based secondary indexes have to sacrifice either the write performance of storage systems or the secondary index query performance, which results from the incompatibility of inherent attributes of LSM-trees and characteristics of secondary indexes.

2.3 Persistent Memory

PM, also called Non-Volatile Memory (NVM) or Storage Class Memory (SCM), provides several attractive benefits for storage systems, such as byte-addressability, DRAM-comparable access latency, and data persistency. CPUs can access data on PM directly with load and store instructions. Besides, compared to DRAM, PM has a much larger capacity, lower cost, and lower power consumption. Therefore, both academia and industry have proposed plenty of work to harness PM’s benefits in storage systems [18, 26, 33, 46, 53, 56, 58, 66, 67, 73]. In addition to DDR bus-connected PM (e.g., Intel Optane DCPMM), the recent high-bandwidth and low-latency IO interconnection, Compute Express Link (CXL) [4, 35], brings a new form of SCM, CXL device-attached memory (e.g., Samsung’s CMM-H (CXL Memory Module - Hybrid) [9, 10]).

However, the PM also has some performance idiosyncrasies. For example, the current commercial PM hardware (i.e., Intel Optane DCPMM) has physical media access granularity of 256 bytes, leading to high random access latency (about 3\(\times\) of DRAM in terms of reads) and write amplification for small random writes, which needs to be considered when designing PM systems [18, 60, 63, 65, 68, 71]. These performance idiosyncrasies are likely to be a general problem or even more obvious in other PM devices due to physical media characteristics (e.g., flash page in CXL-SSD).

Though Intel Optane DCPMM is currently the only available commercial device, we believe it can represent other emerging PM devices to some extent. In this article, we mainly focus on PM’s general characteristics described above but not the specific numbers of Optane’s attributes.

3 MOTIVATION

Though recent work introduces some techniques to optimize secondary indexing in LSM-based systems, we find that the performance of LSM-based secondary indexing is still unsatisfactory due to the incompatibility of inherent attributes of LSM-trees and characteristics of secondary indexing. On the one hand, LSM-tree is not a competent data structure for secondary indexes, since the characteristics of secondary indexes exacerbate the deficiency of LSM-tree’s read operations: (1) KV pairs are usually small in secondary indexes, to which LSM-tree’s cumbersome lookup operations are unfriendly; (2) Secondary keys are not unique and can have multiple values, which LSM-tree’s out-of-place update will exacerbate the query inefficiency. On the other hand, the blind-write attribute of LSM-based primary tables makes the consistency of secondary indexes troublesome.

Therefore, this motivates us to find a better solution for secondary indexes in LSM-based storage systems. As PM provides attractive features such as byte-addressability, DRAM-comparable access latency, and data persistency, we argue that it is promising to provide secondary indexing with PM.

Though there are many state-of-the-art PM-based index structures, they are not specifically designed for secondary indexing. To adopt them as secondary indexes (e.g., support the multi-value feature), naive approaches include the composite index or using a conventional allocator to organize posting lists (Section 2.2). However, simply adopting these naive approaches to use existing PM-based indexes as secondary indexes will overshadow their superior advantages.

Why not use a PM-based composite index? Though this method is straightforward and easy to implement in LSM-based systems, it is not ideal for tree-based persistent indexes. First, when adding or removing a primary key for a secondary key, a value update operation turns into a new composite key insert or delete operation for composite indexes. Insert and delete operations are more expensive than update operations in a PM-based tree index because they may cause shift operations or structural modification operations (SMOs). Second, composite indexes store every pair of mappings as an individual KV pair, expanding the number of KV pairs, which increases the height of the tree index and thus degrades its query performance. Third, storing the same secondary keys repeatedly in multiple composite keys wastes PM space, which can be a dominant overhead for some real-world databases [70].

Why not use a conventional allocator for posting lists? One may use a conventional allocator, such as a slab-based allocator or a log-structured approach, to allocate space for values (posting lists) out of the index. Nevertheless, they are not suitable for values of secondary keys. One way is allocating space for a whole posting list for each secondary key with a general-purpose allocator such as a slab-based allocator. However, these general-purpose allocators usually have high overheads on PM since they conduct expensive mechanisms for crash consistency (e.g., logging) and perform many small writes on their metadata which is necessary for recovery [7]. Though some PM allocators relieve allocation overheads by techniques such as deferring garbage collection to post-failure [14, 15], slab-based allocators have low memory utilization due to the memory fragmentation issue [57], which cannot be eliminated by restarting on PM [22]. Worse still, these issues are more severe for secondary indexes. In secondary indexes, a posting list of a secondary key is changed by inserting or removing primary keys, which means the size of the posting list (the total size of associated primary keys) changes constantly. This characteristic requires frequent reallocations and copy-on-writes. Another way is allocating space for each individual new value (primary key) of a postint list, and using pointers to link them together. One can use a lightweight and PM-friendly allocator, such as a log-structured approach for its sequential-write pattern. However, it will scatter posting lists (primary keys) associated with the same secondary key into multiple pieces and thus reduce the query performance due to poor data locality.

Our experiments (Section 5.2) show that these naive approaches on PM-based indexes lead to several times performance degradation. It thus motivates us to explore a new PM-based secondary indexing mechanism for LSM-based KV stores. In addition, an efficient validation approach is required to retain the blind-write attribute of LSM-based KV stores.

4 PERSEID DESIGN

4.1 Overview

Motivated by the analysis above, we propose Perseid, a PM-based secondary indexing mechanism for LSM-based storage systems, which overcomes traditional LSM-based secondary indexes’ deficiencies. Figure 2 shows the overall architecture of an LSM-based storage system with Perseid.

Fig. 2. The overall architecture with Perseid.

—	Perseid contains a PM-based secondary index, PS-Tree, which is both PM and secondary index friendly: by adopting log-structured insertion, PS-Tree achieves fast insertion on PM; by storing primary keys that associate to the same secondary key closer and further rearranging them to be adjacent, PS-Tree supports efficient query operations (Section 4.2).
—	Perseid retains the blind-write attribute of the LSM primary table for write performance (i.e., taking the validation strategy (Section 2.2)), without sacrificing query performance by introducing a lightweight hybrid PM-DRAM and hash-based validation approach. The validation approach contains a persistent hash table to record version information of primary keys, and a volatile and lite hash table to absorb random accesses to PM. (Section 4.3).
—	To accelerate non-index-only queries, Perseid adapts two optimizations on primary table searching issued from secondary indexes. Perseid filters out irrelevant component searching with sequence numbers and parallelizes primary table searching in an efficient way (Section 4.4).

4.2 PS-Tree Design

Perseid introduces PS-Tree, a PM-based secondary index, which is designed considering the multi-value feature and PM characteristics. We first present PS-Tree’s structure (Section 4.2.1), and then describe its operations (Section 4.2.2).

4.2.1 Structure.

The overall structure of PS-Tree is shown in Figure 3. PS-Tree consists of two layers, SKey Layer for indexing secondary keys and PKey Layer for storing values. Specifically, the SKey Layer resembles a normal in-memory index, which maintains mappings from secondary keys to posting lists in the PKey Layer. Thus, the SKey Layer can leverage an existing high-performance PM-based index (e.g., P-Masstree [39, 50] and FAST&FAIR [31]). The PKey Layer stores variable-number values (i.e., primary keys and other user-specified values) of secondary keys in a manner of blended B\(^+\)-Tree leaf nodes and log-structured approaches, which combines the advantages of the two approaches. The value of a secondary key in the SKey Layer is a pointer, which points to corresponding primary keys in the PKey Layer. Each pointer is a combination of the address of the PKey Page and an offset within the page.

Fig. 3. The structure of PS-Tree. KP: Key Pointer pair, GH: Group Header, and PE: PKey Entry.

In the PKey Layer, primary key entries (PKey Entries) are stored in PKey Pages. Each PKey Entry has an 8-byte metadata header and a primary key. The header consists of a 2-byte size, a 1-bit obsolete flag, and a 47-bit sequence number (SQN) of the primary key. The SQN is internally used for multi-version concurrency control (MVCC) in LSM-based KV stores [28, 30]. Each new record (including updates and deletes) in the primary table gets a monotonically increased SQN. Perseid leverages the SQN mechanism to guarantee data consistency among the primary table and secondary indexes, and also for validation which will be described in Section 4.3. PKey Pages are aligned to PM physical media access granularity (e.g., 256 bytes of Intel Optane DCPMM [68]). PS-Tree inserts PKey Entries into PKey Pages in a log-structured manner to reduce the write overhead and ease crash consistency on PM.

Nevertheless, traditional log-structured approaches scatter different values of the same secondary key in the log, resulting in poor data locality and degraded query performance. To improve data locality, PS-Tree stores PKey Entries of contiguous SKeys in the same PKey Page, similar to the leaf node in a B\(^+\)-Tree. Furthermore, during the PKey Page split, PS-Tree rearranges PKey Entries that belong to the same secondary keys to store continuously as a PKey Group. Each PKey Group has an 8-byte Group Header and one or multiple PKey Entries. The lower 48 bits of a group header are the address of the previous PKey Group of the same secondary key or null if the current group is the last one. Thus, all PKey Groups belonging to one secondary key are linked as a list. The remaining 16 bits store the number of total entries and the number of obsolete entries in the group.

4.2.2 Basic Operations.

PS-Tree considers features of both secondary indexing and PM. Compared with DRAM, PM has limited write bandwidth and the issue of write amplification. Therefore, PS-Tree adopts log-structured insertion and copy-on-write split for efficient writes and lightweight crash consistency mechanisms. To avoid high latencies of multiple random accesses of multiple values on PM during query operations, PS-Tree reorganizes values of the secondary index and conducts lazy garbage collection during the PKey Page split.

Log-Structured Insert. Algorithm 1 describes the process of the insert operation in PS-Tree. First, PS-Tree searches for the SKey and its pointer in the SKey Layer. From the pointer, PS-Tree locates the previous PKey Group and the corresponding PKey Page (Lines 1–3). If the SKey is not found, then the PKey Page is located from the pointer of the previous SKey, which is just smaller than this new SKey (Line 5).

Second, PS-Tree appends a new PKey Group in that PKey Page (Lines 11–12). The new PKey Group contains one entry with the new PKey and other values if specified, and the header points to the previous PKey Group if exists.

Third, the new pointer of the SKey (i.e., the address of the new PKey Group) is updated or inserted in the SKey Layer (Line 13). Thus, the insert request usually performs an update operation in the SKey Layer. PKey Entries of a secondary key are always linked in the order of recency to facilitate query operations which usually require the most latest entries [13, 54].

Search. Algorithm 2 describes the process of the search operation in PS-Tree, which starts with searching for the secondary key and its pointer in the SKey Layer (Line 2). Then, from the latest PKey Group indicated by the pointer, primary keys and other user-specified values can be retrieved in the order of recency. Perseid adopts the Validation strategy (Section 2.2) for its high ingestion performance. Therefore, all primary keys are first validated before returning (Line 7). The validation process will identify and mark obsolete entries as deleted by setting their obsolete flags. The LSM-based primary table supports MVCC by attaching one snapshot and using reference counters to protect components from being deleted [28, 30]. In PS-Tree, we adopt an epoch-based approach: readers publish their snapshot numbers during query operations, and obsolete entries whose sequence number is larger than any reader’s snapshot number are guaranteed not to be removed physically.

Update and Delete. PS-Tree has no update or delete operations (from the point of view of secondary indexes rather than the data structure). Since updating the primary key of a record in the primary table is commonly not supported in database systems, there is no requirement to update values (i.e., primary keys) in secondary indexes. With the Validation strategy, PS-Tree does not delete the obsolete entries synchronously with the primary table. PS-Tree leaves obsolete entry cleaning to garbage collection.

Locality-aware PKey Page Split with Garbage Collection. When a PKey Page does not have enough space for a new entry, it splits into two new PKey Page in a copy-on-write manner. Algorithm 3 shows the process of the PKey Page split operation. Since insertions are performed in a log-structured manner, the PKey Entries associated with one SKey may scatter discontinuously. Querying these entries may need multiple random accesses on PM. As PM has non-negligible read latencies compared with DRAM (e.g., about 300 ns with Intel Optane DCPMM [68]), query operations can have high overheads. Therefore, as shown in Figure 4, to improve locality, PS-Tree reorganizes PKey Entries when the PKey Page splits. Specifically, PS-Tree iterates all secondary keys associated with the PKey Page (Line 3). For each secondary key, PS-Tree collects scattered PKey Entries and puts them together into a new PKey Page (Lines 4–24). PS-Tree rearranges PKey Entries belonging to the same SKey in one PKey Group, so these entries are stored continuously, and the storage overhead of the Group Header is reduced since multiple PKey Entries share one Group Header.

Fig. 4. An example of PKey Page split. PEs (PKey Entries) with the same color belong to the same secondary key; PE in gray are obsolete.

Besides, entries not marked as deleted in the current PKey Page are validated by a lightweight approach (described in Section 4.3), and obsolete entries are physically removed during reorganization to reduce space overhead (Lines 9–11 in Algorithm 3).

A skewed secondary key may have many primary keys that occupy more than one PKey Page. For those PKey Entries not in the current PKey Page, PS-Tree lazily garbage collects them when the number of obsolete entries exceeds half of the number of total entries in that PKey Group (Lines 15–17). The size of collected PKey Entries (i.e., new PKey Group) of a skewed secondary key may exceed a PKey Page. To keep page management simple, instead of using variable-sized PKey Pages, PS-Tree allocates multiple PKey Pages on demand in the append operation of PKey Page (Line 24).

To support MVCC, PS-Tree retains obsolete entries whose sequence number is larger than the minimum snapshot number of concurrent readers. Obsolete entries may be retained long if there exist long-running queries. Perseid can be enhanced with similar techniques in recent work [36, 41] to handle long-lived snapshots. After rearranging valid entries to the new PKey Pages, pointers of related SKeys are updated (Line 25) and the old PKey Page is freed (Line 26).

Crash Consistency. Perseid relies on the existing write-ahead-log (WAL) of the LSM-based primary table to guarantee atomic durability among the primary table and secondary indexes. During recovery with WAL, Perseid redoes uncompleted operation to the PS-Tree.

PS-Tree also handles its own crash consistency issues. Insert operations are committed only when the pointers in the SKey Layer are updated. If the system crashes before updating pointers but after allocating a new PKey Page, then the PKey Page is unreachable. After restart, a background thread will scan the allocated pages and PS-Tree to find and reclaim unreachable pages. Besides, PS-Tree allows concurrent insertion in one PKey Page. A thread obtains a piece of space to write new entries by compare-and-swap (CAS) the tail pointer of the PKey Page. Thus, the space may leak if any thread obtains it but does not update the pointer in the SKey Layer when the system crashes. PS-Tree tolerates this situation and leaves these leakages to page splitting which can naturally reclaim these leaked spaces.

4.3 Hybrid PM-DRAM Validation

Perseid adopts the Validation strategy (see Section 2.2) for high write performance, which necessitates a lightweight validation approach. Since update-intensive workloads are quite common nowadays [13, 16], if the validation approach is heavy, validating a large number of obsolete entries brings no outcomes but generates huge overhead.

4.3.1 Structure.

Perseid introduces a lightweight validation approach based on the requirement of validation. Perseid adopts a hash table on PM storing version information for primary keys. The hash table is indexed by the primary key and stores its latest sequence number (Section 4.2.1). Nevertheless, even though point lookups on a PM-based hash table are much faster than on a tree, the validation time is comparable to the query time of PS-Tree. This is because one secondary key has multiple primary keys to validate, and PM has non-negligible random access latency. Simply placing the hash table on DRAM will occupy a large memory footprint. However, as validation only needs to validate whether a version of a primary key is valid, but not obtaining the specific latest version number, Perseid builds another volatile hash table on DRAM which only stores versions for primary keys that have been updated or deleted. In this way, Perseid only needs to query the small volatile hash table and thus the validation overhead is further reduced.

Figure 5 illustrates the hybrid PM-DRAM validation approach. The values in the hash tables consist of the sequence number of the record (6-byte) and a 2-byte counter. The counter is used to determine whether a primary key has obsolete versions. There is a slight difference in the counters of the two hash tables. In the volatile hash table, each counter indicates the number of logically existing entries related to a primary key in the secondary index. By contrast, each counter in the persistent hash table indicates the number of physically existing entries in the secondary index.

Fig. 5. Hybrid PM-DRAM hash-based validation.

4.3.2 Basic Operations.

Next, we describe the validation approach in detail according to operations.

Upsert. The process of upsert operation on validation hash tables is shown in Algorithm 4. When a new record (including update and delete) is inserted into the primary table, the primary key is inserted or updated with its sequence number into the persistent hash table. If the persistent hash table does not contain this primary key before, its counter is set to one (Line 9), which means this primary key has only one version and no obsolete entries of this primary key exist in the secondary index. For example in Figure 5, at t2, key c is inserted for the first time, and it is inserted into the persistent hash table. Otherwise, the primary key’s counter in the persistent hash table is increased by one (Line 4); besides, the primary key is inserted or updated with its sequence number into the volatile hash table, and the counter in the volatile hash table is set to two if it’s an insertion or increased by one if it’s an update (Lines 5–7). For example, when key c is updated with a new version v2 at t3 in Figure 5, the entry in the persistent hash table is updated, and a new entry is inserted into the volatile hash table.

Validate. The secondary index validates an entry by querying the volatile hash table, which is shown in Algorithm 5. Specifically, the entry is valid if the sequence number of this entry matches the latest sequence number stored in the hash table, or the hash table does not contain the primary key which means there are no obsolete entries of this primary key (Line 2 in Algorithm 5). Otherwise, the entry may be obsolete. If the version of the hash table entry is smaller than the global minimum read snapshot number, which means all readers can see the newer version, Perseid further marks the entry as obsolete and decreases the counter of the entry in the volatile hash table by one (Lines 9–13). For example, when key a is checked with an obsolete version v1 at t2 in Figure 5, the result is false, and then the counter is decreased from 3 to 2. If the counter is decreased to 1, which means all obsolete entries have been marked, the entry is removed from the volatile hash table to restrict the hash table size (Lines 14-16). For example, when key a is checked with an obsolete version v2 at t3 in Figure 5, the counter is decreased to one, the validation returns false and the entry is removed. We describe other corner cases regarding to snapshot in Section 4.3.3.

During validation for secondary index queries, Perseid only operates with the volatile validation hash table. Thus, the validation overhead is quite small.

Garbage Collection. During the PKey Page split, entries that are not marked as obsolete are also validated to remove obsolete entries (Section 4.2.2). Since this step physically removes obsolete entries, Perseid decreases the corresponding counters in the persistent hash table. If a counter is decreased to one, Perseid removes the corresponding hash pairs from the volatile hash table.

Recovery. When the system restarts from a crash or a normal shutdown, the volatile hash table needs to be recovered. Perseid iterates the whole persistent hash table and inserts primary keys whose counter is greater than one into the volatile hash table. Now the counters in the volatile hash table are numbers of physically existing entries, which may be larger than the actual numbers of logically existing entries. Therefore, some false positive primary keys may exist in the volatile hash table. However, this does not affect the validation accuracy and these primary keys can be removed by garbage collection.

4.3.3 Together with PS-Tree.

Each write operation in the LSM-based storage system starts with getting a monotonically increased sequence number (SQN). After writing the write-ahead-log (WAL) and inserting new records to the MemTable of the LSM primary table, Perseid inserts the PKey Entry to the PS-Tree and inserts or updates the version with the primary key (i.e., SQN) in the validation hash tables. Thus, both the record in the LSM primary table, the PKey Entry in PS-Tree, and the version in the validation hash tables are tagged with the same SQN. After that, the write operation is committed.

Each query operation first gets the latest committed snapshot number. Then it searches the PS-Tree with the secondary key and gets corresponding PKey Entries visible in the current snapshot (i.e., SQN is not larger than the snapshot number). Perseid validates candidate primary key entries via the volatile hash table, and then returns valid entries.

A rare scenario is that the volatile hash table reports a new sequence number larger than the current reader’s snapshot number, which means a concurrent writer has updated this primary key. In this case, Perseid cannot directly confirm whether this entry is still valid in this snapshot, since there may exist a version newer than the entry and valid in the snapshot, so Perseid has to validate it by the primary table (Lines 5-7 in Algorithm 5). Another scenario where the volatile hash table reports an older version than the requested PKey Entry is not possible to happen. Perseid commits a write operation after Perseid has inserted the new secondary entry in PS-Tree and updated the validation hash tables, so readers can only get consistent snapshots and ignore entries in PS-Tree whose version is larger than readers’ snapshot number.

4.4 Non-Index-Only Query Optimizations

Though the Perseid significantly reduces the overhead of secondary indexing, the overhead of non-index-only queries (requiring full records) is still dominated by the LSM-based primary table. Thus, Perseid further introduces two optimizations for non-index-only queries.

4.4.1 Locating Components with Sequence Number.

A secondary index query operation may need to search the primary LSM table multiple times for all its associated records. LSM-trees have mediocre read performance due to the multi-level structure. Besides device I/Os, if data is cached in memory or using fast storage devices, LSM-trees have non-negligible overheads on probing components (i.e., indexing and checking Bloom filters) [21, 25, 71]. Since LSM-based KV stores usually employ Bloom filters for each data block [28, 30], the indexing overhead includes indexing not only SSTables but also data blocks. Moreover, the read performance gets worse with the tiering compaction strategy since more components (SSTables) need to be checked and read.

Nevertheless, we find that many components are unnecessary to probe in searching processes issued from the secondary index. Previous work uses zone maps, which store the minimum and maximum values of an attribute, to skip irrelevant data blocks or components during searching [11, 12, 54]. We found that this technique can also be used by secondary indexes to search the primary table. Since we have already recorded the sequence numbers of primary keys in the secondary index, the sequence number can be used as an additional attribute to skip irrelevant components. Perseid builds a zone map that records a sequence number range (i.e., the minimum and maximum sequence numbers of records) for each component (including MemTables).

Moreover, as shown in Figure 6, since tiering compaction merges SSTables from lower level (\(L_n\)) to generate new SSTables in higher level (\(L_{n+1}\)) and does not rewrite other SSTables in the higher level (except for the last level), for a range partition, the sequence number ranges of different levels and even different sorted runs are strictly divided. For primary tables adopting the tiering strategy, with the primary key to search SSTables horizontally and the additional sequence number to search sorted runs vertically, Perseid can locate the exact component that contains the record directly. Besides, since Perseid already validates the version so it must exist in the component, Perseid can further skip the Bloom filter checking. Thus, the indexing overheads are greatly reduced and overheads on checking Bloom filters are almost eliminated.

Fig. 6. Sequence number range of components after tiering compaction. The black text in components indicates the key range. The blue text below each component indicates the sequence number range.

This optimization fits with the leveling strategy less effectively. The sequence number rangesin different levels may overlap because compaction rewrites SSTables in higher levels with blended sequence numbers from lower levels. However, since most LSM-base KV stores adopt the tiering strategy on \(L_0\) at least [28, 30], this optimization is still effective to some extent.

4.4.2 Parallel Primary Table Searching.

A single secondary key usually has multiple associated primary keys, and queries on these primary keys are independent. Therefore, using multiple threads to accelerate primary table searching is a natural optimization method. One naive approach is to assign primary keys to threads equally (e.g., in a round-robin fashion as shown in Figure 7(a)). However, point lookups on LSM-trees may have a large latency gap, since some KV pairs can be fetched from MemTable or block cache directly and others may reside at a relatively high level and need several disk I/Os due to Bloom filter false positives. It cannot be known in advance how much time each point lookup will take. Therefore, the naive approach may result in a load imbalance among parallel threads where some threads have finished their tasks and become idle while others are still stuck and there may still exist some unfinished tasks.

Fig. 7. Parallel primary index searching. The example shows 3 workers (threads) processing 6 tasks.

To relieve this issue, we apply a worker-active fashion as shown in Figure 7(b). Perseid publishes primary keys into a lock-free shared queue as tasks, and each parallel worker thread fetches one task from the queue. An element in the shared queue is a required primary key and the corresponding sequence number. When a worker thread finishes the current task, it tries to fetch another task from the queue. In this way, though each thread may perform a different number of tasks, parallel threads are utilized more adequately and latencies of query requests are further reduced.

5 EVALUATION

In this section, we evaluate Perseid against existing PM-based indexes with naive approaches and state-of-the-art LSM-based secondary indexing techniques [47, 54]. After describing the experimental setup (Section 5.1), we evaluate these secondary indexing mechanisms with micro benchmarks to show their performance on basic operations (Section 5.2). Then, we evaluate these systems’ overall performance with mixed workloads (Section 5.3) and recovery time (Section 5.4).

5.1 Experimental Setup

Platform. Our experiments are conducted on a server with an 18-core Intel Xeon Gold 5220 CPU, which runs Ubuntu 20.04 LTS with Linux 5.4. The system is equipped with 64 GB DRAM, two 128 GB Intel Optane DC Persistent Memory in AppDirect mode, and a 480 GB Intel Optane 905P SSD.

Implementation. Perseid can leverage any existing state-of-the-art PM-based index as the SKey Layer of PS-Tree. In our implementation, we build PS-Tree based on two typical PM-based indexes, FAST&FAIR [31] and P-Masstree [39, 50]. FAST&FAIR is a B\(^+\)-Tree that leverages total store ordering (TSO) in x86 architecture to tolerate transient inconsistency caused by incomplete write transactions, thus avoiding expensive copy-on-write or logging. P-Masstree is a converted version of Masstree [50] for PM [39], which is a trie-like concatenation of B\(^+\)-Tree. Indexes use their original memory allocators allocating space with memory-mapped files on PM.

For the hybrid PM-DRAM validation hash table, depending on the different usages of two hash tables, we deploy CLHT [23] as the volatile hash table, and CCEH [52] as the persistent hash table. CLHT is a cache-friendly hash table providing high search performance. CCEH is an extendible hash table optimized for PM that achieves high insert performance by mitigating rehashing overhead.

Compared Systems. We compare Perseid against the two original PM-based indexes (FAST&FAIR and P-Masstree), and LSM-based secondary index with validation strategy (denoted as LSMSI) approaches of LevelDB++ [54]. The compared PM-based indexes are implemented as secondary indexes via the composite index approach and the log-structured approach (denoted as FAST&FAIR-composite, FAST&FAIR-log, P-Masstree-composite, P-Masstree-log, respectively). For the log-structured approach, we simply provide enough space for allocation and disable garbage collection to avoid its influence and present the ideal performance [60]. We enhance other PM-based indexes with Perseid’s hybrid PM-DRAM validation approach (Section 4.3) and LSMSI with the primary key index [47] (Section 2.2) for validation. For a fair comparison, we also implement LSM-based secondary indexing approaches on PM (LSMSI-PM). In addition, an LSM-based secondary index with synchronous strategy (Section 2.2, denoted as LSMSI-PM-sync) is evaluated for comparison. We use PebblesDB [55], a state-of-the-art tiering-based KV store, as the primary table.

Workloads. Since common benchmarks for KV stores such as YCSB [19] do not have operations on secondary indexes, as in previous work [42, 47, 54], we implemented a secondary index workload generator based on an open-source twitter-like workload generator [3] for evaluation. With this generator, we generate several microbenchmark workloads and mixed workloads. The primary key (e.g., ID) and secondary key (e.g., UserID) are randomly generated 64-bit integers. The key space of primary keys and secondary keys is 100 million and 4 million, respectively. Thus the average number of records per secondary key is about 25. The size of each record is 1KB.

KV Store Configurations. For the primary table, according to configuration tuning guide [29], MemTable size is set to 64 MB and the Bloom filters are set to 10 bits per key. As our workloads will generate a primary table larger than 100 GB, we set a 16-GB block cache for the primary table and a 1-GB block cache for the LSM-based secondary index. Compression is turned off to reduce other influencing factors.

5.2 Microbenchmarks

In this section, we evaluate the basic single-threaded performance and scalability of compared secondary indexing mechanisms.

5.2.1 Insert and Update.

The Insert workload (i.e., no updates) has 100 million unique records. Figure 8(a) shows the average latency of insert operations of each secondary index.

Perseid performs about 10–38% faster than the corresponding composite indexes, but 25% slower than the ideal log-structured approach without garbage collection due to the page split overhead in PS-Tree. The composite index approach results in inferior performance as we analyzed in Section 3. Other approaches have higher performance due to the sequential-write pattern.

The upsert workloads contain 100 million insert operations and 100 million update operations. Operations are shuffled to avoid all newer entries being valid in secondary indexes.

In the Uniform workload (Figure 8(b)), both primary keys and secondary keys follow a uniform distribution. In the Skewed-Pri workload (Figure 8(c)), primary keys follow a Zipfian distribution with the skewness parameter 0.99, and secondary keys are selected randomly. In the Skewed-Sec workload (Figure 8(d)), secondary keys follow a Zipfian distribution (parameter 0.99), and primary keys are uniform. Thus, hot secondary keys have lots of associated primary keys, which represent low-cardinality columns.

LSMSI-PM-sync has the largest upsert overhead due to its synchronous strategy, which needs to fetch old records from the LSM primary table and delete old secondary index entries (by inserting tombstones) synchronously. The skewness of primary keys has a large impact on the synchronous strategy. Hence LSMSI-PM-sync has about 50% higher upsert latencies in the Uniform and Skewed-Sec workloads than in other workloads.

Among other validation-based secondary indexes, composite indexes perform even worse in upsert workloads than other secondary indexes. This is because, with additional upsert operations, composite indexes have more KV pairs and larger tree heights. By contrast, PS-Tree and the log-structured approach do not increase the number of KV pairs in the index part.

Figure 9 shows the normalized memory usage of the persistent hash table (PM-HT) and the volatile hash table (DRAM-HT) of Perseid after each upsert workload. For a fair comparison, we evaluate the memory usage of PM-HT with the same hashing structure (CLHT) as DRAM-HT. The PM-HT stores all 100 million primary keys with their latest sequence numbers, so it contains about 46 million hashing buckets including linked collision buckets, which occupies about 2.7 GiB memory. By contrast, since the DRAM-HT only stores versions for primary keys that have been updated (Section 4.3), it has a smaller memory footprint than the PM-HT. Specifically, the DRAM-HT is empty because there are no updates in the Insert workload. Besides, Perseid reduce the memory usage of the DRAM-HT to 37.8%, 10.4%, and 77.3% of the whole PM-HT in Uniform, Skewed-Pri, and Skewed-Sec, respectively. Though in the Uniform and Skewed-Sec workloads, most primary keys have been updated, PS-Tree conducts garbage collection and validation during PKey Page splitting, so primary keys with obsolete versions cleaned up are removed from DRAM-HT.

Fig. 9. Normalized memory usage of validation hash table.

5.2.2 Query.

In this experiment, we evaluate the performance of index-only queries after loading the insert workload or upsert workloads. Index-only query reflects the performance of a secondary index itself and is a common query technique (i.e., covering index [6, 8]) to avoid looking up the primary table. We show two different selectivities by specifying limit N (10 and 200) on return results. The most recent and valid N entries are returned. For limit of 200, the actual average number of returned entries per query is 25 and 142 for the Skewed-Pri and Skewed-Sec, respectively.

Figure 10 shows the results of index-only query performance. From the results, we have the following observations.

First, PM-based indexes have significantly lower latencies than LSM-based secondary indexes. Putting LSMSI on PM (LSMSI-PM) has very limited improvement, which is because LSMSI already benefits from block cache and OS page cache. Even so, LSMSI is still inefficient due to the high overhead of indexing and Bloom filter checking. Besides, LSMSI has a high overhead on validating the primary key index. LSMSI-PM-sync has much higher query performance than LSMSI-PM, as it does not require validation. However, this comes at the cost of poor write performance (Section 5.2.1). From the gap between LSMSI-PM and LSMSI-PM-sync, we can find that validating the primary key index has a huge overhead. This is because validating each primary key requires one heavy point search on the LSM-tree. Despite being exempted from validation, LSMSI-PM-sync still has higher query latencies than PM-based indexes which conduct validation with Perseid’s approach.

Second, Perseid outperforms existing PM-based indexes with the composite index and the log-structured approach by up to 4.5\(\times\) and 4.3\(\times\), respectively. The log-structured approach has poor locality since relevant values are scattered across the whole log and require multiple random accesses to fetch them all. Composite indexes are inferior due to the larger number of KV pairs in the indexes and range-scan operations as we analyzed in Section 3. They are especially inefficient under the Skewed-Sec workload with a large limit (e.g., 200), where they fetch a large number of entries and fail to enjoy the cache effect. By contrast, the performance of Perseid is much more stable across different workloads, owing to the locality-aware design of PS-Tree. For a limit of 10, PM-based secondary indexes benefit from higher cache hit ratios under the Skewed-Sec workload, thus achieving better performance than other upsert workloads. Composite indexes also occupy about 4\(\times\) more PM space than PS-Tree, which is because they repeatedly store secondary keys and have more index nodes. In addition, P-Masstree-composite has higher latencies than FAST&FAIR-composite, because trie-based indexes are less efficient than B\(^+\)-Trees in range search since their leaf nodes do not have sibling pointers pointing to neighbor nodes.

Third, under upsert workloads, all systems need to validate more primary keys to exclude obsolete entries, which contributes to the higher overheads than under insert workloads. For LSMSI, since the primary key index needs multiple heavy point lookups on LSM-trees, validating the primary key index accounts for the lion’s share of the total cost of an index-only query. LSMSI has lower latencies under the Skewed-Pri workload than other upsert workloads since the primary key index benefits from the data locality on primary keys. By contrast, Perseid (and other PM-based indexes) validates on a volatile hash table, which takes up less than half of the total cost. The overhead on Perseid increased little owing to the locality-aware design of PS-Tree and the lightweight validation approach.

Figure 11 demonstrates the necessity and the benefit of the volatile hash table of Perseid. Directly validating multiple primary keys on persistent hash table (PM-HT) has a large overhead, since it requires multiple random accesses on PM. This prominent overhead can overshadow the advantage of PS-Tree. Thus, Perseid validates on the volatile hash table (DRAM-HT), which is 2.7-6.6\(\times\) faster than validating on PM-HT. As shown in Figure 9, DRAM-HT is much smaller under other workloads than the Skewed-Sec workload, and the improvement brought by DRAM-HT is more evident under other workloads. This result proves the benefit of DRAM-HT is not only due to the lower access latency of DRAM, but also because of the smaller size of DRAM-HT which is more cache-friendly and brings lower hash collision.

Fig. 11. Average validation latency per query.

5.2.3 Range Query.

In the following experiments, we show results of the LSM-based secondary index on PM (LSMSI-PM), and PM-based secondary indexes based on P-Masstree as representatives. We evaluate the range queries of these secondary indexes. Each range query searches for 20 secondary keys and retrieves 5 latest associated primary keys of each secondary key.

The results are shown in Figure 12. Range queries need to search more KV pairs from ten different secondary keys, showing a more pronounced difference between these secondary indexes than low-limit query operations. Perseid outperforms LSMSI-PM, the Composite P-Masstree, and the log-structured approach by up to 92\(\times\), 5.2\(\times\), and 1.6\(\times\), respectively under the Skewed-Sec workload. Though LSMSI-PM benefits from PM access latency and DRAM caching, it still has a fairly high latency. This is because the range operation in LSM needs to merge-sort multiple iterators of components. The composite index needs to perform more range search than Perseid in the index since Perseid groups primary keys outside of the index.

Fig. 12. Index-only range query performance.

5.2.4 Multi-Threaded Performance.

Figure 13 shows the multi-threaded performance of compared secondary indexes. We take the results of Skewed-Pri and Skewed-Sec workloads as representatives. For Skewed-Sec, we show the result with the limit of 200, and the result with the limit of 10 is similar to that of Skewed-Pri. For upsert operations, Perseid scales up to 24 threads, achieving 2.8\(\times\) and 16\(\times\) the upsert throughput of the composite P-Masstree and LSMSI-PM, and slightly slower than the ideal log-structured approach. For query operations, Perseid scales well and achieves 7\(\times\) and 3\(\times\) query throughput of P-Masstree-composite and P-Masstree-log under the Skewed-Sec workload due to the locality-aware design of PS-Tree. LSMSI has poor scalability due to its coarse-granularity lock and non-concurrent logging mechanism. Though using the same index structure (P-Masstree), because the composite index turns update operations into insert operations, the index operations limit its write scalability; and because it expands the number of KV pairs and thus has a bigger tree height, the increased index overhead limits its query scalability. As for the log-structured approach, the poor data locality restricts its query performance, especially for large-range queries.

5.2.5 Non-Index-Only Query.

We next evaluate the non-index-only query operations. Besides the basic compared secondary indexes, we also enhance them by applying the two optimizations (Section 4.4), sequence number zone map (+SEQ), naive parallel primary table searching (+PAR), and worker-active parallel table searching (+PAR-WA) sequentially. In this experiment, we use 4 threads for parallel primary table searching. Figure 14 shows the performance and time breakdown of non-index-only query operations. Note that the breakdown of primary table time on +PAR only shows the time not covered by the secondary index and validation. Perseid brings considerable improvements against the LSMSI-PM, even if it has the two optimizations applied. Perseid outperforms LSMSI-PM by up to 62% and 2.3\(\times\), when without and with optimizations on primary table lookups (the sequence number zone map and parallel primary table searching), respectively.

Fig. 14. Non-index-only query performance. The primary table time on +PAR only shows the time not covered by other parts.

Though the primary key index indeed reduces unnecessary point lookup operations on the primary table for LSMSI-PM, with advanced low-latency storage devices and sufficient DRAM caching, it also has significant overhead. On the contrary, the hybrid PM-DRAM validation of Perseid reduces the primary table lookups with subtle extra overhead.

Perseid’s optimizations on primary table searching can also boost other compared secondary indexing. The zone map improves the overall query performance of the KV store with Perseid by about 50%, and the worker-active parallel primary table searching further improves by up to 3.1\(\times\). The worker-active parallel searching exceeds naive parallel searching by up to 30% for Perseid. This effect is more evident when the limit of return results is small as the load imbalance among multiple parallel worker threads are more prominent. However, the numbers are only 20–36% and up to 2.4\(\times\) for LSMSI-PM, respectively. This is because these optimizations only accelerate the primary table lookups, but the LSMSI-PM still has huge overheads. In addition, LSMSI-PM has to conduct the heavy validation first then it can pass the lookup tasks to parallel worker threads. Therefore, parallel threads cannot work adequately for LSMSI-PM. For the same reason, the worker-active parallel searching helps little above the naive parallel searching.

We also implement secondary indexes and conduct the experiments on a leveling-based LSM primary table (LevelDB [30]). Figure 15 shows the results of Skewed-Sec as an example. The main difference is that the sequence number zone map is less effective on leveling-based LSM primary tables. However, the zone map is still effective when the limit is small, since the latest few records stay in MemTables or SSTables in lower levels like \(L_0\), and these components can be filtered by sequence number with a high probability.

Fig. 15. Non-index-only query performance on leveling-based LSM table.

5.3 Mixed Workloads

In this section, we evaluate Perseid, the composite P-Masstree, and LSMSI-PM under mixed workloads. The mixed workloads consist of interleaved and various types of operations, which are more representative of real-world workloads. Each workload has 40 million operations, containing both Skewed-Pri and Skewed-Sec operations. Table 1 describes these workloads’ traits. Each system is prefilled with 80 million records before performing the workloads. We also enable Perseid’s optimizations on primary table searching (i.e., the sequence number zone map and worker-active parallel primary table searching) for all systems.

Figure 16 reports the average operation latencies every million operations. At the beginning of the Write-Heavy workload and the Balanced workload, PM-based secondary indexes have a spike in latency, which is mainly caused by seek-driven compaction in the LSM primary table. Perseid outperforms LSMSI-PM significantly under different mixed workloads. Even though the overhead of the primary table dominates the whole operations, Perseid still has visible advantages against the other PM-based indexes. Note that PS-Tree has much less capacity overhead than the composite index. As we set the limit on return results to 10 for query operations, the log-structured approach is not affected too much by its poor data locality.

Fig. 16. Performance of mixed workloads.

Table 1.

Workload	Operation Ratios
Workload	Upsert	Get	Index-Only Query	Non-Index-Only Query
Write-Heavy	70%	10%	10%	10%
Balanced	45%	10%	25%	20%
Read-Heavy	20%	20%	40%	20%

View Table

Table 1. Mixed Workloads Description

5.4 Recovery Time

We evaluate the recovery time of Perseid and LSMSI-PM after the Zipfian upsert workload that contains 200 million upsert operations with a single thread. Since we only need to recover the volatile validation hash table in Perseid, it takes 2.7 seconds to scan the persistent hash table and rebuild the volatile hash table. Note that the recovery process of the validation hash table can be placed in the background, and validation can be served from the persistent hash table until the volatile hash table is restored. By contrast, it takes 2.3 seconds and 1.4 seconds to recover the LSM-based secondary index and the primary key index, respectively. Their recovery time is mainly spent on rebuilding MemTables from logs and varies with the size of MemTables.

6 RELATED WORK

Secondary Indexing in LSM-based KV stores. Qader et al. [54] conduct a comparative study on secondary indexing techniques in LSM-based systems. They conclude and evaluate several common secondary indexing techniques, including filter-based embedded index, composite index, and posting list. DELI [59] proposes an index maintenance approach that defers expensive index repair to compaction of the primary table. Luo et al. [47] propose several techniques for LSM-based secondary indexes, improving data ingestion and query performance. However, their techniques mainly reduce random device I/Os for traditional disk devices but at the cost of more sequential reads. Based on KV separation [45], SineKV [42] keeps both the primary index and secondary indexes pointing to the record values. Thus, secondary index queries can get records directly without searching the primary index. However, SineKV has to discard the blind-write attribute and maintain index consistency synchronously. Cuckoo Index [38] enhances the filter-based indexing with a cuckoo filter. However, as a filter-based index, Cuckoo Index does not support range queries.

Though there are many proposed optimizations, LSM-based secondary indexing is not efficient enough due to the nature of LSM-trees. In this work, we revisit the design of the secondary index with PM.

PM-based indexes. There has been plenty of research on high-performance PM indexes [17, 31, 37, 39, 49, 52, 53, 62, 72]. These general-purpose indexing are not directly competent for efficient secondary indexing.

Improving LSM-based KV stores with PM. There is a lot of work optimizing LSM-based KV stores with PM. NoveLSM [33] introduces a large mutable MemTable on PM to lower compaction frequency and avoid logging. SLM-DB [32] utilizes a B\(^+\)-Tree on PM to index KV pairs on disks; SSTables on disks are organized in a single level, which reduces the compaction requirements. MatrixKV [69] places level \(L_0\) on PM and adopts fine-granularity and parallel column compaction to reduce write stalls in LSM-trees. Facebook redesigns the block cache on PM to reduce the DRAM usage and thus reduce the total cost of ownership (TCO) [27, 34]. Different from these efforts, this work revisits the secondary indexing for LSM-based KV stores with PM.

7 CONCLUSION

In this article, we revisit secondary indexing in LSM-based storage systems with PM. We propose Perseid, an efficient PM-based secondary indexing mechanism for LSM-based storage systems. Perseid overcomes the deficiencies of traditional LSM-based secondary indexing and existing PM-based indexes with naive approaches. Perseid achieves much higher query performance than state-of-the-art LSM-based secondary indexing techniques and existing PM-based indexes without sacrificing the write performance of LSM-based storage systems. The prototype of Perseid is open-source at https://github.com/thustorage/perseid.

ACKNOWLEDGMENTS

We sincerely thank all anonymous reviewers for their valuable comments.

Footnotes

¹ For clarity, we use record to refer to a KV pair in the primary table, and entry to refer to a KV pair in a secondary index.
Footnote
² Index-only query is a common query technique: Users create a covering index that contains specific columns required by queries to avoid the cost of reading the primary table [6, 8, 51]. A non-index-only query searches the secondary index by secondary key to get primary keys and then retrieves full records from the primary table.
Footnote

REFERENCES

[1] 2022. Apache Cassandra. Retrieved from https://cassandra.apache.org/Google Scholar
Reference 1Reference 2
[2] 2022. Apache Cassandra: How are Indexes Stored And Updated. Retrieved from https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/dml/dmlIndexInternals.htmlGoogle Scholar
Reference
[3] 2022. Chirp: A Twitter-like Workload Generator. Retrieved from http://alumni.cs.ucr.edu/ ameno002/benchmark/Google Scholar
Reference
[4] 2022. Compute Express Link: The Breakthrough CPU-to-Device Interconnect. Retrieved from https://www. computeexpresslink.org/Google Scholar
Reference
[5] 2022. MongoDB. Retrieved from https://www.mongodb.comGoogle Scholar
Reference
[6] 2022. MySQL Glossary for Covering Index. Retrieved from https://dev.mysql.com/doc/refman/8.0/en/glossary.html# glos_covering_indexGoogle Scholar
Reference
[7] 2022. Persistent Memory Development Kit. Retrieved from https://pmem.io/pmdk/Google Scholar
Reference
[8] 2022. PostgreSQL: Documentation: Index-Only Scans and Covering Indexes. Retrieved from https://www. postgresql.org/docs/current/indexes-index-only-scans.htmlGoogle Scholar
Reference
[9] 2022. Samsung Electronics Unveils Far-Reaching, Next-Generation Memory Solutions at Flash Memory Summit 2022. Retrieved from https://news.samsung.com/global/samsung-electronics-unveils-far-reaching-next-generation-memory-solutions-at-flash-memory-summit-2022/Google Scholar
Reference
[10] 2023. MS-SSD—Samsung. Retrieved from https://samsungmsl.com/cmmh/Google Scholar
Reference
[11] Alsubaiee Sattam, Altowim Yasser, Altwaijry Hotham, Behm Alexander, Borkar Vinayak, Bu Yingyi, Carey Michael, Cetindil Inci, Cheelangi Madhusudan, Faraaz Khurram, Gabrielova Eugenia, Grover Raman, Heilbron Zachary, Kim Young-Seok, Li Chen, Li Guangqiang, Ok Ji Mahn, Onose Nicola, Pirzadeh Pouria, Tsotras Vassilis, Vernica Rares, Wen Jian, and Westmann Till. 2014. AsterixDB: A scalable, open source BDMS. Proc. VLDB Endow. 7, 14 (oct2014), 1905–1916. DOI:Google ScholarDigital Library
Reference 1Reference 2Reference 3
[12] Alsubaiee Sattam, Carey Michael J., and Li Chen. 2015. LSM-based storage and indexing: An old idea with timely benefits. In Proceedings of the 2nd International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data (Melbourne, VIC, Australia) (GeoRich’15). Association for Computing Machinery, New York, NY, 1–6. DOI:Google ScholarDigital Library
Reference
[13] Armstrong Timothy G., Ponnekanti Vamsi, Borthakur Dhruba, and Callaghan Mark. 2013. LinkBench: A database benchmark based on the facebook social graph. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (New York, New York, USA) (SIGMOD’13). Association for Computing Machinery, New York, NY, 1185–1196. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[14] Bhandari Kumud, Chakrabarti Dhruva R., and Boehm Hans-J.. 2016. Makalu: Fast recoverable allocation of non-volatile memory. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (Amsterdam, Netherlands) (OOPSLA 2016). Association for Computing Machinery, New York, NY, 677–694. DOI:Google ScholarDigital Library
Reference
[15] Cai Wentao, Wen Haosen, Beadle H. Alan, Kjellqvist Chris, Hedayati Mohammad, and Scott Michael L.. 2020. Understanding and optimizing persistent memory allocation. In Proceedings of the 2020 ACM SIGPLAN International Symposium on Memory Management (London, UK) (ISMM 2020). Association for Computing Machinery, New York, NY, 60–73. DOI:Google ScholarDigital Library
Reference
[16] Cao Zhichao, Dong Siying, Vemuri Sagar, and Du David H.C.. 2020. Characterizing, modeling, and benchmarking RocksDB key-value workloads at facebook. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST 20). USENIX Association, Santa Clara, CA, 209–223. Retrieved from https://www.usenix.org/conference/fast20/presentation/cao-zhichaoGoogle ScholarDigital Library
Reference
[17] Chen Youmin, Lu Youyou, Fang Kedong, Wang Qing, and Shu Jiwu. 2020. uTree: A persistent B+-tree with low tail latency. Proc. VLDB Endow. 13, 12 (July2020), 2634–2648. DOI:Google ScholarDigital Library
Reference 1Reference 2
[18] Chen Youmin, Lu Youyou, Yang Fan, Wang Qing, Wang Yang, and Shu Jiwu. 2020. FlatStore: An efficient log-structured key-value storage engine for persistent memory. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS’20). Association for Computing Machinery, New York, NY, 1077–1091. DOI:Google ScholarDigital Library
Reference 1Reference 2
[19] Cooper Brian F., Silberstein Adam, Tam Erwin, Ramakrishnan Raghu, and Sears Russell. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (Indianapolis, Indiana, USA) (SoCC’10). Association for Computing Machinery, New York, NY, 143–154. DOI:Google ScholarDigital Library
Reference
[20] Corbett James C., Dean Jeffrey, Epstein Michael, Fikes Andrew, Frost Christopher, Furman J. J., Ghemawat Sanjay, Gubarev Andrey, Heiser Christopher, Hochschild Peter, Hsieh Wilson, Kanthak Sebastian, Kogan Eugene, Li Hongyi, Lloyd Alexander, Melnik Sergey, Mwaura David, Nagle David, Quinlan Sean, Rao Rajesh, Rolig Lindsay, Saito Yasushi, Szymaniak Michal, Taylor Christopher, Wang Ruth, and Woodford Dale. 2013. Spanner: Google’s globally distributed database. ACM Trans. Comput. Syst. 31, 3, Article 8 (aug2013), 22 pages. DOI:Google ScholarDigital Library
Reference
[21] Dai Yifan, Xu Yien, Ganesan Aishwarya, Alagappan Ramnatthan, Kroth Brian, Arpaci-Dusseau Andrea, and Arpaci-Dusseau Remzi. 2020. From wisckey to bourbon: A learned index for log-structured merge trees. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 155–171. Retrieved from https://www.usenix.org/conference/osdi20/presentation/daiGoogle Scholar
Reference 1Reference 2
[22] Dang Zheng, He Shuibing, Hong Peiyi, Li Zhenxin, Zhang Xuechen, Sun Xian-He, and Chen Gang. 2022. NVAlloc: Rethinking heap metadata management in persistent memory allocators. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS’22). Association for Computing Machinery, New York, NY, 115–127. DOI:Google ScholarDigital Library
Reference
[23] David Tudor, Guerraoui Rachid, and Trigonakis Vasileios. 2015. Asynchronized concurrency: The secret to scaling concurrent search data structures. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (Istanbul, Turkey) (ASPLOS’15). Association for Computing Machinery, New York, NY, 631–644. DOI:Google ScholarDigital Library
Reference
[24] Dayan Niv, Athanassoulis Manos, and Idreos Stratos. 2017. Monkey: Optimal navigable key-value store. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (SIGMOD’17). Association for Computing Machinery, New York, NY, 79–94. DOI:Google ScholarDigital Library
Reference 1Reference 2
[25] Dayan Niv and Twitto Moshe. 2021. Chucky: A succinct cuckoo filter for LSM-tree. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (SIGMOD’21). Association for Computing Machinery, New York, NY, 365–378. DOI:Google ScholarDigital Library
Reference 1Reference 2
[26] Eisenman Assaf, Gardner Darryl, AbdelRahman Islam, Axboe Jens, Dong Siying, Hazelwood Kim, Petersen Chris, Cidon Asaf, and Katti Sachin. 2018. Reducing DRAM footprint with NVM in facebook. In Proceedings of the 30th EuroSys Conference (Porto, Portugal) (EuroSys’18). Association for Computing Machinery, New York, NY, Article 42, 13 pages. DOI:Google ScholarDigital Library
Reference
[27] Eisenman Assaf, Gardner Darryl, AbdelRahman Islam, Axboe Jens, Dong Siying, Hazelwood Kim, Petersen Chris, Cidon Asaf, and Katti Sachin. 2018. Reducing DRAM footprint with NVM in facebook. In Proceedings of the 30th EuroSys Conference (Porto, Portugal) (EuroSys’18). Association for Computing Machinery, New York, NY, Article 42, 13 pages. DOI:Google ScholarDigital Library
Reference
[28] Facebook. 2022. RocksDB. Retrieved from https://rocksdb.org/Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
[29] Facebook. 2022. RocksDB Tuning Guide. Retrieved from https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-GuideGoogle Scholar
Reference
[30] Ghemawat Sanjay and Dean Jeff. 2022. LevelDB. Retrieved from https://github.com/google/leveldbGoogle Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
[31] Hwang Deukyeon, Kim Wook-Hee, Won Youjip, and Nam Beomseok. 2018. Endurable transient inconsistency in byte-addressable persistent B+-tree. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST 18). USENIX Association, Oakland, CA, 187–200. Retrieved from https://www.usenix.org/conference/fast18/presentation/hwangGoogle Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[32] Kaiyrakhmet Olzhas, Lee Songyi, Nam Beomseok, Noh Sam H., and Choi Young ri. 2019. SLM-DB: Single-level key-value store with persistent memory. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST 19). USENIX Association, Boston, MA, 191–205. Retrieved from https://www.usenix.org/conference/fast19/presentation/kaiyrakhmetGoogle Scholar
Reference
[33] Kannan Sudarsun, Bhat Nitish, Gavrilovska Ada, Arpaci-Dusseau Andrea, and Arpaci-Dusseau Remzi. 2018. Redesigning LSMs for nonvolatile memory with NoveLSM. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC 18). USENIX Association, Boston, MA, 993–1005. Retrieved from https://www.usenix.org/conference/atc18/presentation/kannanGoogle Scholar
Reference 1Reference 2
[34] Kassa Hiwot Tadese, Akers Jason, Ghosh Mrinmoy, Cao Zhichao, Gogte Vaibhav, and Dreslinski Ronald. 2021. Improving performance of flash based key-value stores using storage class memory as a volatile memory extension. In Proceedings of the 2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 821–837. Retrieved from https://www.usenix.org/conference/atc21/presentation/kassaGoogle Scholar
Reference
[35] Kassa Hiwot Tadese, Akers Jason, Ghosh Mrinmoy, Cao Zhichao, Gogte Vaibhav, and Dreslinski Ronald. 2022. Power-optimized deployment of key-value stores using storage class memory. ACM Trans. Storage 18, 2, Article 13 (mar2022), 26 pages. DOI:Google ScholarDigital Library
Reference
[36] Kim Jongbin, Kim Kihwang, Cho Hyunsoo, Yu Jaeseon, Kang Sooyong, and Jung Hyungsoo. 2021. Rethink the scan in MVCC databases. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (SIGMOD’21). Association for Computing Machinery, New York, NY, 938–950. DOI:Google ScholarDigital Library
Reference
[37] Kim Wook-Hee, Krishnan R. Madhava, Fu Xinwei, Kashyap Sanidhya, and Min Changwoo. 2021. PACTree: A high performance persistent range index using PAC guidelines. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (Virtual Event, Germany) (SOSP’21). Association for Computing Machinery, New York, NY, USA, 424–439. Google ScholarDigital Library
Reference 1Reference 2
[38] Kipf Andreas, Chromejko Damian, Hall Alexander, Boncz Peter, and Andersen David G.. 2020. Cuckoo index: A lightweight secondary index structure. Proc. VLDB Endow. 13, 13 (sep2020), 3559–3572. Google ScholarDigital Library
Reference
[39] Lee Se Kwon, Mohan Jayashree, Kashyap Sanidhya, Kim Taesoo, and Chidambaram Vijay. 2019. Recipe: Converting concurrent dram indexes to persistent-memory indexes. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) (SOSP’19). Association for Computing Machinery, New York, NY, 462–477. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[40] Lepers Baptiste, Balmau Oana, Gupta Karan, and Zwaenepoel Willy. 2019. KVell: The design and implementation of a fast persistent key-value store. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) (SOSP’19). Association for Computing Machinery, New York, NY, 447–461. DOI:Google ScholarDigital Library
Reference
[41] Lepers Baptiste, Balmau Oana, Gupta Karan, and Zwaenepoel Willy. 2020. Kvell+: Snapshot isolation without snapshots. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 425–441. Retrieved from https://www.usenix.org/conference/osdi20/presentation/lepersGoogle Scholar
Reference
[42] Li Fei, Lu Youyou, Yang Zhe, and Shu Jiwu. 2020. SineKV: Decoupled secondary indexing for LSM-based key-value stores. In Proceedings of the 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS). 1112–1122. DOI:Google ScholarCross Ref
Reference 1Reference 2Reference 3
[43] Liu Jihang, Chen Shimin, and Wang Lujun. 2020. LB+Trees: Optimizing persistent index performance on 3DXPoint memory. Proc. VLDB Endow. 13, 7 (mar2020), 1078–1090. DOI:Google ScholarDigital Library
Reference
[44] Lu Baotong, Hao Xiangpeng, Wang Tianzheng, and Lo Eric. 2020. Dash: Scalable hashing on persistent memory. Proc. VLDB Endow. 13, 10 (April2020), 1147–1161. DOI:Google ScholarDigital Library
Reference
[45] Lu Lanyue, Pillai Thanumalayan Sankaranarayana, Arpaci-Dusseau Andrea C., and Arpaci-Dusseau Remzi H.. 2016. WiscKey: Separating keys from values in SSD-conscious storage. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST 16). USENIX Association, Santa Clara, CA, 133–148. Retrieved from https://www.usenix.org/conference/fast16/technical-sessions/presentation/luGoogle Scholar
Reference
[46] Lu Youyou, Shu Jiwu, Chen Youmin, and Li Tao. 2017. Octopus: An RDMA-enabled distributed persistent memory file system. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 773–785. Retrieved from https://www.usenix.org/conference/atc17/technical-sessions/presentation/luGoogle Scholar
Reference
[47] Luo Chen and Carey Michael J.. 2019. Efficient data ingestion and query processing for LSM-based storage systems. Proc. VLDB Endow. 12, 5 (jan2019), 531–543. DOI: Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
[48] Luo Chen and Carey Michael J.. 2020. LSM-based storage techniques: A survey. The VLDB Journal 29, 1 (jan2020), 393–418. DOI:Google ScholarDigital Library
Reference 1Reference 2
[49] Ma Shaonan, Chen Kang, Chen Shimin, Liu Mengxing, Zhu Jianglang, Kang Hongbo, and Wu Yongwei. 2021. ROART: Range-query optimized persistent ART. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST 21). USENIX Association, 1–16. Retrieved from https://www.usenix.org/conference/fast21/presentation/maGoogle Scholar
Reference
[50] Mao Yandong, Kohler Eddie, and Morris Robert Tappan. 2012. Cache craftiness for fast multicore key-value storage. In Proceedings of the 7th ACM European Conference on Computer Systems (Bern, Switzerland) (EuroSys’12). Association for Computing Machinery, New York, NY, 183–196. DOI:Google ScholarDigital Library
Reference 1Reference 2Reference 3
[51] Matsunobu Yoshinori, Dong Siying, and Lee Herman. 2020. MyRocks: LSM-tree database storage engine serving facebook’s social graph. Proc. VLDB Endow. 13, 12 (aug2020), 3217–3230. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[52] Nam Moohyeon, Cha Hokeun, Choi Young ri, Noh Sam H., and Nam Beomseok. 2019. Write-optimized dynamic hashing for persistent memory. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST 19). USENIX Association, Boston, MA, 31–44. Retrieved from https://www.usenix.org/conference/fast19/presentation/namGoogle ScholarDigital Library
Reference 1Reference 2Reference 3
[53] Oukid Ismail, Lasperas Johan, Nica Anisoara, Willhalm Thomas, and Lehner Wolfgang. 2016. FPTree: A hybrid SCM-DRAM persistent and concurrent B-tree for storage class memory. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD’16). Association for Computing Machinery, New York, NY, 371–386. DOI:Google ScholarDigital Library
Reference 1Reference 2Reference 3
[54] Qader Mohiuddin Abdul, Cheng Shiwen, and Hristidis Vagelis. 2018. A comparative study of secondary indexing techniques in LSM-based NoSQL databases. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD’18). Association for Computing Machinery, New York, NY, 551–566. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
Reference 11
Reference 12
[55] Raju Pandian, Kadekodi Rohan, Chidambaram Vijay, and Abraham Ittai. 2017. PebblesDB: Building key-value stores using fragmented log-structured merge trees. In Proceedings of the 26th Symposium on Operating Systems Principles (Shanghai, China) (SOSP’17). Association for Computing Machinery, New York, NY, 497–514. DOI:Google ScholarDigital Library
Reference 1Reference 2
[56] Ruan Chaoyi, Zhang Yingqiang, Bi Chao, Ma Xiaosong, Chen Hao, Li Feifei, Yang Xinjun, Li Cheng, Aboulnaga Ashraf, and Xu Yinlong. 2023. Persistent memory disaggregation for cloud-native relational databases. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (Vancouver, BC, Canada) (ASPLOS 2023). Association for Computing Machinery, New York, NY, 498–512. DOI:Google ScholarDigital Library
Reference
[57] Rumble Stephen M., Kejriwal Ankita, and Ousterhout John. 2014. Log-structured memory for DRAM-based storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (Santa Clara, CA) (FAST’14). USENIX Association, 1–16. Google ScholarDigital Library
Reference
[58] Shu Jiwu, Chen Youmin, Wang Qing, Zhu Bohong, Li Junru, and Lu Youyou. 2020. TH-DPMS: Design and implementation of an RDMA-enabled distributed persistent memory storage system. ACM Trans. Storage 16, 4, Article 24 (oct2020), 31 pages. DOI:Google ScholarDigital Library
Reference
[59] Tang Yuzhe, Iyengar Arun, Tan Wei, Fong Liana, Liu Ling, and Palanisamy Balaji. 2015. Deferred lightweight indexing for log-structured key-value stores. In Proceedings of the 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 11–20. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[60] Wang Jing, Lu Youyou, Wang Qing, Xie Minhui, Huang Keji, and Shu Jiwu. 2022. Pacman: An efficient compaction approach for log-structured key-value store on persistent memory. In Proceedings of the 2022 USENIX Annual Technical Conference (USENIX ATC 22). USENIX Association, Carlsbad, CA, 773–788. Retrieved from https://www.usenix.org/conference/atc22/presentation/wang-jingGoogle Scholar
Reference 1Reference 2
[61] Wang Jing, Lu Youyou, Wang Qing, Zhang Yuhao, and Shu Jiwu. 2023. Revisiting secondary indexing in LSM-based storage systems with persistent memory. In Proceedings of the 2023 USENIX Annual Technical Conference (USENIX ATC 23). USENIX Association, Boston, MA, 817–832. Retrieved from https://www.usenix.org/conference/atc23/presentation/wang-jingGoogle Scholar
Reference
[62] Wang Qing, Lu Youyou, Li Junru, and Shu Jiwu. 2021. Nap: A black-box approach to NUMA-aware persistent memory indexes. In Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, 93–111. Retrieved from https://www.usenix.org/conference/osdi21/presentation/wang-qingGoogle Scholar
Reference
[63] Wang Qing, Lu Youyou, Wang Jing, and Shu Jiwu. 2023. Replicating persistent memory key-value stores with efficient RDMA abstraction. In Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association. Retrieved from https://www.usenix.org/conference/osdi23/presentation/wang-qingGoogle Scholar
Reference
[64] Wu Xingbo, Xu Yuehai, Shao Zili, and Jiang Song. 2015. LSM-trie: An LSM-tree-based ultra-large key-value store for small data items. In Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC 15). USENIX Association, Santa Clara, CA, 71–82. Retrieved from https://www.usenix.org/conference/atc15/technical-session/presentation/wuGoogle Scholar
Reference
[65] Xiang Lingfeng, Zhao Xingsheng, Rao Jia, Jiang Song, and Jiang Hong. 2022. Characterizing the performance of intel optane persistent memory: A close look at its on-DIMM buffering. In Proceedings of the Seventeenth European Conference on Computer Systems (Rennes, France) (EuroSys’22). Association for Computing Machinery, New York, NY, 488–505. DOI:Google ScholarDigital Library
Reference
[66] Xie Minhui, Lu Youyou, Wang Qing, Feng Yangyang, Liu Jiaqiang, Ren Kai, and Shu Jiwu. 2023. PetPS: Supporting huge embedding models with persistent memory. Proc. VLDB Endow. 16, 5 (jan2023), 1013–1022. DOI:Google ScholarDigital Library
Reference
[67] Yan Baoyue, Cheng Xuntao, Jiang Bo, Chen Shibin, Shang Canfang, Wang Jianying, Huang Gui, Yang Xinjun, Cao Wei, and Li Feifei. 2021. Revisiting the design of LSM-tree based OLTP storage engine with persistent memory. Proc. VLDB Endow. 14, 10 (jun2021), 1872–1885. DOI:Google ScholarDigital Library
Reference
[68] Yang Jian, Kim Juno, Hoseinzadeh Morteza, Izraelevitz Joseph, and Swanson Steve. 2020. An empirical guide to the behavior and use of scalable persistent memory. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST 20). USENIX Association, Santa Clara, CA, 169–182. Retrieved from https://www.usenix.org/conference/fast20/presentation/yangGoogle ScholarDigital Library
Reference 1Reference 2Reference 3
[69] Yao Ting, Zhang Yiwen, Wan Jiguang, Cui Qiu, Tang Liu, Jiang Hong, Xie Changsheng, and He Xubin. 2020. MatrixKV: Reducing write stalls and write amplification in LSM-tree based KV stores with matrix container in NVM. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 17–31. Retrieved from https://www.usenix.org/conference/atc20/presentation/yaoGoogle Scholar
Reference
[70] Zhang Huanchen, Liu Xiaoxuan, Andersen David G., Kaminsky Michael, Keeton Kimberly, and Pavlo Andrew. 2020. Order-preserving key compression for in-memory search trees. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD’20). Association for Computing Machinery, New York, NY, 1601–1615. DOI:Google ScholarDigital Library
Reference
[71] Zhang Wenhui, Zhao Xingsheng, Jiang Song, and Jiang Hong. 2021. ChameleonDB: A key-value store for optane persistent memory. In Proceedings of the 16th European Conference on Computer Systems (Online Event, United Kingdom) (EuroSys’21). Association for Computing Machinery, New York, NY, 194–209. DOI:Google ScholarDigital Library
Reference 1Reference 2
[72] Zhou Xinjing, Shou Lidan, Chen Ke, Hu Wei, and Chen Gang. 2019. DPTree: Differential indexing for persistent memory. Proc. VLDB Endow. 13, 4 (Dec.2019), 421–434. DOI:Google ScholarDigital Library
Reference 1Reference 2
[73] Zuo Pengfei, Hua Yu, and Wu Jie. 2018. Write-optimized and high-performance hashing index scheme for persistent memory. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 461–476. Retrieved from https://www.usenix.org/conference/osdi18/presentation/zuoGoogle Scholar
Reference 1Reference 2

Index Terms

Perseid: A Secondary Indexing Mechanism for LSM-Based Storage Systems
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
      1. Non-volatile memory
2. Information systems
  1. Data management systems
    1. Data structures

Recommendations

FlatLSM: Write-Optimized LSM-Tree for PM-Based KV Stores
The Log-Structured Merge Tree (LSM-Tree) is widely used in key-value (KV) stores because of its excwrite performance. But LSM-Tree-based KV stores still have the overhead of write-ahead log and write stall caused by slow L₀ flush and L₀-L₁ compaction. New ...
Read More
LSM-tree managed storage for large-scale key-value store
SoCC '17: Proceedings of the 2017 Symposium on Cloud Computing

Key-value stores are increasingly adopting LSM-trees as their enabling data structure in the backend storage, and persisting their clustered data through a file system. A file system is expected to not only provide file/directory abstraction to organize ...
Read More
A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

NoSQL databases are increasingly used in big data applications, because they achieve fast write throughput and fast lookups on the primary key. Many of these applications also require queries on non-primary attributes. For that reason, several NoSQL ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Storage Volume 20, Issue 2
May 2024
186 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3613586
Issue’s Table of Contents

Copyright © 2024 Copyright held by the owner/author(s).
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 February 2024
- Online AM: 17 November 2023
- Accepted: 6 November 2023
- Revised: 1 September 2023
- Received: 1 September 2023
Published in tos Volume 20, Issue 2

Check for updates
Author Tags
LSM-tree
secondary indexes
persistent memory
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 738
  Total Downloads
- Downloads (Last 12 months)738
- Downloads (Last 6 weeks)224
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Perseid: A Secondary Indexing Mechanism for LSM-Based Storage Systems

ACM Transactions on Storage

Abstract

1 INTRODUCTION

2 BACKGROUND

2.1 Log-Structured Merge Trees

2.2 Secondary Index in LSM-Based Systems

2.3 Persistent Memory

3 MOTIVATION

4 PERSEID DESIGN

4.1 Overview

4.2 PS-Tree Design

4.2.1 Structure.

4.2.2 Basic Operations.

4.3 Hybrid PM-DRAM Validation

4.3.1 Structure.

4.3.2 Basic Operations.

4.3.3 Together with PS-Tree.

4.4 Non-Index-Only Query Optimizations

4.4.1 Locating Components with Sequence Number.

4.4.2 Parallel Primary Table Searching.

5 EVALUATION

5.1 Experimental Setup

5.2 Microbenchmarks

5.2.1 Insert and Update.

5.2.2 Query.

5.2.3 Range Query.

5.2.4 Multi-Threaded Performance.

5.2.5 Non-Index-Only Query.

5.3 Mixed Workloads

5.4 Recovery Time

6 RELATED WORK

7 CONCLUSION

ACKNOWLEDGMENTS

Footnotes

REFERENCES

Cited By

Index Terms

Recommendations

FlatLSM: Write-Optimized LSM-Tree for PM-Based KV Stores

LSM-tree managed storage for large-scale key-value store

A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media