Abstract
LSM-based storage systems are widely used for superior write performance on block devices. However, they currently fail to efficiently support secondary indexing, since a secondary index query operation usually needs to retrieve multiple small values, which scatter in multiple LSM components. In this work, we revisit secondary indexing in LSM-based storage systems with byte-addressable persistent memory (PM). Existing PM-based indexes are not directly competent for efficient secondary indexing. We propose Perseid, an efficient PM-based secondary indexing mechanism for LSM-based storage systems, which takes into account both characteristics of PM and secondary indexing. Perseid consists of (1) a specifically designed secondary index structure that achieves high-performance insertion and query, (2) a lightweight hybrid PM-DRAM and hash-based validation approach to filter out obsolete values with subtle overhead, and (3) two adapted optimizations on primary table searching issued from secondary indexes to accelerate non-index-only queries. Our evaluation shows that Perseid outperforms existing PM-based indexes by 3–7× and achieves about two orders of magnitude performance of state-of-the-art LSM-based secondary indexing techniques even if on PM instead of disks.
1 INTRODUCTION
Log-Structured Merge trees (LSM-trees) feature outstanding write performance and thus have been widely adopted in modern key-value KV) stores, such as RocksDB [28] and Cassandra [1].
Different from in-place update storage structures (e.g., B\(^+\)-Tree), LSM-trees buffer writes in memory and flush them to storage devices in batches periodically to avoid random writes, which enables high write performance and low device write amplification. Besides high write performance, many database applications also require high-performance queries on not only primary keys but also other specific values [13], thus necessitating secondary indexing techniques.
LSM-trees’ attributes make it challenging to design efficient secondary indexing. Modern LSM-based storage systems typically store a secondary index as another LSM-tree [54] (e.g., a column family in RocksDB [51]). However, designed for block devices and optimized for write performance, LSM-trees are not competent data structures for secondary indexes, which require high search performance. First, since secondary indexes usually only store primary keys instead of full records1 as values, KV pairs in secondary indexes are small. LSM-trees’ heavy lookup operations are inefficient for these small KV pairs. Second, secondary keys are not unique and can have multiple associated primary keys. LSM-trees’ out-of-place write pattern will scatter these non-consecutive-arrived values (i.e., associated primary keys) to multiple pieces at different levels. Consequently, query operations need to search all levels in the LSM-based secondary index to fetch these value pieces. Besides the device I/O overhead, LSM-trees have non-negligible overheads of CPU and memory (i.e., indexing and Bloom filter) [21, 25, 40].
Moreover, the consistency of secondary indexes is another issue in LSM-based storage systems. As an LSM-based primary table adopts the blind-write pattern to insert, update, and delete records (appends new data without checking prior data, versus read-modify-write in B\(^+\)-Trees) for high write performance, it is unable to delete the obsolete entry in a secondary index without acquiring the old secondary key. Consequently, when querying a secondary index, the system should validate each entry by checking the primary table before returning the results to users, which introduces many unnecessary but expensive lookups on the primary table for obsolete entries. Some systems fetch old records when updating or deleting records to keep secondary indexes up-to-date synchronously [11, 51], whereas this method discards the blind-write attribute and thus degrades the write performance.
Though many efforts have been made to optimize these predicaments [42, 47, 54, 59], they are difficult to solve the problems discussed above well, sacrificing either write performance of the LSM-based storage systems or query performance of the secondary index.
As secondary indexing demands low-latency queries and the KV pairs of secondary indexes are small, we argue that leveraging persistent memory (PM) to provide a new solution for secondary indexing is promising. PM has many attractive advantages such as byte-addressability, DRAM-comparable access latency, and the ability of data persistency, which is well suited to secondary indexing. Though there are many state-of-the-art PM-based indexes [17, 31, 37, 39, 43, 44, 52, 53, 72, 73], none of them are designed for secondary indexing. Without considering the non-unique feature of secondary indexes and consistency in LSM-based KV stores, simply adopting existing general PM-based indexes as secondary indexes can overshadow their performance.
In this work, we propose Perseid [61], a new
Moreover, Perseid retains the blind-write attribute of LSM-based KV stores for high write performance without sacrificing secondary index query performance. This is achieved by a lightweight hybrid PM-DRAM and hash-based validation approach in Perseid. Perseid uses a hash table on PM to record the latest version of primary keys. However, multiple random accesses on PM still incur high latencies. Thus, Perseid adopts a small mirror of the validation hash table on DRAM, which only contains useful information for validation. During validation, the volatile hash table absorbs random accesses to PM, and thus reduces the validation overhead. The small volatile hash table not only saves DRAM memory space but also reduces cache pollution.
Perseid has a fairly low latency of index-only query.2 However, the overhead of non-index-only queries is still dominated by the LSM-based primary table. Therefore, we further propose two optimizations for non-index-only queries in Perseid. First, as querying the primary table issued by the secondary index is an internal operation, we can locate KV pairs with additional auxiliary information much more efficiently, reducing cumbersome indexing operations. By matching the tiering compaction strategy [24, 48], we can further bypass Bloom filter checking. Second, as one secondary index query may need to search for multiple independent records in the primary table, we parallelize these searching operations with multiple threads. Since search latencies on the LSM-based primary table may vary largely, we apply a worker-active manner on parallel threads to avoid load imbalance among threads and improve utilization.
We implement Perseid and evaluate it against state-of-the-art PM-based indexes and LSM-based secondary indexing techniques on PM. The evaluation results show that Perseid outperforms exiting PM-based indexes by 3-7\(\times\) for queries, and achieves about two orders of magnitude higher performance of state-of-the-art LSM-based secondary indexing techniques even if on PM instead of disks, while maintaining the high write performance of LSM-based storage systems.
In summary, this article makes the following contributions:
— | Analysis of the inefficiencies of LSM-based secondary indexing techniques and existing PM-based indexes when adopted as secondary indexes for LSM-based KV stores. | ||||
— | Perseid, an efficient PM-based secondary indexing mechanism, which includes a secondary index-friendly structure, a lightweight validation approach, and two optimizations on primary table searching issued from secondary indexes. | ||||
— | Experiments that demonstrate the advantage of Perseid. |
2 BACKGROUND
2.1 Log-Structured Merge Trees
The LSM-tree applies out-of-place updates and performs sequential writes, which achieves superior write performance compared with other in-place-update storage structures.
The LSM-tree has a multi-level structure on storage and each level comprises one or several sorted runs. The size of Level \(L_n\) is several times (e.g., 10) larger than Level \(L_{n-1}\). Each sorted run contains sorted KV pairs and is further partitioned to multiple small components called SSTables. In LSM-trees, new KV pairs are first buffered into a memory component called a MemTable. When the MemTable fills up, it turns into an immutable MemTable and gets flushed to storage as a sorted run. Since sorted runs have overlapping key ranges, a query operation needs to search multiple sorted runs. To limit the number of sorted runs and improve search efficiency, LSM-trees conduct compaction periodically to merge several components and remove obsolete KV pairs.
Two typical compaction strategy and their variants are commonly used in LSM-trees [24, 48]: The leveling strategy [28, 30] allows each level (besides \(L_0\)) to have only one sorted run. When a level (\(L_n\)) exceeds its size limit, one or more SSTables from level \(L_n\) and all overlapped SSTables from the higher level \(L_{n+1}\) are sort-merged to generate new SSTables into level \(L_{n+1}\); The tiering strategy [55, 64] allows each level (besides \(L_0\)) to have multiple sorted runs to reduce the write amplification. To compact SSTables at level \(L_n\), several SSTables in a range partition are merged to new SSTables writing to level \(L_{n+1}\) directly, without rewriting existing SSTables at level \(L_{n+1}\). Compared with the leveling strategy, the tiering strategy has a much smaller write amplification ratio and thus higher write performance. However, since query operations need to search multiple sorted runs in each level, LSM-trees with a tiering strategy have much lower read performance.
2.2 Secondary Index in LSM-Based Systems
Many applications require queries on specific values other than primary keys. Without an index based on specific values, database systems need to scan the whole table to find relevant data. Thus, secondary indexing is an indispensable technique in database systems. For example, in Facebook’s database service for social graphs, secondary keys are heavily used, such as finding IDs who liked a specific photo [13, 51]. In this work, we mainly discuss stand-alone secondary indexes, which are separate index structures apart from the primary table and are commonly used in database systems [54]. A stand-alone secondary index maintains mappings from each secondary key to its associated primary keys. As secondary keys are not unique, a single secondary key can have multiple associated primary keys.
Consistency Strategy. Since LSM-based KV stores update or delete records by out-of-place blind-writes, maintaining consistency of secondary indexes becomes a challenge in LSM-based storage systems. There are two strategies to handle this issue, Synchronous and Validation.
For Synchronous strategy, whenever a record is written in the primary table, the secondary index is maintained synchronously to reflect the latest and valid status (e.g., AsterixDB [11], MongoDB [5], MyRocks [51]). For example, as shown in Figure 1(a), when writing a new record {p2\(\rightarrow\)s1} (p denotes the primary key, s denotes the secondary key, and other fields are omitted for simplicity) into the primary table, the storage system also fetches the old record of p2 to get its old secondary key s2. Then, the storage system inserts not only a new entry {s1\(\rightarrow\)p2} but also a tombstone to delete the obsolete entry {s2\(\rightarrow\)p2} in the secondary index. Nevertheless, this strategy discards the blind-write attribute and thus degrades the write performance which is the main advantage of LSM-based KV stores.
By contrast, as shown in Figure 1(b), Validation strategy only inserts the new entry {s1\(\rightarrow\)p2} but does not maintain the consistency of obsolete entries in secondary indexes (e.g., Cassandra [1, 2], DELI [59], and secondary indexing proposed by Luo et al. [47]). However, secondary index query operations need to validate all relevant entries by checking the primary table to filter out obsolete mappings. Though previous work proposed some approaches to reduce the validation overhead, their benefits are limited. For example, DELI [59] lazily repairs the secondary index along with compaction of the primary table. Luo et al. [47] propose to store an extra timestamp for each entry in the secondary index and use a primary key index that only stores primary keys and their latest timestamp for validation. The primary key index is validated instead of the primary table. However, since the primary key index is also an LSM-tree, though it filters out unnecessary point lookups on the primary table, it still requires point lookups on itself.
Index Type. As a secondary key can have multiple associated primary keys, LSM-based secondary indexes have two types surrounding this issue, including composite index and posting list [54]. The key in a composite index (i.e., composite key) is a concatenation of a secondary key and a primary key. The composite index is easy to implement and adopted by many systems [20, 51, 54]. However, it turns a secondary lookup operation into a prefix range search operation.
The posting list stores multiple associated primary keys in the value of a KV pair. Entries in each posting list can be sorted by primary keys or recency. When a new record is inserted, there are two update strategies. Eager update strategy conducts read-modify-write, fetching the old posting list and merging the new primary key to the posting list. Lazy update strategy blindly insert a new posting list which only includes the new primary key. It leaves posting lists merging to compaction. However, a secondary lookup needs to search all levels to fetch all relevant entries.
Limitations. Even though there are multiple strategies, types, and optimizations, LSM-based secondary indexes have to sacrifice either the write performance of storage systems or the secondary index query performance, which results from the incompatibility of inherent attributes of LSM-trees and characteristics of secondary indexes.
2.3 Persistent Memory
PM, also called Non-Volatile Memory (NVM) or Storage Class Memory (SCM), provides several attractive benefits for storage systems, such as byte-addressability, DRAM-comparable access latency, and data persistency. CPUs can access data on PM directly with load and store instructions. Besides, compared to DRAM, PM has a much larger capacity, lower cost, and lower power consumption. Therefore, both academia and industry have proposed plenty of work to harness PM’s benefits in storage systems [18, 26, 33, 46, 53, 56, 58, 66, 67, 73]. In addition to DDR bus-connected PM (e.g., Intel Optane DCPMM), the recent high-bandwidth and low-latency IO interconnection, Compute Express Link (CXL) [4, 35], brings a new form of SCM, CXL device-attached memory (e.g., Samsung’s CMM-H (CXL Memory Module - Hybrid) [9, 10]).
However, the PM also has some performance idiosyncrasies. For example, the current commercial PM hardware (i.e., Intel Optane DCPMM) has physical media access granularity of 256 bytes, leading to high random access latency (about 3\(\times\) of DRAM in terms of reads) and write amplification for small random writes, which needs to be considered when designing PM systems [18, 60, 63, 65, 68, 71]. These performance idiosyncrasies are likely to be a general problem or even more obvious in other PM devices due to physical media characteristics (e.g., flash page in CXL-SSD).
Though Intel Optane DCPMM is currently the only available commercial device, we believe it can represent other emerging PM devices to some extent. In this article, we mainly focus on PM’s general characteristics described above but not the specific numbers of Optane’s attributes.
3 MOTIVATION
Though recent work introduces some techniques to optimize secondary indexing in LSM-based systems, we find that the performance of LSM-based secondary indexing is still unsatisfactory due to the incompatibility of inherent attributes of LSM-trees and characteristics of secondary indexing. On the one hand, LSM-tree is not a competent data structure for secondary indexes, since the characteristics of secondary indexes exacerbate the deficiency of LSM-tree’s read operations: (1) KV pairs are usually small in secondary indexes, to which LSM-tree’s cumbersome lookup operations are unfriendly; (2) Secondary keys are not unique and can have multiple values, which LSM-tree’s out-of-place update will exacerbate the query inefficiency. On the other hand, the blind-write attribute of LSM-based primary tables makes the consistency of secondary indexes troublesome.
Therefore, this motivates us to find a better solution for secondary indexes in LSM-based storage systems. As PM provides attractive features such as byte-addressability, DRAM-comparable access latency, and data persistency, we argue that it is promising to provide secondary indexing with PM.
Though there are many state-of-the-art PM-based index structures, they are not specifically designed for secondary indexing. To adopt them as secondary indexes (e.g., support the multi-value feature), naive approaches include the composite index or using a conventional allocator to organize posting lists (Section 2.2). However, simply adopting these naive approaches to use existing PM-based indexes as secondary indexes will overshadow their superior advantages.
Why not use a PM-based composite index? Though this method is straightforward and easy to implement in LSM-based systems, it is not ideal for tree-based persistent indexes. First, when adding or removing a primary key for a secondary key, a value update operation turns into a new composite key insert or delete operation for composite indexes. Insert and delete operations are more expensive than update operations in a PM-based tree index because they may cause shift operations or structural modification operations (SMOs). Second, composite indexes store every pair of mappings as an individual KV pair, expanding the number of KV pairs, which increases the height of the tree index and thus degrades its query performance. Third, storing the same secondary keys repeatedly in multiple composite keys wastes PM space, which can be a dominant overhead for some real-world databases [70].
Why not use a conventional allocator for posting lists? One may use a conventional allocator, such as a slab-based allocator or a log-structured approach, to allocate space for values (posting lists) out of the index. Nevertheless, they are not suitable for values of secondary keys. One way is allocating space for a whole posting list for each secondary key with a general-purpose allocator such as a slab-based allocator. However, these general-purpose allocators usually have high overheads on PM since they conduct expensive mechanisms for crash consistency (e.g., logging) and perform many small writes on their metadata which is necessary for recovery [7]. Though some PM allocators relieve allocation overheads by techniques such as deferring garbage collection to post-failure [14, 15], slab-based allocators have low memory utilization due to the memory fragmentation issue [57], which cannot be eliminated by restarting on PM [22]. Worse still, these issues are more severe for secondary indexes. In secondary indexes, a posting list of a secondary key is changed by inserting or removing primary keys, which means the size of the posting list (the total size of associated primary keys) changes constantly. This characteristic requires frequent reallocations and copy-on-writes. Another way is allocating space for each individual new value (primary key) of a postint list, and using pointers to link them together. One can use a lightweight and PM-friendly allocator, such as a log-structured approach for its sequential-write pattern. However, it will scatter posting lists (primary keys) associated with the same secondary key into multiple pieces and thus reduce the query performance due to poor data locality.
Our experiments (Section 5.2) show that these naive approaches on PM-based indexes lead to several times performance degradation. It thus motivates us to explore a new PM-based secondary indexing mechanism for LSM-based KV stores. In addition, an efficient validation approach is required to retain the blind-write attribute of LSM-based KV stores.
4 PERSEID DESIGN
4.1 Overview
Motivated by the analysis above, we propose Perseid, a PM-based secondary indexing mechanism for LSM-based storage systems, which overcomes traditional LSM-based secondary indexes’ deficiencies. Figure 2 shows the overall architecture of an LSM-based storage system with Perseid.
— | Perseid contains a PM-based secondary index, | ||||
— | Perseid retains the blind-write attribute of the LSM primary table for write performance (i.e., taking the validation strategy (Section 2.2)), without sacrificing query performance by introducing a lightweight hybrid PM-DRAM and hash-based validation approach. The validation approach contains a persistent hash table to record version information of primary keys, and a volatile and lite hash table to absorb random accesses to PM. (Section 4.3). | ||||
— | To accelerate non-index-only queries, Perseid adapts two optimizations on primary table searching issued from secondary indexes. Perseid filters out irrelevant component searching with sequence numbers and parallelizes primary table searching in an efficient way (Section 4.4). |
4.2 PS-Tree Design
Perseid introduces
4.2.1 Structure.
The overall structure of
In the PKey Layer, primary key entries (PKey Entries) are stored in PKey Pages. Each PKey Entry has an 8-byte metadata header and a primary key. The header consists of a 2-byte size, a 1-bit obsolete flag, and a 47-bit sequence number (SQN) of the primary key. The SQN is internally used for multi-version concurrency control (MVCC) in LSM-based KV stores [28, 30]. Each new record (including updates and deletes) in the primary table gets a monotonically increased SQN. Perseid leverages the SQN mechanism to guarantee data consistency among the primary table and secondary indexes, and also for validation which will be described in Section 4.3. PKey Pages are aligned to PM physical media access granularity (e.g., 256 bytes of Intel Optane DCPMM [68]).
Nevertheless, traditional log-structured approaches scatter different values of the same secondary key in the log, resulting in poor data locality and degraded query performance. To improve data locality,
4.2.2 Basic Operations.
Log-Structured Insert. Algorithm 1 describes the process of the insert operation in
Second,
Third, the new pointer of the SKey (i.e., the address of the new PKey Group) is updated or inserted in the SKey Layer (Line 13). Thus, the insert request usually performs an update operation in the SKey Layer. PKey Entries of a secondary key are always linked in the order of recency to facilitate query operations which usually require the most latest entries [13, 54].
Search. Algorithm 2 describes the process of the search operation in
Update and Delete.
Locality-aware PKey Page Split with Garbage Collection. When a PKey Page does not have enough space for a new entry, it splits into two new PKey Page in a copy-on-write manner. Algorithm 3 shows the process of the PKey Page split operation. Since insertions are performed in a log-structured manner, the PKey Entries associated with one SKey may scatter discontinuously. Querying these entries may need multiple random accesses on PM. As PM has non-negligible read latencies compared with DRAM (e.g., about 300 ns with Intel Optane DCPMM [68]), query operations can have high overheads. Therefore, as shown in Figure 4, to improve locality,
Besides, entries not marked as deleted in the current PKey Page are validated by a lightweight approach (described in Section 4.3), and obsolete entries are physically removed during reorganization to reduce space overhead (Lines 9–11 in Algorithm 3).
A skewed secondary key may have many primary keys that occupy more than one PKey Page. For those PKey Entries not in the current PKey Page,
To support MVCC,
Crash Consistency. Perseid relies on the existing write-ahead-log (WAL) of the LSM-based primary table to guarantee atomic durability among the primary table and secondary indexes. During recovery with WAL, Perseid redoes uncompleted operation to the
4.3 Hybrid PM-DRAM Validation
Perseid adopts the Validation strategy (see Section 2.2) for high write performance, which necessitates a lightweight validation approach. Since update-intensive workloads are quite common nowadays [13, 16], if the validation approach is heavy, validating a large number of obsolete entries brings no outcomes but generates huge overhead.
4.3.1 Structure.
Perseid introduces a lightweight validation approach based on the requirement of validation. Perseid adopts a hash table on PM storing version information for primary keys. The hash table is indexed by the primary key and stores its latest sequence number (Section 4.2.1). Nevertheless, even though point lookups on a PM-based hash table are much faster than on a tree, the validation time is comparable to the query time of
Figure 5 illustrates the hybrid PM-DRAM validation approach. The values in the hash tables consist of the sequence number of the record (6-byte) and a 2-byte counter. The counter is used to determine whether a primary key has obsolete versions. There is a slight difference in the counters of the two hash tables. In the volatile hash table, each counter indicates the number of logically existing entries related to a primary key in the secondary index. By contrast, each counter in the persistent hash table indicates the number of physically existing entries in the secondary index.
4.3.2 Basic Operations.
Next, we describe the validation approach in detail according to operations.
Upsert. The process of upsert operation on validation hash tables is shown in Algorithm 4. When a new record (including update and delete) is inserted into the primary table, the primary key is inserted or updated with its sequence number into the persistent hash table. If the persistent hash table does not contain this primary key before, its counter is set to one (Line 9), which means this primary key has only one version and no obsolete entries of this primary key exist in the secondary index. For example in Figure 5, at t2, key c is inserted for the first time, and it is inserted into the persistent hash table. Otherwise, the primary key’s counter in the persistent hash table is increased by one (Line 4); besides, the primary key is inserted or updated with its sequence number into the volatile hash table, and the counter in the volatile hash table is set to two if it’s an insertion or increased by one if it’s an update (Lines 5–7). For example, when key c is updated with a new version v2 at t3 in Figure 5, the entry in the persistent hash table is updated, and a new entry is inserted into the volatile hash table.
Validate. The secondary index validates an entry by querying the volatile hash table, which is shown in Algorithm 5. Specifically, the entry is valid if the sequence number of this entry matches the latest sequence number stored in the hash table, or the hash table does not contain the primary key which means there are no obsolete entries of this primary key (Line 2 in Algorithm 5). Otherwise, the entry may be obsolete. If the version of the hash table entry is smaller than the global minimum read snapshot number, which means all readers can see the newer version, Perseid further marks the entry as obsolete and decreases the counter of the entry in the volatile hash table by one (Lines 9–13). For example, when key a is checked with an obsolete version v1 at t2 in Figure 5, the result is false, and then the counter is decreased from 3 to 2. If the counter is decreased to 1, which means all obsolete entries have been marked, the entry is removed from the volatile hash table to restrict the hash table size (Lines 14-16). For example, when key a is checked with an obsolete version v2 at t3 in Figure 5, the counter is decreased to one, the validation returns false and the entry is removed. We describe other corner cases regarding to snapshot in Section 4.3.3.
During validation for secondary index queries, Perseid only operates with the volatile validation hash table. Thus, the validation overhead is quite small.
Garbage Collection. During the PKey Page split, entries that are not marked as obsolete are also validated to remove obsolete entries (Section 4.2.2). Since this step physically removes obsolete entries, Perseid decreases the corresponding counters in the persistent hash table. If a counter is decreased to one, Perseid removes the corresponding hash pairs from the volatile hash table.
Recovery. When the system restarts from a crash or a normal shutdown, the volatile hash table needs to be recovered. Perseid iterates the whole persistent hash table and inserts primary keys whose counter is greater than one into the volatile hash table. Now the counters in the volatile hash table are numbers of physically existing entries, which may be larger than the actual numbers of logically existing entries. Therefore, some false positive primary keys may exist in the volatile hash table. However, this does not affect the validation accuracy and these primary keys can be removed by garbage collection.
4.3.3 Together with PS-Tree .
Each write operation in the LSM-based storage system starts with getting a monotonically increased sequence number (SQN). After writing the write-ahead-log (WAL) and inserting new records to the MemTable of the LSM primary table, Perseid inserts the PKey Entry to the
Each query operation first gets the latest committed snapshot number. Then it searches the
A rare scenario is that the volatile hash table reports a new sequence number larger than the current reader’s snapshot number, which means a concurrent writer has updated this primary key. In this case, Perseid cannot directly confirm whether this entry is still valid in this snapshot, since there may exist a version newer than the entry and valid in the snapshot, so Perseid has to validate it by the primary table (Lines 5-7 in Algorithm 5). Another scenario where the volatile hash table reports an older version than the requested PKey Entry is not possible to happen. Perseid commits a write operation after Perseid has inserted the new secondary entry in
4.4 Non-Index-Only Query Optimizations
Though the Perseid significantly reduces the overhead of secondary indexing, the overhead of non-index-only queries (requiring full records) is still dominated by the LSM-based primary table. Thus, Perseid further introduces two optimizations for non-index-only queries.
4.4.1 Locating Components with Sequence Number.
A secondary index query operation may need to search the primary LSM table multiple times for all its associated records. LSM-trees have mediocre read performance due to the multi-level structure. Besides device I/Os, if data is cached in memory or using fast storage devices, LSM-trees have non-negligible overheads on probing components (i.e., indexing and checking Bloom filters) [21, 25, 71]. Since LSM-based KV stores usually employ Bloom filters for each data block [28, 30], the indexing overhead includes indexing not only SSTables but also data blocks. Moreover, the read performance gets worse with the tiering compaction strategy since more components (SSTables) need to be checked and read.
Nevertheless, we find that many components are unnecessary to probe in searching processes issued from the secondary index. Previous work uses zone maps, which store the minimum and maximum values of an attribute, to skip irrelevant data blocks or components during searching [11, 12, 54]. We found that this technique can also be used by secondary indexes to search the primary table. Since we have already recorded the sequence numbers of primary keys in the secondary index, the sequence number can be used as an additional attribute to skip irrelevant components. Perseid builds a zone map that records a sequence number range (i.e., the minimum and maximum sequence numbers of records) for each component (including MemTables).
Moreover, as shown in Figure 6, since tiering compaction merges SSTables from lower level (\(L_n\)) to generate new SSTables in higher level (\(L_{n+1}\)) and does not rewrite other SSTables in the higher level (except for the last level), for a range partition, the sequence number ranges of different levels and even different sorted runs are strictly divided. For primary tables adopting the tiering strategy, with the primary key to search SSTables horizontally and the additional sequence number to search sorted runs vertically, Perseid can locate the exact component that contains the record directly. Besides, since Perseid already validates the version so it must exist in the component, Perseid can further skip the Bloom filter checking. Thus, the indexing overheads are greatly reduced and overheads on checking Bloom filters are almost eliminated.
This optimization fits with the leveling strategy less effectively. The sequence number rangesin different levels may overlap because compaction rewrites SSTables in higher levels with blended sequence numbers from lower levels. However, since most LSM-base KV stores adopt the tiering strategy on \(L_0\) at least [28, 30], this optimization is still effective to some extent.
4.4.2 Parallel Primary Table Searching.
A single secondary key usually has multiple associated primary keys, and queries on these primary keys are independent. Therefore, using multiple threads to accelerate primary table searching is a natural optimization method. One naive approach is to assign primary keys to threads equally (e.g., in a round-robin fashion as shown in Figure 7(a)). However, point lookups on LSM-trees may have a large latency gap, since some KV pairs can be fetched from MemTable or block cache directly and others may reside at a relatively high level and need several disk I/Os due to Bloom filter false positives. It cannot be known in advance how much time each point lookup will take. Therefore, the naive approach may result in a load imbalance among parallel threads where some threads have finished their tasks and become idle while others are still stuck and there may still exist some unfinished tasks.
To relieve this issue, we apply a worker-active fashion as shown in Figure 7(b). Perseid publishes primary keys into a lock-free shared queue as tasks, and each parallel worker thread fetches one task from the queue. An element in the shared queue is a required primary key and the corresponding sequence number. When a worker thread finishes the current task, it tries to fetch another task from the queue. In this way, though each thread may perform a different number of tasks, parallel threads are utilized more adequately and latencies of query requests are further reduced.
5 EVALUATION
In this section, we evaluate Perseid against existing PM-based indexes with naive approaches and state-of-the-art LSM-based secondary indexing techniques [47, 54]. After describing the experimental setup (Section 5.1), we evaluate these secondary indexing mechanisms with micro benchmarks to show their performance on basic operations (Section 5.2). Then, we evaluate these systems’ overall performance with mixed workloads (Section 5.3) and recovery time (Section 5.4).
5.1 Experimental Setup
Platform. Our experiments are conducted on a server with an 18-core Intel Xeon Gold 5220 CPU, which runs Ubuntu 20.04 LTS with Linux 5.4. The system is equipped with 64 GB DRAM, two 128 GB Intel Optane DC Persistent Memory in AppDirect mode, and a 480 GB Intel Optane 905P SSD.
Implementation.
Perseid can leverage any existing state-of-the-art PM-based index as the SKey Layer of
For the hybrid PM-DRAM validation hash table, depending on the different usages of two hash tables, we deploy CLHT [23] as the volatile hash table, and CCEH [52] as the persistent hash table. CLHT is a cache-friendly hash table providing high search performance. CCEH is an extendible hash table optimized for PM that achieves high insert performance by mitigating rehashing overhead.
Compared Systems.
We compare Perseid against the two original PM-based indexes (FAST&FAIR and P-Masstree), and LSM-based secondary index with validation strategy (denoted as
Workloads. Since common benchmarks for KV stores such as YCSB [19] do not have operations on secondary indexes, as in previous work [42, 47, 54], we implemented a secondary index workload generator based on an open-source twitter-like workload generator [3] for evaluation. With this generator, we generate several microbenchmark workloads and mixed workloads. The primary key (e.g., ID) and secondary key (e.g., UserID) are randomly generated 64-bit integers. The key space of primary keys and secondary keys is 100 million and 4 million, respectively. Thus the average number of records per secondary key is about 25. The size of each record is 1KB.
KV Store Configurations. For the primary table, according to configuration tuning guide [29], MemTable size is set to 64 MB and the Bloom filters are set to 10 bits per key. As our workloads will generate a primary table larger than 100 GB, we set a 16-GB block cache for the primary table and a 1-GB block cache for the LSM-based secondary index. Compression is turned off to reduce other influencing factors.
5.2 Microbenchmarks
In this section, we evaluate the basic single-threaded performance and scalability of compared secondary indexing mechanisms.
5.2.1 Insert and Update.
The Insert workload (i.e., no updates) has 100 million unique records. Figure 8(a) shows the average latency of insert operations of each secondary index.
Perseid performs about 10–38% faster than the corresponding composite indexes, but 25% slower than the ideal log-structured approach without garbage collection due to the page split overhead in
The upsert workloads contain 100 million insert operations and 100 million update operations. Operations are shuffled to avoid all newer entries being valid in secondary indexes.
In the Uniform workload (Figure 8(b)), both primary keys and secondary keys follow a uniform distribution. In the Skewed-Pri workload (Figure 8(c)), primary keys follow a Zipfian distribution with the skewness parameter 0.99, and secondary keys are selected randomly. In the Skewed-Sec workload (Figure 8(d)), secondary keys follow a Zipfian distribution (parameter 0.99), and primary keys are uniform. Thus, hot secondary keys have lots of associated primary keys, which represent low-cardinality columns.
Among other validation-based secondary indexes, composite indexes perform even worse in upsert workloads than other secondary indexes. This is because, with additional upsert operations, composite indexes have more KV pairs and larger tree heights. By contrast,
Figure 9 shows the normalized memory usage of the persistent hash table (PM-HT) and the volatile hash table (DRAM-HT) of Perseid after each upsert workload. For a fair comparison, we evaluate the memory usage of PM-HT with the same hashing structure (CLHT) as DRAM-HT. The PM-HT stores all 100 million primary keys with their latest sequence numbers, so it contains about 46 million hashing buckets including linked collision buckets, which occupies about 2.7 GiB memory. By contrast, since the DRAM-HT only stores versions for primary keys that have been updated (Section 4.3), it has a smaller memory footprint than the PM-HT. Specifically, the DRAM-HT is empty because there are no updates in the Insert workload. Besides, Perseid reduce the memory usage of the DRAM-HT to 37.8%, 10.4%, and 77.3% of the whole PM-HT in Uniform, Skewed-Pri, and Skewed-Sec, respectively. Though in the Uniform and Skewed-Sec workloads, most primary keys have been updated,
5.2.2 Query.
In this experiment, we evaluate the performance of index-only queries after loading the insert workload or upsert workloads. Index-only query reflects the performance of a secondary index itself and is a common query technique (i.e., covering index [6, 8]) to avoid looking up the primary table. We show two different selectivities by specifying limit N (10 and 200) on return results. The most recent and valid N entries are returned. For limit of 200, the actual average number of returned entries per query is 25 and 142 for the Skewed-Pri and Skewed-Sec, respectively.
Figure 10 shows the results of index-only query performance. From the results, we have the following observations.
First, PM-based indexes have significantly lower latencies than LSM-based secondary indexes. Putting
Second, Perseid outperforms existing PM-based indexes with the composite index and the log-structured approach by up to 4.5\(\times\) and 4.3\(\times\), respectively. The log-structured approach has poor locality since relevant values are scattered across the whole log and require multiple random accesses to fetch them all. Composite indexes are inferior due to the larger number of KV pairs in the indexes and range-scan operations as we analyzed in Section 3. They are especially inefficient under the Skewed-Sec workload with a large limit (e.g., 200), where they fetch a large number of entries and fail to enjoy the cache effect. By contrast, the performance of Perseid is much more stable across different workloads, owing to the locality-aware design of
Third, under upsert workloads, all systems need to validate more primary keys to exclude obsolete entries, which contributes to the higher overheads than under insert workloads. For
Figure 11 demonstrates the necessity and the benefit of the volatile hash table of Perseid. Directly validating multiple primary keys on persistent hash table (PM-HT) has a large overhead, since it requires multiple random accesses on PM. This prominent overhead can overshadow the advantage of
5.2.3 Range Query.
In the following experiments, we show results of the LSM-based secondary index on PM (
The results are shown in Figure 12. Range queries need to search more KV pairs from ten different secondary keys, showing a more pronounced difference between these secondary indexes than low-limit query operations. Perseid outperforms
5.2.4 Multi-Threaded Performance.
Figure 13 shows the multi-threaded performance of compared secondary indexes. We take the results of Skewed-Pri and Skewed-Sec workloads as representatives. For Skewed-Sec, we show the result with the limit of 200, and the result with the limit of 10 is similar to that of Skewed-Pri. For upsert operations, Perseid scales up to 24 threads, achieving 2.8\(\times\) and 16\(\times\) the upsert throughput of the composite P-Masstree and
5.2.5 Non-Index-Only Query.
We next evaluate the non-index-only query operations. Besides the basic compared secondary indexes, we also enhance them by applying the two optimizations (Section 4.4), sequence number zone map (+SEQ), naive parallel primary table searching (+PAR), and worker-active parallel table searching (+PAR-WA) sequentially. In this experiment, we use 4 threads for parallel primary table searching. Figure 14 shows the performance and time breakdown of non-index-only query operations. Note that the breakdown of primary table time on +PAR only shows the time not covered by the secondary index and validation. Perseid brings considerable improvements against the
Though the primary key index indeed reduces unnecessary point lookup operations on the primary table for
Perseid’s optimizations on primary table searching can also boost other compared secondary indexing. The zone map improves the overall query performance of the KV store with Perseid by about 50%, and the worker-active parallel primary table searching further improves by up to 3.1\(\times\). The worker-active parallel searching exceeds naive parallel searching by up to 30% for Perseid. This effect is more evident when the limit of return results is small as the load imbalance among multiple parallel worker threads are more prominent. However, the numbers are only 20–36% and up to 2.4\(\times\) for
We also implement secondary indexes and conduct the experiments on a leveling-based LSM primary table (LevelDB [30]). Figure 15 shows the results of Skewed-Sec as an example. The main difference is that the sequence number zone map is less effective on leveling-based LSM primary tables. However, the zone map is still effective when the limit is small, since the latest few records stay in MemTables or SSTables in lower levels like \(L_0\), and these components can be filtered by sequence number with a high probability.
5.3 Mixed Workloads
In this section, we evaluate Perseid, the composite P-Masstree, and
Figure 16 reports the average operation latencies every million operations. At the beginning of the Write-Heavy workload and the Balanced workload, PM-based secondary indexes have a spike in latency, which is mainly caused by seek-driven compaction in the LSM primary table. Perseid outperforms
5.4 Recovery Time
We evaluate the recovery time of Perseid and
6 RELATED WORK
Secondary Indexing in LSM-based KV stores. Qader et al. [54] conduct a comparative study on secondary indexing techniques in LSM-based systems. They conclude and evaluate several common secondary indexing techniques, including filter-based embedded index, composite index, and posting list. DELI [59] proposes an index maintenance approach that defers expensive index repair to compaction of the primary table. Luo et al. [47] propose several techniques for LSM-based secondary indexes, improving data ingestion and query performance. However, their techniques mainly reduce random device I/Os for traditional disk devices but at the cost of more sequential reads. Based on KV separation [45], SineKV [42] keeps both the primary index and secondary indexes pointing to the record values. Thus, secondary index queries can get records directly without searching the primary index. However, SineKV has to discard the blind-write attribute and maintain index consistency synchronously. Cuckoo Index [38] enhances the filter-based indexing with a cuckoo filter. However, as a filter-based index, Cuckoo Index does not support range queries.
Though there are many proposed optimizations, LSM-based secondary indexing is not efficient enough due to the nature of LSM-trees. In this work, we revisit the design of the secondary index with PM.
PM-based indexes. There has been plenty of research on high-performance PM indexes [17, 31, 37, 39, 49, 52, 53, 62, 72]. These general-purpose indexing are not directly competent for efficient secondary indexing.
Improving LSM-based KV stores with PM. There is a lot of work optimizing LSM-based KV stores with PM. NoveLSM [33] introduces a large mutable MemTable on PM to lower compaction frequency and avoid logging. SLM-DB [32] utilizes a B\(^+\)-Tree on PM to index KV pairs on disks; SSTables on disks are organized in a single level, which reduces the compaction requirements. MatrixKV [69] places level \(L_0\) on PM and adopts fine-granularity and parallel column compaction to reduce write stalls in LSM-trees. Facebook redesigns the block cache on PM to reduce the DRAM usage and thus reduce the total cost of ownership (TCO) [27, 34]. Different from these efforts, this work revisits the secondary indexing for LSM-based KV stores with PM.
7 CONCLUSION
In this article, we revisit secondary indexing in LSM-based storage systems with PM. We propose Perseid, an efficient PM-based secondary indexing mechanism for LSM-based storage systems. Perseid overcomes the deficiencies of traditional LSM-based secondary indexing and existing PM-based indexes with naive approaches. Perseid achieves much higher query performance than state-of-the-art LSM-based secondary indexing techniques and existing PM-based indexes without sacrificing the write performance of LSM-based storage systems. The prototype of Perseid is open-source at https://github.com/thustorage/perseid.
ACKNOWLEDGMENTS
We sincerely thank all anonymous reviewers for their valuable comments.
Footnotes
1 For clarity, we use record to refer to a KV pair in the primary table, and entry to refer to a KV pair in a secondary index.
Footnote2 Index-only query is a common query technique: Users create a covering index that contains specific columns required by queries to avoid the cost of reading the primary table [6, 8, 51]. A non-index-only query searches the secondary index by secondary key to get primary keys and then retrieves full records from the primary table.
Footnote
- [1] 2022. Apache Cassandra. Retrieved from https://cassandra.apache.org/Google Scholar
- [2] 2022. Apache Cassandra: How are Indexes Stored And Updated. Retrieved from https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/dml/dmlIndexInternals.htmlGoogle Scholar
- [3] 2022. Chirp: A Twitter-like Workload Generator. Retrieved from http://alumni.cs.ucr.edu/ ameno002/benchmark/Google Scholar
- [4] 2022. Compute Express Link: The Breakthrough CPU-to-Device Interconnect. Retrieved from https://www. computeexpresslink.org/Google Scholar
- [5] 2022. MongoDB. Retrieved from https://www.mongodb.comGoogle Scholar
- [6] 2022. MySQL Glossary for Covering Index. Retrieved from https://dev.mysql.com/doc/refman/8.0/en/glossary.html# glos_covering_indexGoogle Scholar
- [7] 2022. Persistent Memory Development Kit. Retrieved from https://pmem.io/pmdk/Google Scholar
- [8] 2022. PostgreSQL: Documentation: Index-Only Scans and Covering Indexes. Retrieved from https://www. postgresql.org/docs/current/indexes-index-only-scans.htmlGoogle Scholar
- [9] 2022. Samsung Electronics Unveils Far-Reaching, Next-Generation Memory Solutions at Flash Memory Summit 2022. Retrieved from https://news.samsung.com/global/samsung-electronics-unveils-far-reaching-next-generation-memory-solutions-at-flash-memory-summit-2022/Google Scholar
- [10] 2023. MS-SSD—Samsung. Retrieved from https://samsungmsl.com/cmmh/Google Scholar
- [11] . 2014. AsterixDB: A scalable, open source BDMS. Proc. VLDB Endow. 7, 14 (
oct 2014), 1905–1916.DOI: Google ScholarDigital Library - [12] . 2015. LSM-based storage and indexing: An old idea with timely benefits. In Proceedings of the 2nd International ACM Workshop on Managing and Mining Enriched Geo-Spatial Data (Melbourne, VIC, Australia) (
GeoRich’15 ). Association for Computing Machinery, New York, NY, 1–6.DOI: Google ScholarDigital Library - [13] . 2013. LinkBench: A database benchmark based on the facebook social graph. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (New York, New York, USA) (
SIGMOD’13 ). Association for Computing Machinery, New York, NY, 1185–1196.DOI: Google ScholarDigital Library - [14] . 2016. Makalu: Fast recoverable allocation of non-volatile memory. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (Amsterdam, Netherlands) (
OOPSLA 2016 ). Association for Computing Machinery, New York, NY, 677–694.DOI: Google ScholarDigital Library - [15] . 2020. Understanding and optimizing persistent memory allocation. In Proceedings of the 2020 ACM SIGPLAN International Symposium on Memory Management (London, UK) (
ISMM 2020 ). Association for Computing Machinery, New York, NY, 60–73.DOI: Google ScholarDigital Library - [16] . 2020. Characterizing, modeling, and benchmarking RocksDB key-value workloads at facebook. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST 20). USENIX Association, Santa Clara, CA, 209–223. Retrieved from https://www.usenix.org/conference/fast20/presentation/cao-zhichaoGoogle ScholarDigital Library
- [17] . 2020. uTree: A persistent B+-tree with low tail latency. Proc. VLDB Endow. 13, 12 (
July 2020), 2634–2648.DOI: Google ScholarDigital Library - [18] . 2020. FlatStore: An efficient log-structured key-value storage engine for persistent memory. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (
ASPLOS’20 ). Association for Computing Machinery, New York, NY, 1077–1091.DOI: Google ScholarDigital Library - [19] . 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (Indianapolis, Indiana, USA) (
SoCC’10 ). Association for Computing Machinery, New York, NY, 143–154.DOI: Google ScholarDigital Library - [20] . 2013. Spanner: Google’s globally distributed database. ACM Trans. Comput. Syst. 31, 3, Article
8 (aug 2013), 22 pages.DOI: Google ScholarDigital Library - [21] . 2020. From wisckey to bourbon: A learned index for log-structured merge trees. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 155–171. Retrieved from https://www.usenix.org/conference/osdi20/presentation/daiGoogle Scholar
- [22] . 2022. NVAlloc: Rethinking heap metadata management in persistent memory allocators. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (
ASPLOS’22 ). Association for Computing Machinery, New York, NY, 115–127.DOI: Google ScholarDigital Library - [23] . 2015. Asynchronized concurrency: The secret to scaling concurrent search data structures. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (Istanbul, Turkey) (
ASPLOS’15 ). Association for Computing Machinery, New York, NY, 631–644.DOI: Google ScholarDigital Library - [24] . 2017. Monkey: Optimal navigable key-value store. In Proceedings of the 2017 ACM International Conference on Management of Data (Chicago, Illinois, USA) (
SIGMOD’17 ). Association for Computing Machinery, New York, NY, 79–94.DOI: Google ScholarDigital Library - [25] . 2021. Chucky: A succinct cuckoo filter for LSM-tree. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (
SIGMOD’21 ). Association for Computing Machinery, New York, NY, 365–378.DOI: Google ScholarDigital Library - [26] . 2018. Reducing DRAM footprint with NVM in facebook. In Proceedings of the 30th EuroSys Conference (Porto, Portugal) (
EuroSys’18 ). Association for Computing Machinery, New York, NY, Article42 , 13 pages.DOI: Google ScholarDigital Library - [27] . 2018. Reducing DRAM footprint with NVM in facebook. In Proceedings of the 30th EuroSys Conference (Porto, Portugal) (
EuroSys’18 ). Association for Computing Machinery, New York, NY, Article42 , 13 pages.DOI: Google ScholarDigital Library - [28] . 2022. RocksDB. Retrieved from https://rocksdb.org/Google Scholar
- [29] . 2022. RocksDB Tuning Guide. Retrieved from https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-GuideGoogle Scholar
- [30] . 2022. LevelDB. Retrieved from https://github.com/google/leveldbGoogle Scholar
- [31] . 2018. Endurable transient inconsistency in byte-addressable persistent B+-tree. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST 18). USENIX Association, Oakland, CA, 187–200. Retrieved from https://www.usenix.org/conference/fast18/presentation/hwangGoogle Scholar
- [32] . 2019. SLM-DB: Single-level key-value store with persistent memory. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST 19). USENIX Association, Boston, MA, 191–205. Retrieved from https://www.usenix.org/conference/fast19/presentation/kaiyrakhmetGoogle Scholar
- [33] . 2018. Redesigning LSMs for nonvolatile memory with NoveLSM. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC 18). USENIX Association, Boston, MA, 993–1005. Retrieved from https://www.usenix.org/conference/atc18/presentation/kannanGoogle Scholar
- [34] . 2021. Improving performance of flash based key-value stores using storage class memory as a volatile memory extension. In Proceedings of the 2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 821–837. Retrieved from https://www.usenix.org/conference/atc21/presentation/kassaGoogle Scholar
- [35] . 2022. Power-optimized deployment of key-value stores using storage class memory. ACM Trans. Storage 18, 2, Article
13 (mar 2022), 26 pages. DOI:Google ScholarDigital Library - [36] . 2021. Rethink the scan in MVCC databases. In Proceedings of the 2021 International Conference on Management of Data (Virtual Event, China) (
SIGMOD’21 ). Association for Computing Machinery, New York, NY, 938–950.DOI: Google ScholarDigital Library - [37] . 2021. PACTree: A high performance persistent range index using PAC guidelines. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (Virtual Event, Germany) (
SOSP’21 ). Association for Computing Machinery, New York, NY, USA, 424–439. Google ScholarDigital Library - [38] . 2020. Cuckoo index: A lightweight secondary index structure. Proc. VLDB Endow. 13, 13 (
sep 2020), 3559–3572. Google ScholarDigital Library - [39] . 2019. Recipe: Converting concurrent dram indexes to persistent-memory indexes. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) (
SOSP’19 ). Association for Computing Machinery, New York, NY, 462–477.DOI: Google ScholarDigital Library - [40] . 2019. KVell: The design and implementation of a fast persistent key-value store. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (Huntsville, Ontario, Canada) (
SOSP’19 ). Association for Computing Machinery, New York, NY, 447–461.DOI: Google ScholarDigital Library - [41] . 2020. Kvell+: Snapshot isolation without snapshots. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 425–441. Retrieved from https://www.usenix.org/conference/osdi20/presentation/lepersGoogle Scholar
- [42] . 2020. SineKV: Decoupled secondary indexing for LSM-based key-value stores. In Proceedings of the 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS). 1112–1122.
DOI: Google ScholarCross Ref - [43] . 2020. LB+Trees: Optimizing persistent index performance on 3DXPoint memory. Proc. VLDB Endow. 13, 7 (
mar 2020), 1078–1090.DOI: Google ScholarDigital Library - [44] . 2020. Dash: Scalable hashing on persistent memory. Proc. VLDB Endow. 13, 10 (
April 2020), 1147–1161.DOI: Google ScholarDigital Library - [45] . 2016. WiscKey: Separating keys from values in SSD-conscious storage. In Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST 16). USENIX Association, Santa Clara, CA, 133–148. Retrieved from https://www.usenix.org/conference/fast16/technical-sessions/presentation/luGoogle Scholar
- [46] . 2017. Octopus: An RDMA-enabled distributed persistent memory file system. In Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC 17). USENIX Association, Santa Clara, CA, 773–785. Retrieved from https://www.usenix.org/conference/atc17/technical-sessions/presentation/luGoogle Scholar
- [47] . 2019. Efficient data ingestion and query processing for LSM-based storage systems. Proc. VLDB Endow. 12, 5 (
jan 2019), 531–543. DOI: Google ScholarDigital Library - [48] . 2020. LSM-based storage techniques: A survey. The VLDB Journal 29, 1 (
jan 2020), 393–418.DOI: Google ScholarDigital Library - [49] . 2021. ROART: Range-query optimized persistent ART. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST 21). USENIX Association, 1–16. Retrieved from https://www.usenix.org/conference/fast21/presentation/maGoogle Scholar
- [50] . 2012. Cache craftiness for fast multicore key-value storage. In Proceedings of the 7th ACM European Conference on Computer Systems (Bern, Switzerland) (
EuroSys’12 ). Association for Computing Machinery, New York, NY, 183–196.DOI: Google ScholarDigital Library - [51] . 2020. MyRocks: LSM-tree database storage engine serving facebook’s social graph. Proc. VLDB Endow. 13, 12 (
aug 2020), 3217–3230.DOI: Google ScholarDigital Library - [52] . 2019. Write-optimized dynamic hashing for persistent memory. In Proceedings of the 17th USENIX Conference on File and Storage Technologies (FAST 19). USENIX Association, Boston, MA, 31–44. Retrieved from https://www.usenix.org/conference/fast19/presentation/namGoogle ScholarDigital Library
- [53] . 2016. FPTree: A hybrid SCM-DRAM persistent and concurrent B-tree for storage class memory. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (
SIGMOD’16 ). Association for Computing Machinery, New York, NY, 371–386.DOI: Google ScholarDigital Library - [54] . 2018. A comparative study of secondary indexing techniques in LSM-based NoSQL databases. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (
SIGMOD’18 ). Association for Computing Machinery, New York, NY, 551–566.DOI: Google ScholarDigital Library - [55] . 2017. PebblesDB: Building key-value stores using fragmented log-structured merge trees. In Proceedings of the 26th Symposium on Operating Systems Principles (Shanghai, China) (
SOSP’17 ). Association for Computing Machinery, New York, NY, 497–514.DOI: Google ScholarDigital Library - [56] . 2023. Persistent memory disaggregation for cloud-native relational databases. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (Vancouver, BC, Canada) (
ASPLOS 2023 ). Association for Computing Machinery, New York, NY, 498–512.DOI: Google ScholarDigital Library - [57] . 2014. Log-structured memory for DRAM-based storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (Santa Clara, CA) (
FAST’14 ). USENIX Association, 1–16. Google ScholarDigital Library - [58] . 2020. TH-DPMS: Design and implementation of an RDMA-enabled distributed persistent memory storage system. ACM Trans. Storage 16, 4, Article
24 (oct 2020), 31 pages.DOI: Google ScholarDigital Library - [59] . 2015. Deferred lightweight indexing for log-structured key-value stores. In Proceedings of the 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 11–20.
DOI: Google ScholarDigital Library - [60] . 2022. Pacman: An efficient compaction approach for log-structured key-value store on persistent memory. In Proceedings of the 2022 USENIX Annual Technical Conference (USENIX ATC 22). USENIX Association, Carlsbad, CA, 773–788. Retrieved from https://www.usenix.org/conference/atc22/presentation/wang-jingGoogle Scholar
- [61] . 2023. Revisiting secondary indexing in LSM-based storage systems with persistent memory. In Proceedings of the 2023 USENIX Annual Technical Conference (USENIX ATC 23). USENIX Association, Boston, MA, 817–832. Retrieved from https://www.usenix.org/conference/atc23/presentation/wang-jingGoogle Scholar
- [62] . 2021. Nap: A black-box approach to NUMA-aware persistent memory indexes. In Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). USENIX Association, 93–111. Retrieved from https://www.usenix.org/conference/osdi21/presentation/wang-qingGoogle Scholar
- [63] . 2023. Replicating persistent memory key-value stores with efficient RDMA abstraction. In Proceedings of the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association. Retrieved from https://www.usenix.org/conference/osdi23/presentation/wang-qingGoogle Scholar
- [64] . 2015. LSM-trie: An LSM-tree-based ultra-large key-value store for small data items. In Proceedings of the 2015 USENIX Annual Technical Conference (USENIX ATC 15). USENIX Association, Santa Clara, CA, 71–82. Retrieved from https://www.usenix.org/conference/atc15/technical-session/presentation/wuGoogle Scholar
- [65] . 2022. Characterizing the performance of intel optane persistent memory: A close look at its on-DIMM buffering. In Proceedings of the Seventeenth European Conference on Computer Systems (Rennes, France) (EuroSys’22). Association for Computing Machinery, New York, NY, 488–505.
DOI: Google ScholarDigital Library - [66] . 2023. PetPS: Supporting huge embedding models with persistent memory. Proc. VLDB Endow. 16, 5 (
jan 2023), 1013–1022.DOI: Google ScholarDigital Library - [67] . 2021. Revisiting the design of LSM-tree based OLTP storage engine with persistent memory. Proc. VLDB Endow. 14, 10 (
jun 2021), 1872–1885.DOI: Google ScholarDigital Library - [68] . 2020. An empirical guide to the behavior and use of scalable persistent memory. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST 20). USENIX Association, Santa Clara, CA, 169–182. Retrieved from https://www.usenix.org/conference/fast20/presentation/yangGoogle ScholarDigital Library
- [69] . 2020. MatrixKV: Reducing write stalls and write amplification in LSM-tree based KV stores with matrix container in NVM. In Proceedings of the 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 17–31. Retrieved from https://www.usenix.org/conference/atc20/presentation/yaoGoogle Scholar
- [70] . 2020. Order-preserving key compression for in-memory search trees. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD’20). Association for Computing Machinery, New York, NY, 1601–1615.
DOI: Google ScholarDigital Library - [71] . 2021. ChameleonDB: A key-value store for optane persistent memory. In Proceedings of the 16th European Conference on Computer Systems (Online Event, United Kingdom) (
EuroSys’21 ). Association for Computing Machinery, New York, NY, 194–209.DOI: Google ScholarDigital Library - [72] . 2019. DPTree: Differential indexing for persistent memory. Proc. VLDB Endow. 13, 4 (
Dec. 2019), 421–434.DOI: Google ScholarDigital Library - [73] . 2018. Write-optimized and high-performance hashing index scheme for persistent memory. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 461–476. Retrieved from https://www.usenix.org/conference/osdi18/presentation/zuoGoogle Scholar
Index Terms
- Perseid: A Secondary Indexing Mechanism for LSM-Based Storage Systems
Recommendations
FlatLSM: Write-Optimized LSM-Tree for PM-Based KV Stores
The Log-Structured Merge Tree (LSM-Tree) is widely used in key-value (KV) stores because of its excwrite performance. But LSM-Tree-based KV stores still have the overhead of write-ahead log and write stall caused by slow L0 flush and L0-L1 compaction. New ...
LSM-tree managed storage for large-scale key-value store
SoCC '17: Proceedings of the 2017 Symposium on Cloud ComputingKey-value stores are increasingly adopting LSM-trees as their enabling data structure in the backend storage, and persisting their clustered data through a file system. A file system is expected to not only provide file/directory abstraction to organize ...
A Comparative Study of Secondary Indexing Techniques in LSM-based NoSQL Databases
SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataNoSQL databases are increasingly used in big data applications, because they achieve fast write throughput and fast lookups on the primary key. Many of these applications also require queries on non-primary attributes. For that reason, several NoSQL ...
Comments