research-article

Open Access

Boosting Cache Performance by Access Time Measurements

Authors:
Gil Einziger

Computer Science Department, Ben Gurion University, Raanana, Israel

Computer Science Department, Ben Gurion University, Raanana, Israel

0000-0002-6051-608X
View Profile

,
Omri Himelbrand

Computer Science Department, Ben Gurion University, Raanana, Israel

Computer Science Department, Ben Gurion University, Raanana, Israel

0000-0001-8295-7583
View Profile

,
Erez Waisbard

Computer Science Department, The Open University, Raanana, Israel

Computer Science Department, The Open University, Raanana, Israel

0000-0001-5634-5436
View Profile

Authors Info & Claims

ACM Transactions on Storage Volume 19 Issue 1Article No.: 8pp 1–29https://doi.org/10.1145/3572778

Published:17 February 2023Publication History

ACM Transactions on Storage

Abstract

Most modern systems utilize caches to reduce the average data access time and optimize their performance. Recently proposed policies implicitly assume uniform access times, but variable access times naturally appear in domains such as storage, web search, and DNS resolution.

Our work measures the access times for various items and exploits variations in access times as an additional signal for caching algorithms. Using such a signal, we introduce adaptive access time-aware cache policies that consistently improve the average access time compared with the best alternative in diverse workloads. Our adaptive algorithm attains an average access time reduction of up to 46% in storage workloads, up to 16% in web searches, and 8.4% on average when considering all experiments in our study.

1 INTRODUCTION

Caching is a fundamental technique for boosting systems’ performance by exploiting variations in access times between different memory technologies. Caches store a small portion of the data in a faster than normal medium. For example, DRAM memory is orders of magnitude faster than HDDs. Accesses to cached items are called cache hits, and accesses to non-cached items are called cache misses. Since cache hits are served by a faster than normal medium, caches can reduce the average access time when there are enough cache hits. Luckily, caches are effective in many workloads as these often exhibit predictable patterns that allow for a considerable cache hit probability. Such patterns include recency, where recently accessed items are more likely to be reaccessed, and frequency, where the access distribution of items is slow changing [19, 29]. In practice, we do not know in advance which heuristic would be effective on the current workloads, and many policies mix the two heuristics [34, 43], and even autonomously adapt their configuration according to the current workload characteristics [18, 43].

The optimization goal of most caches is to reduce the average access time. In practice, the vast majority of research in cache policies focuses on maximizing the cache hit ratio that is, they try to maximize the number of cache hits. When the hit and miss times vary, maximizing the hit ratio is not the same as minimizing the average access times. More so, nonuniform access times appear naturally in many domains, such as search engines [8] and data storage [13, 21, 41, 44, 48]. In addition, nonuniform application load times also appear in Mobile apps [36], and large load times may be caused from problems such as low memory.Our work includes an original measurement study that shows that nonuniform access times also appear in domain name systems (DNS).We would now provide some insight as to why non uniform access times are so common. In data storage, variable access times appear when the system employs multiple storage mediums. For example, a system employing HDDs and SSDs (or even multiple types of HDDs) would experience a variation in the miss time depending on which medium is accessed to satisfy the cache miss as Table 2 demonstrates. Further, even if we use DRAM memory, modern servers follow the Nonuniform Memory Access (NUMA) memory architecture, where access times vary between different processing units. Varying access times are common in other important systems such as Domain Name Systems (DNS), where the Internet topology and the current content of existing caches of the network create variations in resolution times for various DNS queries. Finally, search engine queries vary in execution time due to many factors such as the location and technology of data sources and the complexity of the queries. Thus, a search engine cache experiences variations in the miss times between the different queries. In other contexts such as client cloud caching, some works balance multiple criteria such as monetary cost, access time, and object size [25].

The literature includes the cost-aware caching paradigm that works under the assumption that items vary in costs [22, 37]. Such costs can reflect numerous things, including variations in access times, but these algorithms are rarely used in practice to the best of our knowledge. Our work argues that naturally occurring variations in access times justify revisiting such algorithms for practical scenarios. While such works address the right model, their performance leaves a lot to be desired as most of the literature on caching algorithms ignores these works. Thus, rather than inventing ad-hoc algorithms, our work focuses on extending existing algorithms to the cost-aware paradigm. Hopefully, such an endeavor would allow cost-aware algorithms to keep up with the state-of-the-art.

Contribution: Our work introduces a practical framework for measuring access time variations between items without prior knowledge of the system’s configuration. We use the same framework for diverse domains such as DNS queries, search engines, and data storage. We then introduce a cost-aware adaptive algorithm that utilizes access time measurements to improve its decisions. Our evaluation shows that our algorithm provides a competitive access time across all tested workloads. Specifically, in most workloads, it is better than the alternatives, and in the others, its performance is similar to the best alternatives. In contrast, the best alternative varies between the workloads, and there is no other competitive algorithm across the board.

In principle, truly adaptive caches are the holy grail for system engineers as they greatly simplify the adaptation process. Normally, any static configuration would improve performance for some use cases and harm other use cases. Adaptive algorithms learn the workload and adjust their behavior accordingly, so the same configuration works for all use cases.

Roadmap: The rest of the paper is organized as follows: Section 2 positions our work compared to the existing related work. Section 3 outlines our approach for access time measurements and then gradually redesigns the state-of-the-art approach to use access time measurements. Section 4 provides information about the datasets we use for evaluation and explains our scenarios and assumptions. Section 5 introduces our evaluation and gradually builds from evaluating our algorithms to positioning them within the context of the related work. Finally, Section 6 summarizes the results of this work and discusses future directions of pushing it forward.

2 RELATED WORK

Cache algorithms follow heuristics to optimize the access time. The best algorithms vary between workloads as different workloads favor different heuristics. Belady’s Clairvoyant policy [9] looks into the future and evicts the entry whose next access is furthest away. The Clairvoyant is a useful upper bound for cache policies’ attainable hit-ratio, but we do not know the future in realistic cases.

The Recency heuristic assumes that recently accessed items are likely to be reaccessed. The Least Recently Used (LRU) [24] policy is the classic example of this heuristic. LRU evicts the least recently used item when admitting a new item to a full cache. The Segmented LRU (SLRU) policy has two fixed-sized LRU segments named Probationary and Protected. SLRU admits new items to the Probationary segment. If one of the items in the Probationary segment is accessed, SLRU moves the item to the Protected segment and evicts the LRU item of that segment.

Alternatively, the Frequency heuristic assumes that the recent access distribution is a good estimator for access likelihood. There are many interpretations of this heuristic [6, 7, 19, 29, 31]. Intuitively, knowing which are the frequent items requires statistics. TinyLFU [19] minimizes the space required to store such metadata by using an approximate sketch algorithm such as the Count Min Sketch [14] or a counting Bloom filter [20]. Each encountered item updates the sketch, and we periodically half the counters to age them. Alternatively, the Reuse Distance heuristic uses the time between subsequent accesses to the same entry to predict future accesses. Notable examples include Hifi [1], LIRS [28, 35], and FRD [47].

Adaptive caches adjust their behavior to the workload, which allows them to remain competitive in diverse workloads. Such algorithms can potentially eliminate the need for manual optimization by system experts [18]. Hyperbolic caching [11] changes the eviction policy during runtime. Other approaches like Adaptive Replacement Cache (ARC) [43], its recent improvement SHARC [16] and Hill-Climber W-TinyLFU (HC-W-TinyLFU) [18] maintain two caches that reflect the recency and frequency heuristics and vary their size during runtime. The algorithms mentioned above optimize the cache hit ratio under which is the same as optimizing the access time when miss times area the same for all items.

Cost-aware cache algorithms [8, 37, 38, 46, 55] are the most similar to our work. Such algorithms attach each cache entry with a fixed cost [8, 27, 37, 38, 46, 55] that can reflect access time, or the bandwidth required to retrieve the item [5]. For example, the work of [27] shows that an L2 computer cache can be improved by incorporating access times into its considerations. The authors suggest a dynamic algorithm that alternates between LRU (which is the default policy of the cache) and a policy where the maximum cost item is selected, which caching decisions may not coincide with actual flows because both frequency and recency heuristics are ignored. In comparison, our work uses a simpler LRU-like algorithm that combines recency and access time within the object rank and does not need to select either the highest cost item or the LRU item.

For example, GD-Wheel [37] and CAMP [22] combine the popular recency heuristic with the miss access costs which they base on the access time. GD-Wheel targets the scenario where the miss access times are known beforehand, and CAMP uses access time measurements similar to ours but only considers the miss times. In contrast, our approach considers the difference between the hit and miss times. Hyperbolic caching [11] also supports cost-aware caches by multiplying the cost of an item by its computed priority in a similar manner to our approach; we denote this variant as Hyperbolic-CA.

Our estimation of access times on the fly is not unique. Still, it is likely the most flexible in the literature. For reference, the method of [44] estimates the access time for each disk model and object size whereas our method uses no such meta-knowledge. Instead, we consider the difference between the hit and miss access time as the benefit of caching an item. Such an approach uses no prior knowledge of the system and makes no assumptions about its operations. For example, our method does not even assume that the cache is faster from the miss times of all items. Since we make no assumptions, our method would work even if the underlying infrastructure changes. E.g., there is no need to reconfigure our algorithms when one changes a datastore from disk to SSD or in-memory storage. While other approaches measure access time, they either measure only the miss time [22] which makes the implicit assumption that the hit access time negligible, or they use prior knowledge on the distribution of access times. For example, the work of [44] uses meta-knowledge about HDD access times to improve its estimations. While their method may benefit us when the datastores are HDDs, it is tailored to that configuration and is unlikely to work well in other cases (e.g., web searches or DNS accesses). In contrast, our method strives to work reasonably well in all system configurations.

Finally, while almost all the works in the field try to minimize the total cost (or the average access time), the RobinHood [10] policy varies and tries to optimize the tail latency. As in some scenarios, the tail latency is more indicative of the users’ experience. Thus, RobinHood would cache the items whose miss penalty is the largest (regardless of frequency) and optimize the tail latency accordingly. While RobinHood is not directly comparable to our approach due to different optimization criteria, our evaluation shows that our approach does not hurt (and even improves) the tail latency. More interestingly, in some works [26] cache consistency implies a cost for evicting a modified item from the cache. Thus, the expensive operation is not to miss or hit the item but to evict it.

3 OUR ACCESS TIME AWARE POLICIES

Our goal is to enhance the Adaptive algorithm of [18] which is implemented in the Caffeine [42] high-performance caching library and is used by numerous open-source projects such as Cassandra [3], Apache Solr [4], Redisson [49], Dropwizard [15], and more. The algorithm [18] is composed of two caches, a TinyLFU [19] managed cache for frequent items, and an LRU managed cache for recent items. During run-time, their algorithm dynamically sizes these caches to search for a better configuration. The work of [18] shows that it is competitive on a wide variety of traces. Thus, we enhance an already successful algorithm with access time as a new source of information.

Our work starts with extending the TinyLFU [19] admission policy to include access time variations within its decisions in a policy that we name Cost Aware TinyLFU (CATinyLFU). Next, we revisit the LRU policy used within [18] to exploit recency-biased traces to also include access times within its decisions. Our policy Cost and Recency Aware (CRA) retains most of the basic properties of LRU but can exploit variations in access times within its decisions. Equipped with these building blocks, we revisit the adaptation mechanisms of [18] to optimize the average access time (rather than the hit-ratio) since Caffeine implements numerous adaptation techniques. We end up with an extension for each adaptation technique. These are named HCA-W-CATinyLFU, HCN-W-CATinyLFU, HCS-W-CATinyLFU, and each such policy varies in the specific adaptation mechanism used to change the sizes of the CRA and CATinyLFU caches dynamically.

While our work introduces numerous algorithms, they all use a common framework that allows the cache to time hit and miss accesses to infer the overheads associated with each item. Once that information is given, we use similar methods to factor this information into the decisions of each algorithm. We explain the measurement process in Section 3.1, while Section 3.2 explains our frequency-based admission policy. Next, Section 3.3 explains the challenges of forming a recency-based eviction policy and the design rationale of CRA. Section 3.4 tailors everything together and explains the adaptations required by the adaptive policies.

3.1 Access Time Measurements

Our caches collect information about the hit and miss times and incorporate this information into their decision makings. Here, we provide details about this process. We extend each cache entry with two fields \(e_{mt}\) denoting its miss time, and \(e_{ht}\), which denotes e’s hit time. Updating \(e_{mt}\) is straightforward. We admit new items to the cache following a cache miss and thus set \(e_{mt}\) to the time it takes us to handle the first request. When we first admit an item to the cache, we still do not know its hit time. Therefore, we use the average cache hit time (for all items) as the initial value of \(e_{ht}\). Upon a subsequent request to e, we will update \(e_{ht}\) with the measured hit time of e. For ease of reference the full notations used throughout this work are summarized in Table 1.

Table 1.

Notation	Description
M	size of cache in entries
\(e_k\)	key of cache entry e
\(e_{mt}\)	miss access time of cache entry e
\(e_{ht}\)	hit access time of cache entry e
\(\Delta _e\)	access times delta of cache entry e
\(\Delta _{max}\)	running mean of maximum access times delta
\(T_{max}\)	normalization factor for CRA
\(T_{min}\)	normalization bias for CRA
\(fe_e\)	frequency estimation of cache entry e
\(re_e\)	numerical recency estimation of cache entry e
size	current size of the cache in entries
q	max number of LRU lists for CRA
\(s_{e}\)	score of entry e
k	control parameter for recency based benefit
\(rc_e\)	request count of cache entry e
RC	current request count of cache
\(RC_{max}\)	request count threshold for reset operation
L	list of LRU lists used by CRA
\(L_{Active}\)	set of indices for currently active lists used by CRA
S	current number of \(\Delta _{e}\) samples for \(\Delta _{max}\)

View Table

Table 1. Notations

Algorithm 1 provides pseudocode for this process. Specifically, we record the start and end times of each request, which allows us to record the hit time in case of a cache hit (Line 5), and the miss time in case of a miss (Line 9). When we first admit an item to the cache, we do not yet know its hit time. Thus, we use the function estimateHitAccessTime (Line 10) that returns the average time for finding an item in the cache. Knowing the hit time is crucial as caches with different storage technology should behave differently. For example, a memory-based NoSQL database may find an item beneficial for caching as its miss time involves accessing an SSD drive. However, an HDD-based cache may choose never to cache that item as SSDs are faster than HDDs. Our approach requires storing two timestamps for each cached entry, netting in 8 Bytes per cached entry. For reference, any list-based policy (e.g., LRU, SLRU) uses at least two pointers per item (or 16B). Thus, we conclude that the metadata requirement is not excessive.

3.2 Access Time Aware Admission Policy

We now present our Cost Aware TinyLFU (CATinyLFU) admission policy. Admission policies decide if an accessed item should enter the cache, while eviction policies decide which item will be evicted from the cache to accommodate the new item. Figure 1 visualizes how admission and eviction policies interact; upon a cache miss, the eviction policy selects a cache victim, and the admission policy decides if the cache would benefit from admitting the new item at the expense of the cache victim. The work of [19] provides the TinyLFU policy that estimates the recent frequency of the cache victim and the new item. The policy admits the new item if its frequency is higher than the cache victim’s frequency. We follow a similar design but also incorporate the access times in this decision.

Fig. 1. An illustration explaining how admission and eviction policies can be composed together to form cache policies.

Fig. 2. An example run of CRA. Starting from a state (2(a)) and processing a queue of requests. We take \(q=10,k=1,M=4,RC_{max}=20\) . Thus, we can have up to 10 active lists. In the current cache state we have \(T_{max}=1000\) and \(T_{min}=10\) . Note that these keep their value throughout the illustrated run. In (2(b)), a blue request is processed, and since it is already in the list, it moves from being LRU to MRU in \(L[0]\) . Since it causes RC to be 20, the next request processed will result in halving RC and all \(rc_e\) . In (2(c)), a red request is processed, and since it is already in the correct list as \(\lfloor \frac{\Delta _{red}-T_{min}}{T_{max}}\cdot q \rfloor =\lfloor \frac{100}{1000}\rfloor =0\) it only becomes the MRU of \(L[0]\) . Since \(RC\ge RC_{max}\) , we half its value and add one so \(RC=11\) and thus \(rc_e=11\) . In (2(d)) the orange request is processed, and since it is not in the cache, we get a miss, resulting in the eviction of the green item. Since \(\Delta _{orange} = 135\) it goes in \(L[1]\) and \(L[5]\) is removed. In (2(e)), a yellow request is processed, and since it is already in the list, it moves from being LRU to MRU in \(L[1]\) . In (2(f)), a purple request is processed, and since it is not in the cache, we get a miss, resulting in the eviction of the blue item. Since \(\Delta _{purple} = 750\) , it fits in the list at index 7, so we add a new list \(L[7]\) to hold the purple item.

Fig. 3. W-CATinyLFU scheme: Items are first admitted to the CRA Window cache, then the victim of the Window cache is offered to the Main cache, which uses SCRA as its eviction policy and CATinyLFU as an admission policy.

Fig. 4. Segmented CRA (SCRA): New items first enter the protected segment, then the victim of the protected segment is moved to the probation segment, and the victim of that segment is evicted from the cache. When an item stored in the probation segment is accessed, we move that item to the protected segment.

Fig. 5. Access time aware hill climber W-CATinyLFU schemes: The collection stage (a) reports to the climber the hit/miss time for each request. Once per n requests, we reach the decision stage (b). The climber decides the step size and direction for the change in the window size. The climber picks an initial direction and computes a step size for the first decision. It compares the average access time for the current n samples against the previous average access time for every decision. Upon an improvement, it continues in the same direction with a step size computed in the current decision stage, and otherwise, it flips direction or holds in place, i.e., step size of 0. At the end of the decision stage (b), the current average access time becomes the previous average access time.

Our Cost Aware TinyLFU (CATinyLFU) policy uses TinyLFU for frequency estimation without a change. However, our goal is to reduce the average access time rather than to optimize the hit ratio. If we remove an entry (e) from the cache, future access to the cache victim will result in a cache miss. Therefore, for each such access we lose \(\Delta _e = e_{mt}-e_{ht}\) time. Similarly, when we admit a new item (\(e^{\prime }\)) to the cache, subsequent access to that item will result in cache hits instead of cache misses. Therefore, the benefit of each future access is \(\Delta _{e^{\prime }}\). We denote TinyLFU’s frequency estimation by \(fe_e\), and determine the score as \(s_e := fe_e\cdot \Delta _e\). That is, CATinyLFU admits entry \(e^{\prime }\) to the cache and evicts entry e if \(s_{e^{\prime }}\gt s_{e}\). Intuitively, it means that we may admit a newly arriving item even if it is less frequent than the cache victim if the difference in the per hit benefit (\(\Delta _e\)) justifies it. Algorithm 2 provides pseudo-code for CATinyLFU. The difference from TinyLFU manifests on Line 3 and on Line 4, where we multiply the frequency estimations of the candidate and victim with the corresponding per-hit benefits.

3.3 Access Time Aware Recency based Eviction

Our next step is to introduce access time awareness to recency-based cache policies such as the Least Recently Used (LRU) policy. LRU is arguably the most utilized cache policy in practice, and it is a building block in numerous cache policies such as SLRU policy [30] and ARC [43] which are composed of multiple LRU caches, as well as in W-TinyLFU [19] and HCS-TinyLFU [19] that use both LRU and LFU caches in their operation. Thus, creating an access time-aware recency-based policy allows us to upgrade many policies at once. LRU uses a single doubly-linked list that orders all the cached items according to their last access time. The cache victim is always at the tail of the list, which is the least recently used item. An accessed item is moved to the head of the list to keep the list ordered. LRU operates in constant time and is simple to implement.

Developing a recency-based score: We follow the same design as in CATinyLFU. We want to assign items with scores that reflect their access time benefits and evict the minimally scored item. However, LRU does not assign explicit scores to items. It merely orders the items according to their last access time. Artificially forcing our terminology on LRU, we say that LRU only determines the order of scores but not their numerical values.

Thus, we need to assign some recency-based numerical estimation to follow the same design pattern used in CATinyLFU. However, there are infinitely many numerical scores that we can use. Therefore, our approach is to select the following family of decay functions: \(re_e = (\frac{1}{RC - rc_e+1})^{k}\). Similar decay functions were found useful to capture recency [34].

Once an entry is accessed \(RC =rc_e\), and \(re_e=1\) which is its maximal value. \(re_e\) decays as RC grows, and the least recently used item receives the lowest \(re_e\) of all cached entries. The meta-parameter k controls how quickly \(re_e\) decays over time, which affects the balance between cost-awareness and recency.

Once we determined \(re_e\), we factor access time awareness into the considerations by taking the \(re_e\)th power of \(\Delta _e\). That is, the score of an entry is \(s_e = sign(\Delta _e)\cdot (|\Delta _e|)^{re_e}\).

Avoiding overflows: Notice that the score calculation assumes unbounded variables for RC and \(rc_e\). However, as cache systems work indefinitely, overflows are unavoidable. We prevent overflows by periodically halving RC and \(rc_e\) when \(RC\ge RC_{max}\). Such an operation utilizes simple bitwise shift operations, and it keeps the maximal value of RC and \(rc_e\) bounded. E.g., if we half them once per 10 million operations, their max value is 20 million. See Algorithm 3 (Lines 2 and 3) for pseudocode.

Baseline approach: The most straightforward eviction policy is to evict the minimally scored item. However, there are a few problems in implementing this approach efficiently. First, notice that accessed items do not necessarily go to the top of the list. Second, the order of items may change between accesses as the value of RC increases. We name this approach Cost Aware LRU (CALRU) and consider it an impractical suggestion since we have no efficient algorithms for finding the cache victim.

Practical policy: We suggest the Cost and Recency Aware (CRA) policy that approximates CALRU within acceptable complexity. The idea of CRA is to maintain multiple LRU lists, which are ordered only according to recency. However, each list contains only entries with similar benefit values (\(\Delta _e\)). The maximum number of lists is q, which controls the similarity between entries of the same list.

CRA’s data structure: The LRU lists are arranged in an array of size q (zero-based index), denoted as L. Each list in L holds entries of similar benefit (\(\Delta _{e}\)) normalized to the range of \(T_{min}\) to \(T_{max}\) which are the normalization’s bias and factor that are computed on-the-fly. See Algorithm 3 (Line 29), \(T_{max}\) is set to \(\Delta _{max}\) which is an approximation for \(T_{max}\), which is a running mean of \(\Delta _{e}\)s that exceeds the current \(T_{max}\) as seen on Algorithm 3 (Line 16). Note that for the initial \(T_{max}\) and \(\Delta _{max}\) we just use the \(\Delta _{e}\) of the first request, for the initial \(T_{min}\) is set to 0. To compute the \(\Delta _{max}\), we use a sample counter, S, which is initialized to 0 and is used to compute the running mean. After collecting 1,000 samples that exceed the current value of \(T_{max}\), we update \(T_{max}\) by setting it to the learnt \(\Delta _{max}\). This method enables us to avoid “squashing” most entries to a single list due to a very high observed \(\Delta _{e}\). We used \(q=10\), which allows us to have lists corresponding to deciles of the range between \(T_{min}\) and \(T_{max}\), which we found to be a reasonable compromise between the needs of diverse workloads and the operational complexity. Furthermore, to keep complexity even lower when iterating the lists, we only use the lists represented in \(L_{Active}\), which is the set of indices of the currently active lists. As shown in Table 3 for most workloads, the average number of active lists was well below q.

Admitting items to CRA: Once admitting an item (e) to CRA, we need to find an LRU list of the corresponding decile (or q-ile in general). In order to do that we compute the list’s index, see Algorithm 3 (Line 14), \(\ell _{in}\) as followed: \(\ell _{in} = \lfloor \frac{\Delta _{e}-T_{min}}{T_{max}}\cdot q\rfloor\). In case \(\Delta _{e} \lt 0\), meaning it is faster to fetch the entry from outside the cache, we evict it immediately as seen in Algorithm 3 (Lines 7–12). Once admitting e, we may need to perform an eviction if the cache is full. This condition is checked on Algorithm 3 (Line 23).

Determining the cache victim: The cache victim of CRA is one of the LRU items in all active lists. We compute \(s_e\) for each of the LRU items and evict the minimally scored item. Algorithm 4 illustrates the process of finding the victim’s list. The victim’s list is selected on Lines 4–10, the rest of the eviction process is illustrated on Lines 11–14, and if such a removal empties an LRU list then with removing that list from \(L_{Active}\) as well (Lines 15 and 16). For clarity, Figure 2 provides an illustrated example of CRA’s operation. Note that here lies the main difference in our approach to those of GDWheel and CAMP. Specifically, GDWheel and CAMP use the GreedyDual-Size policy [12] for eviction and only use access time to determine an object’s admission location. They only break ties using recency. In contrast, CRA evictions seek the item whose benefit is minimal within complexity limitations.

Operation complexity: CRA’s eviction operation iterate over all the active LRU lists to find the victim. Thus, the complexity of an eviction operation is \(O(q)\); however, in most cases, the number of active lists is less than q. Finding the list for insertion is done using a closed formula and has a complexity of \(O(1)\) in the worst case. Thus, the complexity of requesting an object from CRA is \(O(q)=O(10)=O(1)\).

Metadata overheads: CRA requires q pointers, one for each of the heads of the underlying lists. Each element in one of the lists stores the \(rc_e\) field and the access times meta-data in two integers. Thus, the extra space complexity per cache entry is three integers, netting at 24 bytes of extra space.

3.4 Adaptive Access Time Aware Algorithms

We now form adaptive cost-aware algorithms by combining CATinyLFU with CRA that follow the frequency and recency heuristics. Specifically, we use the recent HCS-W-TinyLFU algorithm in [18] as a baseline. HCS-W-TinyLFU extends the W-TinyLFU policy, which we explain below. W-TinyLFU operates a Window cache following the LRU cache policy and the Main cache that follows the SLRU policy, along with a TinyLFU admission filter. We admit new items to the Window cache and compare its victim to the Main cache’s victim using the TinyLFU cache admission policy. If the LRU victim is admitted to the Main cache, the Main cache’s victim is evicted. Otherwise, the LRU victim is the cache victim.

We use the W-TinyLFU policy but replace the eviction policy of the Window cache to CRA (instead of LRU) and that of the Main cache to SCRA (Segmented CRA, which contains two CRA structures instead of SLRU). Similarly, we replace the TinyLFU admission policy of the main cache with our CATinyLFU policy. We denote our adaptation of W-TinyLFU W-CATinyLFU. Figure 3 illustrates this method, and Figure 4 illustrates SCRA.

The HCS-W-TinyLFU algorithm extends dynamically sizes the Window and Main cache according to the observed hit ratio. Specifically, it makes a step (e.g., increasing the Window cache). Then, it measures the hit ratio and compares the current hit ratio to the previously observed hit ratio. If the step improved the hit ratio, then HCS-W-TinyLFU makes another step in the same direction (e.g., increasing the Window cache again). Otherwise, it takes a step in the opposite direction. Thus, we change the hill climber approach to determine the direction according to the access latency (rather than the hit ratio). The process is illustrated in Figure 5.

In the climber algorithm suggested in [18], the step size is always 5%, and the algorithm never converges. Instead, it would circle the optimal configuration indefinitely. The idea was that such a simplistic approach is sufficient to be competitive on a variety of workloads. However, after the hill climber algorithm was adapted by Caffeine [42], they further optimized the climbing algorithms with variable step sizes.

We extend all the hill-climbing algorithms to cost-aware settings. We determine the step direction according to changes in the average access time rather than the hit ratio in their original implementations. We denote the simple hill climber algorithm HCS-W-TinyLFU from [18]. In the meanwhile, Caffeine has implemented other climbers with adaptive step size based on heuristics used in gradient decent based learning. When the step size is based on ADAM [32] we name the algorithm HCA-W-TinyLFU, and that based on ADAM with Nesterov momentum [45] HCN-W-TinyLFU. While the implementation of these climbers can be found with the Caffeine project, they are not well documented. We compare them mainly to demonstrate the robustness of our approach to multiple adaptation methods.

Similarly, our adaptive extensions are named HCS-W-CATinyLFU, HCA-W-CATinyLFU, and HCN-W-CATinyLFU, respectively.

4 VARIABLE ACCESS TIME DATASETS

We designed an extensive benchmark for access time-aware cache mechanisms. We did not find enough real traces that contain access times, and therefore, we had to simulate access times when they were missing. Our approach is to keep the workloads as real as possible, extending them with latency information where possible. Since the access time depends on hit/miss and the transient network state, almost all our traces need to be extended with access times.

4.1 Simulated DNS Traces/real Access Times

We created a DNS access workload that includes access to the top 1M most visited URLs published by Alexa [23]. We generated concrete traces, where we select the visited URL from a distribution according to the number of accesses estimated by Alexa. Then, we used the Bind [40] open-source DNS resolver to perform DNS lookups for the selected URL and recorded the time it took us to reach the URL. First, we cleared Bind’s cache before accessing the URL to estimate miss times, and then we performed subsequent access to the entry to assess the hit time. We conducted this study around the clock for several weeks using six Linux-based servers. We include a trace denoted as DNS which contains 100 million accesses with times sampled from the times that we collected at different periods. Our timing results show that Bind’s hit time is quite similar except for occasional spikes in latency over most DNS entries. In contrast, miss times vary and expose cache policies with opportunities to exploit. Figure 6(a) illustrates the distribution of hit and miss times in our measurement, the relative variance for hit times is one of our main motivations for using both hit and miss times for our benefit computation.

Fig. 6. Distribution of DNS hit (and miss) times for the 1 million most popular websites measured over an extended period, and distribution of the miss time of DuckDuckGo for performing the search queries in the AOL dataset. We cap the search time at 10 seconds for technical reasons, but very few searches reach 10 seconds.

4.2 Simulated Web Searches/real Access Times

We extracted the top 1M most searched queries on one of the search engines during 2017-2019 from the dataset [50]. We used the frequency distribution of these queries to select a query randomly and search that query in the DuckDuckGo search engine [17], timing the time it takes DuckDuckGo to resolve the query. We used three servers located in Google’s us-central1 zone, timing the time it took to resolve each server. We constructed the WS workload, containing 20 million search queries measured at different times.

4.3 Real Search Trace/real Access Times

We use a real dataset containing three months of queries to the AOL servers in 2006 [2]. We resolve these queries again according to the DuckDuckGo search engine. The AOL dataset includes \(\approx\)3.6M timed web search queries, and \(\approx\)1.2M unique search phrases. Figure 6(b) shows that there is a large variation in execution time of AOL search queries in the DuckDuckGo search engine.

4.4 Real Storage Trace/real Access Times

We use the I/O trace files from SYSTOR17 [33], including a read request with both size and response time. Since we assume uniform size, we split each read request into blocks of 4KB and used the response time divided by the number of blocks for the miss times. We set the hit times to 0 since we observed miss times that were very close to 0, and we had no information regarding the actual hit times. This addition results in the SYSTOR17 workload, which is a trace of around 415 million real requests with real access times.

4.5 Real Traces/simulated Access Times

Next, we used real access and simulated the access latency. Table 2 contains timing information used in these traces. We determined the access times using the average transfer rates by “userbenchmark.com” of various HDD and SSD. We partitioned the address space between the different devices. We select miss times according to the 4KB transfer rate, rotational speed, and seek time of the device storing the accessed address. These traces include:

Table 2.

Hard drive	4KB Read MB/s (min,average,max)	Seek Time	Rotational Latency
HDD1	(1.23, 1.64, 1.9)	6.9ms	3ms
HDD2	(0.6, 0.81, 1)	12.7ms	4.1ms
HDD3	(0.4, 0.71, 0.8)	13ms	4.1ms
SSD1	(36.8, 46.4, 52.6)	35\(\mu\)s	N/A
SSD2	(18.5, 28.7, 38.9)	100\(\mu\)s	N/A

View Table

Table 2. Hard Drives Stats - the Second Column is the Transfer Rate in MB/s for 4KB Read, This Information was Collected from userbenchmark.com [53]

Table 3.

Workload	Avg. Max Active lists	Avg. Mean Active lists
GCC	3.5	2.8
WIKI	6.0	6.0
OLTP2	2.0	2.0
Gradle	3.0	2.44
WS	8.0	2.28
AOL	10.0	7.51
MULTI1	4.88	2.54
MULTI2	4.0	2.78
MAC	6.0	6.0
DNS	10.0	8.85
LINUX	3.0	2.38
SYSTOR17	10.0	9.03

View Table

Table 3. Statistics of Active Lists During Runs of CRA

Gradle: The Gradle trace from [42], provided by the Gradle project, with SSD1/HDD2/HDD3 selected with a ratio of 1:2:3.

GCC: The GCC trace from [51] with SSD1/HDD2 selected with a ratio of 1:2.

MULTI1: The Multi1 trace [28] where SSD1/SSD2/HDD1 are selected with ratios 1:2:5.

MULTI2: The Multi2 trace [28] where SSD1/SSD2/HDD1/HDD2 are selected with ratios 1:2:5:5.

OLTP2: Accesses to the file system of an OLTP server of financial transactions taken from [39] where SSD1/HDD1 are selected with a ratio of 4:5.

LINUX: Memory accesses from a Linux server taken from [54] where SSD1/SSD2/HDD1 are selected with a ratio of 1:1:3.

MAC: Memory accesses of a Macbook laptop taken from [54] SSD1/SSD2 with a ratio of 1:2.

WIKI: We used a Wikipedia traffic trace containing two months of Wikipedia accesses starting from September 2007 [52]. In these trace, we used a constant hit time of 1ms and randomly selected miss time from one of the distributions from [37]. We used the following distribution of miss times: 80% - 10ms to 30ms, 15% - 120ms to 180ms, 5% - 350ms to 450ms.

5 PERFORMANCE EVALUATION

We now perform an extensive evaluation of our access time-aware policies. The section is organized as follows, Section 5.1 and 5.2 explains the evaluation’s methodology and the performance metrics used. Next, Section 5.3 evaluates the meta-parameters of our CRA algorithm and selects values to these parameters that we use throughout the entire evaluation. Next, Section 5.4 evaluates the effectiveness of our access-time aware policies concerning their access time oblivious baselines. Finally, Section 5.5 compares our algorithms with the existing comparable cost-aware algorithms when the cost is the measured miss time.

5.1 Methodology

We used Caffeine’s simulator [42] for our experiments. We used the implementation of Caffeine’s simulator for all the competitor policies. The simulator uses a given trace, which includes multiple ordered entries indicating the order of item requests. Our workloads’ entries have the following values: key, hit-time, and miss-time. The access time is the hit-time in case of a cache hit, and the miss-time otherwise. We give no warm-up periods to the algorithms as the traces are long enough to make these unnecessary. We used cache sizes described as the number of cache entries. We chose the cache sizes to show effective configurations for various workloads. Our evaluation uses the following performance metrics: Hit-ratio, AAT (Average Access Time), and P99-Latency.

The AAT metric sums up the execution time and the access time for hits and misses and divides the total time with the number of accesses in the trace. Thus, intuitively the metric considers the cache policy’s algorithmic complexity (execution time), along with its quality (hit and miss times). Thus, faster policies have an inherent advantage over complex policies as their execution time is shorter. AAT is defined as follows: \(\begin{equation*} AAT\ =\ \frac{ExecutionTime\ +\ TotalMissTimes\ +\ TotalHitTimes}{\#Requests} \end{equation*}\)

The P99-Latency metric is the 99th percentile of the observed access times in the workload, meaning 99% of the accesses are faster than the P99-Latency. Even though this performance metric is not our optimization goal, it is important since it is a widely used metric for evaluating the quality of service in real-world systems as an approximation for the upper bound latency for 99% of the requests. Therefore, we present results for this metric to verify that the reduction of AAT did not cause a spike for P99-Latency.

Since most real systems utilize the LRU policy, we chose to present the AAT and the P99 metrics as normalized values w.r.t. LRU. All simulations were run on the same local system separately to not affect the execution times. Our system has an AMD Ryzen 7 4800H with Radeon Graphics 8 cores (16 threads), running at 2.9GHz, bios version of N.1.16ELU05 (type: UEFI), 32GB RAM, and a 64 bit Ubuntu 20.4 Linux distribution.

For clarity and compactness we do not always show all the traces in any evaluation. Figures are omitted when they illustrate a very similar behavior to the presented traces. More so, we do not show all cache sizes for every trace, instead we only show the interesting range where performance improves with increasing the size. Such a range depends on the number of distinct items in the trace whereas longer traces tend to require larger caches.

5.2 Evaluation of Cost Approximations

We start by evaluating our suggested method for approximating the future miss times by measuring the current miss time. We conducted this evaluation in two stages. First, we evaluated the MAE (Mean Absolute Error) of the approximations by collecting the absolute differences between the approximation of the miss time and the actual reported miss time in case of a hit. We then used this collected data to compute both the MAE and the standard deviation of the errors. We also computed the mean miss time in the workload and the standard deviation of these miss times, and we used the mean miss time of the workload to normalize the MAE and the standard deviations to remove the time units and to get a better sense of the MAE relative to the workload. This information is illustrated in Figure 7, and as can be observed, the MAE is relatively accurate. Second, we examine the effectiveness of our measurement method by testing how much we could improve the performance by artificially knowing the exact miss time. To this end, we created an “oracle” version that provides our algorithm with the exact miss time of the subsequent miss (rather than the measured time). If there is no future access in the workload, the oracle sets the future access time to zero. Figure 8 we compare the normalized AAT w.r.t. LRU between our top algorithm with and without an oracle. As can be observed, the maximum improvement when using the “oracle” is around 5.5%, and the average improvement is around 23%. Note that knowing the following access times didn’t provide better results for the WS workload. We contribute this to the fact that the times oracle knows only the next time and not all the following times in advance. To conclude this experiment, we note that there is (limited) room for improvement in our method by employing better prediction methods. Such optimization is due for future research.

Fig. 7. Evaluation of Mean Absolute Error (MAE) of miss time approximation normalized with the real mean miss time.

Fig. 8. Comparison of normalized AAT w.r.t. LRU of HCA-W-CATinyLFU with and without a time oracle.

5.3 Under the Hood of CRA

We start by exploring our CRA policy’s meta parameter k. We choose the most attractive k value overall as we do not employ any per workload parameter tuning. To get a more generalized caching policy, tuning the meta-parameters for a specific type of workload would give better results for that workload but might be at the expense of decreasing the general performance.

Meta parameter k: Figure 9 shows the average access time for the following values of k: {0, 1, 2, 3, 4, 5, 10, 15}. For the sake of simplicity, we greyed out all the lines except the best and worst lines for every workload. Observe that the best k value varies between workloads and that there is a large margin of up to \(\sim\)\(60\%\) reduction in average access times compared to LRU if we could select the best value for k. We continue with \(k=1\) as it often provides good performance.

Fig. 9. Effect for parameter k on the performance of CRA using the normalized AAT performance metric w.r.t. LRU. Notice that the performance of CRA is sensitive to parameter k.

Number of active lists: We collected the number of active lists every 100 requests computing a running mean and keeping the maximum number. Table 1 presents this collected information averaged across all cache sizes per workload.

5.4 Access Time Aware Algorithms vs Baselines

Next, we compare the performance of our access time-aware algorithms and their baselines. Section 5.4.1 evaluates our static building blocks, while Section 5.4.2 evaluates our adaptive algorithms. We configure literature based algorithms according to their corresponding authors’ suggestions.

5.4.1 Static Algorithms.

Figure 10 shows both the normalized AAT and hit ratio for the static access time aware algorithms W-CATinyLFU and CRA against their baselines W-TinyLFU, LRU [24], and against the Clairvoyant [9] policy. Observe that CRA reduces the average access times by up to \(\sim\)\(60\%\) compared to LRU. CRA out-performed LRU in almost all our experiments. The exceptions include the AOL and WS workloads experiments, which resulted in a tiny increase of the AAT and the DNS workload where CRA underperformed by a more significant factor, for certain cache sizes. We contribute the lesser performance in these workloads to their nature, which is very frequency biased, and CRA’s eviction mechanism utilizes the recency-based score. W-CATinyLFU reduces the access times by up to 50% compared to W-TinyLFU, and the worst result is an increase of 1.8% to the average access time. Note that only in about 2% of the experiments we conducted, the increase was over 1%. Note that even though we refer to CRA and W-CATinyLFU as static algorithms, they do encapsulate a somewhat adaptive approach using both hit and miss times, which can adapt to future hardware changes with no change in the algorithm. For example, suppose a caching node that employs an SSD cache before accessing multiple HDD-based storage nodes, and that one of these is replaced by a DRAM node. In that case, our algorithm might choose not to cache data from the DRAM node since it is suddenly faster to retrieve the data from the node than from the cache.

Fig. 10. Comparison of our CRA and W-CATinyLFU algorithms to their latency oblivious baselines (LRU, and W-TinyLFU), and the Clairvoyant policy under the hit ratio and the normalized AAT performance metrics.

Notice that in Figure 10(c), CRA and W-CATinyLFU achieve lower access times than the Clairvoyant [9] policy for some of the points. This detail is surprising as Clairvoyant is an upper bound for the achievable hit ratio. Therefore, we strengthen the determination that there is a tension between optimizing the hit ratio and optimizing the average access time and that exploiting these variations is sometimes more promising than optimizing the hit ratio.

5.4.2 Adaptive Algorithms.

Here we compare our adaptive HCA-W-CATinyLFU, HCN-W-CATinyLFU, and HCS-W-CATinyLFU with their baselines HCA-W-TinyLFU, HCN-W-TinyLFU, and HCS-W-TinyLFU. Figure 11 shows the normalized AAT and hit ratio over both web-based and storage-based workloads. Notice that the differences between the different climbers are relatively low. Still, our access time-aware climbers have lower access times than their (cost oblivious) baselines but not necessarily better hit-ratio. Please observe that the cost-aware algorithms improve the average access times significantly compared with their oblivious cost baselines for most experiments. In Figure 11(d) our improvement reaches up to \(53\%\) compared to the baseline.

Fig. 11. Hit ratio and the normalized AAT over different workloads of our HC-W-CATinyLFU and its baseline using three different hill-climbing techniques.

5.5 Competitive Evaluation

We now position our top-performing adaptive policy (HCA-W-CATinyLFU) and the novel CRA (static) policy within the context of existing cost-aware cache policies such as GDWheel [37], CAMP [22], and Hyperbolic-CA [11] that utilize our timing mechanism to determine the costs. Our algorithms were run using the meta-parameters: \(k=1\) and \(q=10\). The competing algorithms are run using their suggested best parameters.

5.5.1 DNS, Wikipedia, and Web Search Workloads - Normalized AAT Metric.

Figure 12 shows the web-based workloads’ performance w.r.t. the normalized AAT metric (normalized w.r.t. LRU). As can be observed, both HCA-W-CATinyLFU and CRA outperform GDWheel and CAMP by a large margin on all web-based workloads w.r.t. this metric. HCA-W-CATinyLFU also outperforms Hyperbolic-CA on all web-based workloads with an average improvement of around 6.6% overall web-based experiments we conducted. We attribute this improvement to successfully exploiting naturally occurring variations in access times. Note that in all web-based experiments, both CAMP and GDWheel underperforms even compared to LRU, in some cases by a significant factor.

Fig. 12. Normalized AAT of HCA-W-CATinyLFU, CRA, and state-of-the-art cost-aware algorithms.

5.5.2 DNS, Wikipedia, and Web Search Workloads - Normalized P99-Latency Metric.

Figure 13 shows the performance on the web-based workloads w.r.t. the normalized P99-Latency metric. As can be observed, both HCA-W-CATinyLFU and CRA outperform GDWheel and CAMP on all web-based workloads, for some by a large margin w.r.t. this metric. HCA-W-CATinyLFU also outperforms Hyperbolic-CA on both Wikipedia and web search workloads. However, on DNS workloads, Hyperbolic-CA achieves the best performance w.r.t. the P99-Latency metric for certain cache sizes. While 99 percentile latency is not a design goal of our work, the figure shows that the improvements in average latency do not conflict with the P99 latency. Specifically, in some cases, CRA is even superior to other policies in the P99 metric, as can be seen in Figure 13(b) where CRA improves the P99 latency by about 50 percent compared to the best alternative.

Fig. 13. Normalized P99-Latency of HCA-W-CATinyLFU, CRA, and state-of-the-art cost-aware algorithms.

5.5.3 Storage Workloads - Normalized AAT Metric.

Figure 14 shows the normalized average access time for storage-based workloads. The GCC is very recency biased, and thus CRA is the top performer for this workload for all cache sizes but for the smallest one we tested. Notice that CAMP is the top performer, with CRA a close second on the OLTP2 trace for most cache sizes we checked, which is surprising since OLTP mixes recency and frequency localities. As can be observed, in the OLTP2 workload, HCA-W-CATinyLFU and Hyperbolic-CA perform much better than on the GCC recency biased workload HCA-W-CATinyLFU is the top performer among the two. For the SYSTOR17 workload, we can see that the top performer is dependant on the cache size, and the top two policies for this workload are HCA-W-CATinyLFU and CAMP, with a slight maximum difference between them is less than \(\sim\)\(1\%\). We notice a similar result in the MULTI1 workload. For most cache sizes for the MAC workload experiment, the top performer is CAMP with our HCA-W-CATinyLFU a close second with a maximum difference of \(\sim\)\(10\%\). On the Gradle workload for most cache sizes, the only policy to improve the AAT compared to LRU is our CRA, as seen in Figure 14(a). In the LINUX and MULTI2 workloads, the top performer is our HCA-W-CATinyLFU compared to the other policies w.r.t. the normalized AAT metric.

Fig. 14. Normalized AAT of HCA-W-CATinyLFU, CRA, and state-of-the-art cost-aware algorithms.

Note that we limit the maximum size of the Window cache to 80% of the total cache size. We chose this limit as it is the default configuration for the Caffeine library. Thus, HCA-W-CATinyLFU has a limit on how well it can perform on more recency-biased workloads as seen in Figure 14(a). In principle, one can increase the maximum Window cache size to achieve better performance on such workloads.

5.5.4 Storage Workloads - Normalized P99-Latency Metric.

Figure 15 shows the normalized P99-latency for storage-based workloads. Due to the nature of the access times distribution in storage workloads, the change in the P99-latency is relatively small. However, Figure 15 shows that the improvement in AAT is not at the expense of the P99-latency performance metric.

Fig. 15. Normalized P99-Latency of HCA-W-CATinyLFU, CRA, and state-of-the-art cost-aware algorithms.

5.6 Results Summary

Our evaluation includes diverse workloads such as DNS resolution, web searches, databases, and storage traces. Such traces are very diverse about the variability of access times and the access patterns introduced. Our evaluation shows that access time awareness is practical in all these workloads. It offers opportunities to improve the average access. Our method of measuring access times on the fly is sufficiently accurate to benefit from these variations. The competitive evaluation compares our algorithm to recently proposed cost-aware algorithms where our timing methods determine the cost. The evaluation shows that HCA-W-CATinyLFU is competitive on all the tested workloads despite the variability in workload conditions. Moreso, HCA-W-CATinyLFU is the top-performer policy in almost any workload, except for a few workloads, where it is a close second. The best alternative is Hyperbolic-CA, we do not consider GD-wheel or CAMP to be the best alternative due to their inconsistent performance. In some workloads, CRA is the top performer, but this is due to a technical reason. Specifically, in such workloads, the access pattern is very recency biased that the best configuration is for HCA-W-CATinyLFU to behave like CRA (100% Window cache size). However, due to implementation issues inherited from Caffeine, our implementation of HCA-W-CATinyLFU can only scale the Window cache to 80%. Our work shows that there may be merit in improving the implementation of Caffeine to scale all the way to 100% Window cache when it is required.

6 DISCUSSION

Our work is related to a family of cost-aware cache algorithms, which has been kept primarily academic until now. We exemplify that access time variations naturally arise within numerous important domains. We also show that by measuring these variations, we get reasonably accurate estimations of the hit and miss times that we can then use to optimize the average access time. Such measurements provide cost-aware algorithms with an edge over traditional algorithms, making cost-aware algorithms practical.

We introduce an adaptive and cost-aware algorithm that we named HCA-W-CATinyLFU and demonstrate that it is always competitive with the best state-of-the-art (cost-aware and cost oblivious) algorithm for each workload. We found no other competitive solution across all the tested workloads, implying that all other solutions require benchmarking to verify that their heuristics work well on the ‘typical’ domain workload. In contrast, our HCA-W-CATinyLFU successfully optimizes itself to each workload. Such a capability shortens the adaptation time and eases the deployment of HCA-W-CATinyLFU. Thus, our work would allow engineers to deploy our solutions without extensive performance studies knowing that our algorithm is likely competitive for their workloads.

On route to HCA-W-CATinyLFU, we developed two static cost-aware policies that we used as building blocks. Such a modular approach helps algorithm designers utilize our ideas within many algorithms. Specifically, they can seamlessly extend numerous cache algorithms with cost-aware notions by replacing their building blocks with cost-aware building blocks. Appendix A provides a detailed example to extending the ARC policy in this particular manner [43]. Another example can be seen our implementation of HCA-W-CATinyLFU utilizes the SCRA cost-aware cache policy, which is received by replacing the LRU segments in the SLRU cost oblivious policy with their cost-aware equivalent CRA segments. Other notable policies that can be extended in a similar fashion include FRD [47], Lirs [28], and ARC [43].

To conclude, our work demonstrates a constructive method to extend the access time oblivious algorithms with access time measurements. Further, we show that such an approach results in competitive algorithms even when compared with algorithms designed especially as cost-aware. Looking into the future, we seek to make our access time estimations more accurate to enhance the benefits of our approach further.

We now pursue the adaptation of HCA-W-CATinyLFU within the open-source community, possibly through the Caffeine library. We hope that our work would increase the interest of the community with cost-aware cache algorithms. We will publicly release all the code and traces used in this work to expedite further research into cost-aware algorithms.

Code and Workloads availability: The complete code for our algorithms in this paper is available here. Download links for the workloads are available on the web-page of our code repository with notes and instructions on how to reproduce the main experiments.

Appendix

A COST AWARE ADAPTIVE REPLACEMENT CACHE

In this section, we will evaluate the usage of CRA as a building block for a different state-of-the-art cost oblivious caching policy that utilizes LRU as a basic building block. We do this by replacing LRU with CRA. The chosen caching policy for this experiment is ARC [43], and the evaluation is between it and the new version we denote as ARC-CA.

ARC (Adaptive Replacement Cache) is an adaptive cache policy that utilizes four LRU-type lists and uses this structure. We implemented ARC-CA by replacing each LRU list with a CRA list. No other significant modifications were needed, and the ARC logic that dynamically sizes these lists remains the same.

A.1 DNS, Wikipedia, and Web Search Workloads - Normalized AAT

Figure 16 shows the web-based workloads’ performance w.r.t. the normalized AAT metric (normalized w.r.t. LRU). As can be observed, ARC-CA is superior to its baseline w.r.t this metric in all tested web-based experiments, except for the AOL workload with a maximum cache size of 15K items. Note that even in this experiment, the change is only around \(-\)0.17%.

Fig. 16. Normalized AAT of ARC, and ARC-CA a cost-aware variant using CRA comparison for web-based workloads.

A.2 DNS, Wikipedia, and Web Search Workloads - Normalized P99

Figure 17 shows the performance on the web-based workloads w.r.t. the normalized P99-Latency metric. As can be observed, ARC-CA is either equivalent to or better than ARC, the baseline, w.r.t. this performance metric in all the experiments presented in this figure.

Fig. 17. Normalized P99-Latency of ARC, and ARC-CA a cost-aware variant using CRA comparison for web-based workloads.

A.3 Storage Workloads - Normalized AAT Metric

Figures 18 and 19 show the normalized average access time for storage-based workloads. As can be observed, ARC-CA is superior to ARC w.r.t. this metric in almost all tested storage-based experiments. When ARC-CA is not better, the two algorithms are almost the same. Note that the maximum improvement compared to the baseline over all storage-based workloads is around 31% as seen in Figure 18(b) at the 1250 items cache size point.

Fig. 18. Normalized AAT of ARC, and ARC-CA a cost-aware variant using CRA comparison for small cache sizes over storage-based workloads.

Fig. 19. Normalized AAT of ARC, and ARC-CA a cost-aware variant using CRA comparison for large cache sizes over storage-based workloads.

A.4 Storage Workloads - Normalized P99-Latency Metric

Figures 20 and 21 show the normalized P99-latency for storage-based workloads. Due to the nature of the access times distribution in storage workloads, the change in the P99-latency is relatively small. However, these figures do show that the improvement in AAT is not at the expense of the P99-latency performance metric.

Fig. 20. Normalized P99-Latency of ARC, and ARC-CA a cost-aware variant using CRA comparison for small cache sizes over storage-based workloads.

Fig. 21. Normalized P99-Latency of ARC, and ARC-CA a cost-aware variant using CRA comparison for large cache sizes over storage-based workloads.

REFERENCES

[1] Akhtar Shahid, Beck Andre, and Rimac Ivica. 2017. Caching online video: Analysis and proposed algorithm. ACM Trans. Multimedia Comput. Commun. Appl. 13, 4 (Aug.2017), 48:1–48:21.Google ScholarDigital Library
Reference
[2] AOL. 2006. AOL User Session Collection.Google Scholar
Reference
[3] Apache. 2010. Apache cassandra. https://cassandra.apache.org.Google Scholar
Reference
[4] Apache. 2012. Apache solr. https://solr.apache.org.Google Scholar
Reference
[5] Araldo Andrea, Mangili Michele, Martignon Fabio, and Rossi Dario. 2014. Cost-aware caching: Optimizing cache provisioning and object placement in ICN. In 2014 IEEE Global Communications Conference. 1108–1113. DOI:Google ScholarCross Ref
Reference
[6] Arlitt Martin, Cherkasova Ludmila, Dilley John, Friedrich Rich, and Jin Tai. 1999. Evaluating content management techniques for Web proxy caches. In Proc. of the 2nd Workshop on Internet Server Performance.Google Scholar
Reference
[7] Arlitt Martin, Friedrich Rich, and Jin Tai. 2000. Performance evaluation of Web proxy cache replacement policies. Perform. Eval. 39, 1-4 (Feb.2000), 149–164.Google ScholarDigital Library
Reference
[8] Bakkal Emre, Altingovde Ismail Sengor, and Toroslu Ismail Hakki. 2015. Cost-aware result caching for meta-search engines. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’15). Association for Computing Machinery, New York, NY, USA, 739–742. DOI:Google ScholarDigital Library
Reference 1Reference 2Reference 3
[9] Belady L. A.. 1966. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal 5, 2 (1966), 78–101. DOI:Google ScholarDigital Library
Reference 1Reference 2Reference 3
[10] Berger Daniel S., Berg Benjamin, Zhu Timothy, Sen Siddhartha, and Harchol-Balter Mor. 2018. RobinHood: Tail latency aware caching – dynamic reallocation from cache-rich to cache-poor. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). USENIX Association, Carlsbad, CA, 195–212. https://www.usenix.org/conference/osdi18/presentation/berger.Google Scholar
Reference
[11] Blankstein Aaron, Sen Siddhartha, and Freedman Michael J.. 2017. Hyperbolic caching: Flexible caching for web applications. In 2017 USENIX Annual Technical Conference (USENIX ATC’17). 499–511.Google Scholar
Reference 1Reference 2Reference 3
[12] Cao Pei and Irani Sandy. 1997. Cost-aware WWW proxy caching algorithms. In Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems (USITS’97). USENIX Association, USA, 18.Google ScholarDigital Library
Reference
[13] Chakraborty Abhirup and Singh Ajit. 2011. Cost-aware caching schemes in heterogeneous storage systems. The Journal of Supercomputing 56 (042011), 56–78. DOI:Google ScholarDigital Library
Reference
[14] Cormode Graham and Muthukrishnan S.. 2004. An improved data stream summary: The Count-Min sketch and its applications. J. Algorithms 55 (2004), 29–38.Google Scholar
Reference
[15] Dropwizard. 2011. Dropwizard is a sneaky way of making fast Java web applications. https://www.dropwizard.io.Google Scholar
Reference
[16] Du Xiaoming and Li Cong. 2021. SHARC: Improving adaptive replacement cache with shadow recency cache management. In Proceedings of the 22nd International Middleware Conference (Middleware’21). Association for Computing Machinery, New York, NY, USA, 119–131. DOI:Google ScholarDigital Library
Reference
[17] DuckDuckGo. 2008. search engine. https://duckduckgo.com.Google Scholar
Reference
[18] Einziger Gil, Eytan Ohad, Friedman Roy, and Manes Ben. 2018. Adaptive software cache management. In Proceedings of the 19th International Middleware Conference (Middleware’18). Association for Computing Machinery, New York, NY, USA, 94–106. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
Reference 11
[19] Einziger G., Friedman R., and Manes B.. 2017. TinyLFU: A highly efficient cache admission policy. ACM Transactions on Storage (TOS) (2017).Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
[20] Fan Li, Cao Pei, Almeida Jussara, and Broder Andrei Z.. 2000. Summary cache: A scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw. 8, 3 (June2000), 281–293.Google ScholarDigital Library
Reference
[21] Forney Brian C. and Arpaci-Dusseau Andrea C.. 2002. Storage-aware caching: Revisiting caching for heterogeneous storage systems. In Conference on File and Storage Technologies (FAST’02). USENIX Association, Monterey, CA. https://www.usenix.org/conference/fast-02/storage-aware-caching-revisiting-caching-heterogeneous-storage-systems.Google Scholar
Reference
[22] Ghandeharizadeh Shahram, Irani Sandy, Lam Jenny, and Yap Jason. 2014. CAMP: A cost adaptive multi-queue eviction policy for key-value stores. In Proceedings of the 15th International Middleware Conference (Middleware’14). Association for Computing Machinery, New York, NY, USA, 289–300. DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[23] Ghodke Sid. 2018. Alexa top 1 Million Sites, a listing of the top 1-million websites according to Alexa.com. https://www.kaggle.com/cheedcheed/top1m. (2018).Google Scholar
Reference
[24] Hennessy John L. and Patterson David A.. 2012. Computer Architecture - A Quantitative Approach (5th ed.). Morgan Kaufmann.Google Scholar
Reference 1Reference 2
[25] Hou Binbing and Chen Feng. 2017. GDS-LC: A latency- and cost-aware client caching scheme for cloud storage. ACM Trans. Storage 13, 4, Article 40 (Nov.2017), 33 pages. DOI:Google ScholarDigital Library
Reference
[26] Huang Yaning, Jin Hai, Shi Xuanhua, Wu Song, and Chen Yong. 2013. Cost-aware client-side file caching for data-intensive applications. In Proceedings of the 2013 IEEE International Conference on Cloud Computing Technology and Science - Volume 02 (CLOUDCOM’13). IEEE Computer Society, USA, 248–251. DOI:Google ScholarDigital Library
Reference
[27] Jeong Jaeheon and Dubois Michel. 2003. Cost-sensitive cache replacement algorithms. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture (HPCA’03). IEEE Computer Society, USA, 327.Google ScholarCross Ref
Reference 1Reference 2
[28] Jiang Song and Zhang Xiaodong. 2002. LIRS: An efficient low inter-reference recency set replacement policy to improve buffer cache performance. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems, SIGMETRICS 2002, June 15–19, 2002, Marina Del Rey, California, USA. 31–42.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[29] Karakostas G. and Serpanos D. N.. 2002. Exploitation of different types of locality for web caches. In Proc. of the 7th Int. Symposium on Computers and Communications (ISCC’02).Google Scholar
Reference 1Reference 2
[30] Karedla Ramakrishna, Love J. Spencer, and Wherry Bradley G.. 1994. Caching strategies to improve disk system performance. Computer 27, 3 (March1994), 38–46. DOI:Google ScholarDigital Library
Reference
[31] Ketan Prof, Anirban Shah, and Matani Mitra Dhruv. 2010. An O(1) Algorithm for Implementing the LFU Cache Eviction Scheme. (2010).Google Scholar
Reference
[32] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A Method for Stochastic Optimization. (2014). arxiv:cs.LG/1412.6980Google Scholar
Reference
[33] Lee Chunghan, Kumano Tatsuo, Matsuki Tatsuma, Endo Hiroshi, Fukumoto Naoto, and Sugawara Mariko. 2017. Understanding storage traffic characteristics on enterprise virtual desktop infrastructure. In Proceedings of the 10th ACM International Systems and Storage Conference (SYSTOR’17). ACM, New York, NY, USA, Article 13, 11 pages.Google ScholarDigital Library
Reference
[34] Lee Donghee, Choi Jongmoo, Kim Jong-Hun, Noh Sam H., Min Sang Lyul, Cho Yookun, and Kim Chong-Sang. 2001. LRFU: A spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE Trans. Computers 50, 12 (2001), 1352–1361.Google ScholarDigital Library
Reference 1Reference 2
[35] Li Cong. 2018. DLIRS: Improving low inter-reference recency set cache replacement policy with dynamics. In Proceedings of the 11th ACM International Systems and Storage Conference (SYSTOR’18). Association for Computing Machinery, New York, NY, USA, 59–64. DOI:Google ScholarDigital Library
Reference
[36] Li Cong, Bao Jia, and Wang Haitao. 2017. Optimizing low memory killers for mobile devices using reinforcement learning. In 2017 13th International Wireless Communications and Mobile Computing Conference (IWCMC’17). 2169–2174. DOI:Google ScholarCross Ref
Reference
[37] Li Conglong and Cox Alan. 2015. GD-Wheel: A cost-aware replacement policy for key-value stores. Proceedings of the 10th European Conference on Computer Systems, EuroSys 2015 (042015). DOI:Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
[38] Liang Shuang, Chen Ke, Jiang Song, and Zhang Xiaodong. 2007. Cost-aware caching algorithms for distributed storage servers. In Distributed Computing, Pelc Andrzej (Ed.). Springer Berlin, Berlin, 373–387.Google Scholar
Reference 1Reference 2
[39] Liberatore Marc and Shenoy Prashant. 2016. UMass Trace Repository. (2016). http://traces.cs.umass.edu/index.php/Main/About.Google Scholar
Reference
[40] Liu Cricket and Albitz Paul. 2006. DNS and BIND (5th Edition). O’Reilly Media, Inc.Google Scholar
Reference
[41] Lv Yanfei, Chen Xuexuan, and Cui Bin. 2010. ACAR: An adaptive cost aware cache replacement approach for flash memory. In Web-Age Information Management, Chen Lei, Tang Changjie, Yang Jun, and Gao Yunjun (Eds.). Springer Berlin, Berlin, 558–569.Google ScholarCross Ref
Reference
[42] Manes Ben. 2016. Caffeine: A high performance caching library for Java 8. https://github.com/ben-manes/caffeine.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[43] Megiddo Nimrod and Modha Dharmendra S.. 2003. ARC: A self-tuning, low overhead replacement cache. In Proc. of the 2nd USENIX Conf. on File and Storage Technologies (FAST’03). 115–130.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
[44] Neglia Giovanni, Carra Damiano, Feng Mingdong, Janardhan Vaishnav, Michiardi Pietro, and Tsigkari Dimitra. 2017. Access-time-aware cache algorithms. ACM Trans. Model. Perform. Eval. Comput. Syst. 2, 4, Article 21 (Nov.2017), 29 pages. DOI:Google ScholarDigital Library
Reference 1Reference 2Reference 3
[45] Nesterov Yurii. 1983. A method for solving the convex programming problem with convergence rate O(\(1/k^{2}\)). Proceedings of the USSR Academy of Sciences 269 (1983), 543–547.Google Scholar
Reference
[46] Ozcan Rifat, Altingovde Ismail, and Ulusoy Ozgur. 2011. Cost-aware strategies for query result caching in web search engines. TWEB 5 (052011), 9. DOI:Google ScholarDigital Library
Reference 1Reference 2
[47] Park Sejin and Park Chanik. 2017. FRD: A filtering based buffer cache algorithm that considers both frequency and reuse distance. In Proc. of the 33rd IEEE International Conference on Massive Storage Systems and Technology (MSST’17).Google Scholar
Reference 1Reference 2
[48] Qureshi Moinuddin K., Lynch Daniel N., Mutlu Onur, and Patt Yale N.. 2006. A case for MLP-aware cache replacement. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA’06). IEEE Computer Society, USA, 167–178. DOI:Google ScholarDigital Library
Reference
[49] Redisson. 2012. Redisson - A redis Java client with ultra-fast performance. https://redisson.pro.Google Scholar
Reference
[50] Siy Kaggle Hofe. 2020. 2017-2019 search engine keywords. https://www.kaggle.com/hofesiy/2019-search-engine-keywords.Google Scholar
Reference
[51] University New Mexico State. ([n. d.]). NMSU TraceBase - http://tracebase.nmsu.edu/tracebase/traces.Google Scholar
Reference
[52] Urdaneta Guido, Pierre Guillaume, and Steen Maarten van. 2009. Wikipedia workload analysis for decentralized hosting. Elsevier Computer Networks 53, 11 (July2009), 1830–1845.Google ScholarDigital Library
Reference
[53] UserBenchmark. ([n. d.]). UserBenchmark - https://www.userbenchmark.com.Google Scholar
Reference
[54] Wood Timothy, Tarasuk-Levin Gabriel, Shenoy Prashant, Desnoyers Peter, Cecchet Emmanuel, and Corner Mark D.. 2009. Memory buddies: Exploiting page sharing for smart colocation in virtualized data centers. SIGOPS Oper. Syst. Rev. 43, 3 (July2009), 27–36. DOI:Google ScholarDigital Library
Reference 1Reference 2
[55] Zhang M., Wang Q., Shen Z., and Lee P. P. C.. 2019. Parity-only caching for robust straggler tolerance. In 2019 35th Symposium on Mass Storage Systems and Technologies (MSST’19). 257–268.Google Scholar
Reference 1Reference 2

Index Terms

Recommendations

High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10

Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...
Read More
A performance study of the time-varying cache behavior: a study on APEX, Mantevo, NAS, and PARSEC

Cache has long been used to minimize the latency of main memory accesses by storing frequently used data near the processor. Processor performance depends on the underlying cache performance. Therefore, significant research has been done to identify the ...
Read More
Boosting the performance of hybrid snooping cache protocols
ISCA '95: Proceedings of the 22nd annual international symposium on Computer architecture

Previous studies of bus-based shared-memory multiprocessors have shown hybrid write-invalidate/write-update snooping protocols to be incapable of providing consistent performance improvements over write-invalidate protocols. In this paper, we analyze ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Storage Volume 19, Issue 1
February 2023
259 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3578369
Editor:
Erez Zadok
Stony Brook University, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 February 2023
- Online AM: 13 January 2023
- Accepted: 5 September 2022
- Revised: 16 July 2022
- Received: 2 February 2022
Published in tos Volume 19, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cross domain caching
access times aware caching
dynamic caching
access time dataset
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 1,870
  Total Downloads
- Downloads (Last 12 months)1,170
- Downloads (Last 6 weeks)166
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Boosting Cache Performance by Access Time Measurements

ACM Transactions on Storage

Abstract

1 INTRODUCTION

2 RELATED WORK

3 OUR ACCESS TIME AWARE POLICIES

3.1 Access Time Measurements

3.2 Access Time Aware Admission Policy

3.3 Access Time Aware Recency based Eviction

3.4 Adaptive Access Time Aware Algorithms

4 VARIABLE ACCESS TIME DATASETS

4.1 Simulated DNS Traces/real Access Times

4.2 Simulated Web Searches/real Access Times

4.3 Real Search Trace/real Access Times

4.4 Real Storage Trace/real Access Times

4.5 Real Traces/simulated Access Times

5 PERFORMANCE EVALUATION

5.1 Methodology

5.2 Evaluation of Cost Approximations

5.3 Under the Hood of CRA

5.4 Access Time Aware Algorithms vs Baselines

5.4.1 Static Algorithms.

5.4.2 Adaptive Algorithms.

5.5 Competitive Evaluation

5.5.1 DNS, Wikipedia, and Web Search Workloads - Normalized AAT Metric.

5.5.2 DNS, Wikipedia, and Web Search Workloads - Normalized P99-Latency Metric.

5.5.3 Storage Workloads - Normalized AAT Metric.

5.5.4 Storage Workloads - Normalized P99-Latency Metric.

5.6 Results Summary

6 DISCUSSION

Appendix

A COST AWARE ADAPTIVE REPLACEMENT CACHE

A.1 DNS, Wikipedia, and Web Search Workloads - Normalized AAT

A.2 DNS, Wikipedia, and Web Search Workloads - Normalized P99

A.3 Storage Workloads - Normalized AAT Metric

A.4 Storage Workloads - Normalized P99-Latency Metric

REFERENCES

Cited By

Index Terms

Recommendations

High performance cache replacement using re-reference interval prediction (RRIP)

A performance study of the time-varying cache behavior: a study on APEX, Mantevo, NAS, and PARSEC

Boosting the performance of hybrid snooping cache protocols

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media