research-article

Open Access

The what, The from, and The to: The Migration Games in Deduplicated Systems

Authors:
Roei Kisous

Computer Science Department, Technion, Israel

Computer Science Department, Technion, Israel

0000-0002-1311-4678
View Profile

,
Ariel Kolikant

Computer Science Department, Technion, Israel

Computer Science Department, Technion, Israel

0000-0002-3983-7882
View Profile

,
Abhinav Duggal

Dell EMC, USA

Dell EMC, USA

0000-0002-8223-5828
View Profile

,
Sarai Sheinvald

ORT Braude College of Engineering, Carmiel, Israel

ORT Braude College of Engineering, Carmiel, Israel

0000-0002-0524-7390
View Profile

,
Gala Yadgar

Computer Science Department, Technion, Israel

Computer Science Department, Technion, Israel

0000-0003-2701-0260
View Profile

Authors Info & Claims

ACM Transactions on Storage Volume 18 Issue 4Article No.: 31pp 1–29https://doi.org/10.1145/3565025

Published:15 November 2022Publication History

ACM Transactions on Storage

Abstract

Deduplication reduces the size of the data stored in large-scale storage systems by replacing duplicate data blocks with references to their unique copies. This creates dependencies between files that contain similar content and complicates the management of data in the system. In this article, we address the problem of data migration, in which files are remapped between different volumes as a result of system expansion or maintenance. The challenge of determining which files and blocks to migrate has been studied extensively for systems without deduplication. In the context of deduplicated storage, however, only simplified migration scenarios have been considered.

In this article, we formulate the general migration problem for deduplicated systems as an optimization problem whose objective is to minimize the system’s size while ensuring that the storage load is evenly distributed between the system’s volumes and that the network traffic required for the migration does not exceed its allocation.

We then present three algorithms for generating effective migration plans, each based on a different approach and representing a different trade-off between computation time and migration efficiency. Our greedy algorithm provides modest space savings but is appealing thanks to its exceptionally short runtime. Its results can be improved by using larger system representations. Our theoretically optimal algorithm formulates the migration problem as an integer linear programming (ILP) instance. Its migration plans consistently result in smaller and more balanced systems than those of the greedy approach, although its runtime is long and, as a result, the theoretical optimum is not always found. Our clustering algorithm enjoys the best of both worlds: its migration plans are comparable to those generated by the ILP-based algorithm, but its runtime is shorter, sometimes by an order of magnitude. It can be further accelerated at a modest cost in the quality of its results.

1 INTRODUCTION

Many large-scale storage systems employ data deduplication to reduce the size of the data that they store. The deduplication process identifies duplicate data blocks in different files and replaces them with pointers to a unique copy of the block stored in the system. This reduction in the system’s size comes at the cost of increased system complexity. While the complexity of reading, writing, and deleting data in deduplicated storage systems has been addressed by many academic studies and commercial systems, the high-level management aspects of large-scale systems, such as capacity planning, caching, and quality and cost of service, still need to be adapted to deduplicated storage [48].

This article focuses on the aspect of data migration, in which files are remapped between separate deduplication domains, or volumes. A volume may represent a single server within a large-scale system or an independent set of servers dedicated to a customer or dataset. Files might be remapped as a result of volumes reaching their capacity limitation or of other bottlenecks forming in the system. Deduplication introduces new considerations when choosing which files to migrate due to the data dependencies between files: when a file is migrated, some of its blocks may be deleted from its original volume, whereas others might still belong to files that remain on that volume. Similarly, some blocks need to be transferred to the target volume, whereas others may already be stored there. An efficient migration plan must optimize several possibly conflicting objectives: the physical size of the stored data after migration; the load balancing between the system’s volumes, that is, the physical size of the data stored on each volume; and the network bandwidth generated by the migration itself.

Several recent studies address specific (simplified) cases of data migration in deduplicated systems. Harnik et al. [30] address capacity estimation and propose a greedy algorithm for reducing the system’s size. Rangoli [45] is a greedy algorithm for space reclamation, in which a set of files is deleted to reclaim some of the system’s capacity. GoSeed [43, 44] is an integer linear programming (ILP)–based algorithm for the seeding problem, in which files are remapped into an initially empty target volume. While even the seeding problem is shown to be NP-hard [43], none of these studies addresses the conflicting objectives involved in the full data migration problem — the trade-off between minimizing the system size, minimizing the network traffic consumed during migration, and maximizing the load balance between the volumes in the system.

In this article, we address the general case of data migration for the first time. We begin by formulating the data migration problem in its most general form, as an optimization problem whose main goal is to minimize the overall size of the system. We add the traffic and load balancing considerations as constraints on the migration plan. The degree to which these constraints are enforced directly affects the solution space, allowing the system administrator to prioritize different costs. Thus, the problem of data migration in deduplication systems maps to finding what to migrate, where to migrate from, and where to migrate to within the traffic and load balancing constraints specified by the administrator.

We then introduce three novel algorithms for generating an efficient migration plan. The first is a greedy algorithm that is inspired by the greedy iterative process in [30]. Our extended algorithm distributes the data evenly between volumes while ensuring that the migration traffic does not exceed the maximum allocation. By breaking this process into several phases, we ensure that the allocated traffic is used for both load balancing and capacity reduction, balancing between the two possibly conflicting goals.

Our second algorithm is inspired by the ILP-based approach of GoSeed. GoSeed solves the seeding problem, whose single natural minimization objective is the system size. In contrast, our new algorithm addresses the inherently competing objectives (size, balance, traffic) of general migration. We reformulate the ILP problem with variables and constraints that express the traffic used during migration and the choice of volumes from which to remap files or to remap files onto. Our formulation for the general migration problem is naturally much more complex than the one required for seeding. Nevertheless, we successfully applied it to data migration in systems with hundreds of millions of blocks.

Our third algorithm is based on hierarchical clustering, which, to the best of our knowledge, has not been applied to data deduplication before. We group similar files into clusters, for which the target number of clusters is the number of volumes in the system. We incorporate the physical location of the files into the clustering process, such that the similarity between files expresses the blocks that they share as well as their initial locations. Clusters are assigned to volumes according to the blocks already stored on them, and the migration plan remaps each file to the volume assigned to its cluster.

We implemented our three algorithms and evaluated them on six system snapshots created from three public datasets [6, 10, 41]. Our results demonstrate that all algorithms can successfully reduce the system’s size while complying with traffic and load balancing constraints. Each algorithm has different advantages: the greedy algorithm produces a migration plan in the shortest runtime (often several seconds), although its reduction in system size is typically lower than that of the other algorithms. The ILP-based approach can efficiently utilize the allowed traffic consumption and improve as the load balancing constraints are relaxed. However, its execution must be timed out on the large problem instances, which often prevents it from yielding an optimal migration plan. The clustering algorithm empirically achieves comparable results to those of the ILP-based approach and sometimes even exceeds them. It does so in much shorter runtimes.

We summarize our main contributions as follows. We formulate the general migration problem with deduplication as an optimization problem (Section 3), and design and implement three algorithms for generating general migration plans: the greedy (Section 4) and ILP-based (Section 5) approaches are inspired by previous studies, whereas the clustering-based (Section 6) approach is entirely novel. We methodologically compare our algorithms to analyze the advantages and limitations of each approach (Section 7). This article extends the early publication of this work in the proceedings of the 20th USENIX Conference on File and Storage Technologies (FAST 22) [33] with additional details and analysis of our approaches. The implementation of all of our algorithms is open source and available online [8].

2 BACKGROUND AND RELATED WORK

Data deduplication. In a nutshell, the deduplication process splits incoming data into fixed or variable-sized chunks, which we refer to as blocks. The content of each block is hashed to create a fingerprint, which is used to identify duplicate blocks and to retrieve their unique copy from storage. Several aspects of this process must be optimized so as not to interfere with storage system performance. These include chunking and fingerprinting [13, 39, 42, 54, 55], indexing and lookups [14, 49, 58], efficient storage of blocks [19, 21, 34, 36, 37, 49, 56], and fast file reconstruction [26, 32, 35, 57]. Although the first commercial systems used deduplication for backup and archival data, deduplication is now commonly used in high-end primary storage [11, 12].

Data migration in distributed deduplication systems. Numerous distributed deduplication designs were introduced in commercial and academic studies [20, 24, 29]. We focus on designs that employ a separate fingerprint index in each physical server [17, 18, 22, 23, 30]. This design choice maintains a small index size and a low lookup cost, facilitates garbage collection at the server level, and simplifies the client-side logic. In this design, each server (volume) is a separate deduplication domain, that is, duplicate blocks are identified only within the same volume. Recipes of files mapped to a specific volume thus point to blocks that are physically stored in that volume.

Deduplicated systems are different from traditional distributed systems in that striping files across volumes might reduce deduplication even if it is done using a content-based chunking algorithm. Splitting files across a cluster also complicates garbage collection. Moreover, many storage systems (e.g., in DataDomain [25] and IBM [30]) are organized as a collection of independent clusters or “islands” of storage in the data center or across data centers. Deduplication is performed within each independent subsystem; however, files might be migrated between the different appliances or clusters as a means to rebalance the entire system’s utilization.

For example, if a subsystem becomes full while another subsystem has available capacity, migration is quicker and cheaper than adding capacity to the full subsystem. Existing mechanisms migrate files efficiently by transferring only the files’ metadata and the chunks that are not already present in the target subsystem [25]. Monthly migration aligns with the average retention period that is seen for typical backup customers.

The coupling of the logical file’s location and the physical location of its blocks implies that when a file is remapped from its volume, we must ensure that all of its blocks are stored in the new volume. At the same time, the file’s blocks cannot necessarily be removed from its original volume, because they might also belong to other files. For example, consider the initial system depicted in Figure 1(a) and assume that we remap file \( F_2 \) from volume \( V_2 \) to volume \( V_1 \), resulting in the alternative system in Figure 1(b). Block \( B_1 \) is deleted from \( V_2 \) because it is already stored in \( V_1 \). Block \( B_2 \) is deleted from \( V_2 \) but must be copied to \( V_1 \) because it was not there in the initial system. Block \( B_3 \) must also be copied to \( V_1 \) but is not deleted from \( V_2 \) because it also belongs to \( F_3 \). The total sizes of the initial system and of this alternative are the same: nine blocks.

Fig. 1. Initial system (a) and alternative migration plans: with optimal load balancing (b), optimal traffic (c), and optimal deletion (d). All blocks in the system are of size 1.

Existing approaches. Harnik et al. [30] presented a greedy iterative algorithm for reducing the total capacity of data in a system with multiple volumes. In each iteration, one file is remapped to a new volume, and the process continues until the total capacity is reduced by a predetermined deletion goal.

GoSeed [43, 44] addresses a simplified case of data migration called seeding, in which the initial system consists of many files mapped to a single volume. The migration goal is to delete a portion of this volume’s blocks by remapping files to an empty target volume [25]. GoSeed formulates the seeding problem as an ILP instance whose solution determines which files are remapped, which blocks are moved from the source volume to the target, and which are replicated to create copies on both volumes. This approach is made possible by the existence of open-source [4, 5, 9] and commercial [2, 3] ILP-solvers heuristic–based software tools for solving this NP-hard problem efficiently. GoSeed is applied to instances with millions of blocks with several acceleration heuristics, some of which we adapt to the generalized problem.

Rangoli [45] is a greedy algorithm for space reclamation—another specific case of data migration in which a set of files is chosen for deletion in order to delete a portion of the system’s physical size. Unlike the greedy and ILP-based approaches that inspire our own algorithms, the problem solved by Rangoli is oversimplified for it to be extended for general migration. Shilane et al. [48] discuss additional data migration scenarios and their resulting complexities in deduplicated systems.

3 MOTIVATION AND PROBLEM STATEMENT

Minimizing migration traffic. High-performance storage systems typically limit the portion of their network bandwidth that can be used for maintenance tasks such as reconstruction of data from failed storage nodes [31, 47]. Data migration naturally involves significant network bandwidth consumption; traditional data migration plans and mechanisms strive to minimize their network requirements as one of their optimization goals [15, 16, 25, 38, 40, 52]. In this work, we focus on the amount of data that is moved between nodes. The physical layout of the nodes and the precise scheduling of the migration are outside the scope of our current work.

In deduplicated storage, we distinguish between two costs associated with data migration. The migration traffic is the amount of data that is transferred between volumes during migration. The replication cost is the total size of duplicate blocks that are created as a result of remapping files to new volumes. Previous studies of data migration in deduplicated systems did not consider bandwidth explicitly. Harnik et al. [30] did not address this aspect at all. In the seeding problem addressed by GoSeed [43], the migration traffic is implicitly minimized as a result of minimizing the replication cost. In the general case, however, migration traffic is potentially independent of the amount of data replication.

For example, Alternative 1 in Figure 1(b) results in transferring two blocks, \( B_2 \) and \( B_3 \), between volumes, even though \( B_2 \) is eventually deleted from its source volume. In contrast, the alternative migration plan in Figure 1(c) does not consume traffic at all: file \( F_1 \) is remapped to \( V_2 \), which already stores its only block; thus, \( B_1 \) can simply be deleted from \( V_1 \). This alternative also reduces the system’s size to eight blocks, making it superior to the first alternative in terms of both objectives. We note, however, that this is not always the case and that minimizing the overall system size and minimizing the amount of data transferred might be conflicting objectives.

Load Balancing. Load balancing is a major objective in distributed storage systems, where it often conflicts with other objectives such as utilization and management overhead [16, 46, 53]. Distributed deduplication introduces an inherent trade-off between minimizing the total physical data size and maximizing load balancing: the system’s size is minimized when all of the files are mapped to a single volume, which clearly gives the worst possible load balancing. Thus, distributed deduplication systems weigh the benefit of mapping a file to the volume that contains similar files against the need to distribute the load evenly between the volumes. Load balancing can refer to various measures of load, such as input/output operations per second (IOPS), bandwidth requirements, or the number of files mapped to each volume.

We follow previous work and aim to evenly distribute the capacity load between volumes [18, 22]. Balancing capacity is especially important in deduplicated systems that route incoming files to volumes that already contain similar files. In such designs, volumes whose storage occupancy is slightly higher than others might quickly become overloaded due to their larger amount of data ‘attracting’ even more new files, and so on. Capacity load balancing can be expected to lead to better network utilization and prevent specific volumes from becoming a bottleneck or exhausting their capacity. While performance load balancing is not our main objective, we expect it to improve as a result of distributing capacity. All of our approaches can be extended to address it explicitly.

In this work, we capture the load balancing in the system with the balance metric, which is similar to a commonly used fairness metric [27]—the ratio between the size of the smallest volume in the system and that of the largest volume. For example, the balance of the initial system in Figure 1(a) is \( |V_1|/|V_2|= {1}/{5} \). Alternative 1 (Figure 1(b)) is perfectly balanced, with \( balance=1 \), whereas Alternative 2 (Figure 1(c)) has the worst balance: \( |V_1|/|V_2|=0 \).

Problem statement. There are various approaches for handling conflicting objectives in complex optimization systems. These include searching for the Pareto frontier [59] or defining a composite objective function of weighted individual objectives. We chose to keep the system’s size as our main objective and to address the migration traffic and load balancing as constraints on the migration plan. We define the general migration problem by extending the seeding problem from [43]; thus, we reuse some of their notations for compatibility.

For a storage system \( S \) with a set of volumes \( V \), let \( B =\lbrace b_0,b_1, \ldots , \rbrace \) be the set of unique blocks stored in the system and let \( size(b) \) be the size of block \( b \). Let \( F =\lbrace f_0,f_1, \ldots , \rbrace \) be the set of files in \( S \), and let \( I_S\subseteq B\times F \times V \) be an inclusion relation, where \( (b,f,v)\in I_S \) means that file \( f \) mapped to volume \( v \) contains block \( b \), which is stored in this volume. We use \( b \in v \) to denote that \( (b,f,v)\in I_S \) for some file \( f \). The size of a volume is the total size of the blocks stored in it, that is, \( size(v)=\Sigma _{b\in v}size(b) \). The size of the system is the total size of its volumes, that is, \( size(S)=size(V)=\Sigma _{v\in V}size(v) \).

The general migration problem is to find a set of files \( F_M \subseteq F \) to migrate, the volume each file is migrated to, the blocks that can be deleted from each volume, and the blocks that should be copied to each volume. Applying the migration plan results in a new system, \( S^{\prime } \). The migration goal is to minimize the size of \( S^{\prime } \). This is equivalent to maximizing the size of all blocks that can be deleted during the migration minus the size of all blocks that must be replicated.

The traffic constraint specifies \( T_{max} \)—the maximum traffic allowed during migration. It requires that the total size of blocks that are added to volumes they were not stored in is no larger than \( T_{max} \). The load balancing constraint determines how evenly the capacity is distributed between the volumes. It specifies a margin \( 0 \le \mu \lt 1 \) and requires that the size of each volume in the new system is within \( \mu \) of the average volume size. For a system with \( |V| \) volumes, this is equivalent to requiring that \( balance\ge {[( {size(S^{\prime })}/{|V|})\times (1-\mu)]}/{[( {size(S^{\prime })}/{|V|})\times (1+\mu)]} \), or simply, \( balance\ge {(1-\mu)}/{(1+\mu)} \).

For example, for the initial system in Figure 1(a), Alternative 1 (Figure 1(b)) is the only migration plan that satisfies the load balancing constraint (for any \( \mu \)). For \( T_{max} \) lower than \( {2}/{9} \), no migration is feasible. On the other hand, if we remove the load balancing constraint, the optimal migration plan depends on the traffic constraint alone: Alternative 2 (Figure 1(c)) is optimal for \( T_{max}=0 \), for example, and Alternative 3 (Figure 1(d)) is optimal for \( T_{max}=3 \).

Refinements. This generalized problem can be refined in several straightforward ways. For example, we can restrict the set of files that may be included in \( F_M \), the set of volumes from which files may be removed, or the set of volumes to which files can be remapped. Similarly, we can require that a specific volume be removed from the system (enforcing all its files to be remapped) or that an empty volume be added. We demonstrate some of these cases in our evaluation.

4 GREEDY

The basic greedy algorithm by Harnik et al. [30] iterates over all the files in each volume and calculates the space-saving ratio from remapping a single file to each of the other volumes: the ratio between the total size of the blocks that would be replicated and the blocks that would be deleted from the file’s original volume. In each iteration, the file with the lowest ratio is remapped. For example, if this basic greedy algorithm was applied to the initial system in Figure 1(a), it would first remap file \( F_1 \) to volume \( V_2 \), with a space-saving ratio of 0, resulting in Alternative 2 (Figure 1(c)). The process halts when the total capacity is reduced by a predetermined deletion goal. This algorithm is not directly applicable to the general migration problem because it does not consider traffic and load balancing.

Addressing the traffic constraint is relatively straightforward. In our extended greedy algorithm, we make it the halting condition: the iterations stop when there is no file that can be remapped without exceeding the maximum allocated traffic. Algorithm 1 gives the pseudocode for the choice of file to remap while considering the traffic constraint. A small challenge is that a file might be remapped in several iterations of the algorithm whereas, in the resulting migration plan, it will be remapped from its original volume to its final destination only. As a result, the sum of traffic of all individual iterations can be (and is, in practice) higher than the traffic required when executing the migration plan. This will not violate the traffic constraint but will cause the algorithm to halt before taking advantage of the maximum allowed traffic. Thus, we heuristically allow the algorithm to use 20% more traffic than the original traffic constraint to prevent it from halting prematurely. The required traffic for the resulting migration plan is calculated before its execution. Thus, if it violates the original traffic constraint, a new plan can be generated by the algorithm without this heuristic. We include this simple extension, without a load-balancing constraint, in our evaluation.

Complying with the load-balancing constraint is more challenging. For example, if the basic greedy algorithm reached Alternative 2 (Figure 1(c)), it could no longer remap any single file to volume \( V_1 \) without increasing the system’s capacity. Thus, the system will remain unbalanced with at least one empty volume. A naïve extension to this algorithm could enforce the load-balancing constraint by preventing files from being remapped if this increases the system’s imbalance. However, such a strict requirement might preclude too many opportunities for optimization. For example, for the initial system in Figure 1(a), it would allow remapping file \( F_2 \) to volume \( V_1 \) only, resulting in Alternative 1 (Figure 1(b)). The system would be perfectly balanced, but the basic algorithm would then terminate without reducing its size at all.

We address this challenge with two main techniques. The first is defining two iteration types: one whose goal is to balance the system’s load and another whose goal is to reduce its size. We perform these iterations interchangeably to avoid the entire allocated traffic from being spent on only one goal. The second technique is to relax the load-balancing margin for the early iterations and continuously tighten it until the end of the execution. The idea is to let the early iterations remap files more freely and to ensure that the iterations at the end of the algorithm result in a balanced system.

Figure 2 illustrates the process of our extended greedy algorithm. We divide the algorithm’s process into phases. ① Each phase is allocated an even portion of the traffic allocated for migration, and is limited by a local load-balancing constraint. Each phase is composed of two steps. ② The load-balancing step iteratively remaps files from large volumes to small ones, until the volume sizes are within the margin defined for this phase or its traffic is exhausted. ③ The capacity-reduction step uses the remaining traffic to reduce the system’s size by remapping files between volumes, ensuring that volume sizes remain within the margin.

Fig. 2. Overview of our extended greedy algorithm.

Algorithm 2 consists of the main structure of our algorithm. Each phase is limited by local traffic and load-balancing constraints, calculated at the beginning of the phase. The phase traffic (\( T_{phase} \)) determines the maximum traffic that can be used in each phase, and is roughly even for all phases. The local phase margin (\( \mu _{phase} \)) determines the minimum and maximum allowed volume sizes in each phase. It is larger than the global margin, \( \mu \), in the first phase and gradually decreases before each phase, until reaching \( \mu \) in the last phase. By default, our greedy algorithm consists of \( p=5 \) phases. The phase traffic for phase \( i \), \( 0 \le i \lt p \), is \( {1}/{(p-i)} \) of the unused traffic, and the phase margin for the first phase is \( \mu \times 1.5 \).

The load-balancing step, given in Algorithm 3, is the first step in each phase. In each of its iterations, the volumes are sorted according to their sizes. We attempt to remap files from the largest volumes to the small ones. A file can be remapped only if some blocks will be deleted from its source volume as a result. We look for a file to remap between a \( \langle source,target \rangle \) pair of volumes, where \( source \) is the largest volume and the \( target \) is the smallest volume for which such a file exists. In each iteration, the amount of traffic required to remap the chosen file is calculated, and the iterations halt when the maximum allowed traffic or allowed volume sizes are reached.

The capacity-reduction step, given in Algorithm 4, uses the remaining traffic allocation of the phase. It is similar to the original greedy algorithm, but it ensures that each file remap does not cause the volumes to become unbalanced. In other words, we can remap a file only if this does not cause its source volume to become too small or its target volume to become too large. This constraint is reflected in the use of a different routine for choosing the best file to remap, given in Algorithm 5. Note that the amount of traffic that remains for the capacity-reduction step depends on the degree of imbalance in the initial system. In the most extreme case of a highly unbalanced system, it is possible for the load-balancing step to consume all of the traffic allocated for the phase. In this case, the capacity-reduction step halts in the first iteration. For cases other than this extreme, a higher number of phases can divert more traffic for capacity-reduction at the cost of longer computation time due to the increased number of iterations.

5 ILP

Our ILP-based approach is inspired by GoSeed [43, 44], designed for the seeding problem, in which files can be remapped only from the source volume to the empty target volume. GoSeed thus defined three types of variables whose assignment specified (1) whether a file is remapped, (2) whether a block is replicated on both volumes, and (3) whether a block is deleted from the source and moved to the target. These limited options resulted in a fairly simple set of constraints, which cannot be directly applied to the general migration problem. The major difference is that the decision of whether or not we can delete a block from its source volume depends not only on the files initially mapped to this volume but also on the files that will be remapped to it as a result of the migration. Thus, in our ILP-based approach, every block transfer is modeled as creating a copy of this block and a separate decision is made whether or not to delete the block from its source volume.

The problem’s constraints are defined over the set of volumes, files, and blocks from the problem statement in Section 2, the maximum traffic \( T_{max} \), and the load-balancing margin \( \mu \). We define the target size of each volume \( v \) as \( w_v \), given as a percentage of the system size after migration. By default, \( w_v= \)1\( |V| \). For each pair of volumes, \( v,u \), we define their intersection as the set of blocks that are stored on both volumes: \( Intersect_{vu}= \lbrace b|b \in u \wedge b \in v\rbrace \). The intersections are calculated before the constraints are assigned and are used in the formulation below for better readability.

The constraints are expressed in terms of three types of variables that denote the actions performed in the migration: \( x_{fst} \) denotes whether file \( f \) is remapped from its source volume \( s \) to another (target) volume \( t \). \( c_{bst} \) denotes whether block \( b \) is copied from its source volume \( s \) to another (target) volume \( t \). Finally, \( d_{bv} \) denotes whether block \( b \) is deleted from volume \( v \). The solution to the ILP instance is an assignment of 0 or 1 to these variables. The resulting migration plan remaps the set of files for which \( x_{fst}=1 \) (for some volume \( t \)), transfers the blocks for which \( c_{bst}=1 \) to their target volume, and deletes the blocks for which \( d_{bv}=1 \) from their respective volumes.

Constraints and objective. The ILP formulation for migration with load balancing consists of 13 constraint types.

(1)

All variables are Boolean: \( x_{fst},c_{bst},d_{bv}\in \lbrace 0,1\rbrace \).

(2)

A file can be remapped to at most one volume: for every file \( f \) in volume \( s \), \( \sum _{t\in V}x_{fst}\le 1 \).

(3)

A block can be deleted or copied only from a volume it was originally stored in: for every two volumes \( s,t \); if \( b \notin s \), then \( c_{bst}=d_{bs}=0 \).

(4)

A block can be deleted from a volume only if all the files containing it are remapped to other volumes: for every volume \( s \) and for every file \( f \) such that \( f\in s \), \( d_{bs}\le \sum _t{x_{fst}} \).

(5)

A block can be deleted from a volume only if no file containing it is remapped to this volume: for every two volumes \( s,t \), every file \( f \) such that \( f\in s \) and \( f\notin t \), and every block \( b \) such that \( (b,f,s)\in I_{S} \), \( d_{bt}\le 1-x_{fst} \).

(6)

View all of the blocks in the volume intersections as having been copied: for every two volumes \( s,t \) and for every block \( b\in Intersect_{st} \), \( c_{ist}=1 \).

(7)

When a file is remapped, all of its blocks are either copied to the target volume or are initially there (as part of the intersection): for every two volumes \( s,t \) and every block \( b \) and file \( f \) such as (\( b,f,s)\in I_{S} \), \( x_{fst}\le \Sigma _{v\in V}c_{bvt} \).

(8)

A block can be copied to a target volume only from one source volume: for every block \( b \) and volume \( t \), \( \Sigma _{s \text{ such that } b\notin Intersect_{st}}c_{bst}\le 1 \).

(9)

A block must be deleted if there are no files containing it on the volume: for every two volumes \( s,v \) and all files \( f_{s}\in s \), \( f_{v}\in v \) and all blocks \( b \) where \( b\in f_{s} \) and \( b\in f_{v} \), \( d_{bs}\ge \) \( 1-\lbrace \Sigma _{f_{s}}(1-\Sigma _{v}x_{f_{s}sv}) \)+\( \Sigma _{fv}(x_{f_{v}vs})\rbrace \).

(10)

A block cannot be copied to a target volume if no file will contain it there: for every volume \( t \) and every block \( b\notin t \), \( \Sigma _{s}c_{bst}\le \Sigma _{s}\Sigma _{f\in s \wedge b\in f}x_{fst} \).

(11)

A file cannot be migrated to its initial volume: for every file \( f \) and volume \( v \), \( x_{fvv}=0 \).

(12)

Traffic constraint: the size of all of the copied blocks is not larger than the maximum allowed traffic: \( \sum _{s\in V}\sum _{t\in V}\sum _{b\notin Intersect_{st}}c_{bst}\times size(b)\le T_{max} \).

(13)

Load-balancing constraint: for each volume \( v \),

\( (w_v - \mu)\times Size(S^{\prime }) \le Size(v^{\prime }) \le (w_v + \mu)\times Size(S^{\prime }) \), where \( Size(v^{\prime }) \) is the volume size after migration, i.e., the sum of its non-deleted blocks and blocks copied to it: \( Size(v^{\prime }) = \sum _{b\in v}(1-d_{bv})\times Size(b) + \sum _{s\in V, \sum _{b}\notin Intersect_{sv}}c_{bsv}\times Size(b) \). \( Size(S^{\prime }) \) is the size of the system after migration: \( Size(S^{\prime })= \sum _{v\in V}Size(v^{\prime }) \).

►

Objective: Maximize the sum of sizes of all blocks that are deleted minus all blocks that are copied. This is equivalent to minimizing the overall system size:

\( Max(\sum _{b\in B}Size(b) \times \sum _{s\in V}[d_{bs}-\sum _{t\in V,b\notin Intersect_{st}}c_{bst}]) \).

Constraints 12 and 13 formulate the traffic and load-balancing goals. Constraints 8, 9, and 10 ensure that the solver does not create redundant copies of blocks to artificially comply with the load-balancing constraint. This is similar to the constraint that prevents orphan blocks in the seeding problem [43, 44]. For evaluation purposes, we will also refer to a relaxed formulation of the problem without the load-balancing constraint. In that version, constraints 8, 9, 10, and 13 are removed, considerably reducing the problem complexity.

The ILP formulation given in this article is designed for the most general case of data migration, in which any file can be remapped to any volume. In reality, the migration goal might restrict some of the remapping options, potentially simplifying the ILP instance. For example, we can limit the set of volumes that files can be migrated to by eliminating the \( x_{fst} \) and \( c_{bst} \) variables where \( t \) is not in this set. We can similarly restrict the set of volumes files that may be migrated from or require that a set of specific files are (or are not) remapped.

Complexity and runtime. The complexity of the ILP instance depends on \( |B| \), \( |F| \), and \( |V| \)—the number of blocks, files, and volumes, respectively. The number of variables is \( |V|^2|F| + |V|^2|B| + |V|\times |B| \), corresponding to variable types \( x_{fst} \), \( c_{bst} \), and \( d_{bv} \). Each of the constraints defined on these variables contributes a similar order of magnitude. An exception is constraint 13, which reformulates the system size, twice, to ensure that each individual volume’s size is within the required margin. Indeed, the relaxed formulation without this constraint is significantly simpler than the full formulation.

We use two of the acceleration methods suggested by GoSeed to address the high complexity of the ILP problem. The first is fingerprint sampling, where the problem is solved for a subset of the original system’s blocks. This subset (sample) is generated by preprocessing the block fingerprints and including only those that match a predefined pattern. Specifically, as suggested in [30], a sample generated with sampling degree \( k \) includes only blocks whose fingerprints consist of \( k \) leading zeroes, reducing the number of blocks in the problem formulation by \( {1}/{2^k} \) on average. If the files are large enough, the sample will include the same set of files as the original system. Thus, migration decisions in the real system follow the decisions made for each file in the sample. We discuss the effects of small files in Section 7.

The second acceleration method is solver timeout, which halts the ILP solver’s execution after a predetermined runtime. As a result, the server returns a feasible solution that is not necessarily optimal. A feasible solution to the ILP problem can be directly translated into a migration plan, that is, a list of files to migrate and their destination volumes. Thus, even if the solution is not optimal (due to sampling or timeout), the process still produces a valid plan for the original system.

We do not repeat the detailed analysis of the effectiveness of these heuristics, which were shown to be effective in earlier studies. Namely, the analysis of GoSeed showed that most of the solver’s progress happens in the beginning of its execution (hence, timing out does not degrade its quality too much) and that it is more effective to reduce the sample size than to run the solver longer on a larger sample as long as the sampling degree is not higher than \( k=13 \). Our experiments with the extended ILP formulation, omitted for brevity, confirmed these findings.

Internal sampling. We introduce a new acceleration method specific to the general migration problem. As noted earlier (and demonstrated in our evaluation), the load-balancing constraint increases the complexity of the problem significantly. This complexity can be reduced, at the cost of reduced accuracy, by defining it only on a subset of the system’s blocks. Recall that the ILP problem is defined on a sample of the original system. We create an even smaller sample on which the load-balancing constraint is defined.

Internal sampling of degree \( k^{\prime } \) creates a sample of the system that includes only blocks with \( k \) leading zeroes and \( k^{\prime } \) trailing zeroes. We choose to sample the least significant bits of the fingerprint to ensure that this sampling is independent of the sampling that generated the initial system snapshot. Table 1 shows a toy system with four blocks (\( B=\lbrace b_1, b_2, b_3,b_4\rbrace \)) and their fingerprints, and whether they are included in the initial and internal system snapshots. For example, with an initial sampling degree \( k=1 \) and an internal sampling degree \( k^{\prime }=1 \), the system consists of the blocks \( B_{k=1,k^{\prime }=1}=\lbrace b_2, b_3\rbrace \). In this example, all constraints will be defined for blocks \( B_{k=1}=\lbrace b_1,b_2, b_3\rbrace \), but block \( b_1 \) will not be included in the load balancing constraint. When defining this constraint, only the blocks in \( B_{k=1,k^{\prime }=1} \) will be considered for the calculation of the size of the system and its volumes before and after the migration.

Table 1.

Block	Fingerprint	\( k=1 \)	\( k=2 \)	\( k^{\prime }=1 \)	\( k^{\prime }=2 \)
\( b_1 \)	01110101	✓	✗	✗	✗
\( b_2 \)	00110110	✓	✓	✓	✗
\( b_3 \)	01110100	✓	✗	✓	✓
\( b_4 \)	10110101	✗	✗	✗	✗

\( k \) is the number or leading zeroes in the basic sampling rule and \( k^{\prime } \) is the number of trailing zeroes in the internal sample.

View Table

Table 1. Small System with Four Blocks Indicating Whether They Are Included in the Sample According to the Sampling Rule

\( k \) is the number or leading zeroes in the basic sampling rule and \( k^{\prime } \) is the number of trailing zeroes in the internal sample.

Internal sampling might become a double-edged sword. Although the size of the linear constraints is dramatically reduced, their number remains the same. Moreover, the load-balancing constraint becomes “easier” to meet, that is, it increases the space of feasible solutions. As a result, partial solutions returned when the solver times out might be better. However, searching for the optimal solution might take longer, possibly causing the server to time out before it is found. These effects are demonstrated in our evaluation in Section 7.4.

6 CLUSTERING

Overview. Clustering is a well-known technique for grouping objects based on their similarity [1]. It is fast, effective, and is directly applicable to our domain: files are similar if they share a large portion of their blocks. Thus, our approach is to create clusters of similar files and to assign each cluster to a volume, remapping those files that were assigned to a volume different from their original location. Despite its simplicity, three main challenges \( (Ch1-Ch3) \) are involved in applying this idea to the general migration problem.

(Ch1) Unpredictable traffic. The traffic required for a migration plan can be calculated only after the clusters have been assigned to volumes. When the clustering decisions are being made, their implications on the overall traffic are unknown and, thus, cannot be taken into consideration.
(Ch2) Unpredictable system size. The load-balancing constraint is given in terms of the system’s size after migration. However, this size is required to ensure that the created clusters are within the allowed sizes during the clustering process.
(Ch3) High sensitivity. The file similarity metric is based on the precise set of blocks in each file. When this metric is applied to a sample of the storage system’s fingerprints, it is highly sensitive to the sampling degree and rule.

We address these challenges with several heuristics \( (H1-H4) \):

(H1) Traffic weight. We define a new similarity metric for files. This metric is a weighted sum of the files’ content similarity and a new distance metric that indicates how many source volumes contain files within a cluster. Our algorithm considers files as similar if they contain the same blocks and are mapped to the same source volume. Assigning a higher weight (\( W_T \)) to the content similarity will result in a smaller system but higher migration traffic.
(H2a) Estimated system size. We further use the weight to estimate the size of the system after migration. We calculate the size of a hypothetical system without duplicate, and predict that higher migration traffic will bring the system closer to this hypothetical optimum.
(H2b) Clustering retries. We use the estimated final system size to determine the maximum allowed cluster size. During the clustering process, we stop adding files to clusters that reach this size. If the process halts due to this limitation, we increase the maximum size by a small margin and restart it.
(H3) Randomization. Instead of deterministic clustering decisions, we choose a random option from the set of best options. Different random seeds potentially result in different systems.
(H4) Multiple executions. Our heuristics introduce several parameters that we would be loath to overfit. We use the same initial state to perform repeated executions of the clustering process with multiple sets of parameter combinations (180 in our case) and choose the best migration plan from those executions as our final output.

In the following, we give the required background on the clustering process and describe each of our optimizations in detail.

Hierarchical clustering. Hierarchical clustering [28, Chapter 7] is an iterative clustering process that, in each iteration, merges the most similar pair of clusters into a new cluster. The input is an initial set of objects, which are viewed as clusters of size 1. The process creates a tree whose leaves are the initial objects, and internal nodes are the clusters they are merged into. For example, Figure 3 shows the clustering hierarchy created from the set of initial objects \( \lbrace F_1, \ldots , F_5\rbrace \), where the clusters \( \lbrace C_1, \ldots , C_4\rbrace \) were created in order of their indices.

Fig. 3. Hierarchical clustering with the files from Figure 1 (a) and the respective distance matrices (b).

Hierarchical clustering naturally lends itself to grouping of files. Intuitively, files that share a large portion of their blocks are similar and should thus belong to the same cluster and eventually to the same volume. For example, the initial objects in Figure 3 represent the files in Figure 1(a): \( F_4 \) and \( F_5 \) share two blocks and are thus merged into the first cluster, \( C_1 \). Our clustering-based approach is simple: we group the files into a number of clusters equal to the number of volumes in the system and assign one cluster to each volume. This assignment implies which files should be remapped and which blocks should be transferred and/or deleted in the migration. For example, for a system with three volumes, we could halt the clustering process in Figure 3, resulting in a final set of three clusters: \( \lbrace C_1,C_2,F_3\rbrace \). We develop this basic approach to the general migration problem, that is, to maximize the deletion and to comply with the traffic and load-balancing constraints.

File similarity. The hierarchical clustering process relies on a similarity function that indicates which pair of clusters to merge in each iteration. We use the commonly used Jaccard index [28] for this purpose. For two sets \( A \) and \( B \), their index is defined as \( {J(A,B)= {|A\cap B|}/{|A\cup B|}} \). We view each file as a set of blocks; thus, the Jaccard index for a pair of files is the portion of their shared blocks. From here on, we refer to the complement of the index: the Jaccard distance, which is defined as \( {dist_J=\overline{J(A,B)}}=1-J(A,B) \). This is to comply with the standard terminology in which the two clusters with the smallest distance are merged in each iteration. For example, the leftmost table in Figure 3 depicts the distance matrix for the files in Figure 1. Indeed, the distance is smallest for the pair \( F_4 \) and \( F_5 \), which are the first to be merged.

The Jaccard distance could easily be applied to entire clusters, which can themselves be viewed as sets of blocks. However, calculating the distance between each new cluster and all existing clusters would require repeated traversals of the original file recipes in each iteration. This complexity is addressed in hierarchical clustering by defining a linkage function, which determines the distance between the merged cluster and existing clusters based on the distances before the merge. We use complete linkage, defined as follows: \( dist_J(A\cup B,C) = max \lbrace dist_J(A,C), dist_J(B,C)\rbrace \). For example, the row for \( C_1 \) in the second distance matrix in Figure 3 lists the distances between \( C_1 \) and each of the remaining files.

Traffic considerations \( (H1) \). We limit the traffic required by our migration plan in two ways. First, we assign each of the final clusters to the volume that contains the largest number of its blocks. We calculate the size of the intersection (in terms of the size of the shared blocks) between each cluster and each volume in the initial system. Second, we iteratively pick the \( \langle cluster,volume \rangle \) pair with the largest intersection from the clusters and volumes that have not yet been assigned.

This assignment alone might still result in excessive traffic, especially if highly similar files are initially scattered across many different volumes. To avoid such situations, we incorporate the traffic considerations into the clustering process itself. We define the volume distance, \( dist_V(C) \), of a cluster as the portion of the system’s volumes whose files are included in the cluster. For example, in Figure 3, \( dist_V(C_1)= {1}/{3} \) and \( dist_V(C_2)= {2}/{3} \).

We then define a new weighted distance metric that combines the Jaccard distance and the volume distance: \( dist_W(A,B) = W_T\times dist_J(A,B) + (1-W_T)\times dist_V(A\cup B) \), where \( 0\le W_T \le 1 \) is the traffic weight. Intuitively, increasing \( W_T \) increases the amount of traffic allocated for the migration, which increases the priority of deduplication efficiency over the network transfer cost. Nevertheless, it does not guarantee compliance with a specific traffic constraint. We address this limitation by multiple executions, described later.

Load-balancing considerations \( (H2) \). We enforce the load-balancing constraint by preventing merges that result in clusters that exceed the maximal volume size. We determine the maximal cluster size by estimating the system’s size after migration. Intuitively, we expect that increasing the traffic allocated for migration will increase the reduction in system size. We estimate this traffic with the \( W_T \) weight described earlier. Formally, we estimate the size of the final system as \( Size(W_T) = W_T \times Size_{uniq} + (1-W_T)\times size(S_{init}) \), where \( Size_{uniq} \) is the size of all of the unique blocks in the system. Thus, the maximal cluster size is \( C_{max}= {Size(W_T)}/{|V|} \).

In each clustering iteration, we ensure that the merged cluster is not larger than \( C_{max} \). This requirement might result in the algorithm halting before the target number of clusters is reached due to merging decisions made earlier in the process. If this happens, we increase the value of \( C_{max} \) by a small \( \epsilon \) and retry the clustering process. We continue retrying until the algorithm creates the required number of clusters. A small \( \epsilon \) can potentially yield the most balanced system but might require too many retries. We use \( \epsilon =5\% \) as our default.

Sensitivity to sample \( (H3) \). As in the ILP-based approach, we apply the hierarchical clustering process to a sample of the system rather than to the complete set of blocks, which can be too large. However, it turns out that the Jaccard distance is highly sensitive to the precise set of blocks that represent each file in the sample. We found in our initial experiments that different sampling degrees as well as different sampling rules (e.g., \( k \) leading ones instead of \( k \) leading zeroes in the fingerprint) result in small differences in the Jaccard distance of the file pairs.

Such small differences might change the entire clustering hierarchy, even if the practical difference between the pairs of files is very small. Thus, rather than merging the pair of clusters with the smallest distance, we merge a random pair from the set of pairs with the smallest distances. We include in this set only pairs whose distance is within a certain percentage of the minimum distance. Thus, for a maximum distance difference \( gap \), we choose a random pair \( \langle C_i, C_j \rangle \) from the 10 (or less) pairs for which \( Dist_W(C_i, C_j) \le \text{minimum distance} \times (1+gap) \).

Constructing the final migration plan \( (H4) \). The main advantage of our clustering-based approach is its relatively fast runtime. Constructing the initial distance matrix for the individual files is time-consuming, but the same initial matrix can be reused for all of the consecutive clustering processes on the same initial system. We exploit this advantage to eliminate the sensitivity of our clustering process to the many parameters introduced in this section. For the same given system and migration constraints, we execute the clustering process with six traffic weights (\( W_T \in \lbrace 0,0.2,0.4,0.6,0.8,1\rbrace \)), three gaps (\( gap \in \lbrace 0.5\%, 1\%, 3\% \rbrace \)), and ten random seeds. This results in a total of 180 executions, some of which are performed in parallel (depending on the resources of the evaluation platform). We calculate the deletion, traffic, and balance of each migration plan (on the sample used as the input for clustering). As our final result, we use the plan with the best deletion that satisfies the load-balancing and traffic constraints.

We also include in our evaluation a relaxed scheme without the load-balancing constraint (\( C_{max}=\infty \)). In this scheme, the final migration plan must satisfy only the traffic constraint.

7 EVALUATION

We wish to answer three main questions: (1) how the algorithms compare in terms of the final system size, load balancing, and runtime, (2) how the performance of the different algorithms is affected by the various system and problem parameters, and (3) how the performance of the different algorithms is affected by their internal parameters. In the following, we describe our evaluation setup and the experiments executed to answer those questions.

7.1 Experimental Setup

We ran our experiments on a server running Ubuntu 18.04.3, equipped with 128 GB DDR4 RAM (with 2,666 MHz bus speed), Intel^® Xeon^® Silver 4114 CPU (with hyper-threading functionality) running at 2.20 GHz, one Dell^®T1WH8 240 GB TLC SATA SSD, and one Micron 5200 Series 960 GB 3D TLC NAND Flash SSD.

File system snapshots. We used two static file system snapshots from datasets used to evaluate the seeding problem [43, 44]. The UBC dataset [7, 41] includes file systems of 857 Microsoft employees, of which we used the first 500 file systems (UBC-500). The FSL dataset [10] consists of snapshots of students’ home directories at the FSL Lab at Stony Brook University [50, 51]. We used nine weekly snapshots of nine users between August 28 and October 23, 2014 (Homes). These snapshots include, for each file, the fingerprints of its chunks and their sizes. Each snapshot file represents one entire file system, which is the migration unit in our model, and is represented as one file in our migration problem instances.

We created two additional sets of snapshots from the Linux version archive [6]. Our Linux-all dataset includes snapshots of all of the versions from 2.0 to 5.9.14. We also created a smaller dataset, Linux-skip, which consists of every fifth snapshot. The latter dataset is logically (approximately) \( 5\times \) smaller than the former, although their physical size is almost the same.

The UBC-500 and Homes snapshots were created with variably sized chunks with Rabin fingerprints, whose specified average chunk size is 64 KB. We created the Linux snapshots with an average chunk size of 8 KB because they are much smaller to begin with. We used these sets of snapshots to create six initial systems, with varying numbers of volumes. They are listed in Table 2. We emulate the ingestion of each snapshot into a simplified deduplication system that detects duplicates only within the same volume. In the UBC and Linux systems, we assigned the same number of arbitrary snapshots to each volume. In the Homes-week system, we assigned snapshots from the same week to the same volume such that each volume contains all of the users’ snapshots from a set of 3 weeks. In the Homes-user system, we assign each user to a dedicated volume such that each volume contains all of the weekly snapshots of a set of three users.

Table 2.

System	Files	\( \|V\| \)	Chunks	Dedupe	Logical
UBC-500	500	5	382M	0.39	19.5 TB
Homes-week	81	3	19M	0.38	8.9 TB
Homes-user	81	3	19M	0.16	8.9 TB
Linux-skip	662	5 / 10	1.76M	0.12 / 0.19	377 GB
Linux-all	2703	5	1.78M	0.03	1.8 TB

\( |V| \) is the number of volumes, Chunks is the number of unique chunks, and Dedupe is the deduplication ratio—the ratio between the physical and logical size of each system. Logical is the logical size.

View Table

Table 2. System Snapshots in our Evaluation

\( |V| \) is the number of volumes, Chunks is the number of unique chunks, and Dedupe is the deduplication ratio—the ratio between the physical and logical size of each system. Logical is the logical size.

Implementation. All of our algorithms are executed on a sample of the system’s fingerprints to reduce their memory consumption and runtime. We use a sampling degree of \( k=13 \) unless stated otherwise. The final system size after migration, as well as the resulting balance and consumed traffic, are calculated on the original system’s snapshot. We use a calculator similar to the one used in [43, 44]: we traverse the initial system’s volumes and sum the sizes of blocks that remain in each volume after migration and those that are added to the volume as a result of it. We experimented with three \( T_{max} \) values — 20%, 40%, and 100% — of each system’s initial size, and three \( \mu \) values — 2%, 5%, and 10% —of the system size after migration.

For our greedy algorithm (Greedy), we maintain a matrix in which we record, for each block, the number of files pointing to it in each volume. We update this array to reflect the file remap performed in each iteration. To determine the space-saving ratio of each file, we reread its original snapshot file and lookup the counters of its blocks in the array. This is more efficient than keeping the list of each file’s blocks in memory. Our Greedy implementation consists of 680 lines of C++ code.

For our ILP-based algorithm (ILP), we use the commercial Gurobi optimizer [3] as our ILP solver and use its C++ interface to define our problem instances. We use a two-dimensional array similar to the one used for Greedy to calculate the set of blocks shared by each pair of volumes. We then create the variables and constraints as we process each snapshot file, freeing the original array from the memory before invoking the optimization by Gurobi. Our program for converting the input files into an ILP instance and retrieving the solution from Gurobi consists of 1860 lines of C++ code. We solve each ILP instance three times, each with a different random seed. The results in this section are the average of the three executions. Unless stated otherwise, all of our experiments are performed without internal sampling.

For our clustering algorithm (Cluster), we create an \( |F| \times |B| \) bit matrix to indicate whether each file contains each block, and use it to create the distance matrix (see Figure 3). The clustering process uses and updates only the lower triangle of this matrix. We use the upper triangle to record the initial distances and to reset the lower triangle when the clustering process is repeated for the same system and different input parameters (\( W_T \), \( gap \), or random seed). When the clustering process completes, we use the file-block bit matrix to determine the assignment of clusters to volumes. Our program consists of approximately 1,000 lines of C++ code. Each clustering process is performed on a private copy of the distance matrix within a single thread; our evaluation platform is sufficient for executing six processes in parallel.

Each algorithm has different resource requirements. Greedy is single threaded and requires a simple representation of the system’s snapshot in memory. The ILP solver uses as much memory and as many threads as possible (38 in our case). The clustering algorithm ran in six processes and used approximately 50% of our server’s memory. We did not measure CPU utilization directly, although the runtime of the algorithms gives another indication of their compute overheads. We included in our evaluation the relaxed (R) version, without the load-balancing constraint, of each algorithm. Our implementation is open source and available online [8].

7.2 Basic Comparison between Algorithms

Figure 4 shows the deletion—percentage of the initial system’s physical size that was deleted by each algorithm. The deletion is higher for systems that were initially more balanced, that is, the Linux and Homes-weeks systems. For all systems except UBC-500, Greedy achieved the smallest deletion. For Homes-users, Greedy increased the system’s size in an attempt to comply with the load-balancing constraint. In UBC-500, there is less similarity and, therefore, less dependency between files, which eliminates some of the advantage that Cluster and ILP have over Greedy, which outperforms them when \( T_{max}=100\% \).

Fig. 4. Reduction in system size of all systems and all algorithms (with and without load-balancing constraints. \( k=13 \) and \( \mu =2\% \) ).

ILP and Cluster achieve comparable deletions to one another even though the ILP solver attempts to find the theoretically optimal migration plan. We distinguish between two cases when explaining this similarity. In the first case (Linux-skip and Homes), the ILP-solver produces an optimal solution on the system’s sample, but it is not optimal when applied to the full (unsampled) system. The deletion of Cluster is up to 1% higher than that of ILP in those cases. In the second case, marked by a red ‘x’ in the figures, ILP times out (after 6 hours in our experiments) and returns a suboptimal solution. Specifically, the complexity of the UBC-500 system demonstrates an interesting limitation of ILP: its deletion with \( T_{max}=20\% \) is higher than with \( T_{max}=100\% \). The reason is that the solution space grows with \( T_{max} \) and, thus, the best solution found when the solver times out is farther from the optimum.

The ‘relaxed’ (R) version of the algorithms, without the load-balancing constraint, usually achieves a higher deletion than the full version. The largest difference is 558%, although the difference is typically smaller and can be as low as 1.3%. In the case of Greedy in the Homes-users system, the relaxed version does not identify any file that can be remapped and does not return any solution.

Figure 5 shows the balance achieved by each algorithm. With a margin of \( \mu =2\% \) and 5 volumes, the balance should be at least \( {18}/{22}=0.82 \). In practice, however, the balance might be lower, for two main reasons. Greedy might fail to bring the system to a balanced state if it exhausts (or thinks it exhausts) the maximum traffic allowed for migration. In contrast, Cluster and ILP generate a migration plan that complies with the load balancing constraint on the sample, but violates it when applied to the full (unsampled) system. The violation is highest in the Linux systems, in which some files (i.e., entire Linux versions) are represented in the sample by only one or two blocks. Nevertheless, specifying the load-balancing constraint successfully improves the load balancing of the system. Without it, the relaxed Cluster and ILP versions create highly unbalanced systems, with some volumes storing no files at all or very few small files. Greedy typically avoids such extremes because it is unable to identify and group similar files in the same volume.

Fig. 5. Resulting balance of all systems and all algorithms (with and without load-balancing constraints. \( k=13 \) and \( \mu =2\% \) ).

Figure 6 shows the runtime of each of the algorithms (note the log scale of the y-axis). Greedy generates a migration plan in the shortest runtime: 20 s or less in all of our experiments. ILP requires the longest time because it attempts to solve an NP-hard problem. Indeed, except for the Homes systems, which have the fewest files, ILP requires more than an hour and often halts at the 6-hour timeout. The runtime of Cluster is longer than that of Greedy and usually shorter than that of ILP. It is still relatively long as a result of performing 180 executions of the clustering process. However, we note that this runtime can be shortened by reducing the number of executions by reducing the number of random seeds or gaps, for example. We evaluate the effect of these parameters in the following subsection.

Fig. 6. Algorithm runtime for all systems and all algorithms (with and without load-balancing constraints. \( k=13 \) and \( \mu =2\% \) ).

Removing the load-balancing constraint reduces the runtime of ILP and Cluster by one or two orders of magnitude. In ILP, this happens because the problem complexity is significantly reduced. In Cluster, the clustering is completed in a single attempt without having to restart it due to illegal cluster sizes. Surprisingly, removing this constraint from Greedy increases its runtime. The reason is that each iteration in the capacity-reduction step is much longer than those in the load-balancing step, as it examines all possible file remaps between all volume pairs in the system. In the relaxed Greedy version, all traffic is allocated to capacity savings; thus, its runtime increases.

Implications. Our basic comparison leads to several notable observations. (1) Cluster and ILP have a clear advantage over Greedy. This was not the case in previous studies that examined simple cases of migration, that is, seeding [43, 44] and space reclamation [45]. However, the increased complexity of the general migration problem increases the gap between the greedy and the optimal solutions. (2) Cluster is comparable and might even outperform ILP despite the premise of optimality of the ILP-based approach. This is a combination of the high complexity of the ILP problem with the ability to execute multiple clustering processes quickly and in parallel. We conclude that hierarchical clustering is highly efficient for grouping similar files and that our heuristics for addressing the traffic and load-balancing constraints are highly effective. (3) In most systems, adding the load balancing constraint limits the potential capacity reduction,. However, this limit is usually modest, that is, several percentage points of the system’s size. The extent of this limitation depends on the degree of similarity between files and the balance of the initial system.

7.3 Sensitivity to Problem Parameters

Effect of sampling degree. Figure 7 shows the deletion, load balancing, and runtime of all of the algorithms on two samples of the Linux-skip system. The small and large samples were generated with sampling degrees of \( k=13 \) and \( k=8 \), respectively. The sample size affects each algorithm differently. Greedy achieves a higher deletion on the larger sample (by up to 238%), as it identifies more opportunities for capacity reduction. In contrast, ILP suffers from the increase in the problem size. It spends more time on finding a feasible solution and has less time for optimization; thus, its deletion on the larger sample is smaller. We repeated the execution of ILP on the large (\( k=8 \)) sample with a longer timeout — 12 hours instead of 6 — but the increase in deletion was minor. This confirmed the observation made for GoSeed [43, 44], that it is more effective to reduce the sample size than to increase the runtime of the ILP solver. The relaxed ILP instance is much simpler; thus, relaxed ILP does not suffer such degradation. Cluster returns similar results for both sample sizes. The differences in the accuracy of the sample are masked by its randomized process.

Fig. 7. Linux-skip system with 5 volumes, \( \mu =2\% \) , and two sampling degrees, \( k=8,13 \) .

All algorithms return a more balanced system for the larger sample (\( k=8 \)) because the load-balancing constraint is enforced on more blocks and, thus, more accurately. At the same time, as we expected, their runtime was higher by several orders of magnitude, as the large sample included \( 2^5\times \) more blocks than the small one. We note that Greedy is so much faster than ILP and Cluster and that its runtime on the large sample is considerably shorter than their runtime on the small one. Thus, if the sample is generated on-the-fly for the purpose of constructing the migration plan, it is possible to execute Greedy on a larger sample for a better migration plan.

Effect of load balancing and traffic constraints. Figure 8 shows the deletion, balance, and traffic consumption of all algorithms on the UBC-500 system with different values of \( T_{max} \) and \( \mu \). The results on this system show the highest sensitivity to these constraints due to the lower similarity between the files. The deletion achieved by all of the algorithms increases as \( T_{max} \) increases, and their traffic consumption increases accordingly. Removing the load-balancing constraint also allows for more deletion, as we observed in Figure 4. At the same time, the precise value of the load balancing margin, \( \mu \), has a much smaller effect on the achieved deletion, although, in most cases, a lower margin does guarantee a more balanced system. Increasing the margin increases the runtime (not shown) of Greedy as a result of more space-reduction iterations, as discussed earlier. The runtime of ILP and Cluster is not affected by the precise value of \( \mu \).

Fig. 8. UBC-500 system with \( k=13 \) and different load-balancing margins.

Effect of the number of volumes. Figure 9 shows the deletion and runtime of our algorithms on the Linux-skip system when the number of volumes is reduced (4), increased (6), or is larger overall (10). Due to the high similarity between the Linux versions, the same deletion is achieved when the number of volumes remains 5, or when a volume is added or removed (the reduced performance of Cluster is an outlier for \( \mu =2\% \)). When the initial number of volumes is 10, there are more duplicates in the system. This provides more opportunities for deletion, which is indeed higher.

Fig. 9. Linux-skip with different numbers of target volumes with \( T_{max}=100, k=13, \mu =2\% \) .

The number of volumes affects the problem’s complexity, affecting each algorithm differently. Greedy requires less time when a volume is added or removed (compared to a problem in which the number of volumes remains the same), because it spends most of its traffic, and hence more iterations, on the faster load-balancing step. The runtime for a system with 10 volumes is much longer than for a system with only 5 volumes because there are more volume pairs and, thus, more file remap options to consider in each iteration. The ILP problem complexity increases with every additional volume; thus, its runtime increases until it reaches the timeout. The clustering process could potentially stop at an earlier stage when more clusters are needed. However, as the number of clusters increases, the load-balancing constraint is encountered at an earlier stage, causing the clustering to restart more often when the number of volumes is higher. Nevertheless, all of our algorithms successfully generated migration plans for a varying number of volumes, most within less than an hour.

7.4 Sensitivity to Algorithm Parameters

Effect of timeout on ILP. To analyze the effect of the timeout value on ILP, we generated a migration plan for the UBC-500 system with \( \mu =2\% \) and different values of \( T_{max} \), repeating the experiment with increasing timeout values, between 3 and 48 hours. The results, presented in Figure 10, show that the effect of the timeout depends on the space of feasible solutions. In this example, increasing \( T_{max} \) increases the number of solutions that meet the traffic constraint, respectively increasing the number of solutions that the ILP solver must consider when searching for the optimal solution. With \( T_{max}=20 \), approximately 8 hours is required to find the optimal solution; increasing the timeout beyond this time had no effect. The solutions found within 3 and 6 hours were already very close to the optimal one. With \( T_{max}=40 \), increasing the timeout beyond 6 hours carried diminishing returns, indicating that the solution is likely very close to the optimal one. In contrast, with \( T_{max}=100 \), the solution keeps improving even after 48 hours due to the very large solution space.

Fig. 10. Deletion in the UBC-500 system with \( \mu =2\% \) , \( k=13, \) and increasing timeout values.

These results are consistent with the analysis of GoSeed in [43, 44], which showed that the majority of the solver’s progress is typically achieved in the first half of its overall runtime. As we increased the problem’s complexity (by increasing \( T_{max} \)), we increased the time required for the solver to complete its execution, thus, increasing its benefit from longer timeouts. The point of diminishing returns could possibly be identified in future heuristics, replacing the fixed timeout with one that dynamically adapts to the problem instance.

Effect of internal sampling on ILP. We evaluated the effectiveness of internal sampling on two systems, UBC-500 and Linux-skip, with four initial sampling degrees (\( k \)) and four internal sampling degrees (\( k^{\prime } \)). We use \( \mu =2\% \) as in the rest of the experiments. Figure 11 shows the results for the UBC-500 system. With \( T_{max}=20 \), the solution obtained with \( k=13 \) was close to optimal (see earlier discussion). Nevertheless, the internal sampling relaxed the constraints to allow more efficient solutions. Note that increasing the initial sampling to \( k=14,15 \) had a similar effect.

Fig. 11. Reduction in system size, balance, and runtime of UBC-500 with \( \mu =2\% \) and varying degrees of initial sampling, \( k \) (X axis), and internal sampling, \( k^{\prime } \) .

With \( T_{max}=100 \), where the solution space is initially very large, the effect of internal sampling was different with different initial sampling degrees. With \( k=12 \), the solution space became so large that the deletion decreased with \( k^{\prime } \) as a result of the solver timing out farther from the optimal solution. With \( k=13 \), the solver found a better solution with an internal sample of \( k^{\prime }=1 \) but increasing \( k^{\prime } \) reduced the quality of the solution. With \( k=14,15 \), the initial sample was small enough to prevent these negative effects. In general, increasing the internal sampling degree increased the solver’s runtime and reduced the system’s balance when compared with migration plans without internal sampling.

Figure 12 shows the results for the Linux-skip system. Recall that this system is much smaller, with some files represented by as few as one or two blocks in the initial sample with \( k=13 \). Thus, we used smaller initial sampling degrees for this system. Nevertheless, internal sampling resulted in an unfeasible ILP problem (i.e., there is no solution that satisfies its constraints) when the combined sampling degrees were too high: \( k=11,12 \) with \( k^{\prime }=3 \) and \( k=13 \) with \( k^{\prime }=2,3 \). This is the result of some files having a size of zero in the load-balancing constraint. As in the UBC-500 system, increasing the internal sampling degree increased the solver’s runtime and reduced the system’s balance except for some anomalies due to the aggressive sampling.

Fig. 12. Reduction in system size, balance, and runtime of Linux-skip with \( \mu =2\% \) and varying degrees of initial sampling (X axis) and internal sampling ( \( k^{\prime } \) ). Red \( X \) ’s indicate infeasible ILP instances.

We conclude that internal sampling is not an effective acceleration heuristic: it increases the space of feasible solutions without reducing the number of variables in the ILP instance. As a result, it increases the time required to find an optimal solution instead of reducing it. For a large system, it is more effective to increase the initial sampling degree. Doing so reduces the size of the problem (rather than its complexity), resulting in shorter runtimes and better migration plans. However, in systems with small files, care must be taken not to reduce the size of the sampled system excessively, as this might result in an unfeasible ILP instance or may have negative effects on the system’s balance after migration.

Effect of fingerprint sampling on Cluster. When creating a fingerprint sample for representing the system, the sampling rule is the pattern that is required for including a fingerprint (and corresponding block) in the sample. The default sampling rule, described in [30] and used in our evaluation, is leading zeroes. The sampling rule is orthogonal to the sampling degree and does not affect the accuracy of the sketch. However, it does affect the precise set of blocks that are included in the sample. Our initial experiments revealed that the hierarchical clustering algorithm is highly sensitive to the blocks that represent each file in the sample.

To illustrate this sensitivity, we define two additional sampling rules. The leading-ones rule requires that the block fingerprint contains \( k \) leading ones, and the alternating zero-one rule requires a pattern of \( k \) alternating bits as the prefix of the fingerprint. For example, with \( k=7 \) and the alternating zero-one rule, only blocks whose fingerprint starts with 0101010 will be included in the sample. We created a system of five arbitrary files and calculated its sample with \( k=13 \) and the three sampling rules. We then calculated the Jackard distance (dissimilarity) of all possible file pairs in each of the samples. The resulting dissimilarities are illustrated in Figure 13.

Fig. 13. Dissimilarities under different sampling rules with \( k=13 \) .

The results show that different sampling methods result in different dissimilarity values and, more importantly, in a different ordering of the pairs based on that value. This order dictates which files are merged in the initial steps of the clustering process. Recall that the clustering algorithm merges the pair of files with the lowest dissimilarity regardless of its absolute value. In our example, the pair \( \langle C, D\rangle \) is the most similar pair in all of the samples, but this is not always the case. Consider, for example, the three pairs \( \langle A, B\rangle \) (white), \( \langle A, E\rangle \) (green), and \( \langle B, E\rangle \) (black). According to the leading-zeros sampling rule, the pairs are ranked (best to worst, low to high): \( \langle A, E\rangle \), \( \langle B, E\rangle \), \( \langle A, B\rangle \). Alternatively, with the leading-ones rule, the order becomes \( \langle B, E\rangle \), \( \langle A, B\rangle \), \( \langle A, E\rangle \). Finally, the alternating zero-one rule results in the order \( \langle B, E\rangle \), \( \langle A, E\rangle \), \( \langle A, B \rangle \).

We repeated this experiment with different sampling rules and degrees on the system snapshots from Table 2 and observed similar behavior: different samples resulted in different merging decisions in the initial stages of the clustering process, eventually leading to entirely different migration plans. This observation motivated the addition of randomization to the clustering process, as described in Section 6.

Effect of randomization on Cluster. Figure 14 shows the range of deletion values and traffic usage of the migration plans generated by Cluster for Linux-all with \( k=13 \). Each bar shows the 25th, 50th, and 75th percentiles, and the whiskers show the minimum and maximum values achieved with different random seeds for each combination of \( gap \) and \( W_T \).

Fig. 14. The distribution of migration traffic (a) and reduction in system size (b) in the set of plans returned by Cluster for Linux-all with \( k=13 \) .

Our results show that different random seeds can result in large differences in the deletion and traffic: up to 84% and 400%, respectively, when \( W_T \) and \( gap \) are fixed. At the same time, \( W_T \) cannot predict the actual traffic used by the migration plan since it is only used heuristically to simulate the traffic constraint. This is the reason for repeating the clustering process for a range of \( W_T \) values. Indeed, different \( W_T \) values result in very different values of deletion. For a given \( W_T \), the range of deletion and traffic values generated by different \( gaps \) are similar. Thus, as no \( gap \) consistently outperforms the others, executing the clustering with one or two gaps instead of three will likely have a limited effect on the results while significantly reducing the runtime.

We repeated the same experiment with a different number of random seeds. Figure 15 shows that increasing the number of seeds from 5 to 15 (respectively increasing the number of runs from 90 to 270) carries diminishing returns. Thus, in practice, it is possible to halt the algorithm when additional runs do not improve the best solution so far.

Fig. 15. The distribution of migration traffic (a) and reduction in system size (b) in the set of plans returned by Cluster for Linux-all with \( k=13 \) under different number of random seeds.

Finally, we examine whether the sensitivity to the configuration parameters differs between different systems. We execute Cluster with three gaps and 10 random seeds on three different systems. Figure 16 shows the results for each system and each value of \( W_T \). As we expected, the traffic consumption is different in different systems. The results for Homes-users demonstrate that smaller systems are more sensitive to this parameter because each file consists of a larger portion of the entire system. Thus, when the traffic weight of the similarity metric is small (small \( W_T \)), there are limited options for clustering in the early stages of the process and all random seeds result in the same plan. However, as this weight is increased (\( W_T\gt 0.6 \)), the difference between the migration plans increases dramatically.

Fig. 16. The distribution of migration traffic (a) and reduction in system size (b) in the set of plans returned by Cluster for different datasets with \( k=13 \) .

8 CONCLUSIONS AND FUTURE CHALLENGES

We formulated the general migration problem for storage systems with deduplication and presented three algorithms for generating an efficient migration plan. Our evaluation showed that the greedy approach is the fastest but least effective and that the clustering-based approach is comparable to the one based on ILP, despite ILP’s premise of optimality. While the ILP-based approach guarantees a near-optimal solution (given sufficient runtime), clustering lends itself to a range of optimizations that enable it to produce such a solution faster.

All of our approaches can be applied to more specific cases of migration, presenting additional opportunities for further optimizations in the future. For example, thanks to its short runtime, we can use Greedy to generate multiple plans with different traffic constraints. These plans are points on the Pareto frontier [59], that is, they represent different trade-offs between the conflicting objectives of maximizing deletion and minimizing traffic. The multiple executions in the clustering algorithm provide a similar set of options.

Applying our approach in a live deduplicated system introduces several challenges, such as collecting and generating the system’s snapshot as input to the algorithms, efficiently updating the metadata, determining the migration schedule, and adjusting it if new files are added to the system during this process. We leave these challenges for future work.

ACKNOWLEDGMENTS

We thank Aviv Nachman for help with the ILP approach, Nadav Elias for the Linux snapshots, and Danny Harnik for insightful discussions.

REFERENCES

[1] Cluster analysis. (n.d.). Retrieved October 24, 2020 from https://en.wikipedia.org/wiki/Cluster_analysis.Google Scholar
Reference
[2] CPLEX Optimizer. (n.d.). Retrieved October 24, 2018 from https://www.ibm.com/analytics/cplex-optimizer.Google Scholar
Reference
[3] The Fastest Mathematical Programming Solver. (n.d.). Retrieved October 24, 2018 from http://www.gurobi.com/.Google Scholar
Reference 1Reference 2
[4] GLPK (GNU Linear Programming Kit). (n.d.). Retrieved October 24, 2018 from https://www.gnu.org/software/glpk/.Google Scholar
Reference
[5] Introduction to lp_solve 5.5.2.5. (n.d.). Retrieved October 24, 2018 from http://lpsolve.sourceforge.net/5.5/.Google Scholar
Reference
[6] Linux Kernel Archives. (n.d.). Retrieved October 24, 2020 from https://mirrors.edge.kernel.org/pub/linux/kernel/.Google Scholar
Reference 1Reference 2
[7] SNIA IOTTA Repository. (n.d.). Retrieved October 24, 2018 from http://iotta.snia.org/tracetypes/6.Google Scholar
Reference
[8] Source code of migration algorithms. (n.d.). Retrieved February 22, 2022 from https://github.com/roei217/DedupMigration.Google Scholar
Reference 1Reference 2
[9] SYMPHONY development home page. (n.d.). Retrieved October 24, 2018 from https://projects.coin-or.org/SYMPHONY.Google Scholar
Reference
[10] Traces and Snapshots Public Archive. (n.d.). Retrieved October 24, 2018 from http://tracer.filesystems.org/.Google Scholar
Reference 1Reference 2
[11] What is data deduplication? (n.d.). Retrieved September 17, 2022 from https://www.netapp.com/data-management/what-is-data-deduplication/.Google Scholar
Reference
[12] 2022. DELL PowerStore T STORAGE FAMILY. Retrieved September 17, 2022 from https://www.delltechnologies.com/asset/he-il/products/storage/technical-support/dell-powerstore-3-0-spec-sheet.pdf.Google Scholar
Reference
[13] Aggarwal Bhavish, Akella Aditya, Anand Ashok, Balachandran Athula, Chitnis Pushkar, Muthukrishnan Chitra, Ramjee Ramachandran, and Varghese George. 2010. EndRE: An end-system redundancy elimination service for enterprises. In 7th USENIX Conference on Networked Systems Design and Implementation (NSDI’10). USENIX Association, San Jose, CA, 1–14.Google Scholar
Reference
[14] Allu Yamini, Douglis Fred, Kamat Mahesh, Prabhakar Ramya, Shilane Philip, and Ugale Rahul. 2018. Can’t we all get along? Redesigning protection storage for modern workloads. In 2018 USENIX Annual Technical Conference (USENIX ATC’18). USENIX Association, Boston, MA, 705–717.Google Scholar
Reference
[15] Anderson Eric, Hall Joseph, Hartline Jason D., Hobbs Michael, Karlin Anna R., Saia Jared, Swaminathan Ram, and Wilkes John. 2001. An experimental study of data migration algorithms. In 5th International Workshop on Algorithm Engineering (WAE’01). Springer-Verlag, Aarhus, Denmark, 145–158.Google Scholar
Reference
[16] Anderson Eric, Hobbs Michael, Keeton Kimberly, Spence Susan, Uysal Mustafa, and Veitch Alistair. 2002. Hippodrome: Running circles around storage administration. In 1st USENIX Conference on File and Storage Technologies (FAST’02). USENIX Association, Monterey, CA, 175–188.Google ScholarDigital Library
Reference 1Reference 2
[17] Balasubramanian Bharath, Lan Tian, and Chiang Mung. 2014. SAP: Similarity-aware partitioning for efficient cloud storage. In IEEE Conference on Computer Communications (INFOCOM’14). IEEE, Toronto, Canada, 592–600.Google ScholarCross Ref
Reference
[18] Bhagwat Deepavali, Eshghi Kave, Long Darrell D. E., and Lillibridge Mark. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS’09). IEEE, London, UK, 1–9.Google Scholar
Reference 1Reference 2
[19] Chen Feng, Luo Tian, and Zhang Xiaodong. 2011. CAFTL: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In 9th USENIX Conference on File and Storage Technologies (FAST’11). USENIX Association, San Jose, CA, 1–14.Google Scholar
Reference
[20] Clements Austin T., Ahmad Irfan, Vilayannur Murali, and Li Jinyuan. 2009. Decentralized deduplication in SAN cluster file systems. In 2009 Conference on USENIX Annual Technical Conference (USENIX’09). USENIX Association, San Diego, CA, 1–14.Google ScholarDigital Library
Reference
[21] Debnath Biplob, Sengupta Sudipta, and Li Jin. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In 2010 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’10). USENIX Association, Boston, MA, 1–16.Google Scholar
Reference
[22] Dong Wei, Douglis Fred, Li Kai, Patterson Hugo, Reddy Sazzala, and Shilane Philip. 2011. Tradeoffs in scalable data routing for deduplication clusters. In 9th USENIX Conference on File and Stroage Technologies (FAST’11). USENIX Association, San Jose, CA, 1–15.Google ScholarDigital Library
Reference 1Reference 2
[23] Douglis Fred, Duggal Abhinav, Shilane Philip, Wong Tony, Yan Shiqin, and Botelho Fabiano. 2017. The logic of physical garbage collection in deduplicating storage. In 15th USENIX Conference on File and Storage Technologies (FAST’17). USENIX Association, Santa Clara, CA, 29–44.Google ScholarDigital Library
Reference
[24] Dubnicki Cezary, Gryz Leszek, Heldt Lukasz, Kaczmarczyk Michal, Kilian Wojciech, Strzelczak Przemyslaw, Szczepkowski Jerzy, Ungureanu Cristian, and Welnicki Michal. 2009. HYDRAstor: A scalable secondary storage. In 7th Conference on File and Storage Technologies (FAST’09). USENIX Association, San Francisco, CA, 197–210.Google Scholar
Reference
[25] Duggal Abhinav, Jenkins Fani, Shilane Philip, Chinthekindi Ramprasad, Shah Ritesh, and Kamat Mahesh. 2019. Data domain cloud tier: Backup here, backup there, deduplicated everywhere!. In 2019 USENIX Annual Technical Conference (USENIX ATC’19). USENIX Association, Renton, WA, 647–660.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[26] Fu Min, Feng Dan, Hua Yu, He Xubin, Chen Zuoning, Xia Wen, Huang Fangting, and Liu Qing. 2014. Accelerating restore and garbage collection in deduplication-based backup systems via exploiting historical information. In 2014 USENIX Annual Technical Conference (USENIX ATC’14). USENIX Association, Philadelphia, PA, 181–192.Google Scholar
Reference
[27] Gabor Ron, Weiss Shlomo, and Mendelson Avi. 2007. Fairness enforcement in switch on event multithreading. ACM Transactions on Architecture and Code Optimization 4, 3 (Sept. 2007), 15–es.Google ScholarDigital Library
Reference
[28] Greenacre Michael and Primicerio Raul. 2013. Multivariate Analysis of Ecological Data. Fundación BBVA, Bilbao, Chapter Hierarchical Cluster Analysis.Google Scholar
Reference 1Reference 2
[29] Guo Fanglu and Efstathopoulos Petros. 2011. Building a high-performance deduplication system. In 2011 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’11). USENIX Association, Portland, OR, 1–25.Google ScholarDigital Library
Reference
[30] Harnik Danny, Hershcovitch Moshik, Shatsky Yosef, Epstein Amir, and Kat Ronen. 2019. Sketching volume capacities in deduplicated storage. In 17th USENIX Conference on File and Storage Technologies (FAST’19). USENIX Association, Boston, MA, 107–119.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
[31] Huang Cheng, Simitci Huseyin, Xu Yikang, Ogus Aaron, Calder Brad, Gopalan Parikshit, Li Jin, and Yekhanin Sergey. 2012. Erasure coding in Windows Azure storage. In 2012 USENIX Annual Technical Conference (USENIX ATC’12). USENIX Association, Boston, MA, 1–12.Google Scholar
Reference
[32] Kaczmarczyk Michal, Barczynski Marcin, Kilian Wojciech, and Dubnicki Cezary. 2012. Reducing impact of data fragmentation caused by in-line deduplication. In Proceedings of the 5th Annual International Systems and Storage Conference (SYSTOR’12). Association for Computing Machinery, Haifa, Israel, 1–12.Google ScholarDigital Library
Reference
[33] Kisous Roei, Kolikant Ariel, Duggal Abhinav, Sheinvald Sarai, and Yadgar Gala. 2022. The what, the from, and the to: The migration games in deduplicated systems. In 20th USENIX Conference on File and Storage Technologies (FAST’22). USENIX Association, Santa Clara, CA, 265–280.Google ScholarDigital Library
Reference
[34] Li Cheng, Shilane Philip, Douglis Fred, Shim Hyong, Smaldone Stephen, and Wallace Grant. 2014. Nitro: A capacity-optimized SSD cache for primary storage. In 2014 USENIX Annual Technical Conference (USENIX ATC’14). USENIX Association, Philadelphia, PA, 501–512.Google Scholar
Reference
[35] Lillibridge Mark, Eshghi Kave, and Bhagwat Deepavali. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In 11th USENIX Conference on File and Storage Technologies (FAST’13). USENIX Association, San Jose, CA, 183–197.Google ScholarDigital Library
Reference
[36] Lillibridge Mark, Eshghi Kave, Bhagwat Deepavali, Deolalikar Vinay, Trezise Greg, and Camble Peter. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In 7th Conference on File and Storage Technologies (FAST’09). USENIX Association, San Francisco, CA, 125–138.Google Scholar
Reference
[37] Lin Xing, Lu Guanlin, Douglis Fred, Shilane Philip, and Wallace Grant. 2014. Migratory compression: Coarse-grained data reordering to improve compressibility. In 12th USENIX Conference on File and Storage Technologies (FAST’14). USENIX Association, Santa Clara, CA, 265–273.Google Scholar
Reference
[38] Lu Chenyang, Alvarez Guillermo A., and Wilkes John. 2002. Aqueduct: Online data migration with performance guarantees. In 1st USENIX Conference on File and Storage Technologies (FAST’02). USENIX Association, Monterey, CA, 219–230.Google ScholarDigital Library
Reference
[39] Manber Udi. 1994. Finding similar files in a large file system. In USENIX Winter 1994 Technical Conference (WTEC’94). USENIX Association, San Francisco, CA, 1–2.Google Scholar
Reference
[40] Matsuzawa Keiichi, Hayasaka Mitsuo, and Shinagawa Takahiro. 2018. The quick migration of file servers. In 11th ACM International Systems and Storage Conference (SYSTOR’18). Association for Computing Machinery, Haifa, Israel, 65–75.Google ScholarDigital Library
Reference
[41] Meyer Dutch T. and Bolosky William J.. 2011. A study of practical deduplication. In 9th USENIX Conference on File and Storage Technologies (FAST’11). USENIX Association, San Jose, CA, 1–13.Google ScholarDigital Library
Reference 1Reference 2
[42] Muthitacharoen Athicha, Chen Benjie, and Mazières David. 2001. A low-bandwidth network file system. In 18th ACM Symposium on Operating Systems Principles (SOSP’01). Association for Computing Machinery, Banff, Canada, 174–178.Google Scholar
Reference
[43] Nachman Aviv, Sheinvald Sarai, Kolikant Ariel, and Yadgar Gala. 2021. GoSeed: Optimal seeding plan for deduplicated storage. ACM Transactions on Storage 17, 3, Article 24 (Aug. 2021), 28 pages.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
Reference 10
Reference 11
Reference 12
[44] Nachman Aviv, Yadgar Gala, and Sheinvald Sarai. 2020. GoSeed: Generating an optimal seeding plan for deduplicated storage. In 18th USENIX Conference on File and Storage Technologies (FAST’20). USENIX Association, Santa Clara, CA, 193–207.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
Reference 8
Reference 9
[45] Nagesh P. C. and Kathpal Atish. 2013. Rangoli: Space management in deduplication environments. In 6th International Systems and Storage Conference (SYSTOR’13). Association for Computing Machinery, Haifa, Israel, 1–6.Google ScholarDigital Library
Reference 1Reference 2Reference 3
[46] Nightingale Edmund B., Elson Jeremy, Fan Jinliang, Hofmann Owen, Howell Jon, and Suzue Yutaka. 2012. Flat datacenter storage. In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI’12). USENIX Association, Hollywood, CA, 1–15.Google Scholar
Reference
[47] Rashmi K. V., Shah Nihar B., Gu Dikang, Kuang Hairong, Borthakur Dhruba, and Ramchandran Kannan. 2013. A solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the Facebook warehouse cluster. In 5th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’13). USENIX Association, San Jose, CA, 1–5.Google Scholar
Reference
[48] Shilane Philip, Chitloor Ravi, and Jonnala Uday Kiran. 2016. 99 deduplication problems. In 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16). USENIX Association, Denver, CO, 1–5.Google Scholar
Reference 1Reference 2
[49] Srinivasan Kiran, Bisson Tim, Goodson Garth, and Voruganti Kaladhar. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In 10th USENIX Conference on File and Storage Technologies (FAST’12). USENIX Association, San Jose, CA, 1–14.Google Scholar
Reference 1Reference 2
[50] Sun Zhen, Kuenning Geoff, Mandal Sonam, Shilane Philip, Tarasov Vasily, Xiao Nong, and Zadok Erez. 2016. A long-term user-centric analysis of deduplication patterns. In 32nd Symposium on Mass Storage Systems and Technologies (MSST’16). IEEE/NASA Goddard, Santa Clara, CA, 1–7.Google Scholar
Reference
[51] Tarasov Vasily, Mudrankit Amar, Buik Will, Shilane Philip, Kuenning Geoff, and Zadok Erez. 2012. Generating realistic datasets for deduplication analysis. In 2012 USENIX Annual Technical Conference (USENIX ATC’12). USENIX Association, Boston, MA, 261–272.Google Scholar
Reference
[52] Tran Nguyen, Aguilera Marcos K., and Balakrishnan Mahesh. 2011. Online migration for geo-distributed storage systems. In 2011 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’11). USENIX Association, Portland, OR, 1–16.Google ScholarDigital Library
Reference
[53] Weil Sage A., Brandt Scott A., Miller Ethan L., and Maltzahn Carlos. 2006. CRUSH: Controlled, scalable, decentralized placement of replicated data. In ACM/IEEE Conference on Supercomputing (SC’06). ACM/IEEE, Tampa, Florida, 1–12.Google ScholarCross Ref
Reference
[54] Xia Wen, Jiang Hong, Feng Dan, Tian Lei, Fu Min, and Zhou Yukun. 2014. Ddelta: A deduplication-inspired fast delta compression approach. Performance Evaluation 79 (2014), 258–272.Google ScholarCross Ref
Reference
[55] Xia Wen, Zhou Yukun, Jiang Hong, Feng Dan, Hua Yu, Hu Yuchong, Liu Qing, and Zhang Yucheng. 2016. FastCDC: A fast and efficient content-defined chunking approach for data deduplication. In 2016 USENIX Annual Technical Conference (USENIX ATC’16). USENIX Association, Denver, CO, 101–114.Google Scholar
Reference
[56] Yan Zhichao, Jiang Hong, Tan Yujuan, and Luo Hao. 2016. Deduplicating compressed contents in cloud storage environment. In 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16). USENIX Association, Denver, CO, 1–5.Google Scholar
Reference
[57] Cao Zhichao, Wen Hao, Wu Fenggang, and Du David H. C.. 2018. ALACC: Accelerating restore performance of data deduplication systems using adaptive look-ahead Window assisted chunk caching. In 16th USENIX Conference on File and Storage Technologies (FAST’18). USENIX Association, Oakland, CA, 309–324.Google Scholar
Reference
[58] Zhu Benjamin, Li Kai, and Patterson Hugo. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In 6th USENIX Conference on File and Storage Technologies (FAST’08). USENIX Association, San Jose, CA, 269–282.Google ScholarDigital Library
Reference
[59] Zitzler Eckart, Knowles Joshua, and Thiele Lothar. 2008. Quality assessment of Pareto set approximations. In Multiobjective Optimization: Interactive and Evolutionary Approaches, Branke Jürgen, Deb Kalyanmoy, Miettinen Kaisa, and Słowiński Roman (Eds.). Springer, Berlin,373–404.Google Scholar
Reference 1Reference 2

Index Terms

The what, The from, and The to: The Migration Games in Deduplicated Systems
1. Information systems
  1. Information storage systems
    1. Storage architectures
      1. Distributed storage
    2. Storage management

Recommendations

GoSeed: Optimal Seeding Plan for Deduplicated Storage
Deduplication decreases the physical occupancy of files in a storage volume by removing duplicate copies of data chunks, but creates data-sharing dependencies that complicate standard storage management tasks. Specifically, data migration plans must ...
Read More
Improving Virtual Machine Migration via Deduplication
MASS '14: Proceedings of the 2014 IEEE 11th International Conference on Mobile Ad Hoc and Sensor Systems

For this study the techniques of virtual machine migration are understood and the affects deduplication has on migration are evaluated. The benefits of using deduplication and compression on virtual machines show in the metric of space saved during ...
Read More
Live Data Migration for Reducing SLA Violations in Multi-tiered Storage Systems
IC2E '14: Proceedings of the 2014 IEEE International Conference on Cloud Engineering

Today, the volume of data in the world has been tremendously increased. Large-scaled and diverse data sets are raising new big challenges of storage, process, and query. Tiered storage architectures combining solid-state drives (SSDs) with hard disk ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Storage Volume 18, Issue 4
November 2022
255 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3570642
Editor:
Erez Zadok
Stony Brook University, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 November 2022
- Online AM: 29 September 2022
- Accepted: 13 September 2022
- Revised: 19 June 2022
- Received: 19 June 2022
Published in tos Volume 18, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Deduplication
data migration
capacity planning
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 1,039
  Total Downloads
- Downloads (Last 12 months)644
- Downloads (Last 6 weeks)123
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

The what, The from, and The to: The Migration Games in Deduplicated Systems

ACM Transactions on Storage

Abstract

1 INTRODUCTION

2 BACKGROUND AND RELATED WORK

3 MOTIVATION AND PROBLEM STATEMENT

4 GREEDY

5 ILP

6 CLUSTERING

7 EVALUATION

7.1 Experimental Setup

7.2 Basic Comparison between Algorithms

7.3 Sensitivity to Problem Parameters

7.4 Sensitivity to Algorithm Parameters

8 CONCLUSIONS AND FUTURE CHALLENGES

ACKNOWLEDGMENTS

REFERENCES

Cited By

Index Terms

Recommendations

GoSeed: Optimal Seeding Plan for Deduplicated Storage

Improving Virtual Machine Migration via Deduplication

Live Data Migration for Reducing SLA Violations in Multi-tiered Storage Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media