Abstract
The imbalanced I/O load on large parallel file systems affects the parallel I/O performance of high-performance computing (HPC) applications. One of the main reasons for I/O imbalances is the lack of a global view of system-wide resource consumption. While approaches to address the problem already exist, the diversity of HPC workloads combined with different file striping patterns prevents widespread adoption of these approaches. In addition, load-balancing techniques should be transparent to client applications. To address these issues, we propose
1 INTRODUCTION
Unbalanced load distribution and poor resource allocation schemes have been identified as major contributors to performance penalties in many HPC storage systems, including Lustre [10], one of the most widely used parallel file systems for scientific computing. Several recent works address the load-balancing issue. Server-side approaches [18, 67] aim to allocate resources for all concurrently running applications simultaneously. This includes the standard approach used in Lustre’s request ordering system, the Network Resource Scheduler (NRS) [74]. Other techniques attempt to minimize resource contention on a per-application basis, i.e., client-side approaches [58, 91]. While client-side and server-side approaches work well in isolation for some applications, the diversity of HPC I/O workloads unfortunately leads to situations where both isolated approaches lose performance.
Another commonly seen but neglected aspect when managing large-scale parallel file systems with a diverse workload and different file I/O sizes is the use of poor file striping patterns [51]. In general, file striping enables parallel file I/O and ideally provides a high application I/O throughput [36]. However, often a much lower stripe count is used than recommended for large files, resulting in a poor resource allocation scheme. This ultimately can lead to an imbalanced utilization of storage components, or worst case, can cause some of the storage targets to completely fill up. These suboptimal file layouts are not necessarily the result of conscious choices by the users, but simply the result of inherited file layouts that are configured as the system-wide default. These observations coupled with the constantly increasing size of HPC systems result in an intensified I/O subsystem complexity and a decreased system reliability (i.e., mean time to failure). Therefore, sub-optimal placement of file stripes can indirectly lead to lower I/O bandwidth. Load imbalance on the storage servers and targets also has a direct effect on I/O bandwidth utilization, as a higher load on a storage server can lead to I/O and network congestion for all the I/O requests forwarded to that server. There have been client-side approaches (for example, Reference [19]) to tackle jitter-free I/O, but there has been no work that tackles the end-to-end problem including clients and storage servers to achieve better I/O bandwidth utilization. As a result, since there is no “one-file-layout-for-all,” and a load-balanced set of servers and targets can lessen I/O congestion issues, a configurable and smart load-balancing framework is needed that can adapt to different file layouts, facilitate scientific code development for users, and make efficient use of extreme-scale parallel I/O and storage resources.
Previous works such as iez [87] and AIOT [103] combine the application-centric strengths of client-side approaches with the system-centric strengths of server-side approaches. For example, iez provides an application-agnostic global view of all resources to the Metadata Server (MDS). This includes the current statistics of Object Storage Servers (OSSs) and the set of Object Storage Targets (OSTs) where data resides. It coordinates the I/O requests from all concurrently running applications simultaneously to optimize the I/O placement strategy on a per-client basis. However, iez and AIOT have two major drawbacks. First, the algorithm for predicting application I/O request patterns run in a centralized fashion, which limits the scalability of load-balancing frameworks. Second, both frameworks are not able to efficiently adapt load balancing to the different file layout requirements of different file sizes when running different HPC workloads simultaneously.
Our work focuses on three main areas. First, we introduce
In a nutshell,
(1) | We design and implement a prediction algorithm and placement library for the end-to-end control plane | ||||
(2) | We evaluate the effectiveness of | ||||
(3) | We demonstrate the effectiveness and scalability of |
2 BACKGROUND
Recent work [5, 18, 66, 69, 71, 91] has shown that unbalanced I/O load in HPC systems can lead to serious resource contention and degradation of overall I/O performance. The parallel I/O system and the complex path of an application’s I/O request—consisting of myriad components such as I/O libraries, network resources, and back-end memory—are inherently complex. Today’s HPC implementations lack a centralized, system-wide I/O coordination and control mechanism to address the overall problem of resource contention. As a result, existing parallel file and storage systems can only optimize some parts of the I/O path, but not the entire end-to-end path. In the following, we provide a brief overview of the Lustre file system and its file allocation policies. Afterwards, we discuss the relation between parallel I/O and file requests. The section concludes with a discussion about the progressive file layout and its benefits for emerging HPC workloads.
2.1 Introduction to Lustre
We have implemented
2.1.1 Lustre Architecture.
Lustre is a scalable storage platform that is based on distributed object-based storage. Figure 1 shows a high-level overview of the Lustre architecture and its key building blocks. Lustre clients provide a POSIX-compliant interface between applications and the storage servers. The application data is managed by two types of servers, Metadata Server (MDS) and Object Storage Server (OSS). MDS manages all namespace operations and stores the namespace metadata on one or more storage targets called Metadata Targets (MDTs). The bulk storage of contents of application data files is provided by OSSs. Each OSS typically manages between two and eight Object Storage Targets (OSTs), although more are possible, and stores the data on one or more OSTs. OSTs are stored on direct-attached storage. Each data file is typically striped across multiple OSTs; the stripe count can be specified by the user. The distributed components are connected via the high-speed data network protocol LNet [55], which supports different network technologies, such as Ethernet and InfiniBand [72]. LNet is designed to support full RDMA throughput and zero copy communications when supported by the underlying network technology.
One of Lustre’s key performance features is file striping using RAID 0, which is the process of dividing a body of data into blocks and spreading the data blocks across multiple storage devices in a redundant array of independent disks group. Striping allows segments or chunks of data in a file to be stored on different OSTs, as shown in Figure 2. The RAID 0 pattern stripes the data across a certain number of objects. The number of objects in a single file is called the stripe count. Each object contains chunks of data from the file, and chunks are written to the file in a circular round-robin manner. When the chunk of data being written to a particular object exceeds the configured stripe size, the next chunk of data in the file is stored on the next object. In Figure 2, the stripe size for file C is larger than the stripe size for file A, so more data can be stored in a single stripe for file C. File striping offers two main benefits:
(1) | The ability to store large files by placing chunks of a file on multiple OSTs, i.e., a file’s size is not limited to the space available on a single OST. | ||||
(2) | An increase in bandwidth, because multiple processes can simultaneously access the same file, i.e., a file’s I/O bandwidth is not limited to a single OST. |
2.1.2 Managing Free Spaces.
To provide optimized I/O performance, the MDT assigns file stripes to OSTs based on location (which OSS) and size considerations (free space) to optimize file system performance. Less-used OSTs are preferentially selected for stripes, and stripes are preferentially spread out between OSSs to better utilize network bandwidth. The default OST load-balancing approach uses Lustre’s standard allocation (LSA) policy to distribute I/O load over OSTs. Lustre comes with two stripe allocation methods: the round-robin and the weighted allocator. Depending on the free-space imbalance on the OSTs, Lustre transparently switches between the faster round-robin allocator, which maximizes network balancing, and the weighted allocator, which fills less-used OSTs faster by using a weighted random algorithm.
The round-robin allocator alternates stripes between OSTs on different OSSs, so the OST used for stripe 0 of each file is evenly distributed among OSTs, regardless of the stripe count, as depicted in Figure 3(a). Note that the list of OSTs for a file is not necessarily sequential with regards to the OST index. In contrast, the weighted allocator uses a weighted random mechanism to select OSTs. OSTs that are the least full have a higher probability of being allocated (in an attempt to bring the storage system back into balance), but there is still some chance that full OSTs could be selected. The target window, as shown in Figure 3(b), specifies the allowed free-space imbalance and defines when to switch between the two strategies. Let us assume that max is the maximum amount of free space on any OST in the file system, min is the minimum amount of free space on any OST, and Window defines the quality of service threshold of allowed free-space balance: (1) \(\begin{equation} (\text{\emph {max}} - \text{\emph {min}}) \le \frac{\text{\emph {Window}}}{100} * \text{\emph {max.}} \end{equation}\)
If Equation (1) is true, then the OSTs are considered balanced and the round-robin allocator is used. This means that all the OST usages are within a small window of each other (which by default is set in between 17% and 20%). The weighted allocator is used when any two OSTs are imbalanced.
2.2 Introduction to Parallel I/O and File Requests
In the context of HPC systems, parallel I/O [9, 73] describes the ability to perform multiple input and output operations at the same time, for instance, simultaneous outputs to storage and display devices. It is a fundamental feature of large-scale HPC environments. Parallel file systems distribute the workload over multiple I/O paths and components to satisfy the I/O requirements in terms of performance, capacity, and scalability. Scientific I/O is performed by large-scale applications from different scientific domains. HPC applications frequently issue I/O operations to access (i.e., read or write) TBs of data; some applications produce a few hundred TBs or even PBs of data. Typically, domain scientists think about their data in terms of their science problems, e.g., molecules, atoms, grid cells, and particles. Ultimately, physical disks store bytes of data, which makes such workloads difficult to handle for the storage system. Most HPC storage systems employ a parallel file system such as Lustre or GPFS to hide the complex nature of the underlying storage infrastructure, e.g., solid state drives (SSDs), spinning disks, and RAID arrays, and provide a single address space for reading and writing to files. The I/O behavior of an application depends on multiple factors such as type of I/O operation (i.e., read or write), file-sharing strategy (single-shared-file versus file-per-process), I/O intensity, and current system load when the application is executed.
There are three common file-sharing strategies used by applications to interact with the parallel file system, which are described in the following: In Single Writer I/O, also known as sequential I/O, one process aggregates data from all other processes and then performs I/O operations to one or more files. The ratio of writers to running processes is 1 to N, as depicted in Figure 4(a). This pattern is very simple and can provide a good performance for very small I/O sizes but does not scale for large-scale application runs, since it is limited by a single I/O process. In File-Per-Process I/O, each process performs I/O operations on individual files, as shown in Figure 4(b). If an application runs with N processes, then N or more files are created and accessed (N:M ratio with \(N \le M\)). Up to a certain point, this pattern can perform very well, but is limited by each individual process that performs I/O. It is the simplest implementation of parallel I/O enabling the possibility to take advantage of an underlying parallel file system. However, it can quickly accumulate many files. Parallel file systems often perform well with this strategy up to several thousands of files, but synchronizing metadata for a large collection of files introduces a potential bottleneck. Also, an increasing number of simultaneous disk accesses creates contention on file system resources. Finally, the Single-Shared-File pattern allows many processes to share a common file handle but write to exclusive regions of a file. Figures 4(c) and 4(d) show the independent and collective buffering variants of this strategy. In the independent variant, all processes of an application write to the same file, while in the collective buffering variant, the performance of shared file access is improved by offloading some of the coordination work from the file system to the application. The data layout within the file is very important to reduce concurrent accesses to the same region. Contesting processes can introduce a significant overhead, since the file system uses a lock manager to serialize the access and guarantee file consistency. The advantage of the single shared file I/O pattern lies in the data management and portability, e.g., when using a high-level I/O library such as HDF5.
Ultimately, I/O operations are translated into file requests accessing the parallel shared file system. Figure 5 uses Lustre as an example to show how an I/O operation is turned into a file request by the Lustre client running on a compute node. First, the Lustre client sends a remote procedure call (RPC) to the MDS via the logical metadata volume (LMV) and metadata client (MDC) to request a file lock (1). This can either be a read lock with look-up intent or a write lock with create intent. When the MDS has processed the request, it returns a file lock and all available metadata and file layout attributes to the Lustre client (2). If the file does not exist yet (i.e., file create request is performed), then the MDS will also allocate OST objects via the logical object volume (LOV) and object storage client (OSC) for the file based on the requested striping layout and current allocator policy when the file is opened for the first time. With the help of the file layout information, the client is able to access the file directly on the OSTs (3).
2.3 Progressive File Layout and Emerging Hybrid HPC Workloads
Striping enables users to obtain a high parallel I/O performance [36]. Files are divided into stripes, which are stored across multiple OSTs. This mechanism enables parallel read and write accesses to files and therefore parallel I/O. The Progressive File Layout (PFL) [51] is a recent Lustre feature where a file can have different striping patterns for different regions of the file to balance the space and bandwidth usage against the stripe count. Using PFL, a file can have several non-overlapping extents, with each extent having different striping parameters. This can provide lower overhead for small files that require only a single stripe, higher bandwidth for larger files, and wide distribution of storage usage for a very large file.
An example PFL configuration with four sub-layouts is shown in Figure 6. The first extent has a single stripe up to 128 MB, the second extent will have three stripes up to 512 MB, the third extent will have eight stripes, and the last component goes to the end of file and will have up to 16 stripes. The PFL feature is implemented using composite file layouts. The number of sub-layouts in each file and the number of stripes in each sub-layout can be specified either as a system-wide default or by the user using the
HPC applications are evolving to include not only traditional scale-up modeling and simulation bulk-synchronous workloads but also scale-out workloads [57] such as advanced data analytics and machine learning [97, 100], deep learning [14], and data-intensive workflows [16, 17, 24]—challenging the long- and widely held belief that HPC workloads are write-intensive, as shown by a recent I/O behavior analysis [65]. In contrast to the traditional well-structured HPC I/O patterns (for example, checkpoint/restart, multi-dimensional I/O access), emerging workflows will often utilize non-sequential, metadata-intensive, and small-transaction reads and writes, and invoke file read requests to the HPC parallel file systems [24]. PFL has been designed to cope with the changing landscape of I/O workloads so applications observe a reasonable performance for a variety of file I/O patterns without the need to explicitly understand the underlying I/O model.
3 RELATED WORK
In this work, we seek to design an end-to-end I/O load-balancing control plane for large-scale parallel file systems. Therefore, two research areas are of particular interest for this work: end-to-end I/O monitoring and resource load balancing.
3.1 End-to-end I/O Monitoring
Existing work in end-to-end I/O monitoring has focused mainly on I/O tracing and profiling tools, which can be divided into two main categories: application-oriented tools and back-end-oriented tools. Recent research work also focuses on the end-to-end I/O path analysis, thus introducing end-to-end I/O monitoring tools as a third category.
Application-oriented tools focus on collecting detailed information about particular application runs to tune applications for increased scientific productivity or to gain insight into trends in large-scale computing systems. These tools include, for example, Darshan [12], IPM [85], and RIOT [94], all of which are designed to capture an accurate picture of application I/O behavior, including key characteristics such as access patterns within files, in the context of the parallel I/O stack on compute nodes with a minimal overhead. Patel et al. [64], for example, used Darshan to perform an in-depth characterization of access, reuse, and sharing characteristics of I/O-intensive files. Wu et al. introduced a scalable tracing and replay methodology for MPI and I/O event tracing called ScalaTrace [53, 95, 96]. Another popular tool is Recorder [49], a multi-level I/O tracing tool that captures HDF5, MPI-I/O, and POSIX I/O calls, which requires no modification or recompilation of the application. It has been extended to also support tracing of most metadata POSIX calls [89].
Back-end-oriented tools focus on collecting I/O performance data on the system-level in the form of summary statistics. Example tools include LIOProf [99], LustreDU [45, 62], and LMT [27]. Apollo [75] is a real-time storage resource monitoring tool, which relies on publisher-subscriber semantics for low latency and low overhead. Its target is to provide a current view of the system to aid middleware services in making more optimal decisions. Finally, Paul et al. [66] analyzed application-agnostic file system statistics gathered on compute nodes as well as metadata and object storage file system servers.
Finally, end-to-end I/O monitoring tools try to provide holistic insight from an application and system perspective, including factors such as the network, I/O, resource allocation, and system software stack. An initial attempt was to utilize static instrumentation to trace parallel I/O calls. For example, SIOX [93] and IOPin [35] extended the application-level I/O instrumentation introduced by Darshan to other system levels to characterize I/O workloads across the parallel I/O stack. However, their overhead impedes their use in large-scale HPC production environments [78]. In recent years, end-to-end frameworks have become increasingly popular. TOKIO [39], for example, relies on the combination of front-end tools (Darshan, Recorder) and back-end tools (LMT). UMAMI [48] combines on-demand, modular synthesis of I/O characterization data into a unified monitoring and metrics interface, which provides cross-layer I/O performance analysis and visualization. GUIDE [86], however, is a framework used to collect, federate, and analyze center-wide and multi-source log data from the Oak Ridge Leadership Computing Facility (OLCF). Finally, the MAWA-HPC (Modular and Automated Workload Analysis for HPC Systems) [108, 109] project aims to develop a generic workflow and tooling suite that can be transparently applied to applications and workloads from different science domains. Through its modular design, the workflow is able to support various community tools, which increases its compatibility with different applications. Similar to UMAMI, MAWA-HPC provides cross-layer performance analysis and visualization. Beacon [101, 102] complements previous work by providing a real-time end-to-end I/O monitoring framework. It can be used to analyze performance and resource utilization, but also for automatic anomaly detection and continuous per-application I/O pattern profiling. Beacon is currently deployed on the TaihuLight system.
When designing an end-to-end I/O control plane, key aspects such as low latency, low overhead, and an application-agnostic global view of resources play an important role in the overall system design. Therefore, this work combines different approaches from the discussed related works to provide transparent coordination of I/O requests and intelligent and adaptive placement of file I/O. For example,
3.2 Resource Load Balancing
Given typical non-uniform data allocation patterns across storage resources, the striping of application data across multiple OSTs often leads to load imbalance. The main limitation is that LSA only aims to balance the load on OSTs, without any consideration about other components, such as MDS and OSSs. Previous work on HPC I/O behavior [55, 67, 87] has shown that LSA can take a long time to balance a system, since it is unable to capture the complex behavior of modern HPC applications. Consequently, the default policy falls short of providing the desired I/O balanced storage system.
Load balancing and resource contention mitigation in large-scale parallel file system deployments are extensively studied research topics [4]. One approach is to address the problem from the client-side on a per-application basis [56, 58, 91]. For example, the I/O calls can be intercepted on the client-side during runtime and the OST assignments can be managed accordingly to mitigate resource contention [29, 44, 83, 107]. One example is TAPP-I/O [58], which transparently intercepts metadata operations, supports both statically and dynamically linked applications, and provides a heuristic-based placement strategy for FPP and SSF. However, the main limitation of these approaches is that they do not consider the requirements of other applications running concurrently on the system due to lack of a global system view and therefore only tune the I/O of individual applications.
Another approach is to have a global view of storage servers and server-side statistics and consider the load balance and job interference across all applications instead of a per-job basis. Here, the load-balancing problem is handled from the server-side perspective [18, 67, 80, 105]. The main limitations of such approaches are that they require the modification of application source code and do not consider different file I/O (SSF or FPP) and striping layouts.
Recent work [2, 6, 7, 8] has introduced auto-tuning approaches for specific high-level I/O libraries such as MPI-IO and HDF5 to learn and predict the I/O behavior of HPC applications to improve the parallel read and write performance. Another alternative is presented by the Optimal Overloaded IO Protection System (OOOPS) [32], which detects and throttles I/O-intensive workloads to reduce excessive pressure on the metadata servers and service nodes. Ji et al. [33] introduced an application-adaptive dynamic forwarding resource allocation (DFRA), which, based on monitoring data from the real-time I/O monitoring system Beacon [101], determines whether an application should be granted more forwarding resources or given dedicated forwarding nodes. Hence, DFRA attempts to mitigate the load imbalance at the forwarding layer and can be considered complementary to this work. In 2022, Yang et al. [103] presented an end-to-end and adaptive I/O optimization tool (AIOT), which is also based on the Beacon framework. AIOT tunes system parameters across multiple layers of the storage system by using the automated identified application I/O behaviors and the instant status of the workload of storage system. The main drawback of AIOT is its centralized design for predicting and tuning I/O behavior.
The aforementioned approaches improve the parallel I/O performance for individual applications by effectively reducing the resource contention and improving the load balance but fail to exploit the opportunity of an interference-aware, end-to-end I/O path optimization. They also fail to achieve effective resource utilization (e.g., bandwidth) and performance improvements by adapting the load balance to different I/O sizes. In contrast,
4 MOTIVATION: LOAD IMBALANCE IN DEFAULT LUSTRE SETUPS
In the following, we use two well-known HPC benchmarks to highlight the load imbalance in a default Lustre deployment and to motivate the need for a framework such as
4.1 Use Cases and Benchmarks
The Hardware Accelerated Cosmology Code (HACC) [28] application uses N-body techniques to simulate the formation of structure in collision-less fluids under the influence of gravity in an expanding universe. HACC-I/O mimics the I/O patterns and evaluates the performance of the HACC simulation code. It can be used with the MPI-I/O and POSIX I/O interfaces and differentiates between FPP and SSF file-sharing modes.
The InterleavedOrRandom (IOR) [40] benchmark provides a flexible way to measure the parallel file system’s I/O performance under different read/write sizes, concurrencies, file formats, and file layout strategies. It measures the performance for different configurations including I/O interfaces ranging from traditional POSIX I/O to advanced parallel I/O interfaces like MPI-IO and differentiates parallel I/O strategies between file-per-process and single-shared-file approaches.
4.2 Observed Load Imbalance
To highlight the load imbalance in a default Lustre setup, we use Lustre’s standard allocation (LSA) strategy to distribute the I/O load on the OSTs. We deployed a testbed consisting of a 10-node cluster, with 1 MDS, 7 OSSs, and 2 Clients. Each OSS manages 5 OSTs with a capacity of 10 GB each. Hence, the cluster has 35 OSTs in total with a capacity of 350 GB. IOR was run with 16 processes for two different configurations, resulting in 32 GB and 128 GB of data to be stored on the OSTs in the FPP access mode. In addition, HACC-I/O was run for 8 and 16 processes with \(50~million\) and \(20~million\) particles generating \(14.3~GB\) and \(11.7~GB\) data, respectively. All experiments were run for both the PFL and non-PFL setup. For the PFL setup, we used the same configuration as shown in Figure 6, referred to as Configuration 1. For the non-PFL setup, the stripe count was set to eight.
Figure 7(a) shows the OST utilization for different runs of IOR and HACC-IO in the FPP mode with the non-PFL file striping configuration. In a balanced load setting, these graphs would be straight lines, but in the studied scenario, the load is observed to be imbalanced with some OSTs getting a higher I/O load than others. A similar pattern can be seen in Figure 7(b) for IOR and HACC-I/O in the FPP mode using the PFL configuration. These results show that a default Lustre deployment, which relies on LSA to allocate OSTs for each job, can suffer from a significant load imbalance at the server-level. The load imbalance persists at different scales and different striping layouts (PFL and non-PFL) and thus can lead to imbalanced resource usage and contention.
It should be noted that the OST utilization for non-PFL and PFL files looks similar. The reason for this is how Lustre internally maps data blocks onto the stripe objects on the OSTs. By default, when the free space across OSTs differs by less than 20%, round-robin is used to distribute the file I/O across multiple OSTs. For example for HACC-I/O with FPP, 8 processes and \(50~million\) particles, the non-PFL layout writes 8 stripes per file with 219 MB per stripe, while PFL divides the files in 12 stripes per file with 4 128 MB stripes and 7 192 MB stripes (the last stripe of the third extent remains unallocated, since one file is only 1.830.4 MB in size).
Regarding the observed load imbalance, the following points should be noted:
— | The load imbalance on the OSTs occurs during the file creation phase. Each file creation request contains two parameters—the number of stripes and the file size associated with the file. Therefore, our load-balancing algorithm needs to optimally place the files during the file creation phase. | ||||
— | The jagged line plot of the OST utilization in Figures 7(a) and 7(b) indicates that the load on the OSTs is not balanced during an application run. A balanced set of OSTs exhibits a straight line for OST utilization. This load imbalance on the OSTs results in an imbalance in the read and write requests coming to OSSs, which leads to I/O congestion and thus lower overall I/O bandwidth for the application. Therefore, our load-balancing algorithm should be able to balance load on both OSTs and OSSs to improve the overall application bandwidth. | ||||
— | The goal of a parallel file system with load balancing should be to keep the total load of the storage targets within reasonable limits and to use all OSTs and OSSs in a similar manner so not only a certain set of OSSs and OSTs fulfills the majority of the I/O requirements. Therefore, the percentage of OST utilization should not be very close to 100%, because the storage targets that reach 100% utilization will operate slower and cause I/O bottlenecks. |
4.3 Parallel File Access with Varying Striping Layouts
Before we discuss the
In the FPP mode example, four processes each write 8 GB files to the parallel file system. In the non-PFL layout, each 8 GB file is split into a predefined number of equal-sized stripes (in this example, the stripe count is 8), while in PFL layout, each 8 GB file is partitioned according to the defined PFL layout for these files or directories (here Configuration 1). In SSF mode, a single process creates a single file while all other processes perform I/O operations on the file. This file is divided into a predetermined number of stripes according to a non-PFL or a PFL layout.
It is evident that the different stripes can be of different sizes, depending on the file-sharing mode and striping layout. Also, for non-PFL files, individual stripes can be very large, which can cause load imbalance. In addition, data segments are stored in a RAID 0 pattern during striping, as already explained in Section 2.1. This can pose a significant problem especially when several processes want to write to the same file. Lustre provides file locking on a server-basis, which can lead to contention for concurrent file operations, especially when accessing segments of files in the RAID 0 pattern (i.e., in a circular round-robin manner). The challenge is to select OSTs to place different sized stripes for concurrent workloads with different striping layouts such that all OSTs have load balancing and less resource contention, improving I/O for different workload characteristics.
5 TARAZU SYSTEM DESIGN
In the following, we will introduce our design philosophy, provide an overview of the software architecture, and discuss every software component individually. We have implemented
5.1 Design Philosophy and Contributions
The design of
— | End-to-end Control Plane: One of the most important design features in | ||||
— | Application-agnostic Global View of Resources: Storage servers act indifferently to the applications sending I/O requests, that is, servers do not have application-level information. Therefore, the placement algorithm in | ||||
— | Automatic Coordination of I/O Requests from Concurrent Workloads: All applications perform I/O on the same shared file system. Therefore, in | ||||
— | Intelligent and Adaptive Placement Algorithm: The placement algorithm should be intelligent enough to make accurate predictions about future file requests by tracking the I/O behavior of the application. Since this depends on the behavior of individual clients, | ||||
— | Transparent Placement of Application Files: Applications should have no knowledge of how the entire placement algorithm works. Therefore, one of the most important design decisions in |
5.2 End-to-end Control Flow
Figure 9 shows the high-level control flow of
This trained model is then used to predict file creation requests (Phase 2a). As discussed previously, file creation requests lead to load imbalance in OSTs. Therefore, before the actual application run, our prediction model will predict the file size for all file creation requests in an application. The number of file stripes is also collected from the configuration file. This set of predicted file create requests (file size and the number of stripes) from the client is sent to the OST allocation algorithm running on the MDS that has a global view of the system.
The MDS collects real-time statistics on OSS and OST resource usage (Phase 2b) asynchronously and in parallel with Phase 2a. Based on the set of file creation requests sent by all clients and the server statistics, the OST allocation algorithm running on the MDS maps each file to a set of OSTs (Phase 3) such that there is OST and OSS-level load balance in the system. The file creation requests and the corresponding set of load-balanced OSTs are given back to the respective clients.
When the actual file creation requests come from the application, the mapped, load-balanced set of OSTs is allotted for the corresponding file creation request and the metadata information is applied to the actual request (Phase 4). This helps reduce the latency of file creation requests and achieve a scalable load-balanced design.
Phases 1 through 3, which involve retraining the model based on historical traces and making a prediction, are only required if there are many actual file creation requests that are not included in the predicted set of requests, resulting in a higher miss rate, or if the application striping pattern and file-sharing strategy change. This separation of Phases 1–3 and Phase 4 helps design a transparent load-balancing framework that does not require changing the application source code. At the same time, it enables seamless load balancing by reducing the overall latency of file creation requests, resulting in better application I/O throughput. This control flow also results in the prediction model not interfering with the actual application flow.
5.3 System Overview
Figure 10 shows an overview of the
When applications are initially run,
On the server side (i.e. on the OSS), we collect various statistics such as CPU and memory usage information, the associated OST capacity (kbytestotal), and the number of bytes available on the OSTs (kbytesavail). These statistics are collected from the OSSs via
Mapping the system design of
5.4 Phase 1: Trace Collection and Training Data
We implement a simple, lightweight I/O tracing library,
The load imbalance on OSSs and OSTs is due to the write and create requests [49, 67], as this is when the actual stripes for the files are allocated on the OSTs, as described in Section 2.1. Therefore, the tracing tool only needs to be used to capture the I/O creation and write behavior of an application or workflow. Afterwards, miniRecorder is only run whenever the ARIMA-inspired prediction model performs poorly and the OST prediction algorithm needs to be re-trained (see Section 5.5.1). In the next step, the data collected by miniRecorder is converted into a readable (comma-separated) file in .csv format by a
5.5 Phase 2a: Application File Create Request Prediction
For each application, our prediction model foretells the file size for all file create and write requests performed by an application. We rely on predictions based on ARIMA time series modeling and a configuration manager responsible for determining the striping layout of a file to achieve this.
5.5.1 ARIMA-Inspired Prediction Algorithm (AIPA).
Recent work [67] suggests that I/O patterns of HPC applications are predictable. This observation is also confirmed by multiple HPC practitioners. Therefore, the
Previous work [2, 6, 8] has presented auto-tuning approaches for MPI-IO and Lustre to learn and predict the I/O parameters to improve read and write performance of HPC applications. Researchers have previously used AutoRegressive Integrated Moving Average (ARIMA) model [11], Seasonal Integrated ARMA (SARIMA), and Fractionally Integrated ARMA (ARFIMA) to estimate CPU, RAM, and network usage for HPC workloads [37]. Formal grammar has also been used to predict I/O behaviors in HPC [20], which proves that HPC I/O is predictive. However, formal grammar is ineffective for predicting data in a time-series manner, effectively consuming low resources. Markov Chain Model [76] has also been used to exploit knowledge of spatial and temporal I/O requests [60, 67]. Recently, a regression-based I/O performance prediction scheme for HPC environments was also proposed [34]. Based on previous studies, the time series nature of the traces provided by our tracking tool allows us two options for our prediction model—ARIMA and Markov chain. We initially find the I/O request prediction accuracy and resource consumption using both models. We observe that for IOR data, ARIMA has a \(99.1\%\) accuracy with \(1.2\%\) CPU overhead and \(0.01\%\) memory usage, while Markov chain model yielded an accuracy of \(95.5\%\) utilizing \(4.5\%\) CPU and \(0.01\%\) memory.
Prediction models should not interfere with the client-side I/O activities. Therefore, to ensure prediction on the client side in an online setting, we use the ARIMA model that provides better accuracy at lower resource consumption.
Two design aspects for the ARIMA-based prediction model are:
Our prediction model is implemented using the \(\texttt {statsmodels.tsa.arima_model}\) package in Python. Our results show a \(98.3\%\) accuracy in HACC-I/O data and \(99.1\%\) accuracy in IOR data.
5.5.2 Configuration Manager.
The
An important design concern to calculate the stripe size for both layouts is the 64k-alignment constraint imposed by Lustre. This constraint states that the stripe size should be an even multiple of 64k or 65,536 bytes (Alignment Parameter (AP)). For file sizes that are not AP-aligned, we use Equations (2) and (3). The method ensures a 64k-aligned stripe size for all the stripes allocated on the stripeCount number of OSTs by allocating a slightly bigger file than is requested by the client. For example, a 766.175 MB file size will be allocated an 833 KB (0.1%) bigger file, which ensures allocating equal-sized stripes on all the OSTs, and hence contributing to a load-balanced setup. Further details on the equations can be found in Reference [87]. (2) \(\begin{equation} \begin{aligned}writeBytes = {} & AP*2*N*stripeCount \end{aligned} \end{equation}\) with (3) \(\begin{equation} N = \left\lceil \frac{writeBytes}{AP*2*stripeCount} \right\rceil . \end{equation}\)
5.5.3 Interaction Database.
The
File Name | File Size | Extent ID | Extent Start | Extent End | Stripe Size | Stripe Count | MPI Rank |
---|---|---|---|---|---|---|---|
/mnt/lustre/ior/test.0 | 8,589,934,592 | 1 | 0 | 134,217,728 | 134,217,728 | 1 | 0 |
/mnt/lustre/ior/test.0 | 8,589,934,592 | 2 | 134,217,728 | 536,870,912 | 134,217,728 | 3 | 0 |
/mnt/lustre/ior/test.0 | 8,589,934,592 | 3 | 536,870,912 | 2,147,483,648 | 201,326,592 | 8 | 0 |
/mnt/lustre/ior/test.0 | 8,589,934,592 | 4 | 2,147,483,648 | 8,589,934,592 | 402,653,184 | 16 | 0 |
/mnt/lustre/ior/test.1 | 8,589,934,592 | 1 | 0 | 134,217,728 | 134,217,728 | 1 | 1 |
/mnt/lustre/ior/test.1 | 8,589,934,592 | 2 | 134,217,728 | 536,870,912 | 134,217,728 | 3 | 1 |
/mnt/lustre/ior/test.1 | 8,589,934,592 | 3 | 536,870,912 | 2,147,483,648 | 201,326,592 | 8 | 1 |
/mnt/lustre/ior/test.1 | 8,589,934,592 | 4 | 2,147,483,648 | 8,589,934,592 | 402,653,184 | 16 | 1 |
As every client application has an individual database on the client node, the size of the database depends on the total number of files created by an application. This decentralized nature of the database helps in scalability by not needing to save the tremendous number of files created by all applications running in the HPC cluster in a single database. The database querying frequency is also kept to a minimum, as the database needs to be accessed only for file creation requests. Subsequent file read and write accesses do not need the database.
5.6 Phase 2b: Server Statistics Collection
The statistics collection needs to be lightweight and scalable so it can handle up to thousands of OSSs in a seamless manner without affecting the file system activities. Therefore, ZeroMQ (\(\phi\)MQ) [31] is used as a message queue, which has been proven to be lightweight and efficient at large-scale. To ensure scalability, an asynchronous publish-subscribe model is used where the OSSs act as publishers and the MDS acts as a subscriber. Table 3 shows the list of system metrics collected.
Component | Factors | Discussion |
---|---|---|
Metadata Server (MDS) | CPU and memory utilization | |
reflect the system load. | ||
Load on the Lustre networking | ||
layer connected to MDS. | ||
Object Storage Server (OSS) | Reflects the system load | |
of the management server. | ||
Load on the Lustre networking | ||
layer connected to OSS. | ||
Object Storage Target (OST) | Overall statistics per OST. | |
Statistics per job per OST. | ||
Available disk space per OST. | ||
I/O read/write time and sizes per OST. |
Each OSS has a
The
(1) | collecting CPU, memory, and network utilization from the MDS, | ||||
(2) | subscribing to statistics from the OSSs via ZeroMQ, and | ||||
(3) | parsing and preparing all the collected statistics. |
The CPU, memory, and network usage recorded on the MDS is important in determining when to run the OST allocation algorithm. To not interrupt normal MDS activity, the OST allocation algorithm runs only when the CPU and memory utilization fall below \(70\%\) and \(50\%\), respectively. These utilization thresholds can be manually tuned. The collected statistics from the MDS and the OSS are parsed and used as input to the OST allocation algorithm. Our results show that the statistics collection on the MDS has a CPU utilization of \(0.1\%\) and negligible memory usage.
5.7 Phase 3: OST Allocation Algorithm
Algorithm 1 shows the steps employed for allocating OSTs for a client write request. The inputs used for the
The cost to reach an OST is the load of the OSS containing the OST. The cost of an OST is defined as the ratio of bytes already used in the OST to the total size of the OST. The allocation algorithm should be able to handle both PFL and non-PFL applications. The PFL requests have varied stripe sizes. Therefore, to have consistency in the allocation algorithm, we compute the maximum stripe size across both PFL and non-PFL databases from all client applications. The capacity of an OST is defined as the number of stripes that can be handled by the OST. This is calculated by dividing the available space in the OST by the maximum stripe size.
To construct the flow graph shown in Figure 13, source and sink nodes need to be identified. The total demand for the source node is the total number of stripes requested by all application requests, and the total demand for the sink node is the negative amount of the total number of stripes requested. The Ford-Fulkerson algorithm [84] is used to solve the minimum-cost maximum-flow problem. This approach outputs a list of OSTs (OSTAllocationList), which will yield a balanced load over all OSSs and OSTs. For our implementation, we use the Python library
The MDS uses a scalable publisher-subscriber model via ZeroMQ [31] to interoperate with the interaction database, which helps in scaling
The OSTAllocationList is then shared with the respective clients using our publisher-subscriber model via ZeroMQ. The complete set of requests is stored in the
File Name | File Size | Ext.ID | Ext.Start | Ext.End | St.Size | #St. | Rank | OST List |
---|---|---|---|---|---|---|---|---|
/mnt/lustre/ior/test.0 | 8,589,934,592 | 1 | 0 | 134,217,728 | 134,217,728 | 1 | 0 | 10 |
/mnt/lustre/ior/test.0 | 8,589,934,592 | 2 | 134,217,728 | 536,870,912 | 134,217,728 | 3 | 0 | 24 29 34 |
/mnt/lustre/ior/test.0 | 8,589,934,592 | 3 | 536,870,912 | 2,147,483,648 | 201,326,592 | 8 | 0 | 15 2 28 25 11 21 7 33 |
/mnt/lustre/ior/test.0 | 8,589,934,592 | 4 | 2,147,483,648 | 8,589,934,592 | 402,653,184 | 16 | 0 | 14 23 16 30 1 6 4 26 32 20 12 31 22 35 5 17 |
/mnt/lustre/ior/test.1 | 8,589,934,592 | 1 | 0 | 134,217,728 | 134,217,728 | 1 | 1 | 24 |
/mnt/lustre/ior/test.1 | 8,589,934,592 | 2 | 134,217,728 | 536,870,912 | 134,217,728 | 3 | 1 | 13 9 29 |
/mnt/lustre/ior/test.1 | 8,589,934,592 | 3 | 536,870,912 | 2,147,483,648 | 201,326,592 | 8 | 1 | 22 11 35 21 5 7 17 10 |
/mnt/lustre/ior/test.1 | 8,589,934,592 | 4 | 2,147,483,648 | 8,589,934,592 | 402,653,184 | 16 | 1 | 14 23 16 30 1 6 15 4 26 2 32 20 12 28 31 25 |
5.8 Phase 4: Applying Metadata Information
The
For each file creation request,
Algorithm 2 describes a simplified version of the placement library’s mode of operation, which is run on the clients. If the result returned by the MySQL query contains only one row, then the non-PFL layout is used by allocating a Layout Extended Attributes (Layout EA) with the predicted striping pattern on the MDS. The striping pattern is applied by initializing the layout EA with the stripe count, stripe size, and the list of OSTs retrieved from the interaction database. It should be noted that the configured striping pattern differs from the default RAID 0 pattern typically applied by Lustre. Instead of writing multiple data segments in a round-robin fashion as introduced in Section 2.1,
6 SIMULATOR ENVIRONMENT AND WORKLOAD GENERATION
To enable scaling experiments, we implement a discrete-time simulator. In the following, we describe the design of the simulator based on Darshan-based workload generation and give a brief validation of the simulation results.
6.1 Overview of Existing Parallel File System Simulators
Before we discuss the design of the discrete-event simulator, we present an overview of previous work on file system simulation and argue why they cannot be directly applied to our research.
The Lustre simulator [104] was developed as an event-driven simulation platform to research scalability, analyze I/O behaviors, and design various algorithms at large scale. It simulates disks, the Linux I/O elevator, a file system with mballoc block allocation, a packet-level network, and three Lustre subsystems: client, MDS, and OSS. The main focus of this simulation tool is the evaluation of the Network Request Scheduler (NRS). Since this simulator was developed in 2009, it is based on Lustre 1.8 and therefore is not compatible with our experiments.
Another open-source simulator developed in 2009 is IMPIOUS (Imprecisely Modelling I/O is Usually Successful) [52]. IMPIOUS is trace-driven and provides abstract models to capture the key characteristics of three parallel file systems; PVFS, PanFS, and Ceph. Depending on the simulated file system, the simulator can be configured to distinguish different characteristics such as data placement strategies, resource locking protocols, redundancy strategies, and client-side caching strategies. Due to its age and the lack of supporting Lustre as a file system, IMPIOUS cannot be used for our evaluation.
Liu et al. have introduced PFSsim [46, 47], which is also a trace-driven simulator designed for evaluating I/O scheduling algorithms in parallel file systems. It uses OMNeT++ for detailed network models and relies on DiskSim [38] to simulate disk operations. Since PFSsim only supports PVFS2 and mainly focuses on I/O scheduling algorithms, it cannot be used to evaluate
In 2012, the parallel file system simulator FileSim [21] was introduced by Erazo et al., which is based on SimCore, a generic discrete-event simulation library. It provides pluggable models with different levels of modeling abstraction for different parallel file system components. In addition, FileSim supports trace-driven simulation, which can be used to validate parallel file system models by comparing against the behavior observed in the real systems. Even though the description of the simulator would fit our requirements needed to simulate
The Hybrid Parallel I/O and Storage System Simulator (HPIS3) [23] was introduced in 2014 by Feng et al. It provides a co-design tool targeting the optimization of hybrid parallel I/O and storage systems, where a set of SSDs and HDDs are deployed as storage nodes. HPIS3 is built on Rensselaer Optimistic Simulation System (ROSS) [13], a parallel simulation platform, and capable of simulating a variety of parallel storage systems with two distinct types of hybrid system design, namely, buffered-SSD and tiered-SSD storage systems. Hence, HPIS3 is targeting a different scenario than
Other simulators relying on ROSS are CODES [15] and BigSim [106]. CODES provides a tool for I/O and storage system simulations. Its main target is the exploration and co-design of exascale storage systems for different I/O workloads. Initially, workloads could only be described via the CODES I/O language. In 2015, Snyder et al. [79] proposed an I/O workload abstraction (IOWA). IOWA describes different techniques to generate workload for simulation frameworks depending on the use case, including workload generation from Recorder [49] and Darshan [12]. Since our simulation use case is mostly concerned with the bandwidth performance when reading or writing to the parallel file system, we only adopt the workload generation techniques proposed by IOWA.
6.2 Simulator Design
As discussed in the previous section, there are no simulators that can either simulate the different components in the Lustre file system or integrate various OST allocation algorithms to help evaluate
The simulator consists of four key components that are very similar to those of Lustre’s OST, OSS, MDT, and MDS. These implement the various Lustre operations and allow us to collect data about the system behavior. The MDS is also equipped with multiple strategies for OST selection, such as round-robin, random, and the OST allocation algorithm designed for
The steps for building the simulator are shown in Algorithm 3. The application trace file generated from the Darshan traces, which is explained in the next section, serves as input to the simulator along with the configuration file for PFL, number of I/O routers, and LNet routers. First the time-series set of requests are generated from the application traces by reading the configuration file and calculating the request size and stripe count. The network topology of the simulator is built next.
The major components of our Lustre Simulator are as follows:
— | Application: Parses the trace file and the PFL configuration. The trace file is converted into a time-series event list with request size and stripe count. | ||||
— | I/O Router: These are simulated as network components placed between the client nodes (Application) and the storage nodes. It forms one of the most important components to simulate the network traffic from multiple applications in our simulator. | ||||
— | LNet Router: Handles the OSS in the cluster. All packets coming to the I/O Router will be redirected through the LNet Router, which keeps track of the OSS where the packets are being sent and accordingly updates the load on the OSS. | ||||
— | OSS: This module handles the load distribution on the OSTs under each OSS. For each OSS, it keeps track of the list of active OSTs, CPU usage of the OSS—which is calculated by the combined usage of all the associated OSTs. | ||||
— | OST: This module handles the final step of the packet transfer from the Application. It reduces the disk space on the OST in accordance to the packet size as well as contributes to the calculation of the OSS CPU usage. |
This network topology helps in simulating the application requests. As discussed before, the time series set of requests is stored as an event list. At the simulated time instance, the event is taken from the event list, and the stripe size is calculated by dividing the write bytes by the number of stripes. The stripe size and the number of stripes help in getting the OST list by running the appropriate OST allocation algorithm (LSA or
6.3 Darshan-based Workload Generation
Darshan logs [12] are used to generate I/O traces for real-world HPC workloads. The Darshan I/O characterization tool maintains details of every file that is opened by the application. The I/O interfaces recorded are POSIX, MPI-IO, STDIO, HDF5, and PnetCDF used to access the file. For the purpose of this article, we only focus on the POSIX and MPI-IO interfaces. The major counters collected by Darshan for every file at the POSIX interface are:
— | Timestamps of the first file open/read/write/close operation - | ||||
— | Timestamps of the last file open/read/write/close operation - | ||||
— | Cumulative time spent on reading from a file - | ||||
— | Cumulative time spent on writing to a file - | ||||
— | Total number of bytes that were read from a file - | ||||
— | Total number of bytes written to a file - | ||||
— | Rank that accessed a file - |
The process of converting the file-wise Darshan records into a time-series I/O trace is discussed below. The ranks accessing the file indicate whether the application was run in file-per-process mode or single-shared-file mode. Each file is arranged based on increasing order of first open timestamp. The I/O idle time for every file is calculated by subtracting the cumulative time spent on reading and writing from the duration between the file open start and file close end timestamps. This idle time (or delay) along with the cumulative bytes read or written are uniformly distributed for every file within the file open start timestamp and file close end timestamp. Once the distribution of every file’s I/O activity is done, insertion sort is used to sort and combine the I/O activity of the application in a time-series manner. This process of regenerating I/O workload of an application from its Darshan record is in sync with a technique proposed by Snyder et al. [79]. Figure 14 shows the general process of transforming Darshan logs into comprehensive I/O workloads. In this work, we use this technique to build the trace files of three real-world workloads discussed in Section 7.1.3 that are fed as inputs to the Lustre simulator discussed in the previous section.
6.4 Validation of the Lustre Simulator
To validate the ns-3-based Lustre simulator and the Darshan-based workload trace generation, we use the Darshan logs of IOR [40] that ran on Summit [41], the world’s fourth-largest supercomputer, according to the latest Top-500 list [82]. As the first step, time-series-based IOR traces are generated using the Darshan logs. These traces are then fed into the Lustre simulator. Since Summit is based on IBM Spectrum Scale, we use the real bandwidth results of IOR as well the Lustre setup from Titan [22], a decommissioned supercomputer housed at Oak Ridge National Laboratory that had Lustre file system as the storage backend.
The results from the IOR runs on the Lustre simulator and Titan are shown in Table 6. As seen from the results, for both 128 MB and 512 MB file sizes, and a large number of nodes, the Lustre simulator is able to provide approximately similar file system bandwidth. This enables us to use the Darshan-based workload trace generation approach along with the Lustre simulator for the scalability evaluation of
In addition, we also validate the correctness of the simulator by using the same cluster system setup for Lustre as described in Section 7.1.1 (35 OSTs, 7 OSSs) and executing the traces of HACC-I/O (8 processes, 50 million particles) under PFL Configuration 2 and IOR (8 processes, 64 GB) under PFL Configuration 1 simultaneously. The simulator provides a similar OST utilization percentage for both LSA and
7 EVALUATION
To the best of the authors’ knowledge,
7.1 Test Environment
7.1.1 Cluster System Setup.
We evaluate
7.1.2 Performance Measurements and Metrics.
We analyze the following performance metrics: effective read and write bandwidth, load balance, and resource utilization. To capture the degree of load balancing across all OSTs for a given test run, we define the metric OST Cost as the ratio of the maximum utilization of any OST to the average utilization of all OSTs, as shown in Equation (4). An ideal load balanced system has the OST Cost of 1. (4) \(\begin{equation} {{\boldsymbol {OST Cost}}} = \frac{\text{Maximum OST Utilization}}{\text{Average OST Utilization}} \end{equation}\)
The OST Utilization of an OST is the storage used by the client application on the OST relative to the total storage available on the OST.
7.1.3 Large-scale HPC Workloads.
We generate the traces of three real-world HPC workloads from different domain sciences using the process discussed in Section 6.3. The details of the workloads that ran on Summit [41]—the world’s second-largest supercomputer based on the latest Top500 list [82]—are outlined in Table 7.
The genomicPrediction code uses the DeepGP package [110], which implements multilayer perceptron (MLP) networks, convolutional neural network (CNN), ridge regression, and lasso regression for genomic prediction. Therefore, this workload implements deep learning models for predicting complex traits that fall in the category of emerging HPC workloads. For this reason, we see comparatively higher reads than writes in this workload.
The Energy Exascale Earth System Model (E3SM) [98] workload is part of an ongoing, state-of-the-science earth system modeling, simulation, and prediction project that optimizes the use of Department of Energy (DOE) laboratory resources to meet the science needs of the nation and the mission needs of DOE. A major motivation for the E3SM project is the paradigm shift in computing architectures and their related programming models as capability moves into the exascale era. The E3SM model simulates the fully coupled climate system at high-resolution (15–25 km) and will include coupling with energy systems; it has a unique capability for variable resolution modeling using unstructured grids in all its earth system component models. Therefore, this workload represents the exascale real-world HPC workload, which is simulation workload, and thus produces much higher writes than reads.
The cosmoFlow [50] workload aims to process large 3D cosmology datasets on modern HPC platforms. It adapts the deep learning network to a scalable architecture for a large problem size of voxels and predicts three cosmological parameters. The workload uses efficient primitives in MKL-DNN for 3D convolutional neural networks, which are used in an optimized TensorFlow [1] framework for CPU architectures. Thus, this workload represents the extreme use case of emerging machine learning workloads on HPC systems, which is shown by the large difference in the amount of data read and written.
These traces of these three workloads are fed as input (representing data usage by client nodes) to the simulator described in Section 6.2. The evaluation is described in Section 7.5.
7.1.4 File Striping Layouts.
For the non-PFL setup, the stripe count is set to 8 and the stripe size is calculated as described in Section 5.5.2. To evaluate
7.2 OST Utilization
7.2.1 Load Balance for IOR in FPP Mode.
Figure 15 shows the comparison of load under
As can be seen in Figures 16 and 17,
It is important to note that for large files, such as the second IOR experiment with 8 GB non-PFL files per process, the data distribution with
7.2.2 Load Balance for HACC-I/O in FPP Mode.
For HACC-I/O, we evaluate
Similar to IOR, we observe a significant improvement in load balancing for HACC-I/O, as shown in Figure 18, when compared to the default LSA policy. Note that C1 denotes PFL Configuration 1 and C2 Configuration 2, while N refers to non-PFL in Figure 18. The OST Cost for 16 processes with LSA and
7.2.3 Load Balance for Single Shared Files.
In SSF mode, all processes write into and read from a single shared file. We run IOR in SSF mode for both non-PFL and PFL Configuration 1. We run IOR with 8 processes generating \(8~GB\) and \(16~GB\) files. The results are shown in Figure 19. We observe that
7.2.4 Load Balance for Concurrent Application Runs.
We also evaluate
7.3 OSS Utilization
We want to achieve an end-to-end load balance in the file system. Therefore,
Figure 21 shows the comparison of OSS Utilization of all seven OSSs of our testbed under
7.4 I/O Performance
Next, we compare the effective read and write performance for HACC-I/O and IOR. We measure the I/O rate for storing the data to and reading the data from the OSTs. In Figures 22(a)–22(c), the read performance results are shown for the FPP and SSF sharing modes with both PFL and non-PFL striping layouts. We see an improvement of up to \(43\%\) in read performance for
It should be noted that the read performance improvements for FPP with PFL Configuration 2 (denoted as C2 in the graphs) are marginally lower than with Configuration 1. These results are consistent with our earlier assumption that I/O performance degrades when the total number of stripes in a PFL file exceeds the number of available OSTs. A similar trend can be observed for experiments in the SSF mode, which is why we did not pursue any further experiments with PFL Configuration 2 on our 10-node testbed. In addition, it should be noted that for small systems such as our Lustre testbed, there is no significant benefit from using PFL, because the system and workload are too small to benefit from such an advanced feature.
The write performance results for HACC-I/O and IOR are shown in Figures 22(d)–22(f). Please note that the displayed results are small-scale experiments run on a rather small testbed with only seven OSSs and two client nodes. As the number of writing processes increases, the number of competing processes on the small testbed system increases. This results in increased file locking contention when accessing MDS and OSSs. Hence, the write performance with
7.5 Scalability Study
To showcase the scalability of our proposed framework, we evaluate
Figures 23(a)–23(c) show the read performance results of the concurrent application runs using the non-PFL, PFL Configuration 1, and PFL Configuration 2 file layouts. For all file striping layouts,
The simulator confirms that
8 DISCUSSION AND FUTURE WORK
8.1 Emerging Workloads
High performance computing (HPC) workloads are no longer restricted to traditional checkpoint/restart applications. The rise in popularity and functionality of machine learning and deep learning approaches in various science domains, such as biology, earth science, and physics, has led to the read-intensive nature of applications [68].
8.2 Initial Training through Statistical Analysis and Darshan Logs
Currently,
8.3 Rebalancing of Existing Application Datasets
One of the main limitations of
8.4 Metadata Load Balancing
One of the main contributions of
In the future, we plan to extend the prediction model to not only facilitate sophisticated file striping layouts for individual files, but also provide transparent metadata load balancing to better support data-intensive workloads.
8.5 Integration of Tarazu into Other Hierarchical HPC File Systems
The latest implementation of
The main challenge for supporting file systems such as BeeGFS and Ceph will be the integration of the stripe placement mechanisms. Here, special API extensions will be necessary to apply the striping patterns and advanced file layouts. For example, we are currently working on an extension to the Ceph user API that will adapt striping functionality from the llapi library to Ceph.
8.6 System-specific Autotuning of PFL Configuration
PFL simplifies the use of Lustre so users can expect reasonable performance for a variety of file I/O patterns without having to explicitly understand the parallel I/O model. Specifically, users do not necessarily need to know the size or concurrency of the output files before they are created and explicitly specify an optimal layout for each file to achieve good performance. Therefore, the integration of features like PFL is an essential step to support future HPC workloads. For PFL, it is recommended that small files have a lower stripe count (to reduce overhead), and as the file size increases, the stripe count should also be increased (to spread the storage footprint and increase bandwidth). In addition, the layout should only be expanded until the total number of strips reaches or exceeds the number of OSTs. At this point, it is beneficial to add a final layout expansion to EOF that spans all available OSTs to maximize the bandwidth at the end of the file (if it continues to grow significantly in size).
Currently,
8.7 Further Improving the Scalability of Tarazu
For simplicity reasons,
9 CONCLUSION
This article proposes
We evaluate
- [1] . 2016. TensorFlow: A system for Large-Scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). USENIX Association, 265–283. Retrieved from https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadiGoogle Scholar
- [2] . 2019. Active learning-based automatic tuning and prediction of parallel I/O performance. In IEEE/ACM 4th International Parallel Data Systems Workshop (PDSW’19). IEEE, 20–29.
DOI: Google ScholarCross Ref - [3] . 2017. Network Flows: Theory, Algorithms, and Applications. Pearson Education, Chennai, India.Google Scholar
- [4] . 2018. Towards Efficient and Flexible Object Storage Using Resource and Functional Partitioning. Ph. D. Dissertation. Virginia Tech.Google Scholar
- [5] . 2016. MOS: Workload-aware elasticity for cloud object stores. In 25th ACM International Symposium on High-performance Parallel and Distributed Computing (HPDC’16). ACM, New York, NY, 177–188.
DOI: Google ScholarDigital Library - [6] . 2020. Improving collective I/O performance with machine learning supported auto-tuning. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’20). IEEE, 814–821.
DOI: Google ScholarCross Ref - [7] . 2021. Improving the MPI-IO performance of applications with genetic algorithm based auto-tuning. In IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’21). IEEE, 798–805.
DOI: Google ScholarCross Ref - [8] . 2019. Optimizing I/O performance of HPC applications with autotuning. ACM Trans. Parallel Comput. 5, 4 (2019), 27 pages.
DOI: Google ScholarDigital Library - [9] . 2023. I/O access patterns in HPC applications: A 360-degree survey. ACM Comput. Surv. 56, 2 (2023), 41 pages.
DOI: Google ScholarDigital Library - [10] . 2004. The Lustre Storage Architecture (Tech. Rep.).
Technical Report . Retrieved from http://wiki.lustre.org/Google Scholar - [11] . 2016. Introduction to Time Series and Forecasting (3rd ed.). Springer International Publishing, Cham, Switzerland.Google ScholarCross Ref
- [12] . 2009. 24/7 characterization of petascale I/O workloads. In IEEE International Conference on Cluster Computing and Workshops. IEEE, 12 pages.
DOI: Google ScholarCross Ref - [13] . 1999. Efficient optimistic parallel simulations using reverse computation. ACM Trans. Model. Comput. Simul. 9, 3 (1999), 224–253.
DOI: Google ScholarDigital Library - [14] . 2019. I/O characterization and performance evaluation of BeeGFS for deep learning. In 48th International Conference on Parallel Processing (ICPP’19). ACM, New York, NY.
DOI: Google ScholarDigital Library - [15] . 2011. CODES: Enabling Co-design of multi-layer exascale storage architectures. In Workshop on Emerging Supercomputing Technologies.Google Scholar
- [16] . 2021. Workflows community summit: Advancing the state-of-the-art of scientific workflows management systems research and development. CoRR abs/2106.05177 (2021)Google Scholar
- [17] . 2020. Performance characterization of scientific workflows for the optimal use of Burst Buffers. Fut. Gen. Comput. Syst. 110 (2020), 468–480.
DOI: Google ScholarCross Ref - [18] . 2012. A dynamic and adaptive load balancing strategy for parallel file system with large-scale I/O servers. J. Parallel Distrib. Comput. 72, 10 (2012), 1254–1268.
DOI: Google ScholarDigital Library - [19] . 2012. Damaris: How to efficiently leverage multicore parallelism to achieve scalable, jitter-free I/O. In IEEE International Conference on Cluster Computing. IEEE, 155–163.
DOI: Google ScholarDigital Library - [20] . 2015. Using formal grammars to predict I/O behaviors in HPC: The Omnisc’IO approach. IEEE Trans. Parallel Distrib. Syst. 27, 8 (2015), 2435–2449.
DOI: Google ScholarDigital Library - [21] . 2012. Toward comprehensive and accurate simulation performance prediction of parallel file systems. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’12). IEEE, 1–12.
DOI: Google ScholarCross Ref - [22] . 2022. Titan Supercomputer. Oak Ridge National Laboratory. Retrieved from https://www.olcf.ornl.gov/olcf-resources/compute-systems/titan/Google Scholar
- [23] . 2014. HPIS3: Towards a high-performance simulator for hybrid parallel I/O and storage systems. In 9th Parallel Data Storage Workshop. IEEE, 37–42.
DOI: Google ScholarCross Ref - [24] . 2017. A characterization of workflow management systems for extreme-scale applications. Fut. Gen. Comput. Syst. 75 (2017), 228–238.
DOI: Google ScholarCross Ref - [25] . 2011. An overview of the HDF5 technology suite and its applications. In EDBT/ICDT’11 Workshop on Array Databases. ACM, New York, NY. Google Scholar
- [26] . 2020. New Lustre features to improve Lustre metadata and small-file performance. Concurr. Computat.: Pract. Exper. 32, 20 (2020), 6 pages.
DOI: Google ScholarCross Ref - [27] . 2010. Lustre Monitoring Tool (LMT). https://github.com/LLNL/lmtGoogle Scholar
- [28] . 2013. HACC: Extreme scaling and performance across diverse architectures. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC’13). ACM, New York, NY.
DOI: Google ScholarDigital Library - [29] . 2015. HAS: Heterogeneity-aware selective data layout scheme for parallel file systems on hybrid servers. In IEEE International Parallel and Distributed Processing Symposium. IEEE.
DOI: Google ScholarDigital Library - [30] Jan Heichler. 2014. An Introduction to BeeGFS v1.1. https://www.beegfs.de/docs/whitepapers/Introduction_to_BeeGFS_by_ThinkParQ.pdf. Accessed February 16, 2024.Google Scholar
- [31] . 2013. ZeroMQ: Messaging for Many Applications. O’Reilly Media, Inc.Google Scholar
- [32] . 2020. OOOPS: An innovative tool for IO workload management on supercomputers. In IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS’20). IEEE.
DOI: Google ScholarCross Ref - [33] . 2019. Automatic, application-aware I/O forwarding resource allocation. In 17th USENIX Conference on File and Storage Technologies (FAST’19). USENIX Association, 265–279. Retrieved from https://www.usenix.org/conference/fast19/presentation/jiGoogle ScholarDigital Library
- [34] . 2020. Towards HPC I/O performance prediction through large-scale log analysis. In 29th International Symposium on High-Performance Parallel and Distributed Computing. ACM, New York, NY, 77–88.
DOI: Google ScholarDigital Library - [35] . 2012. IOPin: Runtime profiling of parallel I/O in HPC systems. In SC Companion: High Performance Computing, Networking Storage and Analysis. IEEE, 18–23.
DOI: Google ScholarDigital Library - [36] . 2016. Utilizing progressive file layout leveraging SSDs in HPC cloud environments. In IEEE 1st International Workshops on Foundations and Applications of Self* Systems (FAS*W’16). IEEE, 90–95.
DOI: Google ScholarCross Ref - [37] . 2016. Forecasting HPC workload using ARMA models and SSA. In International Conference on Information Technology (ICIT’16). IEEE, 294–297.
DOI: Google ScholarCross Ref - [38] . 2024. The DiskSim Simulation Environment (V4.0). Carnegie Mellon University. Retrieved from: https://www.pdl.cmu.edu/DiskSim/Google Scholar
- [39] . 2022. TOKIO: Total Knowledge of I/O. Retrieved from https://www.nersc.gov/research-and-development/storage-and-i-o-technologies/tokio/Google Scholar
- [40] . 2021. IOR Benchmark Summary. Retrieved from https://asc.llnl.gov/sequoia/benchmarks/IORsummaryv1.0.pdfGoogle Scholar
- [41] . 2021. Summit Supercomputer. Retrieved from https://www.olcf.ornl.gov/summit/Google Scholar
- [42] . 2022. Frontier Supercomputer. Retrieved from https://www.olcf.ornl.gov/frontier/Google Scholar
- [43] . 2022. Oak Ridge National Laboratory Storage Ecosystem. In Platform for Advanced Scientific Computing Conference (PASC’22). Retrieved from https://linklings.s3.amazonaws.com/organizations/pasc/pasc22/submissions/stype117/PYFgV-msa274s1.pdfGoogle Scholar
- [44] . 2014. Enabling dynamic file I/O path selection at runtime for parallel file system. J. Supercomput. 68, 2 (2014), 996–1021.
DOI: Google ScholarDigital Library - [45] . 2017. Scientific user behavior and data-sharing trends in a petascale file system. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). ACM, New York, NY, 1–12.
DOI: Google ScholarDigital Library - [46] . 2011. Towards simulation of parallel file system scheduling algorithms with PFSsim. In 7th IEEE International Workshop on Storage Network Architecture and Parallel I/O (SNAPI’11). IEEE, 12 pages.Google Scholar
- [47] . 2013. On the design and implementation of a simulator for parallel file system research. In IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST’13). IEEE, 5 pages.
DOI: Google ScholarCross Ref - [48] . 2017. UMAMI: A recipe for generating meaningful metrics through holistic I/O performance analysis. In 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems. ACM, New York, NY, 55–60.
DOI: Google ScholarDigital Library - [49] . 2013. A multi-level approach for understanding I/O activity in HPC applications. In IEEE International Conference on Cluster Computing (CLUSTER’13). IEEE, 5 pages.
DOI: Google ScholarCross Ref - [50] . 2018. CosmoFlow: Using deep learning to learn the universe at scale. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’18). IEEE, 11 pages.
DOI: Google ScholarDigital Library - [51] . 2016. Evaluating Progressive File Layouts for Lustre. In Cray User Group Conference (CUG’16).Google Scholar
- [52] . 2009. Building a parallel file system simulator. J. Phys.: Confer. Series 180 (
July 2009), 012050.DOI: Google ScholarCross Ref - [53] . 2012. ScalaTrace: Tracing, analysis and modeling of HPC codes at scale. In Applied Parallel and Scientific Computing. Springer Berlin, 410–418.
DOI: Google ScholarDigital Library - [54] . 2008. High-Performance Features and Flexible Support for a Wide Array of Networks.Google Scholar
- [55] . 2018. Accelerating Network Communication and I/O in Scientific High Performance Computing Environments. Ph. D. Dissertation. Heidelberg University, Germany.Google Scholar
- [56] . 2016. An I/O load balancing framework for large-scale applications (BPIO 2.0). In Poster at SC’16.Google Scholar
- [57] . 2021. Parallel I/O evaluation techniques and emerging HPC workloads: A perspective. In IEEE International Conference on Cluster Computing (CLUSTER’21). IEEE, 671–679.
DOI: Google ScholarCross Ref - [58] . 2017. Automatic and transparent resource contention mitigation for improving large-scale parallel file system performance. In IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS’17). IEEE, 604–613.
DOI: Google ScholarCross Ref - [59] . 2023. NS-3 Network Simulator. Retrieved from https://www.nsnam.org/Google Scholar
- [60] . 2002. Markov model prediction of I/O requests for scientific applications. In 16th International Conference on Supercomputing (ICS’02). ACM, New York, NY, 147–155.
DOI: Google ScholarDigital Library - [61] . 2021. Lustre Operations Manual 2.x. Retrieved from https://www.lustre.org/documentation/Google Scholar
- [62] Sarp Oral, James Simmons, Jason Hill, Dustin Leverman, Feiyi Wang, Matt Ezell, Ross Miller, Douglas Fuller, Raghul Gunasekaran, Youngjae Kim, Saurabh Gupta, Devesh Tiwari Sudharshan S. Vazhkudai, James H. Rogers, David Dillow, Galen M. Shipman, and Arthur S. Bland. 2014. Best practices and lessons learned from deploying and operating large-scale data-centric parallel file systems. SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, IEEE, 217–228.
DOI: Google ScholarDigital Library - [63] . 2020. Analytical modelling of distributed file systems (GlusterFS and CephFS). In Reliability, Safety and Hazard Assessment for Risk-Based Technologies. Springer Singapore, 213–222.
DOI: Google ScholarCross Ref - [64] . 2020. Uncovering access, reuse, and sharing characteristics of I/O-intensive files on large-scale production HPC systems. In 18th USENIX Conference on File and Storage Technologies (FAST’20). USENIX Association, 91–101. Retrieved from https://www.usenix.org/conference/fast20/presentation/patel-hpc-systemsGoogle ScholarDigital Library
- [65] . 2019. Revisiting I/O behavior in large-scale storage systems: The expected and the unexpected. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’19). ACM, New York, NY, Article
65 , 13 pages.DOI: Google ScholarDigital Library - [66] . 2020. Understanding HPC application I/O behavior using system level statistics. In IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC’20). IEEE, 202–211.
DOI: Google ScholarCross Ref - [67] . 2017. I/O load balancing for big data HPC applications. In IEEE International Conference on Big Data (Big Data’17). IEEE, 233–242.
DOI: Google ScholarCross Ref - [68] . 2021. Characterizing machine learning I/O workloads on leadership scale HPC systems. In 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’21). IEEE, 1–8.
DOI: Google ScholarCross Ref - [69] . 2017. Dynamic virtual machine placement in cloud computing. In Resource Management and Efficiency in Cloud Computing Environments. IGI Global, 136–167.
DOI: Google ScholarCross Ref - [70] . 2017. Toward scalable monitoring on large-scale storage for software defined cyberinfrastructure. In 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems. Association for Computing Machinery, New York, NY, 49–54.
DOI: Google ScholarDigital Library - [71] . 2016. CHOPPER: Optimizing data partitioning for in-memory data analytics frameworks. In IEEE International Conference on Cluster Computing (CLUSTER’16). IEEE, 110–119.
DOI: Google ScholarCross Ref - [72] . 2001. An introduction to the InfiniBand™ architecture. High Perform. Mass Stor. Parallel I/O 42 (2001), 617–632.Google Scholar
- [73] . 2014. High Performance Parallel I/O. CRC Press. Google Scholar
- [74] . 2009. A novel network request scheduler for a large scale storage system. Comput. Sci.-Res. Devel. 23, 3-4 (2009), 143–148.
DOI: Google ScholarCross Ref - [75] . 2021. Apollo: An ML-assisted real-time storage resource observer. In 30th International Symposium on High-Performance Parallel and Distributed Computing. Association for Computing Machinery, New York, NY, 147–159.
DOI: Google ScholarDigital Library - [76] . 2000. Link prediction and path analysis using Markov chains. Comput. Netw. 33, 1-6 (2000), 377–386.
DOI: Google ScholarDigital Library - [77] . 2002. GPFS: A shared-disk file system for large computing clusters. In 1st USENIX Conference on File and Storage Technologies (FAST’02). USENIX Association, 231–244.Google Scholar
- [78] . 2016. Modular HPC I/O characterization with Darshan. In 5th Workshop on Extreme-Scale Programming Tools (ESPT’16). IEEE, 9–17.
DOI: Google ScholarCross Ref - [79] . 2015. Techniques for modeling large-scale HPC I/O workloads. In 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems. ACM, New York, NY, 11 pages.
DOI: Google ScholarDigital Library - [80] . 2011. A segment-level adaptive data layout scheme for improved load balance in parallel file systems. In 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’11). IEEE, 414–423.
DOI: Google ScholarDigital Library - [81] . 1999. On implementing MPI-IO portably and with high performance. In 6th Workshop on I/O in Parallel and Distributed Systems (IOPADS’99). ACM, New York, NY, 23–32.
DOI: Google ScholarDigital Library - [82] . 2022. TOP500 List. Retrieved from https://www.top500.org/lists/top500/2022/06/Google Scholar
- [83] . 2017. Alleviating I/O interference through workload-aware striping and load-balancing on parallel file systems. In ISC High Performance (ISC’17). Springer International Publishing, Cham, 315–333.
DOI: Google ScholarDigital Library - [84] . 1977. A note on convergence of the Ford-Fulkerson flow algorithm. Math. Oper. Res. 2, 2 (1977), 143–144.
DOI: Google ScholarDigital Library - [85] . 2010. Parallel I/O performance: From events to ensembles. In IEEE International Symposium on Parallel & Distributed Processing (IPDPS’10). IEEE, 1–11.
DOI: Google ScholarCross Ref - [86] . 2017. GUIDE: A scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC’17). ACM, New York, NY, 1–12.
DOI: Google ScholarDigital Library - [87] . 2019. iez: Resource contention aware load balancing for large-scale parallel file systems. In IEEE International Parallel and Distributed Processing Symposium (IPDPS’19). IEEE, 610–620.
DOI: Google ScholarCross Ref - [88] . 1995. The POSIX family of standards. StandardView 3, 1 (
Mar. 1995), 11–17. Google ScholarDigital Library - [89] . 2020. Recorder 2.0: Efficient parallel I/O tracing and analysis. In IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’20). IEEE, 1–8.
DOI: Google ScholarCross Ref - [90] . 2013. Performance and scalability evaluation of the Ceph parallel file system. In 8th Parallel Data Storage Workshop. ACM, New York, NY, 14–19.
DOI: Google ScholarDigital Library - [91] . 2014. Improving large-scale storage system performance via topology-aware and balanced data placement. In 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS’14). IEEE, 656–663.
DOI: Google ScholarCross Ref - [92] . 2006. Ceph: A scalable, high-performance distributed file system. In 7th Conference on Operating Systems Design and Implementation (OSDI’06). USENIX Association, 307–320.Google Scholar
- [93] Marc C. Wiedemann, Julian M. Kunkel, Michaela Zimmer, Thomas Ludwig, Michael Resch, Thomas Bönisch, Xuan Wang, Andriy Chut, Alvaro Aguilera, Wolfgang E. Nagel, Michael Kluge, and Holger Mickler. 2013. Towards I/O analysis of HPC systems and a generic architecture to collect access patterns. Computer Science-Research and Development 28 (2013), 241–251.
DOI: Google ScholarDigital Library - [94] . 2013. Parallel file system analysis through application I/O tracing. Comput. J. 56, 2 (2013), 141–155.
DOI: Google ScholarDigital Library - [95] . 2013. Elastic and scalable tracing and accurate replay of non-deterministic events. In 27th International ACM Conference on International Conference on Supercomputing (ICS’13). ACM, New York, NY, 59–68.
DOI: Google ScholarDigital Library - [96] . 2011. Probabilistic communication and I/O tracing with deterministic replay at scale. In International Conference on Parallel Processing. IEEE, 196–205.
DOI: Google ScholarDigital Library - [97] . 2016. Big data analytics on HPC architectures: Performance and cost. In IEEE International Conference on Big Data (Big Data’16). IEEE, 2286–2295.
DOI: Google ScholarCross Ref - [98] Shaocheng Xie, Wuyin Lin, Philip J. Rasch, Po-Lun Ma, Richard Neale, Vincent E. Larson, Yun Qian, Peter A. Bogenschutz, Peter Caldwell, Philip Cameron-Smith, Jean-Christophe Golaz, Salil Mahajan, Balwinder Singh, Qi Tang, Hailong Wang, Jin-Ho Yoon, Kai Zhang, and Yuying Zhang. 2018. Understanding cloud and convective characteristics in version 1 of the E3SM atmosphere model. Journal of Advances in Modeling Earth Systems 10, 10 (2018), 2618–2644.
DOI: Google ScholarCross Ref - [99] . 2016. LIOProf: Exposing Lustre File System Behavior for I/O Middleware. InCray User Group Meeting (CUG’16).Google Scholar
- [100] . 2017. Accelerating big data analytics on HPC clusters using two-level storage. Parallel Comput. 61 (2017), 18–34.
DOI: Google ScholarCross Ref - [101] . 2019. End-to-end I/O monitoring on a leading supercomputer. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI’19). USENIX Association379–394. Retrieved from https://www.usenix.org/conference/nsdi19/presentation/yangGoogle Scholar
- [102] . 2023. End-to-end I/O monitoring on leading supercomputers. ACM Trans. Storage 19, 1, Article
3 (Jan. 2023), 35 pages.DOI: Google ScholarDigital Library - [103] . 2022. An end-to-end and adaptive I/O optimization tool for modern HPC storage systems. In IEEE International Parallel and Distributed Processing Symposium (IPDPS’22). IEEE, 1294–1304.
DOI: Google ScholarCross Ref - [104] . 2009. Lustre Simulator. Retrieved from https://github.com/yingjinqian/Lustre-SimulatorGoogle Scholar
- [105] . 2008. On the design of distributed object placement and load balancing strategies in large-scale networked multimedia storage systems. IEEE Trans. Knowl. Data Eng. 20, 3 (2008), 369–382.
DOI: Google ScholarDigital Library - [106] . 2004. BigSim: A parallel simulator for performance prediction of extremely large parallel machines. In 18th International Parallel and Distributed Processing Symposium.IEEE, 10 pages.
DOI: Google ScholarCross Ref - [107] . 2013. HySF: A striped file assignment strategy for parallel file system with hybrid storage. In IEEE 10th International Conference on High Performance Computing and Communications and IEEE International Conference on Embedded and Ubiquitous Computing (HPCC & EUC’13). IEEE, 511–517.
DOI: Google ScholarCross Ref - [108] . 2023. MAWA-HPC: Modular and Automated Workload Analysis for HPC Systems. In Poster at ISC High Performance Conference (ISC’23).
DOI: Google ScholarCross Ref - [109] . 2022. A comprehensive I/O knowledge cycle for modular and automated HPC workload analysis. In IEEE International Conference on Cluster Computing (CLUSTER’22). IEEE, 581–588.
DOI: Google ScholarCross Ref - [110] . 2022. Deep Learning for Genomic Prediction (DeepGP). Retrieved from https://github.com/lauzingaretti/DeepGPGoogle Scholar
Index Terms
- Tarazu: An Adaptive End-to-end I/O Load-balancing Framework for Large-scale Parallel File Systems
Recommendations
Alleviating I/O Interference Through Workload-Aware Striping and Load-Balancing on Parallel File Systems
High Performance ComputingAbstractNowadays parallel file systems have been widely used in many supercomputers. Lustre is one of the most used parallel file systems, and its enhanced file system named FEFS (Fujitsu Exabyte File System) has been used at K computer. The K computer ...
Design and Evaluation of MPI File Domain Partitioning Methods under Extent-Based File Locking Protocol
MPI collective I/O has been an effective method for parallel shared-file access and maintaining the canonical orders of structured data in files. Its implementation commonly uses a two-phase I/O strategy that partitions a file into disjoint file domains,...
A high-performance distributed parallel file system for data-intensive computations
One of the challenges brought by large-scale scientific applications is how to avoid remote storage access by collectively using sufficient local storage resources to hold huge amounts of data generated by the simulation while providing high-performance ...
Comments