research-article

Open Access

End-to-end I/O Monitoring on Leading Supercomputers

Authors:
Bin Yang

Shandong University, National Supercomputing Center in Wuxi, Jinan, China

Shandong University, National Supercomputing Center in Wuxi, Jinan, China

0000-0002-3783-2228
View Profile

,
Wei Xue

Tsinghua, Beijing, University, Beijing, China

Tsinghua, Beijing, University, Beijing, China

0000-0001-9740-6581
View Profile

,
Tianyu Zhang

Shandong University, Jinan, China

Shandong University, Jinan, China

0000-0002-3491-5909
View Profile

,
Shichao Liu

Shandong University, Jinan, China

Shandong University, Jinan, China

0000-0003-4714-3749
View Profile

,
Xiaosong Ma

Qatar Computing Research Institute, HBKU, Doha, Qatar

Qatar Computing Research Institute, HBKU, Doha, Qatar

0000-0003-1261-2496
View Profile

,
Xiyang Wang

National Supercomputing Center in Wuxi, Wuxi, China

National Supercomputing Center in Wuxi, Wuxi, China

0000-0001-7567-5211
View Profile

,
Weiguo Liu

Shandong University, Jinan, China

Shandong University, Jinan, China

0000-0001-8834-0453
View Profile

Authors Info & Claims

ACM Transactions on Storage Volume 19 Issue 1Article No.: 3pp 1–35https://doi.org/10.1145/3568425

Published:11 January 2023Publication History

ACM Transactions on Storage

Abstract

This paper offers a solution to overcome the complexities of production system I/O performance monitoring. We present Beacon, an end-to-end I/O resource monitoring and diagnosis system for the 40960-node Sunway TaihuLight supercomputer, currently the fourth-ranked supercomputer in the world. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes, and metadata servers. With mechanisms such as aggressive online and offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. With Beacon’s deployment on TaihuLight for more than three years, we demonstrate Beacon’s effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has already successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. Encouraged by Beacon’s success in I/O monitoring, we extend it to monitor interconnection networks, which is another contention point on supercomputers. In addition, we demonstrate Beacon’s generality by extending it to other supercomputers. Both Beacon codes and part of collected monitoring data are released.¹

1 INTRODUCTION

Modern supercomputers are networked systems with increasingly deep storage hierarchies, serving applications with growing scale and complexity. The long I/O path from storage media to application, combined with complex software stacks and hardware configurations, makes I/O optimizations increasingly challenging for application developers and supercomputer administrators. In addition, because I/O utilizes heavily shared system components (unlike computation or memory accesses), it usually suffers from substantial inter-workload interference, causing high performance variance [23, 32, 37, 45, 52, 63, 71].

Online tools that can capture/analyze I/O activities and guide optimization are urgently needed. They also need to provide I/O usage information and performance records to guide future systems’ design, configuration, and deployment. To this end, several profiling/tracing tools and frameworks have been developed, including application-side (e.g., Darshan [9], ScalableIOTrace [81], and IOPin [34]), back-end side (e.g., LustreDU [7], IOSI [40], and LIOProf [91]), and multi-layer tools (e.g., EZIOTracer [49], GUIDE [80], and Logaider [12]).

These proposed tools, however, have one or more of the following limitations. Application-oriented tools often require developers to instrument their source code or link extra libraries. They also do not offer intuitive ways to analyze inter-application I/O performance behaviors such as interference issues. Back-end-oriented tools can collect system-level performance data and monitor cross-application interactions but have difficulty in identifying performance issues for specific applications and in finding their root causes. Finally, problematic applications issuing inefficient I/O requests escape the radar of back-end-side analytical methods [40, 41] relying on high-bandwidth applications.

This paper reports the design, implementation, and deployment of a lightweight, end-to-end I/O resource monitoring and diagnosis system, Beacon, for TaihuLight, currently the fourth-ranked supercomputer in the world [29]. It works with TaihuLight’s 40,960 compute nodes (over ten-million cores in total), 288 forwarding nodes, 288 storage nodes, and two metadata nodes. Beacon integrates front-end tracing and back-end profiling into a seamless framework, enabling tasks such as automatic per-application I/O behavior profiling, I/O bottleneck/interference analysis, and system anomaly detection.

To the best of our knowledge, this is the first system-level, multi-layer monitoring and real-time diagnosis framework deployed on ultra-scale supercomputers. Beacon collects performance data simultaneously from different types of nodes (including the compute, I/O forwarding, storage, and metadata nodes) and analyzes them collaboratively, without requiring any involvement of application developers. Its elaborated collection scheme and aggressive compression minimize the system cost; only 85 part-time servers to monitor the entire 40960-node system, with \(\lt \!1\%\) performance overhead in user applications.

We have deployed Beacon for production use since April 2017. It has already helped the TaihuLight system administration and I/O performance team identify several performance degradation problems. With its rich I/O performance data collection and real-time system monitoring, Beacon successfully exposes the mismatch between application I/O patterns and widely adopted underlying storage design/configurations. To help application developers and users, it enables detailed per-application I/O behavior study, with novel inter-application interference identification and analysis. Beacon also performs automatic anomaly detection. Finally, we have recently started to expand Beacon beyond I/O to network switch monitoring.

Based on our design and deployment experience, we argue that having such an end-to-end detailed I/O monitoring framework is highly rewarding. Beacon’s all-system-level monitoring decouples it from language, library, or compiler constraints, enabling the monitoring data of collection and analysis for all applications and users. Much of its infrastructure reuses existing server/network/storage resources, and it has proved to have negligible overhead. In exchange, users and administrators harvest deep insights into the complex I/O system components’ operations and interactions, and reduce both human resources and machine core-hours wasted on unnecessarily slow/jittery I/O or system anomalies.

2 TAIHULIGHT NETWORK STORAGE

Let us first introduce the TaihuLight supercomputer (and its Icefish I/O subsystem) used to perform our implementation and deployment. Though the rest of our discussion is based on this specific platform, many aspects of Beacon’s design and operation can be applied to other large-scale supercomputers or clusters.

TaihuLight, currently the fourth-ranked supercomputer in the world, is a many-core accelerated 125-petaflop system [22]. Figure 1 illustrates its architecture, highlighting the Icefish storage subsystem. The 40,960 260-core compute nodes are organized into 40 cabinets, each containing four supernodes. Through dual-rail FDR InfiniBand, all the 256 compute nodes in one supernode are fully connected and then connected to Icefish via a Fat-tree network. In addition, Icefish serves an Auxiliary Compute Cluster (ACC) with Intel Xeon processors, mainly used for data pre- and post-processing.

Fig. 1. TaihuLight and its Icefish storage system architecture overview. Beacon uses a separate monitoring and management Ethernet network shown at the bottom.

The Icefish back end employs the Lustre parallel file system [4], with an aggregate capacity of 10 PB on top of 288 storage nodes and 144 Sugon DS800 disk enclosures. An enclosure contains 60 1.2-TB SAS HDD drives, composing six Object Storage Targets (OSTs), each an 8+2 RAID6 array. The controller within each enclosure connects to two storage nodes, via two fiber channels for path redundancy. Therefore, every storage node manages three OSTs, while the two adjacent storage nodes sharing a controller form a failover pair.

Between the compute nodes and the Lustre back end is a layer of 288 I/O forwarding nodes. Each plays a dual role, both as a Lightweight File System (LWFS) based on the Gluster [13] server to the compute nodes and a client to the Lustre back end. This I/O forwarding practice is adopted by multiple other platforms that operate at such a scale [6, 44, 53, 82, 95].

A forwarding node provides a bandwidth of 2.5 GB/s, aggregating to over 720 GB/s for the entire forwarding system. Each back-end controller provides about 1.8 GB/s, amounting to a file system bandwidth of around 260 GB/s. Overall, Icefish delivers 240 GB/s and 220 GB/s aggregate bandwidths for reads and writes, respectively.

TaihuLight debuted on the Top500 list in June 2016. At the time of this study, Icefish was equally partitioned into two namespaces: Online1 (for everyday workloads) and Online2 (reserved for ultra-scale jobs that occupy the majority of the compute nodes), with disjointed sets of forwarding nodes. A batch job can only use either namespace. I/O requests from a compute node are served by a specified forwarding node using a static mapping strategy for easy maintenance (48 fixed forwarding nodes for ACC and 80 fixed forwarding nodes for Sunway compute nodes).

Therefore, the two namespaces, along with statically partitioned back-end resources, are currently utilized separately by routine jobs and “VIP” jobs. One motivation for deploying an end-to-end monitoring system is to analyze the I/O behavior of the entire supercomputer’s workloads and design more flexible I/O resource allocation/scheduling mechanisms. For example, motivated by the findings of our monitoring system, a dynamic forwarding allocation system [31] for better forwarding resource utilization was developed, tested, and deployed.

3 BEACON DESIGN AND IMPLEMENTATION

3.1 Beacon Architecture Overview

Figure 2 shows the three components of Beacon: the monitoring component, the storage component, and a dedicated Beacon server. Beacon performs I/O monitoring at the six components of TaihuLight, including the LWFS client (on the compute nodes), the LWFS serve, the Lustre client (the latter two are both on the forwarding nodes), the Lustre server (on the storage nodes), the Lustre metadata server (on the metadata nodes), and the job scheduler (on the scheduler node). For the first five, Beacon deploys the lightweight daemons that can collect I/O-relevant events, status, and performance data locally, and then delivers the aggregated and compressed data to Beacon’s distributed databases, which are deployed on 84 part-time servers. Aggressive first-pass compression is conducted on all compute nodes for efficient per-application I/O trace collection/storage. For the job scheduler, Beacon interacts with the job queuing system to keep track of per-job information, and then sends the job information to the MySQL database (on the 85th part-time server). Details of Beacon’s monitoring component can be found in Section 3.2.

Fig. 2. Beacon’s main components: daemons at monitoring points, a distributed I/O record database, a job database, plus a dedicated Beacon server.

Beacon’s storage component is deployed on 85 of 288 storage nodes. Beacon has its major back-end processing and storage workflow distributed across these storage nodes with their node-local disks, achieving a low overall overhead and satisfying stability of services. To this end, Beacon divides the 40,960 compute nodes into 80 groups and enlists 80 of the 288 storage nodes to communicate with one group each. Two more storage nodes are used to collect data from the forwarding nodes, plus another for storage nodes and one last for Metadata Data Server (MDS). Together, these 84 “part-time” servers (shown as “N1” to “N84” in Figure 2) are called log servers, which host a distributed I/O record database of Beacon. Considering the data collection across a total number of more than 50,000 nodes, a certain number of servers is beneficial to the stability and concurrent access efficiency of Beacon. In addition, one more storage node (N85 in Figure 2) is used to host Beacon’s job database (implemented using MySQL [16]). By leveraging the hardware devices available on the supercomputer, we can deploy Beacon quickly.

These log servers adopt a layered software architecture built upon mature open-source frameworks. They collect I/O-relevant events, status, and performance data through Logstash [78], a server-side log processing pipeline for simultaneously ingesting data from multiple sources. The data are then imported to Redis [65], a widely used in-memory data store, acting as a cache to quickly absorb monitoring output. Persistent data storage and subsequent analysis are done via Elasticsearch [36], a distributed lightweight search and analytics engine supporting a NoSQL database. It also supports efficient Beacon queries for real-time and offline analysis.

Finally, Beacon conducts data analytics and visualizes the results of analysis to Beacon’s users (either system administrators or application users) with a dedicated Beacon server. Then, it performs two kinds of offline data analysis periodically: (1) second-pass, inter-node compression to further remove data redundancy by comparing and combining logs from compute nodes running the same job, and (2) extracting and caching in MySQL using SQL views of the per-job statistic summary while generating and caching in Redis common performance visualization results, so as to facilitate a speedy user response. Log and monitoring data, after the two-pass compression, are permanently stored using Elasticsearch on this dedicated Beacon server. Data in the distributed I/O record database are kept for six months. Considering the typical daily data collection size of 10–100 GB, its 120-TB RAID5 capacity far exceeds the system’s lifetime storage space needs.

Beacon’s web interface uses the Vue [93]+Django [19] framework, which can efficiently separate the front end (a user-friendly GUI for processing and visualizing the I/O-related job/system information queries) and the back end (the service for obtaining the analysis results of Beacon and feeding them back to the front end). For instance, application users can query a summary of their programs’ I/O behavior based on the job ID, along the entire I/O path, to help diagnose I/O performance problems. Moreover, system administrators can monitor real-time load levels on all forwarding nodes, storage nodes, and metadata servers, facilitating future job scheduling optimizations and center-level resource allocation policies. Figure 3 shows the corresponding screenshots. Section 4 provides more details, with concrete case studies.

Fig. 3. Sample display from Beacon’s web interface: (a) cross-layer read/write bandwidth of one user job, (b) bandwidth of three OSTs identified as undergoing an anomaly.

All communication among Beacon entities uses a low-cost, easy-to-maintain Ethernet connection (marked in green in Figure 1) that is separate from both the main computation and the storage interconnects.

3.2 Multi-layer I/O Monitoring

Figure 4 shows the format of all data collected by Beacon, including the LWFS client trace entry, LWFS server log entry, Lustre client log entry, Lustre server log entry, Lustre MDS log entry, and Job scheduler log entry. For details, see the following section.

3.2.1 Compute Nodes.

On each of the 40,960 compute nodes, Beacon collects LWFS client trace logs via instrumenting in the FUSE (File system in User Space) [17]. Each log entry contains the node’s IP, I/O operation type, file descriptor, offset, request size, and timestamp.

On a typical day, such raw trace data alone amount to over 100 GB, making their collection/processing a non-trivial task on Beacon’s I/O record database, which takes away resources from the storage nodes. However, there exists abundant redundancy in HPC workloads’ I/O operations. For example, as each compute node is usually dedicated to one job at a time, the job IDs are identical among many trace entries. Similarly, owing to the regular, tightly coupled nature of many parallel applications, adjacent I/O operations likely have common components, such as the target file, operation type, and request size. Recognizing this, Beacon performs aggressive online compression on each compute node to dramatically reduce the I/O trace size. This is done by a simple, linear algorithm comparing adjacent log items and combining them with an identical operation type, file descriptor, and request size, and accessing contiguous areas. These log items are replaced with a single item plus a counter. Considering the low computing overhead, we perform such parallel first-pass compression on compute nodes.

Beacon conducts offline log processing and second-pass compression on the dedicated server. Here, it extracts the feature vector \(\lt\)time, operation, file descriptor, size, offset\(\gt\) from the original log records and performs inter-node compression by comparing feature vector lists from all nodes and merging identical vectors, using a similar approach as in block trace modeling [77] or ScalaTrace [54].

Table 1 summarizes the effectiveness of Beacon’s monitoring data compression. It gives the compression ratio under two kinds of methods of eight applications, including six open-source applications (APT [84], WRF [69], DNDC [25], CAM [21], AWP [20], and Shentu [39]) and two closed-source computational fluid dynamics simulators (XCFD and GKUA). The results indicate that the compute-node-side first-pass compression reduces the raw trace size by a factor of 5.4 to 34.6 across eight real-world, large-scale applications. However, the second pass achieves a less impressive reduction, partly because data have already undergone one pass of compression. Here, although the compute nodes perform similar I/O operations, different parameter values such as the file offset make it harder to combine data entries.

Table 1.

Applications	1st-pass	2nd-pass
APT	5.4	2.1
WRF	14.2	3.8
DNDC	10.1	3.4
XCFD	12.2	3.8
GKUA	34.6	3.6
CAM	9.2	4.4
AWP	15.1	3.2
Shentu	22.2	2.6

View Table

Table 1. Compression Ratio of Sample Applications

3.2.2 Forwarding Nodes.

On each forwarding node, Beacon profiles both the LWFS server and Lustre client. It collects the latency and processing time for each LWFS server request by instrumenting all I/O operations at the POSIX layer and the request queue length for each LWFS server by sampling the queue status once per 1,000 requests. Rather than saving the per-request traces, the Beacon daemon periodically processes new traces and only saves I/O request statistics such as latency and queue length distribution.

For the Lustre client, Beacon collects request statistics by sampling the status of all outstanding RPC requests once every second. Each sample contains the forwarding ID and RPC request size sent to the Lustre server.

3.2.3 Storage Nodes and MDS.

On the storage nodes, Beacon daemons periodically sample the Lustre OST status table, record data items such as the OST ID and OST total data size, and further send high-level statistics such as the count of RPC requests and average per-RPC data size in the past time window. On the Lustre MDS, Beacon also periodically collects and records statistics on active metadata operations (such as open and lookup) at 1-second intervals while storing a summary of the periodic statistics in its database.

3.3 Multi-layer I/O Profiling

All the aforementioned monitoring data are transmitted for long-term storage and processing at the database on the dedicated Beacon server as JSON objects, on top of which Beacon builds I/O monitoring/profiling services. These include automatic anomaly detection, which runs periodically, as well as query and visualization tools, which supercomputer users and administrators can use interactively. Below, we give more detailed descriptions of these functions.

3.3.1 Automatic Anomaly Detection.

Beacon performs two types of automatic anomaly detection. One is to locate the job I/O performance anomaly. The job I/O performance anomaly is common in the complicated HPC environment. Various factors can cause performance anomalies, and I/O interference is one of the major factors. However, as supercomputer architectures become more complicated, it becomes increasingly difficult to identify and locate I/O interference. The other type of detection aims to identify the node anomaly. Outright failure, which implies the node is entirely out of service, is a common type of node anomaly that can be detected relatively straightforwardly in a large system; it is commonly handled by tools such as heartbeat detection [67, 74]. We do not discuss outright failure in this paper. Here, we focus on the other type, faulty system components, which are alive yet slow components, such as forwarding nodes and OSTs under performance degradation. Faulty system components may continue to serve requests, but at a much slower pace, draining the entire application’s performance and reducing overall system utilization. In a dynamic storage system serving multiple platforms and many concurrent applications, such stragglers are difficult to identify.

With the assistance of Beacon’s continuous, end-to-end and multi-layer I/O monitoring, a new option is made available to application developers and supercomputer administrators to examine jobs’ performance and system health by connecting statistics on application-issued I/O requests to that of individual OST’s bandwidth measurement. Such a connection guides Beacon to deduce what is the norm and what is an exception. Leveraging this capability, we design and implement a lightweight, automatic anomaly detection tool. Figure 5 shows the workflow of the anomaly detection tool.

Fig. 5. Flow chart of automatic anomaly detection.

The left part of the figure shows the job I/O performance anomaly detection workflow. Beacon detects the job I/O performance anomaly by checking newly measured I/O performance results against historical records, based on the assumption that most data-intensive applications have relatively consistent I/O behavior. First, it adopts the automatic I/O phase identification technique as in the IOSI system [40] developed on the Oak Ridge National Laboratory Titan supercomputer, which uses Discrete Wavelet Transform (DWT) to find distinct “I/O bursts” from continuous I/O bandwidth time-series data. Then, Beacon deploys a two-stage approach to detect jobs’ abnormal I/O phase effectively. In the first stage, Beacon classifies the I/O phase s into several distinct categories in terms of their I/O mode and total I/O volume by using the DBSCAN algorithm [18]. In the second stage, Beacon calculates I/O phase s’ performance vectors for each category, clusters the performance vectors with DBSCAN again, and then identifies the abnormal I/O phase s for each job with the clustering results. Here, we propose a new measurement feature named the performance vector, which is a description of the I/O phase’s throughput waveform. Intuitively, the throughput of the abnormal I/O phase is substantially lower for most of the time during the I/O phase’s period when compared to the I/O phase with normal performance. Therefore, the throughput distribution may become an important feature to differentiate whether the I/O phase is abnormal.

The process of calculating the performance vector is shown in Algorithm 1. We determine the I/O phase’s time span in each range by dividing the throughput between the minimum and maximum into N intervals. Here, we take WRF [69] as an example to describe the process of calculating performance vectors. WRF is a weather forecast application with the highest corehour occupancy rate on the TaihuLight, using a 1:1 I/O mode. Figure 6(a) illustrates two WRF jobs running at a scale of 128 compute nodes. The job with normal performance is shown above, while the job with abnormal performance is shown below. The maximum bandwidth of these I/O phase s is around 60 MB/s, and the minimum is 0.3 MB/s, according to Beacon’s historical statistics, implying that the bandwidth range of o these I/O phase s is (0, 60] (\(TH_{min}\)=0 and \(TH_{max}\)=60). Each I/O phase’s throughput is divided into five (N=5) intervals, and the interval R is set to 12. The number of “five” is selected empirically, based on WRF’s monitoring data. Figure 6(b) shows the calculation results of the distribution of the four I/O phase s’ throughput. In the smallest sub-interval ((0, 12]), the time ratio of abnormal I/O phase s is substantially larger than the time ratio of regular I/O phase s in the same intervals. That is, the performance vectors of abnormal I/O phase s are considerably different from those of other I/O phase s. According to the previous description, Beacon performs the second-stage clustering with performance vectors from the same category of I/O phase s. The outliers obtained after clustering are considered as the abnormal I/O phase s. After testing with the real-world dataset, we find that Beacon’s two-stage clustering approach improves accuracy by around 20% over IOSI’s simple one-stage clustering method (IOSI detects the outliers only by clustering the I/O phase s’ consumed time and I/O volume).

Fig. 6. An example of WRF’s anomaly detection.

Then, Beacon utilizes its rich monitoring data to examine neighbor jobs that share forwarding node(s) with the abnormal job when outliers are found. In particular, it judges the cause of the anomaly by whether such neighbors have interference-prone features, such as high MDOPS, high I/O bandwidth, high IOPS, or N:1 I/O mode. The I/O mode indicates the parallel file sharing mode among processes, where common modes include “N:N” (each compute process accesses a separated file), “N:1” (all processes share one file), “N:M” (N processes perform I/O aggregation to access M files, M\(\lt\)N), and “1:1” (only one of all processes performs sequential I/O on a single file). Such findings are saved in the Beacon database and provided to users via the Beacon web-based application I/O query tool. Applications, of course, need to accumulate at least several executions for such detection to take effect.

The right part of Figure 5 shows the workflow of Beacon’s node anomaly detection, which relies on the execution of large-scale jobs (those using 1,024 or more compute nodes in our current implementation). To spot outliers, it leverages the common homogeneity in I/O behavior across compute and server nodes. Beacon’s multi-level monitoring allows the correlation of I/O activities or loads back to actual client-side issued requests. Again, by using clustering algorithms like DBSCAN and configurable thresholds, Beacon performs outlier detection across forwarding nodes and OSTs involved in a single job, where the vast majority of entities report a highly similar performance, while only a few members produce contrasting readings. Figure 15 in Section 4.3 gives an example of per-OST bandwidth data within the same execution.

3.3.2 Per-job I/O Performance Analysis.

Upon a job’s completion, Beacon performs automatic analysis of its I/O monitoring data collected from all layers. It performs inter-layer correlation by first identifying jobs from the job database that run on given compute node(s) at the log entry collection time. The involved forwarding nodes, leading to relevant forwarding monitoring data, are then located via the compute-to-forwarding node mapping using a system-wide mapping table lookup. As mentioned above, the mapping from computing nodes to forwarding nodes on TaihuLight is statically configured. Finally, relevant OSTs and corresponding storage nodes monitoring data entries are found by the file system lookup using the Lustre command lfs. Note that the correlation can be easily obtained when the application uses each layer node exclusively. However, when several jobs share part of the forwarding and storage nodes, Beacon can only make a simple estimation by using the I/O throughput at the compute layer that is monopolized for each job.

From the above data, Beacon derives and stores coarse-grained information for quick query, including the average and peak I/O bandwidth, average IOPS, runtime, number of processes (and compute nodes) performing I/O, I/O mode, total count of metadata operations, and average metadata operations per second during I/O phases.

To help users understand/debug their applications’ I/O performance, Beacon provides web-based I/O data visualization. This diagnosis system can be queried using a job ID, and after appropriate authentication, it allows visualizing the I/O statistics of the job, both real-time and post-mortem. It reports the measured I/O metrics (such as bandwidth and IOPS) and inferred characteristics (such as the number of I/O processes and I/O mode). Users are also presented with user-configurable visualization tools, showing time-series measurement in I/O metrics, statistics information such as request type/size distribution, and performance variances. Our powerful I/O monitoring database allows for further user-initiated navigation, such as per-compute-node traffic history and zooming control to examine data at different granularity. For security/privacy, users are only allowed to view I/O data from compute, forwarding, and storage nodes involved in and for the duration of their jobs’ execution.

3.3.3 I/O Subsystem Monitoring for Administrators.

Beacon also provides administrators with the capability to monitor the I/O status for any time period, on any node.

Besides all the user-visible information and facilities mentioned above, administrators can further obtain and visualize: (1) the detailed I/O bandwidth and IOPS for each compute node, forwarding node, and storage node, (2) resource utilization status of forwarding nodes, storage nodes and the MDS, including detailed request queue length statistics, and (3) I/O request latency distribution on forwarding nodes. Additionally, Beacon grants administrators direct I/O record database access to facilitate in-depth analysis.

Combining such facilities, administrators can perform powerful and thorough I/O traffic and performance analysis, for example, by checking multi-level traffic, latency, and throughput monitoring information regarding a job execution.

3.4 Generality

Beacon is not an ad-hoc I/O monitoring system for the TaihuLight. It can be adopted not just for data collection in other fields but also for other platforms. Beacon’s building blocks, such as operation log collection, compression, and data management components, are also suitable for collecting from other fields. Section 4.5.1 will show an example of collecting network data.

In addition, Beacon is also applicable to other advanced supercomputers with the I/O forwarding architecture. Beacon’s multi-layer data collection and storage, scheduler-assisted per-application data correlation and analysis, history-based anomaly identification, automatic I/O mode detection, and built-in interference analysis can all be performed on other supercomputers. Its data management components, such as Logstash, Redis, and ElasticSearch, are open-source software that can run on these machines as well. Our forwarding layer design validation and load analysis can also help recent platforms with a layer of burst buffer nodes, such as NERSC’s Cori [10]. Section 4.5.2 gives an example of extending Beacon to another supercomputer with the I/O forwarding architecture.

Finally, we find that while Beacon is designed and deployed on a cutting-edge supercomputer with multi-layer architectures, it can also be applied to traditional two-layer supercomputers. An example of extending Beacon to a traditional two-layer supercomputer is given in Section 4.5.3.

4 BEACON USE CASES

We now discuss several use cases of Beacon. Beacon has been deployed on TaihuLight for over three years, gathering massive I/O information and accumulating around 25 TB of trace data (after two passes of compression) from April 2017 to July 2020. As TaihuLight’s back-end storage changed in August 2020, we use data before August 2020 for analysis. This history contains 1,460,662 jobs using at least 32 compute nodes and consuming 789,308,498 core-hours in total. Of these jobs, 238,585 (16.3%) featured non-trivial I/O, with per-job I/O volume over 200 MB.

The insights and issues revealed by Beacon’s monitoring and diagnosis have already helped TaihuLight administrators fix several design flaws, develop a dynamic and automatic forwarding node allocation tool, and improve system reliability and application efficiency. Owing to Beacon’s success on TaihuLight, we extend Beacon to other platforms. In this section, we focus on four types of use cases and the extended applications of Beacon for network monitoring and monitoring of different storage architectures:

(1)	System performance overview
(2)	Performance issue diagnosis
(3)	Automatic I/O anomaly diagnosis
(4)	Application and user behavior analysis

4.1 System Performance Overview

Beacon’s multi-layer monitoring, especially I/O subsystem monitoring, gives us an overview of the whole system, which helps manage and construct future storage systems. Liu’s work [41] took Titan as an example to prove that individual pieces of hardware (such as storage nodes and disks) are often under-utilized in HPC storage systems, and we make similar observations on TaihuLight. Figure 7 shows back-end utilization level statistics of the Lustre parallel file system on TaihuLight supercomputers for eight months. For each object storage target (OST), a disk array, we plot the percentage of time it reaches a certain average throughput, normalized to its peak throughput. OSTs are almost idle at least 60% of the time, using less than 1% of the I/O bandwidth. At the same time, these OSTs’ utilization is less than 5% about 70% of the time. So we can conclude that OSTs are under-utilized most of the time. Moreover, we also obtain similar conclusions for compute and forwarding nodes using Beacon’s multi-layer monitoring data.

Fig. 7. Cumulative Distribution Function (CDF) of OST I/O throughput.

Besides the conclusion obtained from the individual layer, Beacon can also discover the relationship between different layers, which is unavailable for traditional trace tools. Figure 8 shows the daily access volume from three layers during the sample period. Especially for read operations, the total daily volume requested by the compute layer is larger than that of the forwarding layer most of the time, which results in effective caching for Lustre clients in the forwarding layer. Sometimes, the read volume requested by the forwarding layer is much larger than that of the compute layer, which reveals the phenomenon of cache thrashing, and we discuss the details of it later in this section. For write operations, the total daily volume requested from the forwarding layer is always slightly larger than that of the compute layer. Write amplification is a major reason for this phenomenon, which is caused by writing data aligned with the request size of 4 KB, or the multiples of 4 KB.

Fig. 8. Access volume history for the TaihuLight compute layer, forwarding layer, and OST layer.

However, the OST layer has a different story. We find that both the read and write volumes on the compute and forwarding layer are much smaller than on the OST layer. Besides write amplification, there are other reasons for this phenomenon. In addition to the compute and forwarding nodes on TaihuLight, other nodes like login or ACC nodes can also access the shared Lustre back-end storage system. Currently, Beacon does not capture these nodes. However, from the figure, we can conclude that system administrators should also pay attention to a load of file system access on login nodes or ACC nodes. According to our survey, users often make many file I/O operations, like copying data from local file systems to Lustre or from one directory to another on login nodes or performing data post-processing on ACC nodes. More details are given in Section 4.4.

4.2 Performance Issue Diagnosis

4.2.1 Forwarding Node Cache Thrashing.

Beacon’s end-to-end monitoring facilitates cross-layer correlation of I/O profiling data, at different temporal or spatial granularities. By comparing the total request volume at each layer, we can see that Beacon has helped TaihuLight’s infrastructure management team identify a previously unknown performance issue, as detailed below.

A major driver for the adoption of I/O forwarding or the burst buffer layer is the opportunity to perform prefetching, caching, and buffering, so as to reduce the pressure on slower disk storage. Figure 9 shows the read volume on compute and forwarding node layers, during two sampled 70-hour periods in August 2017. Figure 9(a) shows a case with expected behavior, where the total volume requested by the compute nodes is significantly higher than that requested by the forwarding nodes, signaling good access locality and effective caching. Figure 9(b), however, tells the opposite story, to the surprise of system administrators: The forwarding layer incurs much higher read traffic from the back end than requested by user applications, reading much more data from the storage nodes than returning to compute nodes. Such a situation does not apply to writes, where Beacon always shows the matching aggregate bandwidth across the two levels.

Fig. 9. Sample segments of TaihuLight read volume history, each collected at two layers.

Further analysis of the applications executed and their assigned forwarding nodes during the problem period in Figure 9(b) reveals an unknown cache thrashing problem, caused by the N:N sequential data access behavior. By default, the Lustre client has a 40-MB read-ahead cache for each file. Under the N:N sequential read scenarios, such aggressive prefetching causes severe memory contention, with data repeatedly read from the back end (and evicted on forwarding nodes). For example, a 1024-process Shentu [39] execution has each I/O process read a 1-GB single file, incurring a 3.5\(\times\) I/O amplification at the Lustre back end of Icefish. This echoes the previous finding on the existence of I/O self-contention within a single application [45].

Solution. This problem can be addressed by adjusting the Lustre prefetching cache size per file. For example, changing it from 40 MB per file to 2 MB is shown to remove the thrashing. Automatic, per-job forwarding node cache reconfiguration, which leverages real-time Beacon monitoring results, is currently under development for TaihuLight. Alternatively, reducing the number of accessed files through data aggregation is one of the effective ways to relieve this problem. Using MPI collective I/O is a convenient method to refactor the application from the N:N I/O mode to the N:M mode, leading to a fewer number of files to access at the same time. Given the close collaboration between application teams and machine administrators, making performance-critical program changes as suggested by monitoring data analysis is an accepted practice on leading supercomputers.

4.2.2 Bursty Forwarding Node Utilization.

Beacon’s continuous end-to-end I/O monitoring gives center management a global picture on system resource utilization. While such systems have often been built and configured using rough estimates based on past experience, Beacon collects detailed resource usage history to help improve the current system’s efficiency and assist future system upgrade and design.

Figure 10 gives one example, again on the forwarding load distribution, by showing two 1-day samples from July 2017. Each row portrays the by-hour peak load on one of the same 40 forwarding nodes randomly sampled from the 80 active ones. The darkness reflects the maximum bandwidth reached within that hour. The labels “high”, “mid”, “low”, and “idle” correspond to the maximum residing in the \(\gt\)90%, 50–90%, 10–50%, or 0–10% interval (relative to the benchmarked per-forwarding-node peak bandwidth), respectively.

Fig. 10. Sample TaihuLight 1-day load summary, showing the peak load level by hour, across 40 randomly sampled forwarding nodes.

Figure 10(a) shows the more typical load distribution, where the majority of forwarding nodes stay lightly used for the vast majority of the time (90.7% of cells show a maximum load of under 50% of peak bandwidth). Figure 10(b) gives a different picture, with a significant set of sampled forwarding nodes serving I/O-intensive large jobs for a good part of the day. Moreover, 35.7% of the cells actually see a maximum load of over 99% of the peak forwarding node bandwidth.

These results indicate that (1) overall, there is forwarding resource overprovisioning (confirming prior findings [27, 41, 47, 62]); (2) even with the more representative low-load scenarios, it is not rare for the forwarding node bandwidth to be saturated by application I/O; and (3) a load imbalance across forwarding nodes exists regardless of load level, making idle resources potentially helpful to I/O-intensive applications.

Solution. In view of the above, recently, TaihuLight has enlisted more of its “backup forwarding nodes” into regular service. Moreover, a dynamic, application-aware forwarding node allocation scheme has been designed and partially deployed (turned on for a subset of applications) [31]. Leveraging application-specific job history information, such an allocation scheme is intended to replace the default, static mapping between compute and forwarding nodes.

4.2.3 MDS Request Priority Setting.

Overall, we find that most TaihuLight jobs were rather metadata-light, but Beacon does observe a small fraction of parallel jobs (0.69%) with a high metadata request rate (more than 300 metadata operations/s on average during I/O phases). Beacon finds that these metadata-heavy (“high-MDOPS”) applications tend to cause significant I/O performance interference. Among jobs with Beacon-detected I/O performance anomaly, those sharing forwarding nodes with high-MDOPS jobs experience, an average 13.6\(\times\) increase in read/write request latency during affected time periods.

Such severe delay and corresponding Beacon forwarding node queue status history prompts us to examine the TaihuLight LWFS server policy. We find that metadata requests are given priority over the file I/O, based on the single-MDS design and the need to provide fast response to interactive user operations such as ls. Here, as neither disk bandwidth nor metadata server capacity is saturated, such interference can easily remain undetected using existing approaches that focus on I/O-intensive workloads only [23, 41].

Solution. As a temporary solution, we add probabilistic processing across priority classes to the TaihuLight LWFS scheduling. Instead of always giving metadata requests high priority, an LWFS server thread now follows a \(P\!:\!(1\!-\!P)\) split (P configurable) between picking the next request from the separate queues hosting metadata and non-metadata requests. Figure 11 shows the “before” and “after” pictures, with LAMMPS [15] (a classical molecular dynamics simulator) running against the high-MDOPS DNDC [25] (a bio-geochemistry application for agro-ecosystem simulation). Throughput of their solo-runs, where each application runs by itself on an isolated testbed, is given as reference. With a simple equal probability split, LAMMPS co-run throughput doubles, while DNDC only perceives a 10% slowdown. For a long-term solution, we plan to leverage Beacon to automatically adapt the LWFS scheduling policies by considering operation types, the MDS load level, and application request scheduling fairness.

Fig. 11. Impact of metadata operations’ priority adjustment.

4.3 Automatic I/O Anomaly Diagnosis

In extreme-scale supercomputers, users typically accept jittery application performance, recognizing widespread resource sharing among jobs. System administrators, moreover, see different behaviors among system components with a homogeneous configuration, but cannot tell how much of that difference comes from these components’ functioning and how much comes from the diversity of tasks they perform.

Beacon’s multi-layer monitoring capacity, therefore, presents a new window for supercomputer administrators to examine system health by connecting statistics on application-issued I/O requests all the way to that of an individual OST’s bandwidth measurement.

4.3.1 Overview of Anomaly Detection Results of Applications.

Figure 12 shows the results of anomaly detection with historical data collected from April 2017 to July 2020. Our results show that about 4.8% of all jobs that featured non-trivial I/O have experienced abnormal performance.

Fig. 12. Results of automatic anomaly detection. “Mix” means that abnormal jobs or their neighbor jobs have more than one kind of explicit I/O pattern, such as the N:1 I/O mode, high MDOPS, and high I/O bandwidth. “Multiple jobs” means that an abnormal job has many neighbor jobs, and their aggregate I/O bandwidth, IOPS, or MDOPS is high. “System Anomaly” means that neighbor jobs have no explicit I/O features, but their corresponding forwarding nodes or storage nodes are detected as performance degradation by Beacon.

Figure 12(a) shows abnormal jobs’ categories distribution. Low-bandwidth jobs make up the majority of all jobs, and WRF accounts for most of these low-bandwidth jobs. Jobs with N:1 I/O and high bandwidth also play an important role. This paper later analyzes how applications with N:1 I/O mode can be easy to be disturbed. Jobs with high MDOPS and IOPS account for the smallest percentage, owing to the fact that these two types of jobs make up a small portion of all jobs on TaihuLight.

Figure 12(b) shows the factors that neighbor jobs bring to abnormal jobs, and we divide them into three categories: (1) system anomaly, (2) I/O interference, and (3) unknown factors. I/O interference factors include the N:1 I/O mode, high MDOPS, high I/O bandwidth, high IOPS, mix, and multiple jobs. This figure illustrates that application-interfering jobs account for more than 90% of all jobs, implying that application interference is the predominant cause of jobs’ performance degradation. Among them, the proportion of interference caused by jobs with the N:1 I/O mode occupies the primary partition, which means jobs with the N:1 I/O mode are not only susceptible to disturbance but also bring interference to other applications. Section 4.4 provides more information. Mix and jobs with high MDOPS rank second and third, respectively. The LWFS server thread pool on the forwarding node is restricted to 16, and jobs suffer from performance degradation when I/O operations on the same forwarding node surpass the LWFS server thread pool’s service capabilities.

4.3.2 Applications Affected by Interference.

Figure 13 illustrates an example of 1024-process Shentu co-running with other applications with different I/O patterns on a shared forwarding node. We find that Shentu suffers from various degrees of interference while co-running with other jobs. Among them, jobs with the N:1 I/O mode and high metadata have a significantly higher performance impact than jobs with the other two I/O patterns on Shentu. Because the forwarding nodes and compute nodes on the Sunway TaihuLight are statically connected, I/O interference on forwarding nodes is a major cause of applications’ performance anomalies.

Fig. 13. Sample Shentu-256 I/O bandwidth timelines. The red line represents the performance of Shentu running on a dedicated forwarding node. The blue line represents the performance of the Shentu that is interfered by other applications with different I/O patterns when sharing a forwarding node.

Solution. With Beacon’s real-time collection data, we can find the I/O interference on the forwarding node in advance, which can help to improve performance for applications. Motivated by the findings, a dynamic forwarding allocation system [31] for isolating I/O interference on the forwarding nodes is developed, tested, and deployed.

4.3.3 Application-driven Anomaly Detection.

Most I/O-intensive applications have distinct I/O phases (i.e., episodes in their execution where they perform I/O continuously), such as those to read input files during initialization or to write intermediate results or checkpoints. For a given application, such I/O phase behavior is often consistent. Taking advantage of such repeated I/O operations and its multi-layer I/O information collection, Beacon performs automatic I/O phase recognition, on top of which it conducts I/O-related anomaly detection. More specifically, larger applications (e.g., those using 1024 compute nodes or more) spread their I/O load to multiple forwarding nodes and back-end nodes, giving us opportunities to directly compare the behavior of servers processing requests known to Beacon as homogeneous or highly similar.

Figure 14 gives an example of a 6000-process LAMMPS run with checkpointing, which is affected by abnormal forwarding nodes. The 1500 compute nodes are assigned to three forwarding nodes, whose bandwidth and I/O time are reflected in the time-series data from Beacon. We can clearly see that the Fwd1 node is a straggler in this case, serving at a bandwidth much slower than its peak (without answering to other applications). As a result, there is a 20\(\times\) increase in the application-visible checkpoint operation time, estimated using the other two forwarding nodes’ I/O phase durations.

Fig. 14. Forwarding bandwidth in a 6000-process LAMMPS run.

4.3.4 Anomaly Alert and Node Screening.

Such continuous, online application performance anomaly detection can identify forwarding nodes or back-end units with deviant performance metrics, which in turn triggers Beacon’s more detailed monitoring and analysis. If it finds such a system component to consistently under-perform relative to peers serving similar workloads, with configurable thresholds in monitoring window and degree of behavior deviation, it reports this as an automatically detected system anomaly. By generating and sending an alarm email to the system administration team, Beacon prompts system administrators to do a thorough examination, where its detailed performance history information and visualization tools are also helpful.

Such anomaly screening is particularly important for expensive, large-scale executions. For example, among all applications running on TaihuLight so far, the parallel graph engine Shentu [39] has the most intensive I/O load. It scales well to the entire supercomputer in both computation and I/O, with 160,000 processes and large input graphs distributed evenly to nearly 400 Lustre OSTs. During test runs preparing for its Gordon Bell bid in April 2018, Beacon’s monitoring discovered a few OSTs significantly lagging behind in the parallel read, slowing down the initialization as a result (Figure 15). By removing them temporarily from service and relocating their data to other OSTs, Shentu cuts its production run initialization time by 60%, saving expensive dedicated system allocation and power consumption. In this particular case, further manual examination attributes the problem to these OSTs’ RAID controllers, which are now fixed.

Fig. 15. Per-OST bandwidth during a Shentu execution.

However, without Beacon’s back-end monitoring, applications like Shentu will accept the bandwidth they obtain without suspecting that the I/O performance is abnormal. Similarly, without Beacon’s routine front-end tracing, profiling, and per-application performance anomaly detection, back-end outliers will go unnoticed. Therefore, as full-system benchmarking requires taking the supercomputer offline and cannot be regularly attempted, Beacon provides a much more affordable way for continuous system health monitoring and diagnosis by coupling application-side and server-side tracing/profiling information.

Beacon has been deployed on TaihuLight since April 2017, with features and tools incrementally developed and added to production use. Table 2 summarizes the automatically identified I/O system anomaly occurrences at the two service layers, from April 2017 to July 2020. Such identification adopts a minimum threshold of the measured maximum bandwidth under 30% of the known peak value, as well as a minimum duration of 60 minutes. Such parameters can be configured to adjust the anomaly detection system sensitivity. Most performance anomaly occurrences are found to be transient, lasting under 4 hours.

Table 2.

	Location of anomaly
Duration (hours)	Forwarding node (times)	OSS+OST (times)
\((0,1)\)	193	185
\([1,4)\)	59	73
\([4,12)\)	33	51
\([12,96)\)	22	25
\(\ge\)96, manually verified	15	22

View Table

Table 2. Duration of Beacon-identified System Anomalies

There are a total of 70 occasions of performance anomaly over 4 hours on the forwarding layer and 98 on the back-end layer, confirming the existence of fail-slow situations that are common with data centers [28]. Reasons for such relatively long yet “self-healed” anomalies include service migration and RAID reconstruction. With our rather conservative setting during the initial deployment period, Beacon is set to send the aforementioned alert email when a detected anomaly situation lasts beyond 96 hours (except for large-scale production runs as in the Shentu example above, where the faulty units are immediately reported). With all these occasions, the Beacon-detected anomaly is confirmed by human examination.

4.4 Application and User Behavior Analysis

With its powerful information collection and multi-layer I/O activity correlation, Beacon provides a new capability to perform detailed application or user behavior analysis. Results of such analysis assist in performance optimization, resource provisioning, and future system design. Here, we showcase several application/user behavior studies, some of which have led to corresponding optimizations or design changes to the TaihuLight system.

4.4.1 Application I/O Mode Analysis.

First, Table 3 gives an overview of the I/O volume across all profiled jobs with a non-trivial I/O, categorized by per-job core-hour consumption. Here, 1,000 K core-hours correspond to a 10-hour run using 100,000 cores on 25,000 compute nodes, and jobs with such consumption or higher write more than 40 TB of data on average. Further examination reveals that in each core-hour category, average read/write volumes are influenced by a minority group of heavy consumers. Overall, the amount of data read/written grows as the jobs consume more compute node resources. The less resource-intensive applications tend to perform more reads, while the larger consumers are more write-intensive.

Table 3.

Type	\((0, 1K]\)	\((1K, 10K]\)	\((10K, 100K]\)	\((100K, 1000K]\)	\((1000K, \infty)\)
Read	8.1 GB	101.0 GB	166.9 GB	1172.9 GB	2010.6 GB
Write	18.2 GB	83.9 GB	426.6 GB	615.9 GB	41458.8 GB

View Table

Table 3. Average Per-job I/O Volume by Core-hour Consumption

Figure 16 shows the breakdown of I/O-mode adoption among all TaihuLight jobs performing non-trivial I/O, by total read/write volume. The first impression one takes from these results is that the rather “extreme” cases, such as N:N and 1:1, form the dominant choices, especially in the case of writes. We suspect that this distribution may be skewed by a large number of small jobs doing limited I/O, and calculate the average per-job read/write volume for each I/O mode. The results (Table 4) show that this is not the case. Actually, applications that choose to use the 1:1 mode for writes actually have a much higher overall write volume.

Fig. 16. Distribution of file access modes, in access volume.

Table 4.

I/O mode	Avg. read volume	Avg. write volume	Job count
N:N	96.8 GB	120.1 GB	11073
N:M	36.2 GB	63.2 GB	324
N:1	19.6 GB	19.3 GB	2382
1:1	33.0 GB	142.3 GB	16251

View Table

Table 4. Average I/O Volume and Job Count by I/O Mode

The 1:1 mode is the closest to sequential processing behavior and is conceptually simple. However, it obviously lacks scalability and fails to utilize the abundant hardware parallelism in the TaihuLight I/O system. The wide presentation of this I/O mode may help explain the overall under-utilization of forwarding resources, discussed earlier in Section 4.2. Echoing similar findings (though not so extreme) on other supercomputers [47] (including Intrepid [30], Mira [58], and Edison [51]), effective user education on I/O performance and scalability can both help improve storage system utilization and reduce wasted compute resources.

The N:1 mode tells a different story. It is an intuitive parallel I/O solution that allows compute processes to directly read to or write from their local memory without gather-scatter operations, while retaining the convenience of having a single input/output file. However, our detailed monitoring finds it to be a damaging I/O mode that users should steer away from, as explained below.

First, our monitoring results confirm the findings of existing research [2, 46]: The N:1 mode offers low application I/O performance (by reading/writing to a shared file). Even with a large N, such applications receive no more than 250 MB/s of I/O aggregate throughput despite the peak TaihuLight back end combined bandwidth of 260 GB/s. For read operations, users here also rarely modify the default Lustre stripe width, confirming the behavior reported in a recent ORNL study [38]. The problem is much worse with writes, as performance severely degrades owing to file system locking.

This study, however, finds that applications with the N:1 mode are extraordinarily disruptive, as they harm all kinds of neighbor applications that share forwarding nodes with them, particularly when N is large (e.g., over 32 compute nodes).

The reason is that each forwarding node operates an LWFS server thread pool (currently sized at 16), providing forwarding service to assigned compute nodes. Applications using the N:1 mode tend to flood this thread pool with requests in bursts. Unlike the N:N or N:M modes, N:1 suffers from the aforementioned poor back-end performance by using a single shared file. This, in turn, makes N:1 requests slow to process, further exacerbating their congestion in the queue and delaying requests from other applications, even when those victims are accessing disjointed back-end servers and OSTs.

Here, we give a concrete example of I/O mode-induced performance interference, featuring an earthquake simulation AWP [20] (2017 Gordon Bell Prize winner) that started with the N:1 mode. In this sample execution, AWP co-runs with the weather forecast application WRF [69] using the 1:1 mode, each having 1024 processes on 256 compute nodes. Under the “solo” mode, we assign each application a dedicated forwarding node in a small testbed partition of TaihuLight. In the “co-run” mode, we let the applications share one forwarding node (as the default compute-to-forwarding mapping is 512-to-1).

Table 5 lists the two applications’ average request wait times, processing times, and forwarding node queue lengths during these runs. Note that with the “co-run”, the queue is shared by both applications. We find that the average wait time of WRF increases by 11\(\times\) when co-running, but AWP is not affected. This result reveals the profound malpractice of the N:1 file sharing mode and confirms the prior finding that I/O interference is access-pattern-dependent [37, 43].

Table 5.

Operation	Avg. wait time	Avg. proc. time	Avg. queue length
WRF write (solo)	2.73 ms	0.052 ms	0.22
WRF write (co-run)	30.06 ms	0.054 ms	208.51
AWP read (solo)	58.17 ms	3.44 ms	226.37
AWP read (co-run)	58.18 ms	3.44 ms	208.51

View Table

Table 5. Performance Interference During WRF and AWP Co-run Sharing a Forwarding Node

Solution. Our tests confirm that increasing the LWFS thread pool size does not help in this case, as the bottleneck lies on the OSTs. Moreover, avoiding the N:1 mode has been advised in prior work [2, 90], as well as numerous parallel I/O tutorials. Considering our new inter-application study results, it is an obvious “win-win” strategy that simultaneously improves large applications’ I/O performance and reduces their disruption to concurrent workloads. However, based on our experience with real applications, this message needs to be better promoted.

In our case, the Beacon developers worked with the AWP team to replace its original N:1 file read (for initialization/restart) with the N:M mode during the 2017 ACM Gordon Bell Prize final submission phase. Changing applications’ I/O modes from N:1 to N:M means selecting M out of N processors to perform I/O. The number of M was selected empirically based on N:M experiments. Figure 17 shows the N:M experiment by changing the value of M. The 1024-processor AWP runs on 256 compute nodes connected to one forwarding node during our experiment. We can see that the bandwidth achieves near-linear growth with M, increasing in the range of 1 to 32. The reason is that when the aggregate bandwidth of processors performing I/O operations does not reach the peak bandwidth of a forwarding node, applications can obtain a larger aggregate bandwidth, with more processors writing to more separate files. When M increases to 64, the aggregate bandwidth increases slightly, with the limitation of a forwarding node. When M \(\gt\) 64, the aggregate bandwidth even declines slightly because of the resource contention. Also, more files may lead to unstable performance for applications. Thus, we suggest that when changing applications’ I/O modes from N:1 to N:M, selecting 1 out of every 16 processors or every 32 processors to perform I/O operation is a cost-effective choice on TaihuLight.

This change produces an over 400% enhancement in I/O performance. Note that the GB Prize submission does not report I/O time; we find that AWP’s 130,000-process production runs spend the bulk of their execution time reading around 100 TB of input or checkpoint data. Significant reduction in this time greatly facilitates AWP’s development/testing and saves non-trivial supercomputer resources.

4.4.2 Metadata Server Usage.

Unlike forwarding nodes’ utilization (discussed earlier), the Lustre MDS is found with rather evenly distributed load levels by Beacon’s continuous load monitoring (Figure 18(a)). In particular, 26.8% of the time, the MDS experiences a load level (in requests per second) above 75% of its peak processing throughput.

Fig. 18. TaihuLight Lustre MDS load statistics.

Beacon allows us to further split the requests between systems sharing the MDS, including the TaihuLight forwarding nodes, login nodes, and the ACC. To the surprise of TaihuLight administrators, over 80% of the metadata access workload actually comes from the ACC (Figure 18(b)).

Note that the login node and ACC have their own local file systems, ext4 and GPFS [66], respectively, which users are encouraged to use for purposes such as application compilation and data post-processing/visualization. However, as the users are likely TaihuLight users too, we find most of them prefer to directly use the main Lustre scratch file system intended for TaihuLight jobs, for convenience. While the I/O bandwidth/IOPS resources consumed by such tasks are negligible, user interactive activities (such as compiling or post-processing) turn out to be metadata-heavy.

Large waves of unintended user activities correspond to the most heavy-load periods at the tail end in Figure 18(a), and lead to MDS crashes directly affecting applications running on TaihuLight. According to our survey, many other machines, including two out of the top 10 supercomputers (Sequoia [83] and Sierra [33]), also have a single MDS, assuming that their users follow similar usage guidelines.

Solution. There are several potential solutions to this problem. With the help of Beacon, we can identify and remind users performing metadata-heavy activities to avoid using the PFS directly. Or, we can support more scalable Lustre metadata processing with an MDS cluster. A third approach is to facilitate intelligent workflow support that automatically performs data transfer based on users’ needs. This third approach is the one we are currently developing.

4.4.3 Jobs’ Request Size Analysis.

Figure 19 shows the relationships between the applications according to their bandwidth and IOPS, with all points forming five lines, which represent jobs mainly containing five request size types: 1 KB, 16 KB, 64 KB, 128 KB, 512 KB. Among them, 128 KB for read and 512 KB for write are the most common request sizes, which follow the system configurations of Icefish. On Sunway compute nodes, applications’ small I/O requests are merged, while larger I/O requests are split into multiple requests before being transferred to the forwarding nodes via the LWFS client. So we conclude that the average request size of most applications can reach the set upper limit, implying that the upper limit can be appropriately increased to enable applications to obtain a better read and write performance. In addition, further statistical analysis reveals that 6.89% of jobs still have an I/O request size of less than 1 KB. However, small I/O requests are associated with inefficient I/O behavior, and jobs with such I/O behavior cannot make good use of the high-performance parallel file system.

Fig. 19. The average I/O throughput and IOPS of all jobs on Icefish since April 1, 2017. Each point represents one particular job.

Solution. We take APT [84] (An application for particle dynamical simulations) as an example. APT is designed for systematic large-scale applications of geometric algorithms for particle dynamics simulations, which runs on TaihuLight with 1024 processes and outputs file with the HDF5 file format. A large number of small I/O requests is the main reason for its low performance. As a quick solution, we change its HDF5 format to the binary format and achieve a 20\(\times\) I/O performance improvement.

4.5 Extended Applications of Beacon

4.5.1 Extension to Network Monitoring.

Encouraged by Beacon’s success in I/O monitoring, in summer 2018, we began to design and test its extension to monitor and analyze network problems, motivated by the network performance debugging needs of ultra-large-scale applications. Figure 20 shows the architecture of this new module. Beacon samples performance counters on the 5984 Mellanox InfiniBand network switches, such as per-port sent and received volumes. Again, the data collected are passed to low-overhead daemons on Beacon log servers, more specifically, 75 of its 85 part-time servers, each assigned 80 switches. Similar processing and compression are conducted, with result data persisting in Beacon’s distributed database and then being periodically relocated to its dedicated server for user queries and permanent storage.

Fig. 20. Overview of Beacon’s network monitoring module.

This Beacon network monitoring prototype was tested in time to help in the aforementioned Shentu [39] production runs, for its final submission to Supercomputing ’18 as an ACM Gordon Bell Award finalist. Beacon was called upon to identify the reason that the aggregate network bandwidth was significantly lower than theoretical peak. Figure 21 illustrates this with a 3-supernode Shentu test run. The dark bars (FixedPart) form a histogram of communication volumes measured on 40 switches connecting these 256-node supernodes for inter-supernode communication, reporting the count of switches within five volume brackets. There is a clear bi-polar distribution, showing severe load imbalance and more than expected inter-supernode communication. This monitoring result led to discovery that owing to the existence of faulty compute nodes within each supernode, the fixed partitioning relay strategy adopted by Shentu led to a subset of relay nodes receiving twice the “normal” load. Note that Shentu’s own application-level profiling found that the communication volume across compute nodes was well balanced. Hence, the problem was not obvious to application developers until Beacon provided such switch-level traffic data.

Fig. 21. Distribution of communication volume inter-supernode.

Solution. This finding prompted Shentu designers to optimize their relay strategy, using a topology-aware scholastic assignment algorithm to uniformly partition source nodes to relay nodes [39]. The results are shown by gray bars (FlexPart) in Figure 21. The peak per-switch communication volume is reduced by 27.0% (from 6.3 GB to 4.6 GB), with a significantly improved load balance, bringing a total communication performance enhancement of around 30%.

4.5.2 Extension to the Cutting-edge Supercomputer with I/O Forwarding Architecture.

The Sunway next-generation supercomputer inherits and develops the architecture of the Sunway TaihuLight and is built on a homegrown high-performance heterogeneous multi-core processor, SW26010P. It consists of more than 100, 000 compute nodes, each node equipped with a 390-core SW26010P CPU. Compared to the 10 million cores of TaihuLight, the new machine has more than four times the total number of cores. Figure 22 shows the architecture overview. Like TaihuLight, the compute nodes are connected to the storage nodes through forwarding nodes. Storage nodes run the Lustre servers and support users with a global file system. Unlike TaihuLight, the Sunway next-generation supercomputer provides an additional burst buffer file system on the forwarding node [89]. Each forwarding node provides back-end storage for the burst buffer file system via two high-performance Nvme SSDs.

Fig. 22. Sunway next-generation supercomputer architecture overview.

In order to extend Beacon to the Sunway next-generation supercomputer, we upgraded the collection module of Beacon to support data collection on the burst buffer file system in January 2021. Beacon’s other components can still be performed on this supercomputer as expected. Figure 23 shows an example of Beacon’s use case on the next-generation supercomputer. According to the figure, we find that the load on Nvme SSDs is low most of the time. An important reason is that users tend to use the global file system more often than the burst buffer file system. We confirm this assertion by further statistical analysis.² Although the burst buffer file system can provide high I/O performance for jobs, users have to modify their applications with a specific API for I/O to use the burst buffer file system, which is not convenient and contributes to the low usage of the burst buffer file system. Besides, we also find that the load on Nvme SSDs is imbalanced. One important reason is the control strategy of Nvme SSDs. Nvme SSDs are controlled through static configuration files. Each user can only access the corresponding Nvme SSDs through a configuration file given by an administrator. However, it is difficult for the administrator to balance each Nvme SSD’s load as it lacks real-time load information.

Fig. 23. Sample of the Sunway next-generation supercomputer 1-week load summary, showing the peak load level by the hour, across 60 randomly sampled Nvme SSDs.

Solution. With the help of Beacon’s real-time monitoring, we can obtain the real-time Nvme SSDs’ load, which is necessary for configuration file modification. Currently, we are working with administrators to develop a dynamic configuration system to make full use of Nvme SSDs.

4.5.3 Extension to the Traditional Two-layer Supercomputer.

In addition to Beacon’s adoption as a multi-layer cutting-edge supercomputer, some of Beacon’s components and methods can also be adopted by the traditional two-layer supercomputer. We have deployed Beacon on the Sugon Pai supercomputer [72], a traditional two-layer computer, since March 2020. Sugon Pai is a homogeneous computing cluster that contains 424 compute nodes as well as eight storage nodes. It uses the ParaStor file system to provide high concurrent I/O. The architecture of Beacon’s monitoring and storage module is shown in Figure 24. Beacon performs I/O monitoring on the compute and storage nodes, running ParaStor [73] client and server, respectively. Beacon divides the 424 compute nodes into four groups and enlists four “part-time” servers to communicate with one group. In addition, data collected from eight storage nodes are transferred to another “part-time” server. There is also a MySQL database to store jobs’ running information on the Sugon Pai. In order to reduce data transmission and storage overhead, Beacon also conducts an online compression similar to that used on TaihuLight.

Fig. 24. Overview of Beacon’s monitoring module for the Sugon Pai supercomputer.

Table 6 shows the statistics of I/O-mode adopted by jobs that perform non-trivial I/O on the Sugon Pai supercomputer from March 2020 to April 2020. We find some similar conclusions, for example, the N:N and 1:1 I/O modes form the dominant choices in the case of write. Besides, there are also some new findings on Sugon Pai; the N:1 I/O mode takes up most of the ratio in the case of read. Further analysis shows that the N:1 I/O mode offers a relatively good performance on Sugon Pai. Figure 25 shows an example of a molecular simulation application with the N:1 I/O mode on Sugon Pai. As we can see from the figure, high performance is obtained when reading with the N:1 I/O mode. A plausible reason is that Sugon Pai adopts Parastor as its primary storage system, supporting the N:1 I/O mode better than LWFS and Lustre on TaihuLight. This finding also proves that different platforms support I/O behaviors differently, which implies that an application’s I/O behavior needs to be well matched to the underlying platform adaptively in order to achieve better performance.

Fig. 25. I/O performance of an application with N:1 I/O mode on the Sugon Pai supercomputer.

Table 6.

Job I/O mode	#Read-operated jobs	#Write-operated jobs
1:1	289	689
N:1	521	103
N:N	152	438
N:M	270	2

View Table

Table 6. Jobs Classified by I/O Mode

5 BEACON FRAMEWORK EVALUATION

We now evaluate Beacon’s per-application profiling accuracy and its performance overhead.

5.1 Accuracy Verification

Beacon collects full traces from the compute node side, thus giving it access to complete application-level I/O operation information. However, because the LWFS client trace interface provides only coarse timestamp data (at per-second granularity), and owing to the clock drift across compute nodes, it is possible that the I/O patterns recovered from Beacon logs deviate from the application-level captured records.

To evaluate the degree of such errors, we compare the I/O throughput statistics reported by the MPI-IO Test [26] to those by Beacon. In the experiments, we use the MPI-IO Test to test different parallel I/O modes, including N:N and N:1 independent operations, plus MPI-IO library collective calls. Then, 10 experiments were repeated at each execution scale.

The accuracy evaluation results are shown in Figure 26. We plot the average error in Beacon, measured as the percentage of deviation of the recorded aggregate compute node-side I/O throughput from the application-level throughput reported by the MPI-IO library.

Fig. 26. Average error rate of Beacon reported bandwidth (error bars show 95% confidence intervals).

We find that Beacon is able to accurately capture application performance, even for applications with non-trivial parallel I/O activities. More precisely, Beacon’s recorded throughput deviates from the MPI-IO Test reported values by only 0.78–3.39% (1.84% on average) for the read test and 0.81–3.31% (2.03% on average) for the write test, respectively. The results are similar to those of high-IOPS applications, which are omitted here owing to space limitations.

Beacon’s accuracy can be attributed to the fact that it records all compute node-side trace logs, facilitated by its efficient and lossless compression method (described in Section 3.2). We find that even though individual trace items may be off in timestamps, data-intensive applications on supercomputers seldom perform isolated, fast I/O operations (which are not of interest for profiling purposes). Instead, they exhibit I/O phases with a sustained high I/O intensity. By collecting multi-layer I/O trace entries for each application, Beacon is able to paint an accurate picture of an application’s I/O behavior and performance.

5.2 Monitoring and Query Overhead

We now evaluate Beacon’s monitoring overhead in a production environment. We compare the performance of important I/O-intensive real-world applications and the MPI-IO Test benchmark discussed earlier, with and without Beacon turned on (\(T_w\) and \(T_{w/o}\), respectively). We report the overall run time of each program and calculate the slowdown introduced by turning on Beacon. Table 7 shows the results, listing the average slowdown measured over at least five runs for each program (the variance of slowdown across runs low: under 2%). Note that for the largest applications, such testing is piggybacked on actual production runs of stable codes, with Beacon turned on during certain allocation time frames. Applications like AWP often break their executions to run a certain number of simulation time steps at a time.

Table 7.

Application	#Process	\(T_{w/o}\) (s)	\(T_{w}\) (s)	%Slowdown
MPI-IO\(_N\)	64\(\hphantom{0}\)	26.6	26.8	0.79%
MPI-IO\(_N\)	128\(\hphantom{0}\)	31.5	31.6	0.25%
MPI-IO\(_N\)	256\(\hphantom{0}\)	41.6	41.9	0.72%
MPI-IO\(_N\)	512\(\hphantom{0}\)	57.9	58.4	0.86%
MPI-IO\(_N\)	1024\(\hphantom{0}\)	123.1	123.5	0.36%
WRF\(_1\)	1024\(\hphantom{0}\)	2813.3	2819.1	0.21%
DNDC	2048\(\hphantom{0}\)	1041.2	1045.5	0.53%
XCFD	4000\(\hphantom{0}\)	2642.1	2644.6	0.09%
GKUA	16384\(\hphantom{0}\)	297.5	299.9	0.82%
GKUA	32768\(\hphantom{0}\)	182.8	184.1	0.66%
AWP	130000\(\hphantom{0}\)	3233.5	3241.5	0.25%
Shentu	160000\(\hphantom{0}\)	5468.2	5476.3	0.15%

View Table

Table 7. Avg. Beacon Monitoring Overhead on Applications

These results show that the Beacon tool introduces very low overhead, under \(1\%\) across all test cases. Also, the overhead does not grow with the application execution scale; it actually appears smaller (below 0.25%) for the two largest jobs, which use 130 K processes or more. Such a cost is particularly negligible considering the significant I/O performance enhancement and run-time savings produced by optimizations or problem diagnosis from Beacon-supplied information.

Table 8 lists the CPU and memory usage of Beacon’s data collection daemon. In addition, the storage overhead from Beacon’s deployment on TaihuLight since April 2017 is around 10 TB. Such low operational overhead and scalable operation attest to Beacon’s lightweight design, with background trace-collection and compression generating negligible additional resource consumption. Also, having a separate monitoring network and storage avoids potential disturbance to the application execution.

Table 8.

Level	CPU usage	Memory usage (MB)
Compute node	0.0%	10
Forwarding node	0.1%	6
Storage node	0.1%	5

View Table

Table 8. System Overhead of Beacon

Finally, we assess Beacon’s query processing performance. We measure the query processing time of 2,000 Beacon queries in September 2018, including both application users accessing job performance analysis and system administrators checking forwarding/storage nodes performance. In particular, we examine the impact of Beacon’s in-memory cache system between the web interface and Elasticsearch, as shown in Figure 2. Figure 27 gives the CDF of queries in processing time and demonstrates that (1) the majority of Beacon user queries can be processed within 1 second, and 95.6% of them can be processed under 10 seconds (visualization queries take longer), and (2) Beacon’s in-memory caching significantly improves the user experience. Additional checking reveals that about 95% of these queries can be served from cached data.

Fig. 27. CDF of Beacon query processing time.

6 RELATED WORK

Several I/O tracing and profiling tools have been proposed for HPC systems, which can be divided into two categories: application-oriented tools and back-end-oriented tools.

Application-oriented tools can provide detailed information about a particular execution on a function-by-function basis. Work in this area includes Darshan [9], IPM [79], and RIOT [86], all of which aim to build an accurate picture of application I/O behavior by capturing key characteristics of the mainstream I/O stack on compute nodes. Carns et al. evaluated the performance and runtime overheads of Darshan [8], and Patel et al. performed characterization and analysis of access of I/O intensive files [60] with Darshan. Wu et al. proposed a scalable methodology for MPI and I/O event tracing [48, 87, 88]. Recorder [46] focuses on collecting additional HDF5 trace data. Tools like Darshan provide user-transparent monitoring via automatic environment configuration. Still, instrumentation based tools have restrictions on programming languages or libraries/linkers. In contrast, Beacon is designed to be a non-stop, full-system I/O monitoring system capturing I/O activities at the system level.

Back-end-oriented tools collect system-level I/O performance data across applications and provide summary statistics (e.g., LIOProf [91], LustreDU [7, 38, 56], and LMT [24]). Neeraj et al. [64] tried to provide applications and middles with real-time system resource status while Patel et al. [59] focused on showing system-level characteristics with LMT. Paul et al. [61] also analyzed the statistics in an application-agnostic manner with data collected from Lustre server statistics.

However, identifying application performance issues and finding the cause of application performance degradation are difficult with these tools. While back-end analytical methods [40, 41] have made progress in identifying high-throughput applications using back-end logs only, they lack application-side information. Beacon, in contrast, holds complete cross-layer monitoring data to enable such tasks.

Along this line, there are tools for collecting multi-layer data. Static instrumentation has been used to trace parallel I/O calls from MPI to PVFS servers [35]. SIOX [85] and IOPin [34] characterize HPC I/O workloads across the I/O stack. These projects extended the application-level I/O instrumentation approach that Darshan [9] used to other system layers. However, their overhead hinders its deployment on large-scale production environments [70].

Regarding end-to-end frameworks, the TOKIO [3] architecture combines front-end tools (Darshan, Recorder) and back-end tools (LMT). The UMAMI monitoring interface [43] provides cross-layer I/O performance analysis and visualization. In addition, OVIS [5] uses the Cray specific tool LDMS [1] to provide scalable failure and anomaly detection. GUIDE [80] performs center-wide and multi-source log collection and motivated further analysis and optimizations. Beacon differs through its aggressive real-time performance and utilization monitoring, automatic anomaly detection, and continuous per-application I/O pattern profiling.

I/O interference is identified as an important cause for performance variability in HPC systems [41, 57]. Fang et al. [96] uncovered the interference in an in situ analytics system. Kuo et al. [37] focused on interference from different file access patterns with synchronized time-slice profiles. Yildiz et al. [92] studied the root causes of cross-application I/O interference across software and hardware configurations. To the best of our knowledge, Beacon is the first monitoring framework with built-in features for inter-application interference analysis. Our study confirms findings on large-scale HPC applications’ adoption of poor I/O design choices [47]. This further suggests that aside from workload-dependent, I/O-aware scheduling [14, 41], interference should be countered with application I/O mode optimization and adaptive I/O resource allocation.

Finally, on network monitoring, there are dedicated tools [42, 50, 68] for monitoring switch performance, anomaly detection, and resource utilization optimization. There are also tools specializing in network monitoring/debugging for data centers [75, 76, 94]. However, these tools/systems typically do not target the InfiniBand interconnections commonly used on supercomputers. To this end, Beacon adopts the open-source OFED stack [11, 55] to retrieve relevant information from the IB network. More importantly, it leverages its scalable and efficient monitoring infrastructure, originally designed for I/O, for network problems.

7 CONCLUSION

We have presented Beacon, an end-to-end I/O resource monitoring and diagnosis system for the leading supercomputer TaihuLight. It facilitates comprehensive I/O behavior analysis along the long I/O path and has identified hidden performance and user I/O behavior issues as well as system anomalies. Enhancements enabled by Beacon in the past 38 months have significantly improved ultra-large-scale applications’ I/O performance and the overall TaihuLight I/O resource utilization. More generally, our results and experience indicate that this type of detailed multi-layer I/O monitoring/profiling is affordable in state-of-the-art supercomputers, offering valuable insights while incurring a low cost. In addition, we have explored the public release of Beacon collected supercomputer I/O profiling data to the HPC and storage communities.

Our future work will focus on the cross-layered application I/O portrait, automated I/O scheduling, resource allocation, and optimization via real-time interaction with Beacon.

Footnotes

¹ Github link: https://github.com/Beaconsys/Beacon.
² More than 90% jobs running on the global file system.
Footnote

REFERENCES

[1] Agelastos Anthony, Allan Benjamin, Brandt Jim, Cassella Paul, Enos Jeremy, Fullop Joshi, Gentile Ann, Monk Steve, Naksinehaboon Nichamon, Ogden Jeff, Rajan Mahesh, Showerman Michael, Stevenson Joel, Taerat Narate, and Tucker Tom. 2014. The lightweight distributed metric service: A scalable infrastructure for continuous monitoring of large scale computing systems and applications. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, 154–165.Google ScholarDigital Library
Reference
[2] Bent John, Gibson Garth, Grider Gary, McClelland Ben, Nowoczynski Paul, Nunez James, Polte Milo, and Wingate Meghan. 2009. PLFS: A checkpoint filesystem for parallel applications. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Portland, 1–12.Google ScholarDigital Library
Reference 1Reference 2
[3] Berkeley Lawrence and ANL. 2017. TOKIO: Total knowledge of I/O. http://www.nersc.gov/research-and-development/tokio.Google Scholar
Reference
[4] Braam Peter J. and Zahir Rumi. 2002. Lustre: A scalable, high performance file system. Cluster File Systems, Inc 8, 11 (2002), 3429–3441.Google Scholar
Reference
[5] Brandt Jim, Gentile Ann, Mayo Jackson, Pebay Philippe, Roe Diana, Thompson David, and Wong Matthew. 2009. Resource monitoring and management with OVIS to enable HPC in cloud computing environments. In International Symposium on Parallel and Distributed Processing. IEEE, Rome, 1–8.Google Scholar
Reference
[6] Budnik Tom, Knudson Brant, Megerian Mark, Miller Sam, Mundy Mike, and Stockdell Will. 2010. Blue gene/q resource management architecture. In Workshop on Many-Task Computing on Grids and Supercomputers. IEEE, New Orleans, 1–5.Google Scholar
Reference
[7] Carlyle Adam G., Miller Ross G., Leverman Dustin B., Renaud William A., and Maxwell Don E.. 2012. Practical support solutions for a workflow-oriented Cray environment. In Cray User Group Conference. Cray, Stuttgart, 1–7.Google Scholar
Reference 1Reference 2
[8] Carns P., Harms K., Latham R., and Ross R.. 2012. Performance Analysis of Darshan 2.2. 3 on the Cray XE6 Platform.Technical Report. Argonne National Lab.(ANL), Argonne, IL (United States).Google ScholarCross Ref
Reference
[9] Carns Philip, Latham Robert, Ross Robert, Iskra Kamil, Lang Samuel, and Riley Katherine. 2009. 24/7 characterization of Petascale I/O workloads. In International Conference on Cluster Computing and Workshops. IEEE, New Orleans, 1–10.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[10] Cray. 2017. Cray burst buffer in Cori. https://docs.nersc.gov/filesystems/cori-burst-buffer/.Google Scholar
Reference
[11] Dandapanthula N., Subramoni Hari, Vienne Jérôme, Kandalla K., Sur Sayantan, Panda Dhabaleswar K., and Brightwell Ron. 2011. INAM-a scalable infiniband network analysis and monitoring tool. In European Conference on Parallel Processing. Springer, Bordeaux, 166–177.Google Scholar
Reference
[12] Di Sheng, Gupta Rinku, Snir Marc, Pershey Eric, and Cappello Franck. 2017. LogAider: A tool for mining potential correlations of HPC log events. In International Symposium on Cluster, Cloud and Grid Computing. IEEE, Madrid, 442–451.Google Scholar
Reference
[13] Donvito Giacinto, Marzulli Giovanni, and Diacono Domenico. 2014. Testing of several distributed file-systems (HDFS, Ceph and GlusterFS) for supporting the HEP experiments analysis. In Journal of Physics: Conference Series. IOP Publishing, Yokohama, 042014.Google ScholarCross Ref
Reference
[14] Dorier Matthieu, Antoniu Gabriel, Ross Robert, Kimpe Dries, and Ibrahim Shadi. 2014. CALCioM: Mitigating I/O interference in HPC systems through cross-application coordination. In International Parallel and Distributed Processing Symposium. IEEE, Phoenix, 155–164.Google Scholar
Reference
[15] Duan Xiaohui, Chen Dexun, Meng Xiangxu, Yang Guangwen, Gao Ping, Zhang Tingjian, Zhang Meng, Liu Weiguo, Zhang Wusheng, and Xue Wei. 2018. Redesigning LAMMPS for Petascale and hundred-billion-atom simulation on Sunway TaihuLight. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, 148–159.Google ScholarDigital Library
Reference
[16] DuBois Paul. 2013. MySQL. Addison-Wesley Professional, Boston.Google ScholarDigital Library
Reference
[17] Eggert Paul R. and Jr Douglas Stott Parker. 1993. File systems in user space.. In USENIX Winter. 229–240.Google Scholar
Reference
[18] Ester Martin, Kriegel Hans-Peter, Sander Jörg, Xu Xiaowei, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD. AAAI, Portland, 226–231.Google ScholarDigital Library
Reference
[19] Forcier Jeff, Bissex Paul, and Chun Wesley J.. 2008. Python Web Development with Django. Addison-Wesley Professional.Google ScholarDigital Library
Reference
[20] Fu Haohuan, He Conghui, Chen Bingwei, Yin Zekun, Zhang Zhenguo, Zhang Wenqiang, Zhang Tingjian, Xue Wei, Liu Weiguo, Yin Wanwang, et al. 2017. 9-pflops nonlinear earthquake simulation on Sunway TaihuLight: Enabling depiction of 18-Hz and 8-meter scenarios. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–12.Google ScholarDigital Library
Reference 1Reference 2
[21] Fu Haohuan, Liao Junfeng, Ding Nan, Duan Xiaohui, Gan Lin, Liang Yishuang, Wang Xinliang, Yang Jinzhe, Zheng Yan, Liu Weiguo, et al. 2017. Redesigning CAM-SE for Peta-scale climate modeling performance and ultra-high resolution on Sunway TaihuLight. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–12.Google ScholarDigital Library
Reference
[22] Fu Haohuan, Liao Junfeng, Yang Jinzhe, Wang Lanning, Song Zhenya, Huang Xiaomeng, Yang Chao, Xue Wei, Liu Fangfang, Qiao Fangli, et al. 2016. The Sunway TaihuLight supercomputer: System and applications. Science China Information Sciences 59 (2016), 1–16.Google ScholarCross Ref
Reference
[23] Gainaru Ana, Aupy Guillaume, Benoit Anne, Cappello Franck, Robert Yves, and Snir Marc. 2015. Scheduling the I/O of HPC applications under congestion. In International Parallel and Distributed Processing Symposium. IEEE, Hyderabad, 1013–1022.Google Scholar
Reference 1Reference 2
[24] Garlick Jim. 2010. Lustre monitoring tool. https://github.com/LLNL/lmt.Google Scholar
Reference
[25] Giltrap Donna L., Li Changsheng, and Saggar Surinder. 2010. DNDC: A process-based model of greenhouse gas fluxes from agricultural soils. Agriculture, Ecosystems & Environment 136 (2010), 292–300.Google ScholarCross Ref
Reference 1Reference 2
[26] Grider Gary, Nunez James, and Bent John. 2008. LANL MPI-IO test. http://freshmeat.sourceforge.net/projects/mpiiotest.Google Scholar
Reference
[27] Gunasekaran Raghul, Oral Sarp, Hill Jason, Miller Ross, Wang Feiyi, and Leverman Dustin. 2015. Comparative I/O workload characterization of two leadership class storage clusters. In Proceedings of the Parallel Data Storage Workshop. IEEE, Austin, 31–36.Google ScholarDigital Library
Reference
[28] Gunawi Haryadi S., Suminto Riza O., Sears Russell, Golliher Casey, Sundararaman Swaminathan, Lin Xing, Emami Tim, Sheng Weiguang, Bidokhti Nematollah, McCaffrey Caitie, et al. 2018. Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Transactions on Storage (TOS) 14 (2018), 1–26.Google ScholarDigital Library
Reference
[29] Meuer Jack Dongarra and Hans. 2020. Top 500 list. https://www.top500.org/resources/top-systems/.Google Scholar
Reference
[30] IBM. 2008. Intrepid. https://www.alcf.anl.gov/intrepid.Google Scholar
Reference
[31] Ji Xu, Yang Bin, Zhang Tianyu, Ma Xiaosong, Zhu Xiupeng, Wang Xiyang, El-Sayed Nosayba, Zhai Jidong, Liu Weiguo, and Xue Wei. 2019. Automatic, application-aware I/O forwarding resource allocation. In Conference on File and Storage Technologies. USENIX, Boston, 265–279.Google Scholar
Reference 1Reference 2Reference 3
[32] Jokanovic Ana, Sancho Jose Carlos, Rodriguez German, Lucero Alejandro, Minkenberg Cyriel, and Labarta Jesus. 2015. Quiet neighborhoods: Key to protect job performance predictability. In International Parallel and Distributed Processing Symposium. IEEE, Hyderabad, 449–459.Google Scholar
Reference
[33] Kahle James A., Moreno Jaime, and Dreps Dan. 2019. 2.1 Summit and Sierra: Designing AI/HPC supercomputers. In International Solid-State Circuits Conference. IEEE, San Francisco, 42–43.Google ScholarCross Ref
Reference
[34] Kim Seong Jo, Son Seung Woo, Liao Wei-keng, Kandemir Mahmut, Thakur Rajeev, and Choudhary Alok. 2012. IOPin: Runtime profiling of parallel I/O in HPC systems. In Companion: High Performance Computing, Networking Storage and Analysis. IEEE, Salt Lake City, 18–23.Google Scholar
Reference 1Reference 2
[35] Kim Seong Jo, Zhang Yuanrui, Son Seung Woo, Prabhakar Ramya, Kandemir Mahmut, Patrick Christina, Liao Wei-keng, and Choudhary Alok. 2010. Automated tracing of I/O stack. In European MPI Users’ Group Meeting. Springer, Stuttgart, 72–81.Google Scholar
Reference
[36] Kuc Rafal and Rogozinski Marek. 2013. Elasticsearch Server. Packt Publishing Ltd, Birmingham.Google Scholar
Reference
[37] Kuo Chih-Song, Shah Aamer, Nomura Akihiro, Matsuoka Satoshi, and Wolf Felix. 2014. How file access patterns influence interference among cluster applications. In International Conference on Cluster Computing. IEEE, Madrid, 185–193.Google ScholarCross Ref
Reference 1Reference 2Reference 3
[38] Lim Seung-Hwan, Sim Hyogi, Gunasekaran Raghul, and Vazhkudai Sudharshan S.. 2017. Scientific user behavior and data-sharing trends in a Petascale file system. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–12.Google ScholarDigital Library
Reference 1Reference 2
[39] Lin Heng, Zhu Xiaowei, Yu Bowen, Tang Xiongchao, Xue Wei, Chen Wenguang, Zhang Lufei, Hoefler Torsten, Ma Xiaosong, Liu Xin, Zheng Weimin, and Xu Jingfang. 2018. ShenTu: Processing multi-trillion edge graphs on millions of cores in seconds. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, 706–716.Google ScholarDigital Library
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
[40] Liu Yang, Gunasekaran Raghul, Ma Xiaosong, and Vazhkudai Sudharshan S.. 2014. Automatic identification of application I/O signatures from noisy server-side traces. In Conference on File and Storage Technologies. USENIX, Oakland, 213–228.Google Scholar
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
[41] Liu Yang, Gunasekaran Raghul, Ma Xiaosong, and Vazhkudai Sudharshan S.. 2016. Server-side log data analytics for I/O workload characterization and coordination on large shared storage systems. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Salt Lake City, 819–829.Google ScholarCross Ref
Navigate to
Reference 1
Reference 2
Reference 3
Reference 4
Reference 5
Reference 6
Reference 7
[42] Liu Zaoxing, Manousis Antonis, Vorsanger Gregory, Sekar Vyas, and Braverman Vladimir. 2016. One sketch to rule them all: Rethinking network flow monitoring with UnivMon. In Proceedings of the 2016 ACM SIGCOMM Conference. ACM, Los Angeles, 101–114.Google ScholarDigital Library
Reference
[43] Lockwood Glenn K., Yoo Wucherl, Byna Suren, Wright Nicholas J., Snyder Shane, Harms Kevin, Nault Zachary, and Carns Philip. 2017. UMAMI: A recipe for generating meaningful metrics through holistic I/O performance analysis. In Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems. ACM, Denver, 55–60.Google Scholar
Reference 1Reference 2
[44] Lofstead Jay, Jimenez Ivo, Maltzahn Carlos, Koziol Quincey, Bent John, and Barton Eric. 2016. DAOS and friends: A proposal for an Exascale storage system. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Salt Lake City, 585–596.Google ScholarCross Ref
Reference
[45] Lofstead Jay, Zheng Fang, Liu Qing, Klasky Scott, Oldfield Ron, Kordenbrock Todd, Schwan Karsten, and Wolf Matthew. 2010. Managing variability in the IO performance of Petascale storage systems. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, 1–12.Google ScholarDigital Library
Reference 1Reference 2
[46] Luu Huong, Behzad Babak, Aydt Ruth, and Winslett Marianne. 2013. A multi-level approach for understanding I/O activity in HPC applications. In International Conference on Cluster Computing. IEEE, Indianapolis, 1–5.Google ScholarCross Ref
Reference 1Reference 2
[47] Luu Huong, Winslett Marianne, Gropp William, Ross Robert, Carns Philip, Harms Kevin, Prabhat Mr, Byna Suren, and Yao Yushu. 2015. A multiplatform study of I/O behavior on Petascale supercomputers. In International Symposium on High-Performance Parallel and Distributed Computing. ACM, Portland, 33–44.Google Scholar
Reference 1Reference 2Reference 3
[48] Mueller Frank, Wu Xing, Schulz Martin, Supinski Bronis R. De, and Gamblin Todd. 2010. ScalaTrace: Tracing, analysis and modeling of HPC codes at scale. In International Workshop on Applied Parallel Computing. Springer, Reykjavík, 410–418.Google Scholar
Reference
[49] Naas Mohammed Islam, Trahay François, Colin Alexis, Olivier Pierre, Rubini Stéphane, Singhoff Frank, and Boukhobza Jalil. 2021. EZIOTracer: Unifying kernel and user space I/O tracing for data-intensive applications. In Proceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems. ACM, Edinburgh, 1–11.Google ScholarDigital Library
Reference
[50] Nathan Vikram, Narayana Srinivas, Sivaraman Anirudh, Goyal Prateesh, Arun Venkat, Alizadeh Mohammad, Jeyakumar Vimalkumar, and Kim Changhoon. 2017. Demonstration of the Marple system for network performance monitoring. In Proceedings of the SIGCOMM Posters and Demos. ACM, Los Angeles, 57–59.Google ScholarDigital Library
Reference
[51] NERSC. 2014. Edison. https://www.top500.org/system/178443/.Google Scholar
Reference
[52] Neuwirth Sarah, Wang Feiyi, Oral Sarp, Vazhkudai Sudharshan, Rogers James, and Bruening Ulrich. 2016. Using balanced data placement to address I/O contention in production environments. In International Symposium on Computer Architecture and High Performance Computing. IEEE, Los Angeles, 9–17.Google Scholar
Reference
[53] Newman L. H.. 2014. Piz Daint Supercomputer Shows the Way Ahead on Efficiency.Google Scholar
Reference
[54] Noeth Michael, Ratn Prasun, Mueller Frank, Schulz Martin, and Supinski Bronis R. de. 2009. ScalaTrace: Scalable compression and replay of communication traces for high-performance computing. J. Parallel and Distrib. Comput. 69 (2009), 696–710.Google ScholarDigital Library
Reference
[55] OpenFabrics Alliance. 2010. OpenFabrics enterprise distribution (OFED). http://www.openfabrics.org/.Google Scholar
Reference
[56] Oral Sarp, Simmons James, Hill Jason, Leverman Dustin, Wang Feiyi, Ezell Matt, Miller Ross, Fuller Douglas, Gunasekaran Raghul, Kim Youngjae, Gupta Saurabh, Vazhkudai Devesh Tiwari Sudharshan S., Rogers James H., Dillow David, Shipman Galen M., and Bland Arthur S.. 2014. Best practices and lessons learned from deploying and operating large-scale data-centric parallel file systems. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, 217–228.Google ScholarDigital Library
Reference
[57] Ouyang Jiannan, Kocoloski Brian, Lange John R., and Pedretti Kevin. 2015. Achieving performance isolation with lightweight co-kernels. In International Symposium on High-Performance Parallel and Distributed Computing. ACM, Portland, 149–160.Google Scholar
Reference
[58] Papka Michael, Coghlan Susan, Isaacs Eric, Peters Mark, and Messina Paul. 2013. Mira: Argonne’s 10-Petaflops Supercomputer. Technical Report. ANL (Argonne National Laboratory (ANL), Argonne, IL (United States)).Google Scholar
Reference
[59] Patel Tirthak, Byna Suren, Lockwood Glenn K., and Tiwari Devesh. 2019. Revisiting I/O behavior in large-scale storage systems: The expected and the unexpected. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–13.Google ScholarDigital Library
Reference
[60] Patel Tirthak, Byna Suren, Lockwood Glenn K., Wright Nicholas J., Carns Philip, Ross Robert, and Tiwari Devesh. 2020. Uncovering access, reuse, and sharing characteristics of I/O-intensive files on large-scale production \(\lbrace\)HPC\(\rbrace\) systems. In Conference on File and Storage Technologies. USENIX, Santa Clara, 91–101.Google Scholar
Reference
[61] Paul Arnab K., Faaland Olaf, Moody Adam, Gonsiorowski Elsa, Mohror Kathryn, and Butt Ali R.. 2020. Understanding HPC application I/O behavior using system level statistics. In International Conference on High Performance Computing, Data, and Analytics. IEEE, Pune, 202–211.Google ScholarCross Ref
Reference
[62] Paul Arnab K., Goyal Arpit, Wang Feiyi, Oral Sarp, Butt Ali R., Brim Michael J., and Srinivasa Sangeetha B.. 2017. I/O load balancing for big data HPC applications. In International Conference on Big Data. IEEE, Boston, 233–242.Google ScholarCross Ref
Reference
[63] Qiao Zhenbo, Liu Qing, Podhorszki Norbert, Klasky Scott, and Chen Jieyang. 2020. Taming I/O variation on QoS-less HPC storage: What can applications do?. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Pune, 1–13.Google ScholarDigital Library
Reference
[64] Rajesh Neeraj, Devarajan Hariharan, Garcia Jaime Cernuda, Bateman Keith, Logan Luke, Ye Jie, Kougkas Anthony, and Sun Xian-He. 2021. Apollo: An ML-assisted real-time storage resource observer. In International Symposium on High-Performance Parallel and Distributed Computing. ACM, Stockholm, 147–159.Google Scholar
Reference
[65] Redis W. G.. 2016. Redis. http://redis.io/topics/faqAccessedNovember.Google Scholar
Reference
[66] Schmuck Frank and Haskin Roger. 2002. GPFS: A shared-disk file system for large computing clusters. In Conference on File and Storage Technologies. USENIX, Monterey, 1–15.Google Scholar
Reference
[67] Sergent Nicole, Défago Xavier, and Schiper André. 2001. Impact of a failure detection mechanism on the performance of consensus. In Pacific Rim International Symposium on Dependable Computing. IEEE, Seoul, 137–145.Google Scholar
Reference
[68] Shen Shan-Hsiang and Akella Aditya. 2012. DECOR: A distributed coordinated resource monitoring system. In International Workshop on Quality of Service. IEEE, Coimbra, 1–9.Google Scholar
Reference
[69] Skamarock William C., Klemp Joseph B., Dudhia Jimy, Gill David O., Barker Dale M., Wang Wei, and Powers Jordan G.. 2005. A Description of the Advanced Research WRF Version 2. Technical Report. National Center For Atmospheric Research Boulder Co Mesoscale and Microscale.Google Scholar
Reference 1Reference 2Reference 3
[70] Snyder Shane, Carns Philip, Harms Kevin, Ross Robert, Lockwood Glenn K., and Wright Nicholas J.. 2016. Modular HPC I/O characterization with Darshan. In Workshop on Extreme-scale Programming Tools. IEEE, Salt Lake City, 9–17.Google Scholar
Reference
[71] Song Huaiming, Yin Yanlong, Sun Xian-He, Thakur Rajeev, and Lang Samuel. 2011. Server-side I/O coordination for parallel file systems. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Seattle, 1–11.Google ScholarDigital Library
Reference
[72] Sugon. 2018. Pai system, Sugon. https://www.top500.org/system/179425/.Google Scholar
Reference
[73] Sungon. 2015. ParaStor200 Distributed Parallel Storage System. http://hpc.sugon.com/en/HPC-Components/parastor.html.Google Scholar
Reference
[74] Tai Ann T., Tso Kam S., and Sanders William H.. 2004. Cluster-based failure detection service for large-scale ad hoc wireless network applications. In International Conference on Dependable Systems and Networks. IEEE, Florence, 805–814.Google ScholarCross Ref
Reference
[75] Tammana Praveen, Agarwal Rachit, and Lee Myungjin. 2016. Simplifying datacenter network debugging with pathdump. In Symposium on Operating Systems Design and Implementation. USENIX, Savannah, 233–248.Google Scholar
Reference
[76] Tammana Praveen, Agarwal Rachit, and Lee Myungjin. 2018. Distributed network monitoring and debugging with SwitchPointer. In Symposium on Networked Systems Design and Implementation. USENIX, Renton, 453–456.Google Scholar
Reference
[77] Tarasov Vasily, Kumar Santhosh, Ma Jack, Hildebrand Dean, Povzner Anna, Kuenning Geoff, and Zadok Erez. 2012. Extracting flexible, replayable models from large block traces. In Conference on File and Storage Technologies. USENIX, San Jose, 22.Google Scholar
Reference
[78] Turnbull James. 2013. The Logstash Book. James Turnbull.Google Scholar
Reference
[79] Uselton Andrew, Howison Mark, Wright Nicholas J., Skinner David, Keen Noel, Shalf John, Karavanic Karen L., and Oliker Leonid. 2010. Parallel I/O performance: From events to ensembles. In International Symposium on Parallel and Distributed Processing. IEEE, Atlanta, 1–11.Google Scholar
Reference
[80] Vazhkudai Sudharshan S., Miller Ross, Tiwari Devesh, Zimmer Christopher, Wang Feiyi, Oral Sarp, Gunasekaran Raghul, and Steinert Deryl. 2017. GUIDE: A scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–12.Google ScholarDigital Library
Reference 1Reference 2
[81] Vijayakumar Karthik, Mueller Frank, Ma Xiaosong, and Roth Philip C.. 2009. Scalable I/O tracing and analysis. In Annual Workshop on Petascale Data Storage. IEEE, Portland, 26–31.Google Scholar
Reference
[82] Vishwanath Venkatram, Hereld Mark, Iskra Kamil, Kimpe Dries, Morozov Vitali, Papka Michael E., Ross Robert, and Yoshii Kazutomo. 2010. Accelerating I/O forwarding in IBM Blue Gene/p systems. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, 1–10.Google Scholar
Reference
[83] Vranas P.. 2012. BlueGene/Q Sequoia and Mira. Technical Report. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States).Google Scholar
Reference
[84] Wang Yulei, Liu Jian, Qin Hong, Yu Zhi, and Yao Yicun. 2017. The accurate particle tracer code. Computer Physics Communications 220 (2017), 212–229.Google ScholarCross Ref
Reference 1Reference 2
[85] Wiedemann Marc C., Kunkel Julian M., Zimmer Michaela, Ludwig Thomas, Resch Michael, Bönisch Thomas, Wang Xuan, Chut Andriy, Aguilera Alvaro, Nagel Wolfgang E., et al. 2013. Towards I/O analysis of HPC systems and a generic architecture to collect access patterns. Computer Science-Research and Development 28 (2013), 241–251.Google ScholarDigital Library
Reference
[86] Wright Steven A., Hammond Simon D., Pennycook Simon J., Bird Robert F., Herdman J. A., Miller Ian, Vadgama Ash, Bhalerao Abhir, and Jarvis Stephen A.. 2013. Parallel file system analysis through application I/O tracing. Comput. J. 56, 2 (2013), 141–155.Google ScholarDigital Library
Reference
[87] Wu Xing and Mueller Frank. 2013. Elastic and scalable tracing and accurate replay of non-deterministic events. In International Conference on Supercomputing. ACM, Eugene, 59–68.Google ScholarDigital Library
Reference
[88] Wu Xing, Vijayakumar Karthik, Mueller Frank, Ma Xiaosong, and Roth Philip C.. 2011. Probabilistic communication and I/O tracing with deterministic replay at scale. In International Conference on Parallel Processing. IEEE, Taipei, 196–205.Google ScholarDigital Library
Reference
[89] Xiao Jianyuan, Chen Junshi, Zheng Jiangshan, An Hong, Huang Shenghong, Yang Chao, Li Fang, Zhang Ziyu, Huang Yeqi, Han Wenting, Liu Xin, Chen Dexun, Liu Zixi, Zhuang Ge, Chen Jiale, Li Guoqiang, Sun Xuan, and Chen Qiang. 2021. Symplectic structure-preserving particle-in-cell whole-volume simulation of tokamak plasmas to 111.3 trillion particles and 25.7 billion grids. In International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, New York, 1–13.Google ScholarDigital Library
Reference
[90] Xie Bing, Chase Jeffrey, Dillow David, Drokin Oleg, Klasky Scott, Oral Sarp, and Podhorszki Norbert. 2012. Characterizing output bottlenecks in a supercomputer. In International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, Salt Lake City, 1–11.Google ScholarDigital Library
Reference
[91] Xu Cong, Byna Suren, Venkatesan Vishwanath, Sisneros Robert, Kulkarni Omkar, Chaarawi Mohamad, and Chadalavada Kalyana. 2016. LIOProf: Exposing lustre file system behavior for I/O middleware. In Cray User Group Meeting. Cray, London, 1–9.Google Scholar
Reference 1Reference 2
[92] Yildiz Orcun, Dorier Matthieu, Ibrahim Shadi, Ross Rob, and Antoniu Gabriel. 2016. On the root causes of cross-application I/O interference in HPC storage systems. In International Parallel and Distributed Processing Symposium. IEEE, Chicago, 750–759.Google Scholar
Reference
[93] You E.. 2020. Vuejs framework. https://vuejs.org.Google Scholar
Reference
[94] Yu Minlan, Greenberg Albert G., Maltz David A., Rexford Jennifer, Yuan Lihua, Kandula Srikanth, and Kim Changhoon. 2011. Profiling network performance for multi-tier data center applications. In Symposium on Networked Systems Design and Implementation. USENIX, Boston, 5–5.Google Scholar
Reference
[95] Yu Weikuan, Vetter J. S, and Oral H. S. 2008. Performance characterization and optimization of parallel I/O on the cray XT. In International Symposium on Parallel and Distributed Processing. IEEE, Sydney, 1–11.Google Scholar
Reference
[96] Zheng Fang, Yu Hongfeng, Hantas Can, Wolf Matthew, Eisenhauer Greg, Schwan Karsten, Abbasi Hasan, and Klasky Scott. 2013. GoldRush: Resource efficient in situ scientific data analytics using fine-grained interference aware execution. In International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–12.Google ScholarDigital Library
Reference

Index Terms

End-to-end I/O Monitoring on Leading Supercomputers

Recommendations

End-to-end I/O monitoring on a leading supercomputer
NSDI'19: Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation

This paper presents an effort to overcome the complexities of production system I/O performance monitoring. We design Beacon, an end-to-end I/O resource monitoring and diagnosis system, for the 40960-node Sunway TaihuLight supercomputer, current ranked ...
Read More
Dark silicon and the end of multicore scaling
ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture

Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit ...
Read More
Self-adaptive software system monitoring for performance anomaly localization
ICAC '11: Proceedings of the 8th ACM international conference on Autonomic computing

Autonomic computing components and services require continuous monitoring capabilities for collecting and analyzing data of runtime behavior. Particularly for software systems, a trade-off between monitoring coverage and performance overhead is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Storage Volume 19, Issue 1
February 2023
259 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3578369
Editor:
Erez Zadok
Stony Brook University, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 January 2023
- Online AM: 19 November 2022
- Accepted: 24 May 2022
- Revised: 29 March 2022
- Received: 9 December 2021
Published in tos Volume 19, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
I/O monitoring
anomaly detection
I/O diagnosis
bottleneck optimization
Qualifiers
- research-article
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 2,311
  Total Downloads
- Downloads (Last 12 months)1,797
- Downloads (Last 6 weeks)276
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

End-to-end I/O Monitoring on Leading Supercomputers

ACM Transactions on Storage

Abstract

1 INTRODUCTION

2 TAIHULIGHT NETWORK STORAGE

3 BEACON DESIGN AND IMPLEMENTATION

3.1 Beacon Architecture Overview

3.2 Multi-layer I/O Monitoring

3.2.1 Compute Nodes.

3.2.2 Forwarding Nodes.

3.2.3 Storage Nodes and MDS.

3.3 Multi-layer I/O Profiling

3.3.1 Automatic Anomaly Detection.

3.3.2 Per-job I/O Performance Analysis.

3.3.3 I/O Subsystem Monitoring for Administrators.

3.4 Generality

4 BEACON USE CASES

4.1 System Performance Overview

4.2 Performance Issue Diagnosis

4.2.1 Forwarding Node Cache Thrashing.

4.2.2 Bursty Forwarding Node Utilization.

4.2.3 MDS Request Priority Setting.

4.3 Automatic I/O Anomaly Diagnosis

4.3.1 Overview of Anomaly Detection Results of Applications.

4.3.2 Applications Affected by Interference.

4.3.3 Application-driven Anomaly Detection.

4.3.4 Anomaly Alert and Node Screening.

4.4 Application and User Behavior Analysis

4.4.1 Application I/O Mode Analysis.

4.4.2 Metadata Server Usage.

4.4.3 Jobs’ Request Size Analysis.

4.5 Extended Applications of Beacon

4.5.1 Extension to Network Monitoring.

4.5.2 Extension to the Cutting-edge Supercomputer with I/O Forwarding Architecture.

4.5.3 Extension to the Traditional Two-layer Supercomputer.

5 BEACON FRAMEWORK EVALUATION

5.1 Accuracy Verification

5.2 Monitoring and Query Overhead

6 RELATED WORK

7 CONCLUSION

Footnotes

REFERENCES

Cited By

Index Terms

Recommendations

End-to-end I/O monitoring on a leading supercomputer

Dark silicon and the end of multicore scaling

Self-adaptive software system monitoring for performance anomaly localization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media