Abstract
This paper offers a solution to overcome the complexities of production system I/O performance monitoring. We present Beacon, an end-to-end I/O resource monitoring and diagnosis system for the 40960-node Sunway TaihuLight supercomputer, currently the fourth-ranked supercomputer in the world. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes, and metadata servers. With mechanisms such as aggressive online and offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. With Beacon’s deployment on TaihuLight for more than three years, we demonstrate Beacon’s effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has already successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. Encouraged by Beacon’s success in I/O monitoring, we extend it to monitor interconnection networks, which is another contention point on supercomputers. In addition, we demonstrate Beacon’s generality by extending it to other supercomputers. Both Beacon codes and part of collected monitoring data are released.1
1 INTRODUCTION
Modern supercomputers are networked systems with increasingly deep storage hierarchies, serving applications with growing scale and complexity. The long I/O path from storage media to application, combined with complex software stacks and hardware configurations, makes I/O optimizations increasingly challenging for application developers and supercomputer administrators. In addition, because I/O utilizes heavily shared system components (unlike computation or memory accesses), it usually suffers from substantial inter-workload interference, causing high performance variance [23, 32, 37, 45, 52, 63, 71].
Online tools that can capture/analyze I/O activities and guide optimization are urgently needed. They also need to provide I/O usage information and performance records to guide future systems’ design, configuration, and deployment. To this end, several profiling/tracing tools and frameworks have been developed, including application-side (e.g., Darshan [9], ScalableIOTrace [81], and IOPin [34]), back-end side (e.g., LustreDU [7], IOSI [40], and LIOProf [91]), and multi-layer tools (e.g., EZIOTracer [49], GUIDE [80], and Logaider [12]).
These proposed tools, however, have one or more of the following limitations. Application-oriented tools often require developers to instrument their source code or link extra libraries. They also do not offer intuitive ways to analyze inter-application I/O performance behaviors such as interference issues. Back-end-oriented tools can collect system-level performance data and monitor cross-application interactions but have difficulty in identifying performance issues for specific applications and in finding their root causes. Finally, problematic applications issuing inefficient I/O requests escape the radar of back-end-side analytical methods [40, 41] relying on high-bandwidth applications.
This paper reports the design, implementation, and deployment of a lightweight, end-to-end I/O resource monitoring and diagnosis system, Beacon, for TaihuLight, currently the fourth-ranked supercomputer in the world [29]. It works with TaihuLight’s 40,960 compute nodes (over ten-million cores in total), 288 forwarding nodes, 288 storage nodes, and two metadata nodes. Beacon integrates front-end tracing and back-end profiling into a seamless framework, enabling tasks such as automatic per-application I/O behavior profiling, I/O bottleneck/interference analysis, and system anomaly detection.
To the best of our knowledge, this is the first system-level, multi-layer monitoring and real-time diagnosis framework deployed on ultra-scale supercomputers. Beacon collects performance data simultaneously from different types of nodes (including the compute, I/O forwarding, storage, and metadata nodes) and analyzes them collaboratively, without requiring any involvement of application developers. Its elaborated collection scheme and aggressive compression minimize the system cost; only 85 part-time servers to monitor the entire 40960-node system, with \(\lt \!1\%\) performance overhead in user applications.
We have deployed Beacon for production use since April 2017. It has already helped the TaihuLight system administration and I/O performance team identify several performance degradation problems. With its rich I/O performance data collection and real-time system monitoring, Beacon successfully exposes the mismatch between application I/O patterns and widely adopted underlying storage design/configurations. To help application developers and users, it enables detailed per-application I/O behavior study, with novel inter-application interference identification and analysis. Beacon also performs automatic anomaly detection. Finally, we have recently started to expand Beacon beyond I/O to network switch monitoring.
Based on our design and deployment experience, we argue that having such an end-to-end detailed I/O monitoring framework is highly rewarding. Beacon’s all-system-level monitoring decouples it from language, library, or compiler constraints, enabling the monitoring data of collection and analysis for all applications and users. Much of its infrastructure reuses existing server/network/storage resources, and it has proved to have negligible overhead. In exchange, users and administrators harvest deep insights into the complex I/O system components’ operations and interactions, and reduce both human resources and machine core-hours wasted on unnecessarily slow/jittery I/O or system anomalies.
2 TAIHULIGHT NETWORK STORAGE
Let us first introduce the TaihuLight supercomputer (and its Icefish I/O subsystem) used to perform our implementation and deployment. Though the rest of our discussion is based on this specific platform, many aspects of Beacon’s design and operation can be applied to other large-scale supercomputers or clusters.
TaihuLight, currently the fourth-ranked supercomputer in the world, is a many-core accelerated 125-petaflop system [22]. Figure 1 illustrates its architecture, highlighting the Icefish storage subsystem. The 40,960 260-core compute nodes are organized into 40 cabinets, each containing four supernodes. Through dual-rail FDR InfiniBand, all the 256 compute nodes in one supernode are fully connected and then connected to Icefish via a Fat-tree network. In addition, Icefish serves an Auxiliary Compute Cluster (ACC) with Intel Xeon processors, mainly used for data pre- and post-processing.
The Icefish back end employs the Lustre parallel file system [4], with an aggregate capacity of 10 PB on top of 288 storage nodes and 144 Sugon DS800 disk enclosures. An enclosure contains 60 1.2-TB SAS HDD drives, composing six Object Storage Targets (OSTs), each an 8+2 RAID6 array. The controller within each enclosure connects to two storage nodes, via two fiber channels for path redundancy. Therefore, every storage node manages three OSTs, while the two adjacent storage nodes sharing a controller form a failover pair.
Between the compute nodes and the Lustre back end is a layer of 288 I/O forwarding nodes. Each plays a dual role, both as a Lightweight File System (LWFS) based on the Gluster [13] server to the compute nodes and a client to the Lustre back end. This I/O forwarding practice is adopted by multiple other platforms that operate at such a scale [6, 44, 53, 82, 95].
A forwarding node provides a bandwidth of 2.5 GB/s, aggregating to over 720 GB/s for the entire forwarding system. Each back-end controller provides about 1.8 GB/s, amounting to a file system bandwidth of around 260 GB/s. Overall, Icefish delivers 240 GB/s and 220 GB/s aggregate bandwidths for reads and writes, respectively.
TaihuLight debuted on the Top500 list in June 2016. At the time of this study, Icefish was equally partitioned into two namespaces: Online1 (for everyday workloads) and Online2 (reserved for ultra-scale jobs that occupy the majority of the compute nodes), with disjointed sets of forwarding nodes. A batch job can only use either namespace. I/O requests from a compute node are served by a specified forwarding node using a static mapping strategy for easy maintenance (48 fixed forwarding nodes for ACC and 80 fixed forwarding nodes for Sunway compute nodes).
Therefore, the two namespaces, along with statically partitioned back-end resources, are currently utilized separately by routine jobs and “VIP” jobs. One motivation for deploying an end-to-end monitoring system is to analyze the I/O behavior of the entire supercomputer’s workloads and design more flexible I/O resource allocation/scheduling mechanisms. For example, motivated by the findings of our monitoring system, a dynamic forwarding allocation system [31] for better forwarding resource utilization was developed, tested, and deployed.
3 BEACON DESIGN AND IMPLEMENTATION
3.1 Beacon Architecture Overview
Figure 2 shows the three components of Beacon: the monitoring component, the storage component, and a dedicated Beacon server. Beacon performs I/O monitoring at the six components of TaihuLight, including the LWFS client (on the compute nodes), the LWFS serve, the Lustre client (the latter two are both on the forwarding nodes), the Lustre server (on the storage nodes), the Lustre metadata server (on the metadata nodes), and the job scheduler (on the scheduler node). For the first five, Beacon deploys the lightweight daemons that can collect I/O-relevant events, status, and performance data locally, and then delivers the aggregated and compressed data to Beacon’s distributed databases, which are deployed on 84 part-time servers. Aggressive first-pass compression is conducted on all compute nodes for efficient per-application I/O trace collection/storage. For the job scheduler, Beacon interacts with the job queuing system to keep track of per-job information, and then sends the job information to the MySQL database (on the 85th part-time server). Details of Beacon’s monitoring component can be found in Section 3.2.
Beacon’s storage component is deployed on 85 of 288 storage nodes. Beacon has its major back-end processing and storage workflow distributed across these storage nodes with their node-local disks, achieving a low overall overhead and satisfying stability of services. To this end, Beacon divides the 40,960 compute nodes into 80 groups and enlists 80 of the 288 storage nodes to communicate with one group each. Two more storage nodes are used to collect data from the forwarding nodes, plus another for storage nodes and one last for Metadata Data Server (MDS). Together, these 84 “part-time” servers (shown as “N1” to “N84” in Figure 2) are called log servers, which host a distributed I/O record database of Beacon. Considering the data collection across a total number of more than 50,000 nodes, a certain number of servers is beneficial to the stability and concurrent access efficiency of Beacon. In addition, one more storage node (N85 in Figure 2) is used to host Beacon’s job database (implemented using MySQL [16]). By leveraging the hardware devices available on the supercomputer, we can deploy Beacon quickly.
These log servers adopt a layered software architecture built upon mature open-source frameworks. They collect I/O-relevant events, status, and performance data through Logstash [78], a server-side log processing pipeline for simultaneously ingesting data from multiple sources. The data are then imported to Redis [65], a widely used in-memory data store, acting as a cache to quickly absorb monitoring output. Persistent data storage and subsequent analysis are done via Elasticsearch [36], a distributed lightweight search and analytics engine supporting a NoSQL database. It also supports efficient Beacon queries for real-time and offline analysis.
Finally, Beacon conducts data analytics and visualizes the results of analysis to Beacon’s users (either system administrators or application users) with a dedicated Beacon server. Then, it performs two kinds of offline data analysis periodically: (1) second-pass, inter-node compression to further remove data redundancy by comparing and combining logs from compute nodes running the same job, and (2) extracting and caching in MySQL using SQL views of the per-job statistic summary while generating and caching in Redis common performance visualization results, so as to facilitate a speedy user response. Log and monitoring data, after the two-pass compression, are permanently stored using Elasticsearch on this dedicated Beacon server. Data in the distributed I/O record database are kept for six months. Considering the typical daily data collection size of 10–100 GB, its 120-TB RAID5 capacity far exceeds the system’s lifetime storage space needs.
Beacon’s web interface uses the Vue [93]+Django [19] framework, which can efficiently separate the front end (a user-friendly GUI for processing and visualizing the I/O-related job/system information queries) and the back end (the service for obtaining the analysis results of Beacon and feeding them back to the front end). For instance, application users can query a summary of their programs’ I/O behavior based on the job ID, along the entire I/O path, to help diagnose I/O performance problems. Moreover, system administrators can monitor real-time load levels on all forwarding nodes, storage nodes, and metadata servers, facilitating future job scheduling optimizations and center-level resource allocation policies. Figure 3 shows the corresponding screenshots. Section 4 provides more details, with concrete case studies.
All communication among Beacon entities uses a low-cost, easy-to-maintain Ethernet connection (marked in green in Figure 1) that is separate from both the main computation and the storage interconnects.
3.2 Multi-layer I/O Monitoring
Figure 4 shows the format of all data collected by Beacon, including the LWFS client trace entry, LWFS server log entry, Lustre client log entry, Lustre server log entry, Lustre MDS log entry, and Job scheduler log entry. For details, see the following section.
3.2.1 Compute Nodes.
On each of the 40,960 compute nodes, Beacon collects LWFS client trace logs via instrumenting in the FUSE (File system in User Space) [17]. Each log entry contains the node’s IP, I/O operation type, file descriptor, offset, request size, and timestamp.
On a typical day, such raw trace data alone amount to over 100 GB, making their collection/processing a non-trivial task on Beacon’s I/O record database, which takes away resources from the storage nodes. However, there exists abundant redundancy in HPC workloads’ I/O operations. For example, as each compute node is usually dedicated to one job at a time, the job IDs are identical among many trace entries. Similarly, owing to the regular, tightly coupled nature of many parallel applications, adjacent I/O operations likely have common components, such as the target file, operation type, and request size. Recognizing this, Beacon performs aggressive online compression on each compute node to dramatically reduce the I/O trace size. This is done by a simple, linear algorithm comparing adjacent log items and combining them with an identical operation type, file descriptor, and request size, and accessing contiguous areas. These log items are replaced with a single item plus a counter. Considering the low computing overhead, we perform such parallel first-pass compression on compute nodes.
Beacon conducts offline log processing and second-pass compression on the dedicated server. Here, it extracts the feature vector \(\lt\)time, operation, file descriptor, size, offset\(\gt\) from the original log records and performs inter-node compression by comparing feature vector lists from all nodes and merging identical vectors, using a similar approach as in block trace modeling [77] or ScalaTrace [54].
Table 1 summarizes the effectiveness of Beacon’s monitoring data compression. It gives the compression ratio under two kinds of methods of eight applications, including six open-source applications (APT [84],
3.2.2 Forwarding Nodes.
On each forwarding node, Beacon profiles both the LWFS server and Lustre client. It collects the latency and processing time for each LWFS server request by instrumenting all I/O operations at the POSIX layer and the request queue length for each LWFS server by sampling the queue status once per 1,000 requests. Rather than saving the per-request traces, the Beacon daemon periodically processes new traces and only saves I/O request statistics such as latency and queue length distribution.
For the Lustre client, Beacon collects request statistics by sampling the status of all outstanding RPC requests once every second. Each sample contains the forwarding ID and RPC request size sent to the Lustre server.
3.2.3 Storage Nodes and MDS.
On the storage nodes, Beacon daemons periodically sample the Lustre OST status table, record data items such as the OST ID and OST total data size, and further send high-level statistics such as the count of RPC requests and average per-RPC data size in the past time window. On the Lustre MDS, Beacon also periodically collects and records statistics on active metadata operations (such as open and lookup) at 1-second intervals while storing a summary of the periodic statistics in its database.
3.3 Multi-layer I/O Profiling
All the aforementioned monitoring data are transmitted for long-term storage and processing at the database on the dedicated Beacon server as JSON objects, on top of which Beacon builds I/O monitoring/profiling services. These include automatic anomaly detection, which runs periodically, as well as query and visualization tools, which supercomputer users and administrators can use interactively. Below, we give more detailed descriptions of these functions.
3.3.1 Automatic Anomaly Detection.
Beacon performs two types of automatic anomaly detection. One is to locate the job I/O performance anomaly. The job I/O performance anomaly is common in the complicated HPC environment. Various factors can cause performance anomalies, and I/O interference is one of the major factors. However, as supercomputer architectures become more complicated, it becomes increasingly difficult to identify and locate I/O interference. The other type of detection aims to identify the node anomaly. Outright failure, which implies the node is entirely out of service, is a common type of node anomaly that can be detected relatively straightforwardly in a large system; it is commonly handled by tools such as heartbeat detection [67, 74]. We do not discuss outright failure in this paper. Here, we focus on the other type, faulty system components, which are alive yet slow components, such as forwarding nodes and OSTs under performance degradation. Faulty system components may continue to serve requests, but at a much slower pace, draining the entire application’s performance and reducing overall system utilization. In a dynamic storage system serving multiple platforms and many concurrent applications, such stragglers are difficult to identify.
With the assistance of Beacon’s continuous, end-to-end and multi-layer I/O monitoring, a new option is made available to application developers and supercomputer administrators to examine jobs’ performance and system health by connecting statistics on application-issued I/O requests to that of individual OST’s bandwidth measurement. Such a connection guides Beacon to deduce what is the norm and what is an exception. Leveraging this capability, we design and implement a lightweight, automatic anomaly detection tool. Figure 5 shows the workflow of the anomaly detection tool.
The left part of the figure shows the job I/O performance anomaly detection workflow. Beacon detects the job I/O performance anomaly by checking newly measured I/O performance results against historical records, based on the assumption that most data-intensive applications have relatively consistent I/O behavior. First, it adopts the automatic I/O phase identification technique as in the IOSI system [40] developed on the Oak Ridge National Laboratory Titan supercomputer, which uses Discrete Wavelet Transform (DWT) to find distinct “I/O bursts” from continuous I/O bandwidth time-series data. Then, Beacon deploys a two-stage approach to detect jobs’ abnormal I/O phase effectively. In the first stage, Beacon classifies the I/O phase s into several distinct categories in terms of their I/O mode and total I/O volume by using the DBSCAN algorithm [18]. In the second stage, Beacon calculates I/O phase s’ performance vectors for each category, clusters the performance vectors with DBSCAN again, and then identifies the abnormal I/O phase s for each job with the clustering results. Here, we propose a new measurement feature named the performance vector, which is a description of the I/O phase’s throughput waveform. Intuitively, the throughput of the abnormal I/O phase is substantially lower for most of the time during the I/O phase’s period when compared to the I/O phase with normal performance. Therefore, the throughput distribution may become an important feature to differentiate whether the I/O phase is abnormal.
The process of calculating the performance vector is shown in Algorithm 1. We determine the I/O phase’s time span in each range by dividing the throughput between the minimum and maximum into N intervals. Here, we take
Then, Beacon utilizes its rich monitoring data to examine neighbor jobs that share forwarding node(s) with the abnormal job when outliers are found. In particular, it judges the cause of the anomaly by whether such neighbors have interference-prone features, such as high MDOPS, high I/O bandwidth, high IOPS, or N:1 I/O mode. The I/O mode indicates the parallel file sharing mode among processes, where common modes include “N:N” (each compute process accesses a separated file), “N:1” (all processes share one file), “N:M” (N processes perform I/O aggregation to access M files, M\(\lt\)N), and “1:1” (only one of all processes performs sequential I/O on a single file). Such findings are saved in the Beacon database and provided to users via the Beacon web-based application I/O query tool. Applications, of course, need to accumulate at least several executions for such detection to take effect.
The right part of Figure 5 shows the workflow of Beacon’s node anomaly detection, which relies on the execution of large-scale jobs (those using 1,024 or more compute nodes in our current implementation). To spot outliers, it leverages the common homogeneity in I/O behavior across compute and server nodes. Beacon’s multi-level monitoring allows the correlation of I/O activities or loads back to actual client-side issued requests. Again, by using clustering algorithms like DBSCAN and configurable thresholds, Beacon performs outlier detection across forwarding nodes and OSTs involved in a single job, where the vast majority of entities report a highly similar performance, while only a few members produce contrasting readings. Figure 15 in Section 4.3 gives an example of per-OST bandwidth data within the same execution.
3.3.2 Per-job I/O Performance Analysis.
Upon a job’s completion, Beacon performs automatic analysis of its I/O monitoring data collected from all layers. It performs inter-layer correlation by first identifying jobs from the job database that run on given compute node(s) at the log entry collection time. The involved forwarding nodes, leading to relevant forwarding monitoring data, are then located via the compute-to-forwarding node mapping using a system-wide mapping table lookup. As mentioned above, the mapping from computing nodes to forwarding nodes on TaihuLight is statically configured. Finally, relevant OSTs and corresponding storage nodes monitoring data entries are found by the file system lookup using the Lustre command
From the above data, Beacon derives and stores coarse-grained information for quick query, including the average and peak I/O bandwidth, average IOPS, runtime, number of processes (and compute nodes) performing I/O, I/O mode, total count of metadata operations, and average metadata operations per second during I/O phases.
To help users understand/debug their applications’ I/O performance, Beacon provides web-based I/O data visualization. This diagnosis system can be queried using a job ID, and after appropriate authentication, it allows visualizing the I/O statistics of the job, both real-time and post-mortem. It reports the measured I/O metrics (such as bandwidth and IOPS) and inferred characteristics (such as the number of I/O processes and I/O mode). Users are also presented with user-configurable visualization tools, showing time-series measurement in I/O metrics, statistics information such as request type/size distribution, and performance variances. Our powerful I/O monitoring database allows for further user-initiated navigation, such as per-compute-node traffic history and zooming control to examine data at different granularity. For security/privacy, users are only allowed to view I/O data from compute, forwarding, and storage nodes involved in and for the duration of their jobs’ execution.
3.3.3 I/O Subsystem Monitoring for Administrators.
Beacon also provides administrators with the capability to monitor the I/O status for any time period, on any node.
Besides all the user-visible information and facilities mentioned above, administrators can further obtain and visualize: (1) the detailed I/O bandwidth and IOPS for each compute node, forwarding node, and storage node, (2) resource utilization status of forwarding nodes, storage nodes and the MDS, including detailed request queue length statistics, and (3) I/O request latency distribution on forwarding nodes. Additionally, Beacon grants administrators direct I/O record database access to facilitate in-depth analysis.
Combining such facilities, administrators can perform powerful and thorough I/O traffic and performance analysis, for example, by checking multi-level traffic, latency, and throughput monitoring information regarding a job execution.
3.4 Generality
Beacon is not an ad-hoc I/O monitoring system for the TaihuLight. It can be adopted not just for data collection in other fields but also for other platforms. Beacon’s building blocks, such as operation log collection, compression, and data management components, are also suitable for collecting from other fields. Section 4.5.1 will show an example of collecting network data.
In addition, Beacon is also applicable to other advanced supercomputers with the I/O forwarding architecture. Beacon’s multi-layer data collection and storage, scheduler-assisted per-application data correlation and analysis, history-based anomaly identification, automatic I/O mode detection, and built-in interference analysis can all be performed on other supercomputers. Its data management components, such as Logstash, Redis, and ElasticSearch, are open-source software that can run on these machines as well. Our forwarding layer design validation and load analysis can also help recent platforms with a layer of burst buffer nodes, such as NERSC’s Cori [10]. Section 4.5.2 gives an example of extending Beacon to another supercomputer with the I/O forwarding architecture.
Finally, we find that while Beacon is designed and deployed on a cutting-edge supercomputer with multi-layer architectures, it can also be applied to traditional two-layer supercomputers. An example of extending Beacon to a traditional two-layer supercomputer is given in Section 4.5.3.
4 BEACON USE CASES
We now discuss several use cases of Beacon. Beacon has been deployed on TaihuLight for over three years, gathering massive I/O information and accumulating around 25 TB of trace data (after two passes of compression) from April 2017 to July 2020. As TaihuLight’s back-end storage changed in August 2020, we use data before August 2020 for analysis. This history contains 1,460,662 jobs using at least 32 compute nodes and consuming 789,308,498 core-hours in total. Of these jobs, 238,585 (16.3%) featured non-trivial I/O, with per-job I/O volume over 200 MB.
The insights and issues revealed by Beacon’s monitoring and diagnosis have already helped TaihuLight administrators fix several design flaws, develop a dynamic and automatic forwarding node allocation tool, and improve system reliability and application efficiency. Owing to Beacon’s success on TaihuLight, we extend Beacon to other platforms. In this section, we focus on four types of use cases and the extended applications of Beacon for network monitoring and monitoring of different storage architectures:
(1) | System performance overview | ||||
(2) | Performance issue diagnosis | ||||
(3) | Automatic I/O anomaly diagnosis | ||||
(4) | Application and user behavior analysis |
4.1 System Performance Overview
Beacon’s multi-layer monitoring, especially I/O subsystem monitoring, gives us an overview of the whole system, which helps manage and construct future storage systems. Liu’s work [41] took Titan as an example to prove that individual pieces of hardware (such as storage nodes and disks) are often under-utilized in HPC storage systems, and we make similar observations on TaihuLight. Figure 7 shows back-end utilization level statistics of the Lustre parallel file system on TaihuLight supercomputers for eight months. For each object storage target (OST), a disk array, we plot the percentage of time it reaches a certain average throughput, normalized to its peak throughput. OSTs are almost idle at least 60% of the time, using less than 1% of the I/O bandwidth. At the same time, these OSTs’ utilization is less than 5% about 70% of the time. So we can conclude that OSTs are under-utilized most of the time. Moreover, we also obtain similar conclusions for compute and forwarding nodes using Beacon’s multi-layer monitoring data.
Besides the conclusion obtained from the individual layer, Beacon can also discover the relationship between different layers, which is unavailable for traditional trace tools. Figure 8 shows the daily access volume from three layers during the sample period. Especially for read operations, the total daily volume requested by the compute layer is larger than that of the forwarding layer most of the time, which results in effective caching for Lustre clients in the forwarding layer. Sometimes, the read volume requested by the forwarding layer is much larger than that of the compute layer, which reveals the phenomenon of cache thrashing, and we discuss the details of it later in this section. For write operations, the total daily volume requested from the forwarding layer is always slightly larger than that of the compute layer. Write amplification is a major reason for this phenomenon, which is caused by writing data aligned with the request size of 4 KB, or the multiples of 4 KB.
However, the OST layer has a different story. We find that both the read and write volumes on the compute and forwarding layer are much smaller than on the OST layer. Besides write amplification, there are other reasons for this phenomenon. In addition to the compute and forwarding nodes on TaihuLight, other nodes like login or ACC nodes can also access the shared Lustre back-end storage system. Currently, Beacon does not capture these nodes. However, from the figure, we can conclude that system administrators should also pay attention to a load of file system access on login nodes or ACC nodes. According to our survey, users often make many file I/O operations, like copying data from local file systems to Lustre or from one directory to another on login nodes or performing data post-processing on ACC nodes. More details are given in Section 4.4.
4.2 Performance Issue Diagnosis
4.2.1 Forwarding Node Cache Thrashing.
Beacon’s end-to-end monitoring facilitates cross-layer correlation of I/O profiling data, at different temporal or spatial granularities. By comparing the total request volume at each layer, we can see that Beacon has helped TaihuLight’s infrastructure management team identify a previously unknown performance issue, as detailed below.
A major driver for the adoption of I/O forwarding or the burst buffer layer is the opportunity to perform prefetching, caching, and buffering, so as to reduce the pressure on slower disk storage. Figure 9 shows the read volume on compute and forwarding node layers, during two sampled 70-hour periods in August 2017. Figure 9(a) shows a case with expected behavior, where the total volume requested by the compute nodes is significantly higher than that requested by the forwarding nodes, signaling good access locality and effective caching. Figure 9(b), however, tells the opposite story, to the surprise of system administrators: The forwarding layer incurs much higher read traffic from the back end than requested by user applications, reading much more data from the storage nodes than returning to compute nodes. Such a situation does not apply to writes, where Beacon always shows the matching aggregate bandwidth across the two levels.
Further analysis of the applications executed and their assigned forwarding nodes during the problem period in Figure 9(b) reveals an unknown cache thrashing problem, caused by the N:N sequential data access behavior. By default, the Lustre client has a 40-MB read-ahead cache for each file. Under the N:N sequential read scenarios, such aggressive prefetching causes severe memory contention, with data repeatedly read from the back end (and evicted on forwarding nodes). For example, a 1024-process
4.2.2 Bursty Forwarding Node Utilization.
Beacon’s continuous end-to-end I/O monitoring gives center management a global picture on system resource utilization. While such systems have often been built and configured using rough estimates based on past experience, Beacon collects detailed resource usage history to help improve the current system’s efficiency and assist future system upgrade and design.
Figure 10 gives one example, again on the forwarding load distribution, by showing two 1-day samples from July 2017. Each row portrays the by-hour peak load on one of the same 40 forwarding nodes randomly sampled from the 80 active ones. The darkness reflects the maximum bandwidth reached within that hour. The labels “high”, “mid”, “low”, and “idle” correspond to the maximum residing in the \(\gt\)90%, 50–90%, 10–50%, or 0–10% interval (relative to the benchmarked per-forwarding-node peak bandwidth), respectively.
Figure 10(a) shows the more typical load distribution, where the majority of forwarding nodes stay lightly used for the vast majority of the time (90.7% of cells show a maximum load of under 50% of peak bandwidth). Figure 10(b) gives a different picture, with a significant set of sampled forwarding nodes serving I/O-intensive large jobs for a good part of the day. Moreover, 35.7% of the cells actually see a maximum load of over 99% of the peak forwarding node bandwidth.
These results indicate that (1) overall, there is forwarding resource overprovisioning (confirming prior findings [27, 41, 47, 62]); (2) even with the more representative low-load scenarios, it is not rare for the forwarding node bandwidth to be saturated by application I/O; and (3) a load imbalance across forwarding nodes exists regardless of load level, making idle resources potentially helpful to I/O-intensive applications.
4.2.3 MDS Request Priority Setting.
Overall, we find that most TaihuLight jobs were rather metadata-light, but Beacon does observe a small fraction of parallel jobs (0.69%) with a high metadata request rate (more than 300 metadata operations/s on average during I/O phases). Beacon finds that these metadata-heavy (“high-MDOPS”) applications tend to cause significant I/O performance interference. Among jobs with Beacon-detected I/O performance anomaly, those sharing forwarding nodes with high-MDOPS jobs experience, an average 13.6\(\times\) increase in read/write request latency during affected time periods.
Such severe delay and corresponding Beacon forwarding node queue status history prompts us to examine the TaihuLight LWFS server policy. We find that metadata requests are given priority over the file I/O, based on the single-MDS design and the need to provide fast response to interactive user operations such as
4.3 Automatic I/O Anomaly Diagnosis
In extreme-scale supercomputers, users typically accept jittery application performance, recognizing widespread resource sharing among jobs. System administrators, moreover, see different behaviors among system components with a homogeneous configuration, but cannot tell how much of that difference comes from these components’ functioning and how much comes from the diversity of tasks they perform.
Beacon’s multi-layer monitoring capacity, therefore, presents a new window for supercomputer administrators to examine system health by connecting statistics on application-issued I/O requests all the way to that of an individual OST’s bandwidth measurement.
4.3.1 Overview of Anomaly Detection Results of Applications.
Figure 12 shows the results of anomaly detection with historical data collected from April 2017 to July 2020. Our results show that about 4.8% of all jobs that featured non-trivial I/O have experienced abnormal performance.
Figure 12(a) shows abnormal jobs’ categories distribution. Low-bandwidth jobs make up the majority of all jobs, and
Figure 12(b) shows the factors that neighbor jobs bring to abnormal jobs, and we divide them into three categories: (1) system anomaly, (2) I/O interference, and (3) unknown factors. I/O interference factors include the N:1 I/O mode, high MDOPS, high I/O bandwidth, high IOPS, mix, and multiple jobs. This figure illustrates that application-interfering jobs account for more than 90% of all jobs, implying that application interference is the predominant cause of jobs’ performance degradation. Among them, the proportion of interference caused by jobs with the N:1 I/O mode occupies the primary partition, which means jobs with the N:1 I/O mode are not only susceptible to disturbance but also bring interference to other applications. Section 4.4 provides more information. Mix and jobs with high MDOPS rank second and third, respectively. The LWFS server thread pool on the forwarding node is restricted to 16, and jobs suffer from performance degradation when I/O operations on the same forwarding node surpass the LWFS server thread pool’s service capabilities.
4.3.2 Applications Affected by Interference.
Figure 13 illustrates an example of 1024-process
4.3.3 Application-driven Anomaly Detection.
Most I/O-intensive applications have distinct I/O phases (i.e., episodes in their execution where they perform I/O continuously), such as those to read input files during initialization or to write intermediate results or checkpoints. For a given application, such I/O phase behavior is often consistent. Taking advantage of such repeated I/O operations and its multi-layer I/O information collection, Beacon performs automatic I/O phase recognition, on top of which it conducts I/O-related anomaly detection. More specifically, larger applications (e.g., those using 1024 compute nodes or more) spread their I/O load to multiple forwarding nodes and back-end nodes, giving us opportunities to directly compare the behavior of servers processing requests known to Beacon as homogeneous or highly similar.
Figure 14 gives an example of a 6000-process
4.3.4 Anomaly Alert and Node Screening.
Such continuous, online application performance anomaly detection can identify forwarding nodes or back-end units with deviant performance metrics, which in turn triggers Beacon’s more detailed monitoring and analysis. If it finds such a system component to consistently under-perform relative to peers serving similar workloads, with configurable thresholds in monitoring window and degree of behavior deviation, it reports this as an automatically detected system anomaly. By generating and sending an alarm email to the system administration team, Beacon prompts system administrators to do a thorough examination, where its detailed performance history information and visualization tools are also helpful.
Such anomaly screening is particularly important for expensive, large-scale executions. For example, among all applications running on TaihuLight so far, the parallel graph engine
However, without Beacon’s back-end monitoring, applications like
Beacon has been deployed on TaihuLight since April 2017, with features and tools incrementally developed and added to production use. Table 2 summarizes the automatically identified I/O system anomaly occurrences at the two service layers, from April 2017 to July 2020. Such identification adopts a minimum threshold of the measured maximum bandwidth under 30% of the known peak value, as well as a minimum duration of 60 minutes. Such parameters can be configured to adjust the anomaly detection system sensitivity. Most performance anomaly occurrences are found to be transient, lasting under 4 hours.
There are a total of 70 occasions of performance anomaly over 4 hours on the forwarding layer and 98 on the back-end layer, confirming the existence of fail-slow situations that are common with data centers [28]. Reasons for such relatively long yet “self-healed” anomalies include service migration and RAID reconstruction. With our rather conservative setting during the initial deployment period, Beacon is set to send the aforementioned alert email when a detected anomaly situation lasts beyond 96 hours (except for large-scale production runs as in the
4.4 Application and User Behavior Analysis
With its powerful information collection and multi-layer I/O activity correlation, Beacon provides a new capability to perform detailed application or user behavior analysis. Results of such analysis assist in performance optimization, resource provisioning, and future system design. Here, we showcase several application/user behavior studies, some of which have led to corresponding optimizations or design changes to the TaihuLight system.
4.4.1 Application I/O Mode Analysis.
First, Table 3 gives an overview of the I/O volume across all profiled jobs with a non-trivial I/O, categorized by per-job core-hour consumption. Here, 1,000 K core-hours correspond to a 10-hour run using 100,000 cores on 25,000 compute nodes, and jobs with such consumption or higher write more than 40 TB of data on average. Further examination reveals that in each core-hour category, average read/write volumes are influenced by a minority group of heavy consumers. Overall, the amount of data read/written grows as the jobs consume more compute node resources. The less resource-intensive applications tend to perform more reads, while the larger consumers are more write-intensive.
Figure 16 shows the breakdown of I/O-mode adoption among all TaihuLight jobs performing non-trivial I/O, by total read/write volume. The first impression one takes from these results is that the rather “extreme” cases, such as N:N and 1:1, form the dominant choices, especially in the case of writes. We suspect that this distribution may be skewed by a large number of small jobs doing limited I/O, and calculate the average per-job read/write volume for each I/O mode. The results (Table 4) show that this is not the case. Actually, applications that choose to use the 1:1 mode for writes actually have a much higher overall write volume.
The 1:1 mode is the closest to sequential processing behavior and is conceptually simple. However, it obviously lacks scalability and fails to utilize the abundant hardware parallelism in the TaihuLight I/O system. The wide presentation of this I/O mode may help explain the overall under-utilization of forwarding resources, discussed earlier in Section 4.2. Echoing similar findings (though not so extreme) on other supercomputers [47] (including Intrepid [30], Mira [58], and Edison [51]), effective user education on I/O performance and scalability can both help improve storage system utilization and reduce wasted compute resources.
The N:1 mode tells a different story. It is an intuitive parallel I/O solution that allows compute processes to directly read to or write from their local memory without gather-scatter operations, while retaining the convenience of having a single input/output file. However, our detailed monitoring finds it to be a damaging I/O mode that users should steer away from, as explained below.
First, our monitoring results confirm the findings of existing research [2, 46]: The N:1 mode offers low application I/O performance (by reading/writing to a shared file). Even with a large N, such applications receive no more than 250 MB/s of I/O aggregate throughput despite the peak TaihuLight back end combined bandwidth of 260 GB/s. For read operations, users here also rarely modify the default Lustre stripe width, confirming the behavior reported in a recent ORNL study [38]. The problem is much worse with writes, as performance severely degrades owing to file system locking.
This study, however, finds that applications with the N:1 mode are extraordinarily disruptive, as they harm all kinds of neighbor applications that share forwarding nodes with them, particularly when N is large (e.g., over 32 compute nodes).
The reason is that each forwarding node operates an LWFS server thread pool (currently sized at 16), providing forwarding service to assigned compute nodes. Applications using the N:1 mode tend to flood this thread pool with requests in bursts. Unlike the N:N or N:M modes, N:1 suffers from the aforementioned poor back-end performance by using a single shared file. This, in turn, makes N:1 requests slow to process, further exacerbating their congestion in the queue and delaying requests from other applications, even when those victims are accessing disjointed back-end servers and OSTs.
Here, we give a concrete example of I/O mode-induced performance interference, featuring an earthquake simulation
Table 5 lists the two applications’ average request wait times, processing times, and forwarding node queue lengths during these runs. Note that with the “co-run”, the queue is shared by both applications. We find that the average wait time of
In our case, the Beacon developers worked with the
This change produces an over 400% enhancement in I/O performance. Note that the GB Prize submission does not report I/O time; we find that
4.4.2 Metadata Server Usage.
Unlike forwarding nodes’ utilization (discussed earlier), the Lustre MDS is found with rather evenly distributed load levels by Beacon’s continuous load monitoring (Figure 18(a)). In particular, 26.8% of the time, the MDS experiences a load level (in requests per second) above 75% of its peak processing throughput.
Beacon allows us to further split the requests between systems sharing the MDS, including the TaihuLight forwarding nodes, login nodes, and the ACC. To the surprise of TaihuLight administrators, over 80% of the metadata access workload actually comes from the ACC (Figure 18(b)).
Note that the login node and ACC have their own local file systems, ext4 and GPFS [66], respectively, which users are encouraged to use for purposes such as application compilation and data post-processing/visualization. However, as the users are likely TaihuLight users too, we find most of them prefer to directly use the main Lustre scratch file system intended for TaihuLight jobs, for convenience. While the I/O bandwidth/IOPS resources consumed by such tasks are negligible, user interactive activities (such as compiling or post-processing) turn out to be metadata-heavy.
Large waves of unintended user activities correspond to the most heavy-load periods at the tail end in Figure 18(a), and lead to MDS crashes directly affecting applications running on TaihuLight. According to our survey, many other machines, including two out of the top 10 supercomputers (Sequoia [83] and Sierra [33]), also have a single MDS, assuming that their users follow similar usage guidelines.
4.4.3 Jobs’ Request Size Analysis.
Figure 19 shows the relationships between the applications according to their bandwidth and IOPS, with all points forming five lines, which represent jobs mainly containing five request size types: 1 KB, 16 KB, 64 KB, 128 KB, 512 KB. Among them, 128 KB for read and 512 KB for write are the most common request sizes, which follow the system configurations of Icefish. On Sunway compute nodes, applications’ small I/O requests are merged, while larger I/O requests are split into multiple requests before being transferred to the forwarding nodes via the LWFS client. So we conclude that the average request size of most applications can reach the set upper limit, implying that the upper limit can be appropriately increased to enable applications to obtain a better read and write performance. In addition, further statistical analysis reveals that 6.89% of jobs still have an I/O request size of less than 1 KB. However, small I/O requests are associated with inefficient I/O behavior, and jobs with such I/O behavior cannot make good use of the high-performance parallel file system.
4.5 Extended Applications of Beacon
4.5.1 Extension to Network Monitoring.
Encouraged by Beacon’s success in I/O monitoring, in summer 2018, we began to design and test its extension to monitor and analyze network problems, motivated by the network performance debugging needs of ultra-large-scale applications. Figure 20 shows the architecture of this new module. Beacon samples performance counters on the 5984 Mellanox InfiniBand network switches, such as per-port sent and received volumes. Again, the data collected are passed to low-overhead daemons on Beacon log servers, more specifically, 75 of its 85 part-time servers, each assigned 80 switches. Similar processing and compression are conducted, with result data persisting in Beacon’s distributed database and then being periodically relocated to its dedicated server for user queries and permanent storage.
This Beacon network monitoring prototype was tested in time to help in the aforementioned
4.5.2 Extension to the Cutting-edge Supercomputer with I/O Forwarding Architecture.
The Sunway next-generation supercomputer inherits and develops the architecture of the Sunway TaihuLight and is built on a homegrown high-performance heterogeneous multi-core processor, SW26010P. It consists of more than 100, 000 compute nodes, each node equipped with a 390-core SW26010P CPU. Compared to the 10 million cores of TaihuLight, the new machine has more than four times the total number of cores. Figure 22 shows the architecture overview. Like TaihuLight, the compute nodes are connected to the storage nodes through forwarding nodes. Storage nodes run the Lustre servers and support users with a global file system. Unlike TaihuLight, the Sunway next-generation supercomputer provides an additional burst buffer file system on the forwarding node [89]. Each forwarding node provides back-end storage for the burst buffer file system via two high-performance Nvme SSDs.
In order to extend Beacon to the Sunway next-generation supercomputer, we upgraded the collection module of Beacon to support data collection on the burst buffer file system in January 2021. Beacon’s other components can still be performed on this supercomputer as expected. Figure 23 shows an example of Beacon’s use case on the next-generation supercomputer. According to the figure, we find that the load on Nvme SSDs is low most of the time. An important reason is that users tend to use the global file system more often than the burst buffer file system. We confirm this assertion by further statistical analysis.2 Although the burst buffer file system can provide high I/O performance for jobs, users have to modify their applications with a specific API for I/O to use the burst buffer file system, which is not convenient and contributes to the low usage of the burst buffer file system. Besides, we also find that the load on Nvme SSDs is imbalanced. One important reason is the control strategy of Nvme SSDs. Nvme SSDs are controlled through static configuration files. Each user can only access the corresponding Nvme SSDs through a configuration file given by an administrator. However, it is difficult for the administrator to balance each Nvme SSD’s load as it lacks real-time load information.
4.5.3 Extension to the Traditional Two-layer Supercomputer.
In addition to Beacon’s adoption as a multi-layer cutting-edge supercomputer, some of Beacon’s components and methods can also be adopted by the traditional two-layer supercomputer. We have deployed Beacon on the Sugon Pai supercomputer [72], a traditional two-layer computer, since March 2020. Sugon Pai is a homogeneous computing cluster that contains 424 compute nodes as well as eight storage nodes. It uses the ParaStor file system to provide high concurrent I/O. The architecture of Beacon’s monitoring and storage module is shown in Figure 24. Beacon performs I/O monitoring on the compute and storage nodes, running ParaStor [73] client and server, respectively. Beacon divides the 424 compute nodes into four groups and enlists four “part-time” servers to communicate with one group. In addition, data collected from eight storage nodes are transferred to another “part-time” server. There is also a MySQL database to store jobs’ running information on the Sugon Pai. In order to reduce data transmission and storage overhead, Beacon also conducts an online compression similar to that used on TaihuLight.
Table 6 shows the statistics of I/O-mode adopted by jobs that perform non-trivial I/O on the Sugon Pai supercomputer from March 2020 to April 2020. We find some similar conclusions, for example, the N:N and 1:1 I/O modes form the dominant choices in the case of write. Besides, there are also some new findings on Sugon Pai; the N:1 I/O mode takes up most of the ratio in the case of read. Further analysis shows that the N:1 I/O mode offers a relatively good performance on Sugon Pai. Figure 25 shows an example of a molecular simulation application with the N:1 I/O mode on Sugon Pai. As we can see from the figure, high performance is obtained when reading with the N:1 I/O mode. A plausible reason is that Sugon Pai adopts Parastor as its primary storage system, supporting the N:1 I/O mode better than LWFS and Lustre on TaihuLight. This finding also proves that different platforms support I/O behaviors differently, which implies that an application’s I/O behavior needs to be well matched to the underlying platform adaptively in order to achieve better performance.
5 BEACON FRAMEWORK EVALUATION
We now evaluate Beacon’s per-application profiling accuracy and its performance overhead.
5.1 Accuracy Verification
Beacon collects full traces from the compute node side, thus giving it access to complete application-level I/O operation information. However, because the LWFS client trace interface provides only coarse timestamp data (at per-second granularity), and owing to the clock drift across compute nodes, it is possible that the I/O patterns recovered from Beacon logs deviate from the application-level captured records.
To evaluate the degree of such errors, we compare the I/O throughput statistics reported by the
The accuracy evaluation results are shown in Figure 26. We plot the average error in Beacon, measured as the percentage of deviation of the recorded aggregate compute node-side I/O throughput from the application-level throughput reported by the MPI-IO library.
We find that Beacon is able to accurately capture application performance, even for applications with non-trivial parallel I/O activities. More precisely, Beacon’s recorded throughput deviates from the
Beacon’s accuracy can be attributed to the fact that it records all compute node-side trace logs, facilitated by its efficient and lossless compression method (described in Section 3.2). We find that even though individual trace items may be off in timestamps, data-intensive applications on supercomputers seldom perform isolated, fast I/O operations (which are not of interest for profiling purposes). Instead, they exhibit I/O phases with a sustained high I/O intensity. By collecting multi-layer I/O trace entries for each application, Beacon is able to paint an accurate picture of an application’s I/O behavior and performance.
5.2 Monitoring and Query Overhead
We now evaluate Beacon’s monitoring overhead in a production environment. We compare the performance of important I/O-intensive real-world applications and the
These results show that the Beacon tool introduces very low overhead, under \(1\%\) across all test cases. Also, the overhead does not grow with the application execution scale; it actually appears smaller (below 0.25%) for the two largest jobs, which use 130 K processes or more. Such a cost is particularly negligible considering the significant I/O performance enhancement and run-time savings produced by optimizations or problem diagnosis from Beacon-supplied information.
Table 8 lists the CPU and memory usage of Beacon’s data collection daemon. In addition, the storage overhead from Beacon’s deployment on TaihuLight since April 2017 is around 10 TB. Such low operational overhead and scalable operation attest to Beacon’s lightweight design, with background trace-collection and compression generating negligible additional resource consumption. Also, having a separate monitoring network and storage avoids potential disturbance to the application execution.
Finally, we assess Beacon’s query processing performance. We measure the query processing time of 2,000 Beacon queries in September 2018, including both application users accessing job performance analysis and system administrators checking forwarding/storage nodes performance. In particular, we examine the impact of Beacon’s in-memory cache system between the web interface and Elasticsearch, as shown in Figure 2. Figure 27 gives the CDF of queries in processing time and demonstrates that (1) the majority of Beacon user queries can be processed within 1 second, and 95.6% of them can be processed under 10 seconds (visualization queries take longer), and (2) Beacon’s in-memory caching significantly improves the user experience. Additional checking reveals that about 95% of these queries can be served from cached data.
6 RELATED WORK
Several I/O tracing and profiling tools have been proposed for HPC systems, which can be divided into two categories: application-oriented tools and back-end-oriented tools.
Application-oriented tools can provide detailed information about a particular execution on a function-by-function basis. Work in this area includes Darshan [9], IPM [79], and RIOT [86], all of which aim to build an accurate picture of application I/O behavior by capturing key characteristics of the mainstream I/O stack on compute nodes. Carns et al. evaluated the performance and runtime overheads of Darshan [8], and Patel et al. performed characterization and analysis of access of I/O intensive files [60] with Darshan. Wu et al. proposed a scalable methodology for MPI and I/O event tracing [48, 87, 88]. Recorder [46] focuses on collecting additional HDF5 trace data. Tools like Darshan provide user-transparent monitoring via automatic environment configuration. Still, instrumentation based tools have restrictions on programming languages or libraries/linkers. In contrast, Beacon is designed to be a non-stop, full-system I/O monitoring system capturing I/O activities at the system level.
Back-end-oriented tools collect system-level I/O performance data across applications and provide summary statistics (e.g., LIOProf [91], LustreDU [7, 38, 56], and LMT [24]). Neeraj et al. [64] tried to provide applications and middles with real-time system resource status while Patel et al. [59] focused on showing system-level characteristics with LMT. Paul et al. [61] also analyzed the statistics in an application-agnostic manner with data collected from Lustre server statistics.
However, identifying application performance issues and finding the cause of application performance degradation are difficult with these tools. While back-end analytical methods [40, 41] have made progress in identifying high-throughput applications using back-end logs only, they lack application-side information. Beacon, in contrast, holds complete cross-layer monitoring data to enable such tasks.
Along this line, there are tools for collecting multi-layer data. Static instrumentation has been used to trace parallel I/O calls from MPI to PVFS servers [35]. SIOX [85] and IOPin [34] characterize HPC I/O workloads across the I/O stack. These projects extended the application-level I/O instrumentation approach that Darshan [9] used to other system layers. However, their overhead hinders its deployment on large-scale production environments [70].
Regarding end-to-end frameworks, the TOKIO [3] architecture combines front-end tools (Darshan, Recorder) and back-end tools (LMT). The UMAMI monitoring interface [43] provides cross-layer I/O performance analysis and visualization. In addition, OVIS [5] uses the Cray specific tool LDMS [1] to provide scalable failure and anomaly detection. GUIDE [80] performs center-wide and multi-source log collection and motivated further analysis and optimizations. Beacon differs through its aggressive real-time performance and utilization monitoring, automatic anomaly detection, and continuous per-application I/O pattern profiling.
I/O interference is identified as an important cause for performance variability in HPC systems [41, 57]. Fang et al. [96] uncovered the interference in an in situ analytics system. Kuo et al. [37] focused on interference from different file access patterns with synchronized time-slice profiles. Yildiz et al. [92] studied the root causes of cross-application I/O interference across software and hardware configurations. To the best of our knowledge, Beacon is the first monitoring framework with built-in features for inter-application interference analysis. Our study confirms findings on large-scale HPC applications’ adoption of poor I/O design choices [47]. This further suggests that aside from workload-dependent, I/O-aware scheduling [14, 41], interference should be countered with application I/O mode optimization and adaptive I/O resource allocation.
Finally, on network monitoring, there are dedicated tools [42, 50, 68] for monitoring switch performance, anomaly detection, and resource utilization optimization. There are also tools specializing in network monitoring/debugging for data centers [75, 76, 94]. However, these tools/systems typically do not target the InfiniBand interconnections commonly used on supercomputers. To this end, Beacon adopts the open-source OFED stack [11, 55] to retrieve relevant information from the IB network. More importantly, it leverages its scalable and efficient monitoring infrastructure, originally designed for I/O, for network problems.
7 CONCLUSION
We have presented Beacon, an end-to-end I/O resource monitoring and diagnosis system for the leading supercomputer TaihuLight. It facilitates comprehensive I/O behavior analysis along the long I/O path and has identified hidden performance and user I/O behavior issues as well as system anomalies. Enhancements enabled by Beacon in the past 38 months have significantly improved ultra-large-scale applications’ I/O performance and the overall TaihuLight I/O resource utilization. More generally, our results and experience indicate that this type of detailed multi-layer I/O monitoring/profiling is affordable in state-of-the-art supercomputers, offering valuable insights while incurring a low cost. In addition, we have explored the public release of Beacon collected supercomputer I/O profiling data to the HPC and storage communities.
Our future work will focus on the cross-layered application I/O portrait, automated I/O scheduling, resource allocation, and optimization via real-time interaction with Beacon.
Footnotes
1 Github link: https://github.com/Beaconsys/Beacon.
2 More than 90% jobs running on the global file system.
Footnote
- [1] . 2014. The lightweight distributed metric service: A scalable infrastructure for continuous monitoring of large scale computing systems and applications. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, 154–165.Google ScholarDigital Library
- [2] . 2009. PLFS: A checkpoint filesystem for parallel applications. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Portland, 1–12.Google ScholarDigital Library
- [3] . 2017. TOKIO: Total knowledge of I/O. http://www.nersc.gov/research-and-development/tokio.Google Scholar
- [4] . 2002. Lustre: A scalable, high performance file system. Cluster File Systems, Inc 8, 11 (2002), 3429–3441.Google Scholar
- [5] . 2009. Resource monitoring and management with OVIS to enable HPC in cloud computing environments. In International Symposium on Parallel and Distributed Processing. IEEE, Rome, 1–8.Google Scholar
- [6] . 2010. Blue gene/q resource management architecture. In Workshop on Many-Task Computing on Grids and Supercomputers. IEEE, New Orleans, 1–5.Google Scholar
- [7] . 2012. Practical support solutions for a workflow-oriented Cray environment. In Cray User Group Conference. Cray, Stuttgart, 1–7.Google Scholar
- [8] . 2012. Performance Analysis of Darshan 2.2. 3 on the Cray XE6 Platform.
Technical Report . Argonne National Lab.(ANL), Argonne, IL (United States).Google ScholarCross Ref - [9] . 2009. 24/7 characterization of Petascale I/O workloads. In International Conference on Cluster Computing and Workshops. IEEE, New Orleans, 1–10.Google ScholarCross Ref
- [10] . 2017. Cray burst buffer in Cori. https://docs.nersc.gov/filesystems/cori-burst-buffer/.Google Scholar
- [11] . 2011. INAM-a scalable infiniband network analysis and monitoring tool. In European Conference on Parallel Processing. Springer, Bordeaux, 166–177.Google Scholar
- [12] . 2017. LogAider: A tool for mining potential correlations of HPC log events. In International Symposium on Cluster, Cloud and Grid Computing. IEEE, Madrid, 442–451.Google Scholar
- [13] . 2014. Testing of several distributed file-systems (HDFS, Ceph and GlusterFS) for supporting the HEP experiments analysis. In Journal of Physics: Conference Series. IOP Publishing, Yokohama, 042014.Google ScholarCross Ref
- [14] . 2014. CALCioM: Mitigating I/O interference in HPC systems through cross-application coordination. In International Parallel and Distributed Processing Symposium. IEEE, Phoenix, 155–164.Google Scholar
- [15] . 2018. Redesigning LAMMPS for Petascale and hundred-billion-atom simulation on Sunway TaihuLight. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, 148–159.Google ScholarDigital Library
- [16] . 2013. MySQL. Addison-Wesley Professional, Boston.Google ScholarDigital Library
- [17] . 1993. File systems in user space.. In USENIX Winter. 229–240.Google Scholar
- [18] . 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD. AAAI, Portland, 226–231.Google ScholarDigital Library
- [19] . 2008. Python Web Development with Django. Addison-Wesley Professional.Google ScholarDigital Library
- [20] . 2017. 9-pflops nonlinear earthquake simulation on Sunway TaihuLight: Enabling depiction of 18-Hz and 8-meter scenarios. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–12.Google ScholarDigital Library
- [21] . 2017. Redesigning CAM-SE for Peta-scale climate modeling performance and ultra-high resolution on Sunway TaihuLight. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–12.Google ScholarDigital Library
- [22] . 2016. The Sunway TaihuLight supercomputer: System and applications. Science China Information Sciences 59 (2016), 1–16.Google ScholarCross Ref
- [23] . 2015. Scheduling the I/O of HPC applications under congestion. In International Parallel and Distributed Processing Symposium. IEEE, Hyderabad, 1013–1022.Google Scholar
- [24] . 2010. Lustre monitoring tool. https://github.com/LLNL/lmt.Google Scholar
- [25] . 2010. DNDC: A process-based model of greenhouse gas fluxes from agricultural soils. Agriculture, Ecosystems & Environment 136 (2010), 292–300.Google ScholarCross Ref
- [26] . 2008. LANL MPI-IO test. http://freshmeat.sourceforge.net/projects/mpiiotest.Google Scholar
- [27] . 2015. Comparative I/O workload characterization of two leadership class storage clusters. In Proceedings of the Parallel Data Storage Workshop. IEEE, Austin, 31–36.Google ScholarDigital Library
- [28] . 2018. Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Transactions on Storage (TOS) 14 (2018), 1–26.Google ScholarDigital Library
- [29] . 2020. Top 500 list. https://www.top500.org/resources/top-systems/.Google Scholar
- [30] . 2008. Intrepid. https://www.alcf.anl.gov/intrepid.Google Scholar
- [31] . 2019. Automatic, application-aware I/O forwarding resource allocation. In Conference on File and Storage Technologies. USENIX, Boston, 265–279.Google Scholar
- [32] . 2015. Quiet neighborhoods: Key to protect job performance predictability. In International Parallel and Distributed Processing Symposium. IEEE, Hyderabad, 449–459.Google Scholar
- [33] . 2019. 2.1 Summit and Sierra: Designing AI/HPC supercomputers. In International Solid-State Circuits Conference. IEEE, San Francisco, 42–43.Google ScholarCross Ref
- [34] . 2012. IOPin: Runtime profiling of parallel I/O in HPC systems. In Companion: High Performance Computing, Networking Storage and Analysis. IEEE, Salt Lake City, 18–23.Google Scholar
- [35] . 2010. Automated tracing of I/O stack. In European MPI Users’ Group Meeting. Springer, Stuttgart, 72–81.Google Scholar
- [36] . 2013. Elasticsearch Server. Packt Publishing Ltd, Birmingham.Google Scholar
- [37] . 2014. How file access patterns influence interference among cluster applications. In International Conference on Cluster Computing. IEEE, Madrid, 185–193.Google ScholarCross Ref
- [38] . 2017. Scientific user behavior and data-sharing trends in a Petascale file system. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–12.Google ScholarDigital Library
- [39] . 2018. ShenTu: Processing multi-trillion edge graphs on millions of cores in seconds. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, 706–716.Google ScholarDigital Library
- [40] . 2014. Automatic identification of application I/O signatures from noisy server-side traces. In Conference on File and Storage Technologies. USENIX, Oakland, 213–228.Google Scholar
- [41] . 2016. Server-side log data analytics for I/O workload characterization and coordination on large shared storage systems. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Salt Lake City, 819–829.Google ScholarCross Ref
- [42] . 2016. One sketch to rule them all: Rethinking network flow monitoring with UnivMon. In Proceedings of the 2016 ACM SIGCOMM Conference. ACM, Los Angeles, 101–114.Google ScholarDigital Library
- [43] . 2017. UMAMI: A recipe for generating meaningful metrics through holistic I/O performance analysis. In Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems. ACM, Denver, 55–60.Google Scholar
- [44] . 2016. DAOS and friends: A proposal for an Exascale storage system. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Salt Lake City, 585–596.Google ScholarCross Ref
- [45] . 2010. Managing variability in the IO performance of Petascale storage systems. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, 1–12.Google ScholarDigital Library
- [46] . 2013. A multi-level approach for understanding I/O activity in HPC applications. In International Conference on Cluster Computing. IEEE, Indianapolis, 1–5.Google ScholarCross Ref
- [47] . 2015. A multiplatform study of I/O behavior on Petascale supercomputers. In International Symposium on High-Performance Parallel and Distributed Computing. ACM, Portland, 33–44.Google Scholar
- [48] . 2010. ScalaTrace: Tracing, analysis and modeling of HPC codes at scale. In International Workshop on Applied Parallel Computing. Springer, Reykjavík, 410–418.Google Scholar
- [49] . 2021. EZIOTracer: Unifying kernel and user space I/O tracing for data-intensive applications. In Proceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems. ACM, Edinburgh, 1–11.Google ScholarDigital Library
- [50] . 2017. Demonstration of the Marple system for network performance monitoring. In Proceedings of the SIGCOMM Posters and Demos. ACM, Los Angeles, 57–59.Google ScholarDigital Library
- [51] . 2014. Edison. https://www.top500.org/system/178443/.Google Scholar
- [52] . 2016. Using balanced data placement to address I/O contention in production environments. In International Symposium on Computer Architecture and High Performance Computing. IEEE, Los Angeles, 9–17.Google Scholar
- [53] . 2014. Piz Daint Supercomputer Shows the Way Ahead on Efficiency.Google Scholar
- [54] . 2009. ScalaTrace: Scalable compression and replay of communication traces for high-performance computing. J. Parallel and Distrib. Comput. 69 (2009), 696–710.Google ScholarDigital Library
- [55] . 2010. OpenFabrics enterprise distribution (OFED). http://www.openfabrics.org/.Google Scholar
- [56] . 2014. Best practices and lessons learned from deploying and operating large-scale data-centric parallel file systems. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, 217–228.Google ScholarDigital Library
- [57] . 2015. Achieving performance isolation with lightweight co-kernels. In International Symposium on High-Performance Parallel and Distributed Computing. ACM, Portland, 149–160.Google Scholar
- [58] . 2013. Mira: Argonne’s 10-Petaflops Supercomputer.
Technical Report . ANL (Argonne National Laboratory (ANL), Argonne, IL (United States)).Google Scholar - [59] . 2019. Revisiting I/O behavior in large-scale storage systems: The expected and the unexpected. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–13.Google ScholarDigital Library
- [60] . 2020. Uncovering access, reuse, and sharing characteristics of I/O-intensive files on large-scale production \(\lbrace\)HPC\(\rbrace\) systems. In Conference on File and Storage Technologies. USENIX, Santa Clara, 91–101.Google Scholar
- [61] . 2020. Understanding HPC application I/O behavior using system level statistics. In International Conference on High Performance Computing, Data, and Analytics. IEEE, Pune, 202–211.Google ScholarCross Ref
- [62] . 2017. I/O load balancing for big data HPC applications. In International Conference on Big Data. IEEE, Boston, 233–242.Google ScholarCross Ref
- [63] . 2020. Taming I/O variation on QoS-less HPC storage: What can applications do?. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Pune, 1–13.Google ScholarDigital Library
- [64] . 2021. Apollo: An ML-assisted real-time storage resource observer. In International Symposium on High-Performance Parallel and Distributed Computing. ACM, Stockholm, 147–159.Google Scholar
- [65] . 2016. Redis. http://redis.io/topics/faqAccessedNovember.Google Scholar
- [66] . 2002. GPFS: A shared-disk file system for large computing clusters. In Conference on File and Storage Technologies. USENIX, Monterey, 1–15.Google Scholar
- [67] . 2001. Impact of a failure detection mechanism on the performance of consensus. In Pacific Rim International Symposium on Dependable Computing. IEEE, Seoul, 137–145.Google Scholar
- [68] . 2012. DECOR: A distributed coordinated resource monitoring system. In International Workshop on Quality of Service. IEEE, Coimbra, 1–9.Google Scholar
- [69] . 2005. A Description of the Advanced Research WRF Version 2.
Technical Report . National Center For Atmospheric Research Boulder Co Mesoscale and Microscale.Google Scholar - [70] . 2016. Modular HPC I/O characterization with Darshan. In Workshop on Extreme-scale Programming Tools. IEEE, Salt Lake City, 9–17.Google Scholar
- [71] . 2011. Server-side I/O coordination for parallel file systems. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Seattle, 1–11.Google ScholarDigital Library
- [72] . 2018. Pai system, Sugon. https://www.top500.org/system/179425/.Google Scholar
- [73] . 2015. ParaStor200 Distributed Parallel Storage System. http://hpc.sugon.com/en/HPC-Components/parastor.html.Google Scholar
- [74] . 2004. Cluster-based failure detection service for large-scale ad hoc wireless network applications. In International Conference on Dependable Systems and Networks. IEEE, Florence, 805–814.Google ScholarCross Ref
- [75] . 2016. Simplifying datacenter network debugging with pathdump. In Symposium on Operating Systems Design and Implementation. USENIX, Savannah, 233–248.Google Scholar
- [76] . 2018. Distributed network monitoring and debugging with SwitchPointer. In Symposium on Networked Systems Design and Implementation. USENIX, Renton, 453–456.Google Scholar
- [77] . 2012. Extracting flexible, replayable models from large block traces. In Conference on File and Storage Technologies. USENIX, San Jose, 22.Google Scholar
- [78] . 2013. The Logstash Book. James Turnbull.Google Scholar
- [79] . 2010. Parallel I/O performance: From events to ensembles. In International Symposium on Parallel and Distributed Processing. IEEE, Atlanta, 1–11.Google Scholar
- [80] . 2017. GUIDE: A scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–12.Google ScholarDigital Library
- [81] . 2009. Scalable I/O tracing and analysis. In Annual Workshop on Petascale Data Storage. IEEE, Portland, 26–31.Google Scholar
- [82] . 2010. Accelerating I/O forwarding in IBM Blue Gene/p systems. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, 1–10.Google Scholar
- [83] . 2012. BlueGene/Q Sequoia and Mira.
Technical Report . Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States).Google Scholar - [84] . 2017. The accurate particle tracer code. Computer Physics Communications 220 (2017), 212–229.Google ScholarCross Ref
- [85] . 2013. Towards I/O analysis of HPC systems and a generic architecture to collect access patterns. Computer Science-Research and Development 28 (2013), 241–251.Google ScholarDigital Library
- [86] . 2013. Parallel file system analysis through application I/O tracing. Comput. J. 56, 2 (2013), 141–155.Google ScholarDigital Library
- [87] . 2013. Elastic and scalable tracing and accurate replay of non-deterministic events. In International Conference on Supercomputing. ACM, Eugene, 59–68.Google ScholarDigital Library
- [88] . 2011. Probabilistic communication and I/O tracing with deterministic replay at scale. In International Conference on Parallel Processing. IEEE, Taipei, 196–205.Google ScholarDigital Library
- [89] . 2021. Symplectic structure-preserving particle-in-cell whole-volume simulation of tokamak plasmas to 111.3 trillion particles and 25.7 billion grids. In International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, New York, 1–13.Google ScholarDigital Library
- [90] . 2012. Characterizing output bottlenecks in a supercomputer. In International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, Salt Lake City, 1–11.Google ScholarDigital Library
- [91] . 2016. LIOProf: Exposing lustre file system behavior for I/O middleware. In Cray User Group Meeting. Cray, London, 1–9.Google Scholar
- [92] . 2016. On the root causes of cross-application I/O interference in HPC storage systems. In International Parallel and Distributed Processing Symposium. IEEE, Chicago, 750–759.Google Scholar
- [93] . 2020. Vuejs framework. https://vuejs.org.Google Scholar
- [94] . 2011. Profiling network performance for multi-tier data center applications. In Symposium on Networked Systems Design and Implementation. USENIX, Boston, 5–5.Google Scholar
- [95] . 2008. Performance characterization and optimization of parallel I/O on the cray XT. In International Symposium on Parallel and Distributed Processing. IEEE, Sydney, 1–11.Google Scholar
- [96] . 2013. GoldRush: Resource efficient in situ scientific data analytics using fine-grained interference aware execution. In International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–12.Google ScholarDigital Library
Index Terms
- End-to-end I/O Monitoring on Leading Supercomputers
Recommendations
End-to-end I/O monitoring on a leading supercomputer
NSDI'19: Proceedings of the 16th USENIX Conference on Networked Systems Design and ImplementationThis paper presents an effort to overcome the complexities of production system I/O performance monitoring. We design Beacon, an end-to-end I/O resource monitoring and diagnosis system, for the 40960-node Sunway TaihuLight supercomputer, current ranked ...
Dark silicon and the end of multicore scaling
ISCA '11: Proceedings of the 38th annual international symposium on Computer architectureSince 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit ...
Self-adaptive software system monitoring for performance anomaly localization
ICAC '11: Proceedings of the 8th ACM international conference on Autonomic computingAutonomic computing components and services require continuous monitoring capabilities for collecting and analyzing data of runtime behavior. Particularly for software systems, a trade-off between monitoring coverage and performance overhead is ...
Comments