skip to main content
research-article
Open Access

At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads

Authors Info & Claims
Published:14 December 2023Publication History

Skip Abstract Section

Abstract

Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method oblivious to the memory subsystem to gauge the upper-bound in performance improvements when data movement costs are eliminated. Then, using the gem5 simulator, we model two variants of a hypothetical LARge Cache processor (LARC), fabricated in 1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of experiments involving a broad set of proxy-applications and benchmarks, we aim to reveal how HPC CPU performance will evolve, and conclude an average boost of 9.56× for cache-sensitive HPC applications, on a per-chip basis. Additionally, we exhaustively document our methodological exploration to motivate HPC centers to drive their own technological agenda through enhanced co-design.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Historically, the reliable performance increase of von Neumann-based general-purpose processors (CPUs) was driven by two technological trends. The first, observed by Gordon E. Moore [76], is that the number of transistors in an integrated circuit doubles roughly every two years. The second, called Dennard’s scaling [30], postulates that as transistors get smaller their power density stays constant. These trends synergized well, allowing computer architectures to continuously improve performance through, for example, aggressive pipelining and superscalar techniques without running into thermal limitations by, e.g., reducing the operating voltage. In the early 2000s, Dennard’s scaling ended [51] and forced architects to shift their attention from improving instruction-level parallelism to exploiting on-chip multiple-instruction multiple-data parallelism [43]. This immediate remedy to the end of Dennard’s scaling applies to this day in the form of processors such as Fujitsu A64FX [96], AMD Ryzen [105], or NVIDIA GPUs [79, 86].

Unfortunately, Moore’s law is impending termination [107], and we are entering a post-Moore era [112], home to a diversity of architectures, such as quantum-, neuromorphic-, or reconfigurable computing [49]. Many of these prototypes hold promise but are still immature, focus on a niche use case, or incur long development cycles. However, there is one salient solution that is growing in maturity and which can facilitate performance improvements in the decades to come even for the classic von Neumann CPUs we have come to rely upon—3D integrated circuit (IC) stacking [14]. 3D ICs refer to the general technologies of vertically building integrated circuits and can be done in multiple ways, such as by stacking multiple discrete dies and connecting them using coarse through-silicon vias (TSVs) or growing the 3D integrated circuit monolithically on the wafer [100].

Recent advances in 3D integrated circuits have enabled many times higher capacity for on-chip memory (caches) than traditional systems (e.g., AMD V-Cache [40]). Intuition tells us that an increased cache size, resulting from 3D-stacking, will help alleviate the performance bottlenecks of key scientific applications. To demonstrate this, we conduct a pilot study where we execute one of the important proxy-apps from the DoE ExaScale Computing Project (ECP) suite, MiniFE [50] (cf. Section 3.3), on AMD EPYC Milan and Milan-X CPUs—two architecturally similar processors with vastly different L3 cache sizes [17]. Figure 1 overviews our result of the pilot study, and we see that for a subset of problem sizes, in particular the 160 × 160 × 160 input, the 3-times larger L3 capacity of Milan-X yields up-to 3.4× improvements over baseline Milan for this memory-bound application, which motivates us to further research 3D-stacked caches.

Fig. 1.

Fig. 1. MiniFE: relative performance improvement of AMD EPYC 7773X Milan-X over 7763 Milan (for details cf. Table tbl:milan-stats), and Figure of Merit; Input problem scaled from 100 × 100 × 100 to 400 × 400 × 400; Benchmarks executed with 16 MPI ranks and 8 OpenMP threads.

3D integrated circuits have various benefits [52], including (i) shorter wire lengths in the interconnect leading to reduced power consumption, (ii) improved memory bandwidth through on-chip integration that can alleviate performance bottlenecks in memory-bound applications, (iii) higher package density yielding more compute and smaller system footprint, and (iv) possibly lower fabrication cost due to smaller die size (thus improved yield). All these are very desirable benefits in today’s exascale (and future) High-Performance Computing (HPC) systems. But how far can 3D ICs (with a focus on increased on-chip cache) take us in HPC?

Table 1.
7763 Milan7773X Milan-X
Sockets22
CPU config. per Socket:
Cores6464
CCDs88
Freq.2.45 GHz2.20 GHz
TDP280 W280 W
L3256 MiB768 MiB
Cache per core:
L2512 KiB512 KiB
L1 I+D32+32 KiB32+32 KiB
Memory1 TiB DDR4, 16 cha., 409.6 GB/s

Table 1. Systems Configuration for the Benchmarked AMD EPYC 7763 Milan and 7773X Milan-X (for More Details: See Zen 3 Microarch)

Contributions: We study our research questions from three different levels of abstraction: (i) we design a novel exploration framework that allows us to simulate HPC applications running on a hypothetical processor having infinitely large L1D cache. We use this framework, that is orders of magnitude faster than cycle-accurate simulators, to estimate an upper-bound for cache-based improvements; (ii) we model a hypothetical LARge Cache processor (LARC), that builds on the design of A64FX, with an LLC (Last Level Caches) designed with eight stacked SRAM dies under 1.5 nm manufacturing assumption; (iii) we complement our study with a plethora of simulations of HPC proxy-applications and CPU micro-benchmarks; and lastly (iv) we find that over half (31 out of 52) of the simulated applications experience a \(\ge \,2\times\) speedup on LARC’s Core Memory Group (CMG) that occupies only one fourth the area of the baseline A64FX CMG. For applications that are responsive to larger cache capacity, this would translate to an average improvement of 9.56× (geometric mean) when we assume ideal scaling and compare at the full chip level.

The novelty in this paper lies in the purpose which LARC serves, and not the design of LARC itself. As Figure 2 shows, the capacity (and bandwidth; not shown) of the LLC have increased at a moderately gradual slope over the last two decades—with Milan-X being a noticeable outlier in per-core LLC. However, we are querying the effect of an LLC, that is an order of magnitude above the trend line as depicted in Figure 2, on HPC applications. On top of our provided baseline, further application-specific restructuring to utilize large caches [69] will result in even greater benefit.

Fig. 2.

Fig. 2. A sample of representative server-grade CPUs of each generational micro-architecture in comparison to our study of LARC; Left: total on-chip last-level cache (in GiB); Right: per-core last-level cache (in MiB) for the same CPUs; The two LARC variants will be discussed in detail in Section 5.1.

Skip 2CPUS EMPOWERED WITH HIGH-CAPACITY CACHE: THE FUTURE OF HPC? Section

2 CPUS EMPOWERED WITH HIGH-CAPACITY CACHE: THE FUTURE OF HPC?

The memory bandwidth of modern systems has been the bottleneck (the “memory wall” [71]) ever since CPU performance started to outgrow the bandwidth of memory subsystems in the early 1990s [70]. Today, this trend continues to shape the performance optimization landscape in high-performance computing [83, 85]. Diverse memory technologies are emerging to overcome said data movement bottleneck, such as Processing-in-Memory (PIM) [12], 3D-stackable High-Bandwidth Memory (HBM) [74], deeper (and more complex) memory hierarchies [115], and—the topic of the present paper—novel 3D-stacked caches [14, 68, 98].

In this study, our aspiration is to gauge the far end of processor technology and how it may evolve in six to eight years from now, circa 2028, when processors using 1.5 nm technology are expected to be available according to the IEEE IRDS Roadmap [53, Figure ES9]. More specifically, as 3D-stacked SRAM memory [120] becomes more common, what are the performance implications for common HPC workloads, and what new challenges lie ahead for the community? However, before attempting to understand what performance may look like six years from now, we must describe how the processor itself might change. In this section, we introduce, motivate, and reason about our design choices of what we envision as a hypothetical CPU that capitalizes on large capacity 3D-stacked cache, briefly called LARC (LARge Cache processor). Before looking at LARC, we must first set and analyze a baseline processor.

2.1 LARC’s Baseline: The A64FX Processor

We choose to base our future CPU design on the A64FX [118]. Fujitsu’s Arm-based A64FX is powering Supercomputer Fugaku [96], leader of the HPCG (TOP500 [104]; cf. Section 3.3) and Graph500 performance charts. A64FX is manufactured in 7 nm technology and has a total of 52 Arm cores (with Scalable Vector Extensions [103]) distributed across four compute clusters, called Core Memory Groups (CMGs). Twelve cores are available to the user, and one core is exclusively used for management. Each core has a local 64 KiB instruction and data-cache, and is capable of delivering 70.4 Gflop/s (IEEE-754 double-precision) performance—accumulated: 845 Gflop/s per CMG (user cores) or 3.4 Tflop/s for the entire chip. Each CMG contains a 8 MiB L2 cache slice, delivering over 900 GB/s bandwidth to the CMG [118]. The combined L2 cache, which is the CPU’s 32 MiB last level cache (LLC), is kept coherent through a ring interconnect that connects the four CMGs. Inside the CMG, a crossbar switch is used to connect the cores and the L2 slice. The L2 cache has 16-way set associativity, a line-size of 256 bytes, and the bus-width between the L1 and L2 cache is set to be 128 bytes (read) and 64 bytes (write).

We emphasize that our aim is not to propose a successor of A64FX, nor are we particularly restricting our vision by the design constrains of A64FX (e.g., power budget). However, we build our design on A64FX because: (i) as mentioned above, A64FX represents the high-end in performance for commercially available CPUs, so it is a logical starting point. (ii) A64FX is the only commercially-available CPU, currently in continued production, with HBM. The expected bandwidth ratio between future HBM and future 3D-stacked caches is similar to the ratio between traditional DRAM and LLC bandwidths [80], which is what applications and performance models are accustomed to. (iii) The A64FX LLC cache design (particularly the L2 slices connected by a crossbar switch) happens to be convenient and thus, requires a minimal effort to extend the design in a simulated environment.

In conclusion, while we extend the A64FX architecture, our workflow itself can be generalized to cover any of the processors supported by CPU simulators (e.g., variants of gem5 [13] can simulate other architectures, including x86).

2.2 Floorplan Analysis for Fujitsu A64FX

In order to estimate the floorplan of the future LARC processor built on 1.5 nm technology, we first need the floorplan of the current A64FX processor built at 7 nm . We do know that the die size of A64FX is \({\approx }400\text{mm}^{2}\) [96]. With the openly-available die shots including processor core segments highlighted [82], we can estimate most of the A64FX floorplan, including the size of CMGs and processor cores, as shown in Figure 3. Overall, each CMG is \({\approx }48\text{mm}^{2}\) in area, where an A64FX core occupies \({\approx }2.25\text{mm}^{2}\) area. The remaining parts of the CMG consist of the L2 cache slice and controller as well as the interconnect for intra-CMG communication.

Fig. 3.

Fig. 3. Difference between A64FX’s Core Memory Group (CMG) and a LARC CMG in various performance-governing parameters; Most notable (for our study) is the 48× increase in per-CMG L2 cache capacity; Note: despite appearing similar in the figure, the LARC CMG is, in fact, four times smaller.

2.3 From A64FX’s to LARC’s CMG Layout

Knowing the floorplan, we proceed to describe how we envision the CMG design with 1.5 nm technology. We scale the CMG by moving four generations, from 7 nm to 1.5 nm , and reduce the silicon footprint by around 8× (\({\approx }\,\text{1.7$\times $}\) per generation) for the entire CMG [39]. The new CMG consumes as little as 6 mm2 of silicon area. Next, we reclaim the area currently occupied by the L2 cache and controller and replace it with three additional CPU cores, yielding a total of 16. Further, inline with the projected year 2019\(\rightarrow\)2028 growth in the number of cores [54, Table SA-1], we double the core count of the CMG to 32, which leads to it occupying \({\approx }\ 12 \text{mm}^{2}\) of silicon area. We pessimistically leave the interconnect area unchanged and continue to use it as the primary means for communication. We call this new variant as LARC’s CMG. Finally, we assume the same die size, and hence, LARC would have 16 CMGs, each with 32 cores, in comparison to A64FX’s 4 CMG with 12+1 cores each. For LARC, we ignore the management core. However, our performance analysis will remain on the CMG level, instead of full chip, due to limitations we detail in Section 3.2.

2.4 LARC’s Vertically Stacked Cache

In the above design, we removed the L2 cache and controller from the CMG of LARC. We now assume that the L2 cache can be directly placed vertically on the CMG through 3D stacking [68]. We build our estimations based on experiments from Shiba et al. [98], who demonstrated the feasibility of stacking up-to-eight SRAM dies on top of a processor using a ThruChip Interface (TCI). The capacity and bandwidth of stacked memory is a function of several parameters: the number of channels available (\(N_\text{ch}\)), the per-channel capacity (\(N_\text{cap}\) in KiB), their width (W in bytes), the number of stacked dies (\(N_\text{dies}\)), and the operating frequency (\(f_\text{clk}\) in GHz). Shiba et al. [98] estimated that at a 10 nm process technology, eight stacks would provide \({\approx }\ 512 \text{MiB}\) of aggregated SRAM capacity for a footprint of \({\approx }\ 121 \text{mm}^{2}\). In their design, each stack has 128 channels of 512 KiB capacity. In our work, we conservatively assume an 8× scaling from 10 nm to 1.5 nm , and thus, at 12 mm2 area (the size of one LARC CMG), \(N_\text{ch}\) on each die would be \({\approx }\,\text{102}\) (=128*8/10).

We approximate \(N_\text{ch}\) to a nearby sum of power-of-two number, viz., \(N_\text{ch}=\text{96}\). Thus, with eight stacked dies (\(N_\text{dies}=\text{8}\)), our 3D SRAM cache has a total storage capacity of \(N_\text{dies} \cdot N_\text{ch} \cdot N_\text{cap}= 384\ \text{MiB}\) per CMG. We estimate the bandwidth in a similar way. We know from previous studies [98] that 3D-stacked SRAM, built on 40 nm technology, can operate at 300 MHz . We conservatively expect the same SRAM to operate at (\(f_\text{clk}\)=)1 GHz when moving from 40 nm \(\rightarrow\)1.5 nm . To account for the increased working set size of future applications, we assume a channel width (W) of 16 byte , compared to the 4 byte width assumed in [98]. With this, the CMG bandwidth becomes: \(N_\text{ch} \cdot f_\text{clk} \cdot W = 1536\ \text{GB/s}\). The read- and write-latency of their SRAM cache is 3 cycles, including the vertical data movement overhead [98].

While stacked DRAM caches theoretically provide higher capacity than stacked SRAM caches, they have limitations. For example, the latency of stacked DRAM is only 50% lower compared to DDR3 DRAM, and hence, they exacerbate miss latency; they requires refresh operations which consumes energy and reduces availability; and due to their large size, the stacked DRAM caches require special techniques for managing metadata and avoiding bandwidth bloat [23, 74]. The tag size of a stacked DRAM may exceed the LLC capacity, and hence, the tags may need to be stored in the DRAM itself which worsens hit latency. Set-associative designs and serial tag-data accesses further increase hit latency. Proposed architectural techniques and mitigation strategies, such as Loh-Hill cache [67], have yet to solve these problems. By contrast, 3D SRAM caches do not suffer from any of these issues. In fact, at iso-capacity, a 3D SRAM cache has even lower access latency than a 2D SRAM cache. Since stacked 3D SRAM caches have lower capacity than stacked DRAM, its metadata (e.g., tag) can be easily stored in SRAM itself, further reducing the access latency.

For our cache design, we assume a 256 B cache block design, which avoids bandwidth bloat. Each tag takes 6 B and as such, the total tag array size for each CMG becomes 9 MiB . This tag array can be easily placed in the cache itself. We assume that tag and data accesses happen sequentially. The tags and data of a cache set are stored on a single die. Hence, on every access, only one die needs to be activated. Since this takes only few cycles, the overall miss penalty remains small and comparable to that of A64FX’ LLC.

To show that our cache projections are realistic, we compare it with AMD’s 3D V-cache design. It uses a single stacked die for the L3 cache, providing 64 MiB capacity (in addition to the 32 MiB cache in the base die) at 7 nm [26, 40] and only 3 to 4 cycles of extra latency compared to the non-stacked version [21]. It has 36 mm2 area and has a bandwidth of 2 TB/s . When stacking additional dies on top, and assuming an 8× scaling of the area by going from 7 nm to 1.5 nm , we speculate that the LLC capacity of this commercial processor could easily exceed that of our proposed LARC.

2.5 LARC’s Core Memory Group (CMG)

At last, we detail our experimental CMG built on a hypothetical 1.5 nm technology: the LARC CMG. An illustration of this system is shown in Figure 3. Each CMG consists of 32 A64FX-like cores, which keeps the L1 instruction- and data-cache to 64 KiB each, yielding a per CMG performance of \(\approx \,\)2.3 Tflop/s (IEEE-754 double-precision). A 384 MiB L2 cache is stacked vertically on the top of the CMG through eight SRAM layers.

We keep the HBM memory bandwidth per CMG to its current A64FX value of 256 GB/s to be able to quantify performance improvements from the proposed large capacity 3D cache in isolation from any improvements that would come from increased HBM bandwidth. Furthermore, we make no assumption on the technology scaling of blocks that contain hard-to-scale-down analog components (e.g., TofuD or PCIe IP blocks) and instead focus exclusively on scaling the CMG-part of the System-on-Chip (i.e., processing cores, L1/L2 caches, and intra-chip interconnects).

While our study focuses on evaluating a single CMG, we conclude that a complete, hypothetical LARC CPU, with a die size similar to the current A64FX, would contain 512 processing cores, 6 GiB of stacked L2 cache, a peak L2 bandwidth of 24.6 TB/s , a peak HBM bandwidth of 4.1 TB/s , and a total of 36 Tflop/s of raw, double-precision, compute. The A64FX processor has a peak HBM bandwidth of 1 TB/s , whereas our envisioned LARC CPU has 4× more CMGs and hence, a peak HBM bandwidth of 4.1 TB/s . Thus, compared to A64FX, LARC has higher effective bandwidth of external memory. Further changes to the HBM generation are beyond the scope of this study.

2.6 LARC’s Power and Thermal Considerations

To estimate the power consumption of LARC, we analyze A64FX’s current consumption and extrapolate to 1.5 nm by leveraging public technology roadmaps. A64FX’s peak power, achieved while running DGEMM, is 122 W [117]; where 95 W correspond to core power and 15 W correspond to the memory interface (MIF), and hence, we conclude 1.98 W/core and 3.75 W/MIF . Therefore, a LARC CMG with 32 cores in 7 nm would consume 67.1 W . TSMC projects that shrinking from 7 nm to 5 nm yields a power reduction of about 30% [99], i.e., 46.98 W for LARC’s CMG in 5 nm . IRDS’s roadmap [53, Figure ES9] indicates a further compounded power reduction (at iso frequency) of 42% when moving from 5 nm to 1.5 nm , i.e., 27.37 W for LARC’s CMG in 1.5 nm . As the full LARC chip is estimated to include 16 CMGs, we project a total power of 438 Watt (not including the L2 cache).

Next, we estimate the power consumed by the principal part of this study—the 384 MiB L2 cache. A 4 MiB SRAM L2 cache in 7 nm consumes 64 mW of static power [44]. Assuming a similar (pessimistic) static power consumption at 1.5 nm and extrapolated to 384 MiB , we find that our cache would have a static power consumption of 6.14 W . Scaled to the full 16 CMGs of our hypothetical LARC, we arrive at a static power consumption of 98.3 W . This static power consumption of caches represents between 90% and 98% of the entire power consumption (at 350 K temperature, see, e.g., [5, 20]), where the remainder is the dynamic power consumption. If we assume a pessimistic 9:1 ratio between static and dynamic power, then this yields a total power consumption of 109.23 W for 6 GiB of chip-wide stacked L2 cache.

To conclude, a LARC processor (16 CMG) would have to be designed for a thermal design power (TDP) of 547 W. While this expected TDP is more than the current A64FX, it is not entirely unlike emerging architectures, such as NVIDIA’s H100 [81] that consumes up to 700 W or the AMD Instinct MI250X GPU [3] at 560 W . We stress that our estimate of 547 W is peak power draw achieved only during parallel DGEMM execution. Adjusting for Stream Triad, based on the breakdown in [117], we conclude a realistic, and considerably lower, power consumption of 420 W for bandwidth-bound applications running on the whole LARC chip.

Finally, while this L2 cache power estimation might appear pessimistic, there are ample opportunities to further reduce power consumption. To save static energy, all the un-accessed dies can be changed to data-retentive, low-power (sleep) state. To deal with remaining thermal issues after stacking the cache layers underneath the cores instead of on top, one can additionally adapt simple direct-die cooling or advanced techniques [18, 106], such as high-\(\kappa\) thermal compound [42], microfluid cooling [114], or thermal-aware floorplanning, task-scheduling and data-placement optimizations. Specifically, microfluid cooling can handle power densities of 3.5 W/mm2 and hot-spot power levels of over 20 W/mm2 for 3D-stacked chips [1]. By contrast, our LARC CPU has a power density of 2.85 W/mm2 at 192 mm2 if we ignore adjunct components such as I/O die, PCIe, TofuD interface, and the like, and around half the power density at 400 mm2 if these components are included.

Skip 3PROJECTING PERFORMANCE IMPROVEMENT IN SIMULATED ENVIRONMENTS Section

3 PROJECTING PERFORMANCE IMPROVEMENT IN SIMULATED ENVIRONMENTS

Analyzing LARC’s feasibility is only the first step, and hence we have to demonstrate the effects of the proposed changes on real workloads to allow a meaningful cost-benefit analysis by CPU vendors. This section details two simulation approaches (one novel; one established) and discusses the HPC applications, which we evaluate extensively in Sections 4 and 5.

3.1 Simulating Unrestricted Locality with MCA

Designing and executing even initial studies (i.e., no complex memory models, etc.) with cycle-level gem5 simulations for realistic workloads takes substantial time with unknown outcome. Therefore, one would want to have a first-order approximation of a very large and fast cache. Regrettably, and to the best of our knowledge, existing approaches for fast first-order approximations do generally not support complex HPC applications, i.e., the existing tools neither handle multi-threading correctly nor do they have support for MPI applications [6]. Hence, we design a simulation approach, using Machine Code Analyzers (MCA), which can estimate the speedup for a given application orders-of-magnitude faster than gem5 (typically hours instead of months; cf. next section). This upper bound in expected performance improvement allows us to: (i) get a perspective on the best possible performance improvement if all read/writes can be satisfied from the cache; and (ii) justify more accurate simulations and classify their results with respect to the baseline and the upper bound.

Machine Code Analyzers, such as llvm-mca [66], have been designed to study microarchitectures, improve compilers, and investigate resource pressure for application kernels. Usually, the input for these tools is a short Assembly sequence and they output, among other things, an expected throughput for a given CPU when the sequence is executed many times and all data is available in L1 data cache. For most real applications, the latter assumption is obviously incorrect, however, it is ideal to gauge an upper bound on performance when all the memory-bottlenecks disappear.

Unfortunately, it is neither feasible to record all executed instructions in one long sequence, nor to analyze a full program sequence with llvm-mca. Hence, we break the program execution into basic blocks (at most tens or hundreds of instructions) and evaluate their throughput individually. For a given combination of a program and input (called workload hereafter), the basic blocks and their dependencies create a directed Control Flow Graph (CFG) [56] with one source (program start) and one sink (program termination). All intermediate nodes (representing basic blocks) of the graph can have multiple parent- and dependent-nodes, as well as self-references (e.g., basic blocks of for-loops). Knowing the “runtime” of each basic block and the number of invocations per basic block, we can estimate the runtime of the entire workload by summation of the parts.

We utilize the Software Development Emulator (SDE) [57] from Intel to record the basic blocks and their caller/callee dependencies for a workload with modest runtime overhead (typically in order of 1000× slowdown). SDE also notes down the number of invocations per CFG edge for a workload, i.e., how often the program counter (PC) jumped from one specific basic block to another specific block. We developed a program which parses the output of Intel SDE and establishes an internal representation of the Control Flow Graph. The internal CFG nodes are then amended with Assembly extracted from the program’s binary, since SDE’s Assembly output is not compatible with Machine Code Analyzers. Our program subsequently executes a Machine Code Analyzer for each basic block, getting in return an estimated cycles-per-iteration metric (CPIter). We record the per-block CPIter at the directed CFG edge from caller to callee, which already holds the number of invocations of this edge, effectively creating a “weighted” graph. Figure 4 showcases the result and it is easy to see that the summation of all edges in the CFG is equivalent to the estimated runtime of the entire workload (assuming all data is inside the L1 data cache).

Fig. 4.

Fig. 4. Illustration of our runtime estimation pipeline with the MCA-based tool for an accumulative kernel executed with \(n=\text{42}\) ; Dotted line: branch not taken; Solid line: kernel execution as recorded by SDE; Edges in directed CFG annotated by number of jumps between basic blocks; Details in Section 3.1.

The above outlined approach works for both sequential and parallel programs. Intel SDE can record the instruction execution and caller/callee dependencies for thread-parallel programs, e.g., pthreads, OpenMP, or TBB. Furthermore, we can attach SDE to individual MPI ranks to get the data for it. Therefore, we are able to estimate the runtime for MPI+X parallelized HPC applications by the following equation: (1) \(\begin{equation} \textstyle \text{t}_\text{app} := \frac{ \max \limits _{r\,\in \,\text{ranks}} \big (\max \limits _{t\,\in \,\text{threads}_{r}} (\,\sum \limits _{\text{edges}\,e\,\in \,\text{CFG}_{t,r}} \text{CPIter}_{e} \cdot \#\text{calls}_{e}\,)\big) }{ \text{processor frequency in Hz} } \end{equation}\) under the assumption that MPI ranks and threads do not share computational resources,1 where we sum up the number of cycles required for each block (i.e., CFG edges) considering only the “slowest” thread and rank, and divide by the CPU frequency to convert the total cycles into runtime.

The self-imposed restriction of Machine Code Analyzers is the limited accuracy compared to cycle-accurate simulators, due to their distinct design goal. To improve our CPIter estimate, we rely on four different MCAs, namely llvm-mca [66], Intel ACA (IACA) [55], uiCA [2], and OSACA [65], and take the median of the results. Another shortcoming of MCA tools is that most of them estimate the throughput of basic blocks in isolation while assuming looping behavior of the assembly block (PC jumps from last back to first instruction). Neither “block looping” nor an empty instruction pipeline (single iteration of the block) are realistic for some blocks. Hence, for non-looping basic blocks, we estimate the CPIter by feeding the MCA tool with the blocks of caller and callee, and the callee’s CPIter is calculated by subtracting the cycle of retirement of its last instruction from the caller’s last instruction retirement (instead of when the callee’s first instructions are decoded, which can overlap with execution of caller instructions). Further, we correct some cycle estimates for specific instructions within our tool in post-processing, since we encountered a few unsupported or grossly mis-estimated instructions while validating our tool against benchmarks. We refer the reader to Section 4.1 for more details.

3.2 Cycle-level Accuracy: CPUs Simulated in gem5

While the MCAs can give a first-order approximation, we still require highly accurate predictions for our 3D-stacked, cache-rich CPU. Hence, we employ an open-source system architecture simulator, called gem5 [13]. It supports Arm, x86, and RISC-V CPUs to varying degrees of accuracy, and can be extended with memory models for higher simulation fidelity of the memory subsystem. We use gem5’s “syscall emulation” mode to execute applications directly without booting a Linux kernel.

Fortunately, RIKEN released their gem5 version which was specially tailored for A64FX’s co-design to support SVE, HBM2, and other advanced features [94]. Hence, it is well suited to simulate our LARC proposal in Section 2.4. This version of gem5 has been validated for A64FX [62], and can be used with production compilers from Fujitsu. Albeit, while evaluating RIKEN’s gem5, we noticed a few drawbacks, such as the lack of support for: (i) dynamically linked binaries; (ii) adequate memory management (freeing memory after application’s free() calls); (iii) simulating more than 16 CPU cores due to limits in the cache coherence protocol; (iv) multi-rank MPI-based programs; and (v) simulating more than one A64FX CMG.

We modify gem5 to remedy the first three problems. However, the last two problems remain intractable without major changes to the simulator’s codebase, and hence we limit ourselves to single-CMG simulations (with one MPI rank). Relying on the assumption that most HPC codes are weak scaled across multiple NUMA domains and compute nodes, we believe the single-rank approach still serves as a solid foundation for future performance projection. However, even single-rank MPI binaries require numerous unsupported system calls. To circumvent this problem, we extend and deploy an MPI stub library [101].

3.3 Relevant HPC (Proxy-)Apps and Benchmarks

Instead of relying on a narrow set of cherry-picked applications, we attempt to cover a broad spectrum of typical scientific/HPC workloads. We customize and extend a publicly available benchmarking framework2 [34, 35] with a few additional benchmarks and necessary features to perform the MCA- and gem5-based simulations. The benchmark complexity ranges from simple kernels to large code bases (O(100,000s) lines-of-code) which are used by vendors for architecture comparisons and used by HPC centers for hardware procurements [41]. Hereafter, we detail the list of 127 included workloads, summed up across all benchmark suites, which are sized to fit within a single node and which could be simulated with gem5 in a reasonable time (\(\le \,\)six months).

Polyhedral Benchmark Suite.

The PolyBench/C suite contains 30 single-threaded, scientific kernels which can be parameterized in memory occupancy (\(\in [16\ \text{KiB}, 120\ \text{MiB}]\)) [90]. Unless stated otherwise, we use the largest configuration.

TOP500, STREAM, and Deep Learning Benchmarks.

High Performance Linpack (HPL) [36] solves a dense system of linear equations \(Ax = b\) of size 36,864 in our case. High Performance Conjugate Gradients (HPCG) [37] applies a conjugate gradient solver to a system of linear equation (with sparse matrix A). We choose \(120^3\) for HPCG’s global problem size. BabelStream [29] evaluates the memory subsystem of CPUs and accelerators, and we configure 2 GiB input vectors. Moreover, we implement a micro-benchmark, DLproxy, to isolate the single-precision GEMM operation (\(m=\text{1577088}; n=\text{27}; k=\text{32}\)) which is commonly found in 2D deep convolutional neural networks, such as 224×224 ImageNet classification workloads [111].

NASA Advanced Supercomputing Parallel Benchmarks.

The NAS Parallel Benchmarks (NPB) [11, 110] consists of nine kernels and proxy-apps which are common in computational fluid dynamics (CFD). The original MPI-only set has been expanded with ten OpenMP-only benchmarks [60] and we select the class B input size for all of them.

RIKEN’s Fiber Mini-Apps and TAPP Kernels.

To aid the co-design of Supercomputer Fugaku, RIKEN developed the Fiber proxy-application set [92], a benchmark suite representing the scientific priority areas of Japan. Additionally, RIKEN released scaled-down TAPP kernels [93] of their priority applications which are tailored for fast simulations with gem5 [62]. Our workloads are as follows: FFB [46] with the 3D-flow problem discretized into 50\(\times 50\times\)50 sub-regions; FFVC [84] using 144\(\times 144\times\)144 cuboids; MODYLAS [9] with the wat222 workload; mVMC [73] with the strong-scaling test reduced to 1/8th of the samples and 1/3rd of the lattice size; NICAM [108] with a single (not 11) simulated day; NTChem [78] with the H2O workload; QCD [16] with the class 2 input.

Exascale Computing Project Proxy-Applications.

The US-based supercomputing centers curated a co-design benchmarking suite for their recent exascale efforts [41]. We select eleven applications of the aforementioned benchmarking framework with the following workloads. AMG [87] with the problem 1 workload; CoMD [75] with the 256,000-atom strong-scaling test; Laghos [32] modelling a 3D Sedov blast but with 1/6th of the timesteps; MACSio [31] with an \(\approx \,\)1.14 GiB data dump distributed across many JSON files; MiniAMR [50] simulating a sphere moving diagonally through 3D space; MiniFE [50] with 128\(\times 128\times\)128 grid size; MiniTri [116] testing triangle- and largest clique-detection on BCSSTK30 (MatrixMarket [15]); Nekbone [10] with 8,640 elements and polynomial order of 8; SW4lite [88] simulating a pointsource; SWFFT [47] with 32 forward and backward tests for a 128\(\times 128\times\)128 grid; XSBench [109] with the small problem and 15 million particle lookups.

3.3.1 SPEC CPU & SPEC OMP Benchmarks.

The Standard Performance Evaluation Corporation [102] offers, among others, two HPC-focused benchmark suits: SPEC CPU® 2017[speed] (ten integer-heavy, single-threaded; ten OpenMP-parallelized, floating-point benchmarks) and SPEC OMP® 2012 (14 OpenMP-parallelized benchmarks). All SPEC tests hereafter are based on non-compliant runs with the train input configuration.

Skip 4MCA-BASED SIMULATION RESULTS Section

4 MCA-BASED SIMULATION RESULTS

Sections 4.1 and 4.2 are dedicated to our MCA-based estimation of the upper bound on performance improvement with abundant L1 cache. First, we evaluate the accuracy of this approach, and then apply the novel methodology to our benchmarking sets.

4.1 MCA-based Simulator Validation

During the development of our MCA-based simulator, we implemented numerous micro-benchmarks to fine-tune the CPI estimation capabilities while comparing the results to an Intel® Xeon® processor E5-2650v4 (formerly code named Broadwell). Our micro-benchmarks comprise MPI-/OpenMP-only, MPI+OpenMP, and single-threaded tests (exercising recursive functions, floating-point- or integer-intensive operations, L1-localised, or stream-like operation).

Needless to say, applying MCA-based simulations to full workloads or complex application kernels is still error-prone, since these tools are designed to analyze small Assembly sequences without guarantee for accurate absolute performance numbers. Regardless, we validate the current status of our tool using PolyBench/C with MINI inputs. In theory, these input sizes (\(\approx \,\)16 KiB) should all fit into the 32 KiB L1D cache of the Broadwell. Hence, measuring the kernel execution time for these PolyBench tests should yield numbers close to MCA-based runtime estimates. For the baseline measurements, we set all cores of the Broadwell to 2.2 GHz , set the uncore to 2.7 GHz , and disable turbo boost; compile each workload with Intel’s Parallel Studio XE,3 and execute every test for 100 times (since many only run for a few ms) to determine the fastest possible execution time. The difference between the real baseline results and our MCA-based estimates is visualized in Figure 5 as projected relative runtime difference.

Fig. 5.

Fig. 5. Validation of MCA-based runtime predictions against PolyBench/C MINI with inputs fitting into L1D; Relative runtime shown (vs. Intel E5-2650v4 measurements); Values \(\le \,\) 1 show prediction of faster execution.

The data shows that on average our MCA-based method slightly overestimates: MCA approach predicts faster execution times than it should. Only seven out of 30 workloads are expected to run slower than what we observe on the real Broadwell (i.e., y-value \(\le\)1). For eight of the PolyBench tests, our tool estimates the runtime to be over 2× faster than our measurements. Hence, we can conclude that for 73% of the micro-benchmarks, the MCA-based method is reasonably accurate: within 2× slower-to-2× faster. While a 2× discrepancy might appear high, we have to point out that our cross-validations using SST [95, 113] and third-party gem5 models [7] for Intel CPUs yield similar inaccuracies,4 but our MCA-based method is substantially faster.

Another indicator for the accuracy of our MCA-approach can be drawn from DGEMM (double precision gemm benchmark in Figure 5). Theoretically, DGEMM performs close to peak and is not memory-bound for large matrices, and hence the measured runtime and MCA-based estimates are expected to match. Unfortunately, PolyBench’s Gflop/s rate for gemm is far from peak (due to its hand-coded loop-nest), and therefore we replace it with an Intel MKL-based implementation of equal matrix dimensions. For the PolyBench input sizes \(\texttt {MINI},\ldots ,\texttt {EXTRALARGE}\) in our MKL-based implementation, our MCA tool estimates a faster runtime by 6.4×, 75%, 11%, 1.9%, and 1.5%, respectively. This closely matches the achievable single-core Gflop/s of the E5-2650v4: for MINI and the MKL-based runs, we measure only 2 Gflop/s , while for EXTRALARGE we peak out at the expected 32 Gflop/s . The low Gflop/s measurements for MINI (and SMALL) demonstrate that MKL is not yet compute-bound, and hence causes the 6.4× (and 75%) misprediction.

4.2 Speedup-potential with Unrestricted Locality

In this section, we take on the entire benchmark suite from Section 3.3 with the MCA-based approach and evaluate their speedup potential when all data fits into L1.

The baseline measurements for the speedup estimates are conducted on a dual-socket Intel Broadwell E5-2650v4 system with 48 cores (2-way hyper-threading enabled, cores are set to 2.2 GHz , turbo boost disabled). For all listed benchmarks, excluding SPEC CPU and OMP, we focus on the solver times only, i.e., we ignore data initialization and post-processing phases. Since most proxy-apps are parallelized with MPI and/or OpenMP, we perform an initial sweep of possible configurations of ranks and threads to determine the fastest time-to-solution (TTS) for our strong-scaling benchmarks, and the highest figure-of-merit (as reported by the benchmarks) for weak-scaling workloads. The highest performing configurations is executed ten times to determine the TTS of the kernel as our reference point in Figure 6.

Fig. 6.

Fig. 6. Projected speedup against a baseline dual-socket Intel Broadwell E5-2650v4 system while assuming all data fits into L1D with “optimistic” load-to-use latency; Top row, left to right: PolyBench, RIKEN TAPP kernels, NPB (OMP); Bottom row, left to right: NPB (MPI), TOP500, etc., ECP proxies, RIKEN Fiber apps, SPEC CPU[int/single] and CPU[float/OMP], SPEC OMP.

The same MPI/OMP configurations are then used for our MCA-based estimate. Under the assumption that some MPI-parallized benchmarks experience imbalances, we randomly sample up to nine ranks (in addition to rank 0),5 execute the selected rank with Intel SDE (and the remaining ranks normally), and calculate the estimated runtime using Equation (1) and the 2.2 GHz processor frequency. The resulting runtime estimate is divided by the measured runtime to determine the upper-bound speedup potential per application when all its data would fit into L1D, see Figure 6.

For PolyBench/C workloads, we see similar speedup trends as for its smallest inputs which we used in Figure 5, although the expected speedup for EXTRALARGE increases to a peak of 8.4× for the ludcomp kernel. Only four kernels show no performance increase, presumably by being compute-bound and not bandwidth-bound: 2mm, 3mm, doitgen, and trisolv. Overall, the MCA-based approach estimates a geometric mean (GM) speedup of 2.9× from fitting all data into L1D. RIKEN’s TAPP kernels benefit the most from unrestricted locality. Especially kernel 20 (SpMV), which represents one core function of the FFB application, shows a speedup of 20×. Altogether, we see a projection of (GM=)2.6× increased performance, but also two cases (kernels 5 and 9) where the MCA tool estimates a \(\approx \,\)50% slowdown. These two are from GENESIS [61] and NICAM, respectively, but as detailed in Section 4.1, some inaccuracy is expected as the trade-off for the faster simulation time.

NPB’s OpenMP version of a conjugate gradient (CG) solver is another workload with a large theoretical performance gain of 13.1×. In total, we expect a (GM=)3× gain for all NAS Parallel Benchmarks; specifically, (GM=)4× for the OpenMP versions and (GM=)2.3× for the MPI versions. The potential gain for CG is not surprising, since these solvers are predominantly bound by memory bandwidth and are sensitive to memory latency [38]. High Performance Linpack is unsurprisingly not expected to gain any performance by placing all its data into L1 cache, as this benchmark is compute-bound. In fact, our MCA tool expected a small runtime decrease of 11%. By contrast, DLproxy, which uses MKL’s SGEMM, would benefit from a large L1, since MKL cannot achieve peak Gflop/s for the tall/skinny matrix in this workload (cf. Section 3.3). XSBench and miniAMR show the highest gains for ECP’s and RIKEN’s proxy-apps, with a value of 7.3× and 7.4×, respectively. This appears to be in line with the expectation from the roofline characteristics of the benchmarks when measured on a similar compute node [33].

A deeper look at roofline analysis in [33] reveals that there is no strong correlation between the position of an application on the roofline model and the expected performance gain from solely running out of L1D cache. We speculate that other, hidden bottlenecks are exposed by our MCA approach, such as data dependencies and lack of concurrency in the applications, which limit the expected speedup. Apart from noticeable outliers in the expected speedup, such as lbm, ilbdc, and especially swim, the potential from enlarged L1D is rather slim for SPEC, and only (GM=)1.9× runtime reduction can be expected across all 34 workloads.

Skip 5GEM5-BASED SIMULATION RESULTS Section

5 GEM5-BASED SIMULATION RESULTS

In Section 5.1, we detail our choice for the simulated architectures in gem5. Similarly structured to the MCA-based simulations, Sections 5.2 and 5.3 highlight our validation of gem5 for our proposed CPU architectures and evaluate numerous benchmarks and proxy applications on said architecture, and we summarize the results in Section 5.4.

5.1 LARC CMG Models in gem5 and A64FXS Baseline

As we discussed in Section 2.4, we envision one LARC CMG to have 32 cores, 384 MiB L2 cache, and 1.6 TB/s L2 bandwidth. Regretfully, gem5 (at least RIKEN’s version) can only be configured with L2 cache sizes that are 2X, and therefore we either have to scale up or down LARC’s L2 cache size. Hence, we explore both as distinct options, one conservative and one technologically aggressive configuration. The conservative option, called LARCC, is limited to 256 MiB L2 cache at \(\sim \,\)800 GB/s , while the aggressive version, LARCA, doubles both values, to 512 MiB and \(\sim \,\)1.6 TB/s , respectively.

Starting at a baseline, i.e., a simulated version of A64FX which we label as A64FXS, and in order to materialize the properties of the LARC CMG (cf. Section 2.4), we modify three parameters in our gem5 model. We modify: (i) the number of cores in the system to match 32 (up from A64FXS’ baseline of 12); (ii) the size of the total L2 cache to match the capacity of the eight stacked layers (256/512 MiB , up from A64FXS’ L2 size of 8 MiB per CMG); and (iii) we adjust the number of L2 banks in LARCA to control the bandwidth.

We introduce a fourth gem5 configuration, called A64FX32, which simulates one baseline A64FXS CMG but with 32 cores. These four configurations A64FXS \(\rightarrow\)A64FX32 \(\rightarrow\)LARCC \(\rightarrow\)LARCA should allow us to determine the speedup gains from the larger core count and larger L2 cache, individually. The core frequency is universally set to 2.2 GHz . Table 2 summarizes the four gem5 configurations.

Table 2.
A64FXSA64FX32LARCCLARCA
Cores12323232
CMGs441616
Core config.Arm v8.2 + SVE, 512 bit SIMD, 2.2 GHz , OoO 128 ROB entries, dispatch width 4
Branch pred.Bi-mode: 16 K global predictor, 16 K choice predictor
Per-core L1D64 KiB 4-way set-assoc, 3 cycles, adjacent line prefetcher
L2 cache per CMG:
L2 size8 MiB256 MiB512 MiB
BW\(\sim \,\)800 GB/s\(\sim \,\)800 GB/s\(\sim \,\)1600 GB/s
L2 Cache Aggregated:
L2 size32 MiB4096 MiB8192 MiB
BW\(\sim \,\)3.2 TB/s\(\sim \,\)12.8 TB/s\(\sim \,\)25.6 TB/s
L2 config.16-way set-associative, 37 cycles, inclusive, 256 B block
Main Memory32 GiB HBM2, 4 channels, 256 GB/s

Table 2. Chip Area and Simulator Configurations for gem5

5.2 gem5-based Simulation and Configuration Validation

We perform OpenMP tests to verify our gem5 simulator for up to 32 cores. For the L2 cache size and bandwidth changes, we employ a STREAM Triad benchmark, parameterized to avoid cache line conflicts among participating threads. Splitting the A64FXS CMG L2 cache into 12 chunks (one per thread) yields a working size of 683 KiB . Hence, the three 128 KiB vectors of the Triad operation will fit into the L2 cache. We increase the total vector size in proportion to the number of threads and test the achievable L2 bandwidth for LARCC and LARCA. Additionally, Figure 7(a) includes the baseline A64FXS CMG scaled to 12 cores. The simulation shows that LARCC’s L2 bandwidth peaks out at 792 GB/s and LARCA’s bandwidth goes up to 1450 GB/s for this particular test case, which is, respectively, 1% and 9% lower than our estimates shown above. The baseline A64FXS closely matches the bandwidth of the real A64FX CPU executing this test.

Fig. 7.

Fig. 7. Validation with simulated STREAM Triad; Both LARC configurations with 32 cores; A64FXS scaled to 12 cores; Real A64FX measurements on 1 CMG for reference; Dashed lines highlight trend (not measured).

Another validation test we perform is setting the number of cores to the maximum (12 and 32, respectively) and scale the vector size from 2 KiB per core to a total of 1 GiB for the three vectors. Figure 7(b) shows the results for this simulation. In the memory range of tens to hundreds of KiB, the Triad operation can be done from L1 cache, for which LARCC and LARCA show higher bandwidth. Their 2.7× higher core count results in 2.6× higher aggregated L1 bandwidth. For the Triad, for the memory sizes that fit into L2 cache, we see a behavior similar to Figure 7(a). Past 8 MiB , the A64FXS configuration shows the expected bandwidth drop to HBM2 level, while for LARCC and LARCA, the expected L2 cache bandwidth is maintained until 256 MiB and 512 MiB , respectively. This validates that our gem5 settings yield the expected LLC characteristics.

Lastly, to validate the LARC configuration and to see the changes applied to more complex science kernels, we perform a sensitivity analysis of cache parameters with the RIKEN TAPP kernels. In Figure 8, we vary L2 cache access latency, size, and bandwidth in ranges beyond our LARCC and LARCA target architectures. This analysis will help us in adjusting our expectations when future LARC-like architectures deviate from our design parameters, e.g., by stacking less SRAM layers or having higher L2 access latency. In this parameter sweep, LARCC will be the baseline and we vary one parameter while keeping the others fixed. The top row of Figure 8 shows the latency sweep, where we choose 22 cycles as best latency (which is 2× the data load latency from L1 for SVE instructions in A64FX). The worst case of 52 cycles is equidistant to our baseline in the opposite direction, and two additional latencies are selected in between. Similarly, we adjust the L2 size (middle row; simulating more or less SRAM stacks or a larger or smaller semiconductor process nodes) and L2 bank bits in gem5, see bottom row of Figure 8. The latter indirectly controls the L2 bandwidth of the simulated architectures. The latency change has minimal impact, since HPC applications are typically not latency bound. However, the L2 cache capacity and bandwidth can have a significant impact on performance, as expected, since they determine the amount of data that can be stored and accessed quickly. For some of the TAPP kernels, though, the performance is unaffected by these parameters,6 since these kernels are actually shrunk-down versions specifically designed for cycle-level architecture simulations, and therefore have low memory footprint.

Fig. 8.

Fig. 8. Sensitivity study of cache parameters using RIKEN’s TAPP [93] kernels; Relative runtime compared to LARCC baseline (37 cycle latency, 256 MiB , 2 bankbits; middle bar among the five) is shown; Top row: L2 latency modified; Middle row: L2 capacity; Bottom row: adjusting L2 bandwidth via bankbits (#banks = \(2^x\) ).

5.3 Speedup-potential with Restricted Locality

To further refine our projections gained by abundant cache, we proceed with the cycle-level simulations of the proxy-applications and benchmarks listed in Section 3.3.

We compile all benchmarks with Fujitsu’s Software Technical Computing Suite (v4.6.1) targeting the real A64FX, and simulate the single-rank workloads in gem5 for our four configurations. Unfortunately, three of our MPI-based benchmarks require multi-rank MPI: MODYLAS, NICAM, and NTChem, and hence we omit them. Furthermore, we skip the MPI-only versions of NPB. Hereafter, we only report proxy applications and benchmarks which ran to completion within gem5 (i.e., gem5-crashes or simulated application-crashes are excluded when infeasible to patch, and simulations exceeding the 6-months time limit are ignored).

The per-configuration speedup is given relative to the baseline A64FXS configuration. We exclude initialization and post-processing times, and measure only the main kernel runtime, except for the SPEC benchmarks as described in Section 4.2. These results are presented in Figure 9 and show the effects of the gradual expansion of simulated resources. The average (single CMG) speedups from LARCC and LARCA are \(\approx \,1.9\times\) and \(\approx \,2.1\times\), respectively, with some applications reaching \(\approx \,4.4\times\) for LARCC and \(\approx \,4.6\times\) for LARCA.

Fig. 9.

Fig. 9. gem5-based, simulated speedups of A64FX32, LARCC and LARCA in comparison to baseline A64FXS; Left to right: RIKEN TAPP kernels, NPB (OMP), TOP500 etc., ECP proxies, SPEC CPU[int/single] and CPU[float/OMP], SPEC OMP; Added MCA-based estimations from Figure 6 for reference; TAPP kernels 3–6 (multiple Nbody kernels) and 18 (MatVecDotP) are limited to 12 threads, hence we omit A64FX32; Missing benchmarks (cf. Figure 6) primarily due to gem5 issues or exceeding simulation time limit. PolyBench results (single core) are also omitted due to limited speedup across all of them and no noteworthy outliers.

As expected, most benchmarks benefit from the additional cores and cache capacity, most prominently MG-OMP which gains a small speedup of \(\approx \,1.3\times\) from the extra cores, \(\approx \,2\times\) speedup from the extra cache, and with 512 MiB cache and higher bandwidth reaches \(\approx \,4.6\times\) speedup. Comparable incremental improvements with all three architecture steps are observable in other workloads, such as TAPP kernel 7 (DifferOpVer) and 17 (MatVecSplit), showing good scaling on multiple cores and being memory-bound since they benefit from the additional cores and cache capacity. TAPP kernels 19 and 20, XSBench, roms, and imagick (SPEC OMP) show similar gain in runtime, but the difference between LARCC and LARCA is smaller, implying that the problem size either fits into the 256 MiB L2 (e.g., XSBench) or the workload arrives at a point of diminishing returns from the 2× larger cache. TAPP kernels 8, 9, 12–15, and FT-OMP suffer a slowdown from cache contention in A64FX32. LARCC and LARCA avoid the cache contention, resulting in speedups similar to the benchmarks discussed earlier. EP-OMP, CoMD, and other compute-bound benchmarks benefit only from the higher core count, with both LARCs providing similar speedup as A64FX32.

Expectedly, single-threaded workloads (all of PolyBench’s benchmarks) show little to no improvements over A64FXS, i.e., they do not benefit from more cores. However, these benchmarks also do not show a performance gain from a larger 3D-stacked L2 cache, albeit their working set size exceeding A64FXS’ 8 MiB L2 yet fitting into LARC’ larger cache. We only see a limited speedup of (GM=)4.3% across all of them and no noteworthy outliers, and hence omit them in Figure 9. We attribute other outliers, such as the slowdown of imagick (SPEC-CPU), to similar intrinsic property of the benchmark: our testing on a real A64FX reveals that imagick has a sweetspot at 8 OpenMP threads, and scales negatively thereafter; and the TAPP kernels 3–6 and 18 were customized for the 12-core A64FX CMG and cannot run effectively on 32 threads without a rewrite. Hence, we limit gem5 to 12 cores for these TAPP kernels, and we see that only the MatVecDotP kernel of the ADVENTURE application [4] benefits from a larger L2. Further proxy-applications and benchmarks missing from Figure 9, yet appearing in Figure 6, are the unfavorable result of persistent, repeatable simulator errors—sometimes occurring after months of simulation.

We should note that in some cases the benchmarks’ implementation and the quality of the compiler may skew the results, for instance, BabelStream measuring memory bandwidth on a 2 GiB buffer. Being unoptimized for A64FX, BabelStream’s baseline underperforms in terms of per-core bandwidth (compared to STREAM Triad tests in Figures 7(a) and 7(b)) which in turn results in performance gain when the number of cores increases to 32.

Overall, the speedup on A64FX32 can originate from the following reasons: (i) the program is compute-bound (a valid result); (ii) the workload exhibits both compute-bound and memory-bound tendencies in different components of a proxy-application (a valid result); (iii) the program is highly latency-bound, and hence the speedup can be the result of the larger aggregate L1 cache (a valid result); or (iv) a poor baseline resulting in a slightly misleading result.

We confirm the validity of attributing improvement to the high capacity L2 by inspecting the L2 cache-miss rates of our gem5 simulations (with the miss rate of some selected examples listed in Table 3). The reduction in cache-miss rates reported in the table is consistent with the performance improvements we observe in Figure 9.

Table 3.
Proxy-AppA64FXSA64FX32LARCCLARCA
NICAM’s ImplicitVer (kernel 12)36.647.610.59.1
ADVENTURE’s MatVecSplit (kernel 17)46.749.548.734.8
FFB’s FrontFlow (kernel 19)73.869.649.148.9
FT-OMP11.648.26.43.8
MG-OMP59.870.929.40.4
XSBench32.136.40.10.1

Table 3. L2 Cache-miss Rate [in %] of Representative Proxies

5.4 Summary of the Results

Our gem5 simulations indicate that more than half (31 out of 52) of the applications experienced a larger than two times speedup on LARCA compared to the baseline A64FXS CMG. For over two-thirds (24 out of 31) of these applications, the performance gains are directly attributed to the larger (3D-stacked) cache, i.e., with at least 10% gain by either of the two LARC configurations over the A64FX32 variant. Most notably, out of all the RIKEN TAPP kernels that experienced meaningful speedup on LARC, a majority benefited from the expanded cache, rather than the increase in number of cores. This carries particular importance as these kernels are highly tuned for A64FX.

Skip 6DISCUSSION AND LIMITATIONS Section

6 DISCUSSION AND LIMITATIONS

In this study, we simulated a single LARC CMG in gem5, and its potential future effect on common HPC workloads.

6.1 The Prospect of LARC

In reality, if a LARC processor were incepted in 2028, it would contain 16 LARC CMGs, which correspond to the same silicon area as the current A64FX CPU, and it is important to understand what impact such a processor would have on the HPC community and its applications. Unfortunately, it is hard to give a conclusive answer to such a forward looking question today. However, if we do ideal scaling of both A64FX and LARC CMGs and compare at the full chip level, then a LARC system in 2028 could give between 4.91× (xz; SPEC CPU) and 18.57× (MG-OMP; NPB) performance improvements over the current A64FX processor with an average improvement of (GM=)9.56× for applications that are responsive to larger cache capacity. For applications that do not yet benefit from a larger cache, future studies should (continue to) consider algorithmic improvements [69], as well as investigate the potential of allocating parts of the cache to vary compute capabilities, for example, processing-in-memory [12] or alternative compute modules, e.g., CGRAs [89].

6.2 Considerations and Limitations

Our MCA-based estimation framework only gives a first-order approximation for a hypothetical CPU with sufficiently large L1 cache to host the entire data structures of a specific workload. This approach has some advantages and disadvantages and should be used with caution, but it also has capabilities which we have not yet detailed, such as estimating the runtime of the same binary/ workload for different (ISA-compatible) x86 systems by simply replacing the MCA target architecture and altering the CPU clock frequency.

We emphasize that we run applications as they are, i.e., without any algorithmic optimizations to the larger last level cache, in our MCA- and gem5-based simulators. This is also true to our motivating experiment shown in Figure 1. While the cache capacity of AMD’s Milan-X CPU is about three times that of Milan, it is far from what we envision in 2028. Hence, our Milan-X results serve as a first-order indication of what SRAM—in its current available SoTA—can offer.

Another notable aspect, which is outside the main scope of this extrapolation study, is the heat dissipation of CPU cores in the face of the 3D-stacked cache. It has been reported that AMD’s Milan-X carefully stacks caches above areas of the chip that are not used for compute, i.e., mostly above caches [77]. Our assumption is that, by 2028, manufacturing technologies will have advanced enough to overcome this limitation. Yet, for interested readers we provide further details on thermal and power estimates for our hypothetical LARC CPU in Section 2.6.

Skip 7RELATED WORK Section

7 RELATED WORK

Stacked Memory and Caches: The size of LLC has increased for the last 25 years [58], a trend anticipated to continue into the future. Yet, 2D IC becomes hard to exploit for additional performance, despite recent attempts by IBM [27, 59]. However, 3D-stacking is becoming a promising alternative [52], as demonstrated by AMD’s 3D V-Cache [40], Samsung’s proposed 3D SRAM stacking solution [64] based on 7 nm TSVs, or the most recent study of 7 nm TCI-based 2- and 4-layer SRAM stacks by Shiba et al. [97]. Moreover, academics explored 3D-stacked DRAM cache [48, 119], but these incur much higher latency and power consumption [74, 98]. Non-Volatile Memory is considered as LLC alternative, yet it suffers similar latency issues [63]. Lastly, NVIDIA applied for a patent of an 8-layer memory stack fused with a processor die [28], theorizing a 50× improvement in bytes-to-flop ratio. However, what differs our work from the work of our peers is: (i) we focus on the real-world impact of future caches, several magnitudes larger than those found today.

Performance modeling tools and methodologies: Computer architecture research is often based on simulators, such as the Structural Simulation Toolkit (SST) [95] or CODES [24], for efficiently evaluating and optimizing HPC architectures and applications. The gem5 simulator, by Binkert et al. [13], is widely used by academia and vendors for micro-and full-system architecture emulation and simulation. It supports validated models for x86 [7] and Arm [62]. We refer the interested reader to www.gem5.org/publications/ for an comprehensive library of gem5-based research and derivative works. However, what differs our work from the work of our peers is: (ii): unlike prior work that utilizes (relatively) small kernels, our work operates on large-scale MPI/OpenMP-parallelized proxy-applications in order to quantify the impact of caches on realistic workloads. To our knowledge of reported research-driven gem5 simulations, this is the largest scale of cycle-accurate simulations conducted in terms of the aggregate number of instructions simulated (\(6.08 \times 10^{13}\)).

Other methods such as MUSA by Grass et al. [45] are closer to our MCA-based approach, since MUSA uses PIN which is the basis for Intel SDE (used in this study), but focus on MPI analysis and multi-node workloads. We are not the first to utilize Machine Code Analyzers, see [2, 65] and derivative works such as [8, 19, 22, 25, 72, 91]. However, what differs our work from the work of our peers is: (iii): instead of estimating accurate performance of existing system architectures, our MCA-based approach tries to gauge the upper-bound in obtainable performance, and exposes bottlenecks better than the roofline approach, for common HPC applications.

Skip 8CONCLUSION Section

8 CONCLUSION

We aspire to understand the performance implications of emerging SRAM-based die-stacking on future HPC processors. We first designed a methodology to project the upper bound that an infinitely large cache would have on relevant HPC applications. We find that several well-known HPC applications and benchmarks have ample opportunities to exploit an increased cache capacity.

We further expand our study by proposing a hypothetical processor (called LARC) in 1.5 nm technology. This processor would have nearly 6 GiB L2 cache memory; compared to our baseline A64FXS CPU architecture with 32 MiB L2 cache. Next, we exercise a single LARC CMG using a plethora of HPC applications and benchmarks using the gem5 simulator and contrast the observed performance against the existing A64FXS CMG. We find that the LARC CMG would (on average) be 1.9× faster than the corresponding A64FXS CMG, albeit consuming \(\frac{1}{4}\)th of the area. When area-normalized to the real A64FX CMG (by assuming optimistic ideal scaling), we can expect to see an average boost of 9.56× for cache-sensitive HPC applications by the end of this decade.

Finally, we expect that the larger caches will motivate and facilitate algorithmic advances that in combination with the abundant cache can potentially yield an order of magnitude gain in performance, as demonstrated by the tile low-rank (TLR) approximations [69]. These approaches however require a minimum size of the cache to reach their fullest potential. We firmly believe that the combination of high-bandwidth, large, 3D-stacked caches, and algorithmic advances, is the path forward for the next generation of HPC processors when attempting to break the “memory wall”.

Skip 9FAIR COMMITMENT BY THE AUTHORS Section

9 FAIR COMMITMENT BY THE AUTHORS

We developed a framework of scripts and git submodules to manage the R&D of LARC, to set up the benchmarking infrastructure, and to perform the simulations. After cloning our repository https://gitlab.com/domke/LARC (or downloading the artifacts from https://doi.org/10.5281/zenodo.6420658), one has access to all benchmarks (see Section 3.3), patches, scripts, and our collected data. Only minor modifications to the configuration files should be necessary, such as changing host names, paths to compilers, or downloading licensed third-party software, before testing on another system. If users deviate from our OS version (CentOS Linux release 7.9.2009, and intel_pstate=disable kernel parameter) then some additional changes might be required.

Skip ACKNOWLEDGMENT Section

ACKNOWLEDGMENT

Furthermore, we thank Masazumi Nakamura from AMD for providing us early access to the Milan-X platform.

JD, EV, BG, and MW designed the study; JD and EV conducted the experiments; BG fixed gem5 issues; JD, EV, MW, AP, MP, LZ and PC analyzed the results; SM and AP developed the stacked cache model, and all authors participated in brainstorming and writing the manuscript.

Footnotes

  1. 1 Resource over-subscription is outside the scope of this study and our tool.

    Footnote
  2. 2 Exact benchmark versions, git commits, inputs, and the like, are provided in our artifacts which are referenced in Section 9.

    Footnote
  3. 3 For details of flags, tools, versions, and executions environments, please refer to Section 9.

    Footnote
  4. 4 A large-scale survey of academic simulators in realistic scenarios, beyond carefully selected and tuned micro-kernels, is—in our humble opinion—consequential, and yet outside the scope of this paper. Although, reference [6] provides a data point.

    Footnote
  5. 5 Sampling at most ten out of all MPI ranks should not substantially alter the result but saves resources, since we have to execute SDE once per rank.

    Footnote
  6. 6 The MatVecSplit oddity (runtime increases for 128 MiB) needs further investigation. It shows an enlarged counter of LoadLockedRequests—this artifact could be attributed to software (such as barrier implementation in the OpenMP runtime).

    Footnote

REFERENCES

  1. [1] 2023. Heterogeneous Integration Roadmap 2023 Edition - Chapter 20: Thermal. Technical Report. IEEE Electronics Packaging Society. 139. https://eps.ieee.org/images/files/HIR_2023/ch20_thermalfinal.pdfGoogle ScholarGoogle Scholar
  2. [2] Abel Andreas and Reineke Jan. 2021. A Parametric Microarchitecture Model for Accurate Basic Block Throughput Prediction on Recent Intel CPUs. https://arxiv.org/pdf/2107.14210.pdfGoogle ScholarGoogle Scholar
  3. [3] Inc Advanced Micro Devices,. 2021. AMD Instinct™ MI250X Accelerator. https://www.amd.com/en/products/server-accelerators/instinct-mi250xGoogle ScholarGoogle Scholar
  4. [4] Project ADVENTURE. 2019. Development of Computational Mechanics System for Large Scale Analysis and Design — ADVENTURE Project. https://adventure.sys.t.u-tokyo.ac.jp/Google ScholarGoogle Scholar
  5. [5] Agarwal A., Li Hai, and Roy K.. 2003. A single-V/Sub t/ low-leakage gated-ground cache for deep submicron. IEEE Journal of Solid-State Circuits 38, 2 (2003), 319328. Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Akram Ayaz and Sawalha Lina. 2019. A survey of computer architecture simulation techniques and tools. IEEE Access 7 (2019), 7812078145. Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Akram Ayaz and Sawalha Lina. 2019. Validation of the Gem5 simulator for X86 architectures. In 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE, Denver, CO, USA, 5358. Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Alappat Christie, Meyer Nils, Laukemann Jan, Gruber Thomas, Hager Georg, Wellein Gerhard, and Wettig Tilo. 2021. Execution-cache-memory modeling and performance tuning of sparse matrix-vector multiplication and lattice quantum chromodynamics on A64FX. Concurrency and Computation: Practice and Experience (Aug. 2021), 30. Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Andoh Yoshimichi, Yoshii Noriyuki, Fujimoto Kazushi, Mizutani Keisuke, Kojima Hidekazu, Yamada Atsushi, Okazaki Susumu, Kawaguchi Kazutomo, Nagao Hidemi, Iwahashi Kensuke, Mizutani Fumiyasu, Minami Kazuo, Ichikawa Shin-ichi, Komatsu Hidemi, Ishizuki Shigeru, Takeda Yasuhiro, and Fukushima Masao. 2013. MODYLAS: A highly parallelized general-purpose molecular dynamics simulation program for large-scale systems with long-range forces calculated by fast multipole method (FMM) and highly scalable fine-grained new parallel processing algorithms. Journal of Chemical Theory and Computation 9, 7 (2013), 32013209. Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Laboratory Argonne National. 2022. NEK5000. http://nek5000.mcs.anl.govGoogle ScholarGoogle Scholar
  11. [11] Bailey D. H., Barszcz E., Barton J. T., Browning D. S., Carter R. L., Dagum L., Fatoohi R. A., Frederickson P. O., Lasinski T. A., Schreiber R. S., Simon H. D., Venkatakrishnan V., and Weeratunga S. K.. 1991. The NAS parallel benchmarks – summary and preliminary results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (SC’91). ACM, New York, NY, USA, 158165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Balasubramonian Rajeev, Chang Jichuan, Manning Troy, Moreno Jaime H., Murphy Richard, Nair Ravi, and Swanson Steven. 2014. Near-data processing: Insights from a MICRO-46 workshop. IEEE Micro 34, 4 (2014), 3642. Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Binkert Nathan, Beckmann Bradford, Black Gabriel, Reinhardt Steven K., Saidi Ali, Basu Arkaprava, Hestness Joel, Hower Derek R., Krishna Tushar, Sardashti Somayeh, Sen Rathijit, Sewell Korey, Shoaib Muhammad, Vaish Nilay, Hill Mark D., and Wood David A.. 2011. The Gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2 (Aug. 2011), 17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Black Bryan, Annavaram Murali, Brekelbaum Ned, DeVale John, Jiang Lei, Loh Gabriel H., McCaule Don, Morrow Pat, Nelson Donald W., Pantuso Daniel, Reed Paul, Rupley Jeff, Shankar Sadasivan, Shen John, and Webb Clair. 2006. Die stacking (3D) microarchitecture. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 39). IEEE Computer Society, USA, 469479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Boisvert Ronald F., Pozo Roldan, Remington Karin, Barrett Richard F., and Dongarra Jack J.. 1997. Matrix market: A web resource for test matrix collections. In Proceedings of the IFIP TC2/WG2.5 Working Conference on Quality of Numerical Software: Assessment and Enhancement. Chapman & Hall, Ltd., London, UK, 125137. http://dl.acm.org/citation.cfm?id=265834.265854Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Boku T., Ishikawa K. I., Kuramashi Y., Minami K., Nakamura Y., Shoji F., Takahashi D., Terai M., Ukawa A., and Yoshie T.. 2012. Multi-block/Multi-core SSOR preconditioner for the QCD quark solver for K computer. Proceedings, 30th International Symposium on Lattice Field Theory (Lattice 2012): Cairns, Australia, June 24–29, 2012 LATTICE2012 (2012), 188. Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Bonshor Gavin. 2022. AMD Releases Milan-X CPUs With 3D V-Cache: EPYC 7003 Up to 64 Cores and 768 MB L3 Cache. https://www.anandtech.com/show/17323/amd-releases-milan-x-cpus-with-3d-vcache-epyc-7003Google ScholarGoogle Scholar
  18. [18] Cao Kun, Zhou Junlong, Wei Tongquan, Chen Mingsong, Hu Shiyan, and Li Keqin. 2019. A survey of optimization techniques for thermal-aware 3D processors. Journal of Systems Architecture 97, C (Aug. 2019), 397415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Cabezas Victoria Caparrós and Stanley-Marbell Phillip. 2011. Parallelism and data movement characterization of contemporary application classes. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11). Association for Computing Machinery, New York, NY, USA, 95104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Chakraborty Shounak and Kapoor Hemangee K.. 2018. Analysing the role of last level caches in controlling chip temperature. IEEE Transactions on Sustainable Computing 3, 4 (2018), 289305.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Cheese. 2022. AMD’s V-Cache Tested: The Latency Teaser. https://chipsandcheese.com/2022/01/14/amds-v-cache-tested-the-latency-teaser/Google ScholarGoogle Scholar
  22. [22] Chen Yishen, Brahmakshatriya Ajay, Mendis Charith, Renda Alex, Atkinson Eric, Sýkora Ondřej, Amarasinghe Saman, and Carbin Michael. 2019. BHive: A benchmark suite and measurement framework for validating X86-64 basic block performance models. In 2019 IEEE International Symposium on Workload Characterization (IISWC). IEEE Press, Orlando, FL, USA, 167177. Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Chou Chiachen, Jaleel Aamer, and Qureshi Moinuddin K.. 2015. BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). Association for Computing Machinery, New York, NY, USA, 198210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Cope J., Liu N., Lang Samuel, Carothers C. D., and Ross Robert B.. 2011. CODES: Enabling co-design of multi-layer exascale storage architectures. In Workshop on Emerging Supercomputing Technologies 2011 (WEST 2011). OSTI.GOV, Tuscon, Arizona, USA, 16. Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Corda Stefano, Singh Gagandeep, Awan Ahsan Javed, Jordans Roel, and Corporaal Henk. 2019. Memory and parallelism analysis using a platform-independent approach. In Proceedings of the 22nd International Workshop on Software and Compilers for Embedded Systems (SCOPES’19). Association for Computing Machinery, New York, NY, USA, 2326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Cutress Ian. 2021. AMD Demonstrates Stacked 3D V-Cache Technology: 192 MB at 2 TB/Sec. https://www.anandtech.com/show/16725/amd-demonstrates-stacked-vcache-technology-2-tbsec-for-15-gamingGoogle ScholarGoogle Scholar
  27. [27] Cutress Ian. 2021. Did IBM Just Preview the Future of Caches? https://www.anandtech.com/show/16924/did-ibm-just-preview-the-future-of-cachesGoogle ScholarGoogle Scholar
  28. [28] Dally William James, Gray Carl Thomas, Keckler Stephen W., and O’Connor James Michael. [n. d.]. Memory Stacked on Processor for High Bandwidth. https://patents.justia.com/patent/20230275068Google ScholarGoogle Scholar
  29. [29] Deakin Tom, Price James, Martineau Matt, and McIntosh-Smith Simon. 2016. GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. In High Performance Computing, Taufer Michela, Mohr Bernd, and Kunkel Julian M. (Eds.). Springer, Cham, 489507.Google ScholarGoogle Scholar
  30. [30] Dennard Robert H., Gaensslen Fritz H., Yu Hwa-Nien, Rideout V. Leo, Bassous Ernest, and LeBlanc Andre R.. 1974. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE Journal of Solid-State Circuits 9, 5 (1974), 256268. Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Dickson James, Wright Steven, Maheswaran Satheesh, Herdmant Andy, Miller Mark C., and Jarvis Stephen. 2016. Replicating HPC I/O Workloads with proxy applications. In Proceedings of the 1st Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS’16). IEEE Press, Piscataway, NJ, USA, 1318. Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Dobrev V., Kolev T., and Rieben R.. 2012. High-order curvilinear finite element methods for lagrangian hydrodynamics. SIAM Journal on Scientific Computing 34, 5 (2012), B606–B641. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Domke Jens, Matsumura Kazuaki, Wahib Mohamed, Zhang Haoyu, Yashima Keita, Tsuchikawa Toshiki, Tsuji Yohei, Podobas Artur, and Matsuoka Satoshi. 2019. Double-precision FPUs in high-performance computing: An embarrassment of riches?. In 2019 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019, Rio de Janeiro, Brazil, May 20–24, 2019. IEEE Press, Rio de Janeiro, Brazil, 7888. Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Domke Jens and Vatai Emil. 2021. Matrix Engine Study. https://gitlab.com/domke/MEstudyGoogle ScholarGoogle Scholar
  35. [35] Domke Jens, Vatai Emil, Drozd Aleksandr, Peng Chen, Oyama Yosuke, Zhang Lingqi, Salaria Shweta, Mukunoki Daichi, Podobas Artur, Wahib Mohamed, and Matsuoka Satoshi. 2021. Matrix engines for high performance computing: A paragon of performance or grasping at straws?. In 2021 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, Oregon, USA, May 17–21, 2021. IEEE Press, Portland, Oregon, USA, 10561065.Google ScholarGoogle Scholar
  36. [36] Dongarra Jack. 1988. The LINPACK benchmark: An explanation. In Proceedings of the 1st International Conference on Supercomputing. Springer-Verlag, London, UK, UK, 456474. http://dl.acm.org/citation.cfm?id=647970.742568Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Dongarra Jack, Heroux Michael, and Luszczek Piotr. 2015. HPCG Benchmark: A New Metric for Ranking High Performance Computing Systems. Technical Report ut-eecs-15-736. University of Tennessee. https://library.eecs.utk.edu/pub/594Google ScholarGoogle Scholar
  38. [38] Dongarra Jack, Heroux Michael A., and Luszczek Piotr. 2016. A new metric for ranking high-performance computing systems. National Science Review 3, 1 (2016), 3035. Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Esmaeilzadeh Hadi, Blem Emily, Amant Renee St., Sankaralingam Karthikeyan, and Burger Doug. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). Association for Computing Machinery, New York, NY, USA, 365376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Evers Mark, Barnes Leslie, and Clark Mike. 2022. The AMD next generation Zen 3 core. IEEE Micro 42, 3 (2022), 712. Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Project Exascale Computing. 2018. ECP Proxy Apps Suite. https://proxyapps.exascaleproject.org/ecp-proxy-apps-suite/Google ScholarGoogle Scholar
  42. [42] Gomes Wilfred, Khushu Sanjeev, Ingerly Doug B., Stover Patrick N., Chowdhury Nasirul I., O’Mahony Frank, Balankutty Ajay, Dolev Noam, Dixon Martin G., Jiang Lei, Prekke Surya, Patra Biswajit, Rott Pavel V., and Kumar Rajesh. 2020. 8.1 Lakefield and mobility compute: A 3D stacked 10nm and 22FFL hybrid processor system in 12×12mm2, 1mm package-on-package. In 2020 IEEE International Solid- State Circuits Conference - (ISSCC). IEEE Press, San Francisco, CA, USA, 144146. Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Gottlieb A., Grishman R., Kruskal C. P., McAuliffe K. P., Rudolph L., and Snir M.. 1983. The NYU ultracomputer—designing an MIMD shared memory parallel computer. IEEE Trans. Comput. C-32, 2 (1983), 175189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Goud A. Arun, Venkatesan Rangharajan, Raghunathan Anand, and Roy Kaushik. 2015. Asymmetric underlapped FinFET based Robust SRAM design at 7nm node. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE’15). EDA Consortium, San Jose, CA, USA, 659664.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Grass Thomas, Allande César, Armejach Adrià, Rico Alejandro, Ayguadé Eduard, Labarta Jesus, Valero Mateo, Casas Marc, and Moreto Miquel. 2016. MUSA: A multi-level simulation approach for next-generation HPC machines. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). IEEE Press, Salt Lake City, UT, USA, 526537. Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Guo Yang, Kato Chisachi, and Yamade Yoshinobu. 2006. Basic features of the fluid dynamics simulation software “FrontFlow/Blue”. Seisan Kenkyu 58, 1 (2006), 1115. Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Habib Salman, Morozov Vitali, Frontiere Nicholas, Finkel Hal, Pope Adrian, Heitmann Katrin, Kumaran Kalyan, Vishwanath Venkatram, Peterka Tom, Insley Joe, Daniel David, Fasel Patricia, and Lukić Zarija. 2016. HACC: Extreme scaling and performance across diverse architectures. Commun. ACM 60, 1 (Dec. 2016), 97104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Hameed Fazal, Khan Asif Ali, and Castrillon Jeronimo. 2021. Improving the performance of block-based DRAM caches via tag-data decoupling. IEEE Trans. Comput. 70, 11 (2021), 19141927. Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Hemsoth Nicole. 2018. A Rogues Gallery of Post-Moore’s Law Options. https://www.nextplatform.com/2018/08/27/a-rogues-gallery-of-post-moores-law-options/Google ScholarGoogle Scholar
  50. [50] Heroux Michael A., Doerfler Douglas W., Crozier Paul S., Willenbring James M., Edwards H. Carter, Williams Alan, Rajan Mahesh, Keiter Eric R., Thornquist Heidi K., and Numrich Robert W.. 2009. Improving Performance via Mini-applications. Technical Report SAND2009-5574. Sandia National Laboratories.Google ScholarGoogle Scholar
  51. [51] Hruska Joel. 2012. The Death of CPU Scaling: From One Core to Many – and Why We’re Still Stuck. https://www.extremetech.com/computing/116561-the-death-of-cpu-scaling-from-one-core-to-many-and-why-were-still-stuckGoogle ScholarGoogle Scholar
  52. [52] Hu Xing, Stow Dylan C., and Xie Yuan. 2018. Die stacking is happening. IEEE Micro 38, 1 (2018), 2228. Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] IRDS™ IEEE. 2021. International Roadmap for Devices and Systems (IRDS™) 2021 Edition – Executive Summary. IEEE IRDS™ Roadmap. IEEE. 64 pages. https://irds.ieee.org/images/files/pdf/2021/2021IRDS_ES.pdfGoogle ScholarGoogle Scholar
  54. [54] IRDS™ IEEE. 2021. International Roadmap for Devices and Systems (IRDS™) 2021 Edition – Systems and Architectures. IEEE IRDS™ Roadmap. IEEE. 23 pages. https://irds.ieee.org/images/files/pdf/2021/2021IRDS_SA.pdfGoogle ScholarGoogle Scholar
  55. [55] Corporation Intel. 2012. Intel® Architecture Code Analyzer – User’s Guide. https://www.intel.com/content/dam/develop/external/us/en/documents/intel-architecture-code-analyzer-2-0-users-guide-157548.pdfGoogle ScholarGoogle Scholar
  56. [56] Corporation Intel. 2020. Dynamic Control- Flow Graph (DCFG) and DCFG-Trace Format Specifications – For Format Version 1.00. https://www.intel.com/content/dam/develop/external/us/en/documents/dcfg-format-548994.pdfGoogle ScholarGoogle Scholar
  57. [57] Corporation Intel. 2021. Intel® Software Development Emulator. https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.htmlGoogle ScholarGoogle Scholar
  58. [58] Iyer Ravi R., De Vivek, Illikkal Ramesh, Koufaty David A., Chitlur Bhushan, Herdrich Andrew, Khellah Muhammad M., Hamzaoglu Fatih, and Karl Eric. 2021. Advances in microprocessor cache architectures over the last 25 years. IEEE Micro 41, 6 (2021), 7888. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Jacobi Christian. 2021. Real-time AI for enterprise workloads: The IBM Telum processor. In 2021 IEEE Hot Chips 33 Symposium (HCS). IEEE Computer Society, Palo Alto, CA, USA, 22. https://hc33.hotchips.org/assets/program/conference/day1/HC2021.C1.3IBMCristianJacobiFinal.pdfGoogle ScholarGoogle ScholarCross RefCross Ref
  60. [60] Jin H., Frumkin M., and Yan J.. 1999. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report NAS-99-011. NASA Ames Research Center. 26 pages. https://www.nas.nasa.gov/assets/pdf/techreports/1999/nas-99-011.pdfGoogle ScholarGoogle Scholar
  61. [61] Jung Jaewoon, Mori Takaharu, Kobayashi Chigusa, Matsunaga Yasuhiro, Yoda Takao, Feig Michael, and Sugita Yuji. 2015. GENESIS: A hybrid-parallel and multi-scale molecular dynamics simulator with enhanced sampling algorithms for biomolecular and cellular simulations. WIREs Computational Molecular Science 5, 4 (2015), 310323. Google ScholarGoogle ScholarCross RefCross Ref
  62. [62] Kodama Yuetsu, Odajima Tetsuya, Asato Akira, and Sato Mitsuhisa. 2020. Accuracy improvement of memory system simulation for modern shared memory processor. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPCAsia2020). Association for Computing Machinery, New York, NY, USA, 142149. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. [63] Korgaonkar Kunal, Bhati Ishwar, Liu Huichu, Gaur Jayesh, Manipatruni Sasikanth, Subramoney Sreenivas, Karnik Tanay, Swanson Steven, Young Ian, and Wang Hong. 2018. Density tradeoffs of non-volatile memory as a replacement for SRAM based last level cache. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE Press, Los Angeles, CA, USA, 315327. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Lau John H.. 2021. 3D IC integration and 3D IC packaging. In Semiconductor Advanced Packaging. Springer, Singapore, 343378. Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Laukemann Jan, Hammer Julian, Hofmann Johannes, Hager Georg, and Wellein Gerhard. 2018. Automated instruction stream throughput prediction for Intel and AMD microarchitectures. In 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE Press, Dallas, TX, USA, 121131. Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Project LLVM. 2022. Llvm-Mca - LLVM Machine Code Analyzer. https://llvm.org/docs/CommandGuide/llvm-mca.htmlGoogle ScholarGoogle Scholar
  67. [67] Loh Gabriel H. and Hill Mark D.. 2011. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). Association for Computing Machinery, New York, NY, USA, 454464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. [68] Loh Gabriel H., Xie Yuan, and Black Bryan. 2007. Processor design in 3D die-stacking technologies. IEEE Micro 27, 3 (May 2007), 3148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. [69] Ltaief Hatem, Cranney Jesse, Gratadour Damien, Hong Yuxi, Gatineau Laurent, and Keyes David E.. 2021. Meeting the real-time challenges of ground-based telescopes using low-rank matrix computations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’21). ACM, New York, NY, USA, 29:1–29:16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. [70] McCalpin J. D.. 1995. Memory bandwidth and machine balance in current high performance computers. IEEE Technical Committee on Computer Architecture (TCCA) Newsletter 2, 19–25 (Dec. 1995), 17.Google ScholarGoogle Scholar
  71. [71] McKee Sally A.. 2004. Reflections on the memory wall. In Proceedings of the 1st Conference on Computing Frontiers (CF’04). Association for Computing Machinery, New York, NY, USA, 162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. [72] Mendis Charith, Renda Alex, Amarasinghe Dr. Saman, and Carbin Michael. 2019. Ithemal: Accurate, portable and fast basic block throughput estimation using deep neural networks. In Proceedings of the 36th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 97), Chaudhuri Kamalika and Salakhutdinov Ruslan (Eds.). PMLR, Long Beach, California, USA, 45054515. https://proceedings.mlr.press/v97/mendis19a.htmlGoogle ScholarGoogle Scholar
  73. [73] Misawa Takahiro, Morita Satoshi, Yoshimi Kazuyoshi, Kawamura Mitsuaki, Motoyama Yuichi, Ido Kota, Ohgoe Takahiro, Imada Masatoshi, and Kato Takeo. 2018. mVMC–open-source software for many-variable variational Monte Carlo method. Computer Physics Communications 235, Feb. 2019 (2018), 447462. Google ScholarGoogle ScholarCross RefCross Ref
  74. [74] Mittal Sparsh and Vetter Jeffrey S.. 2016. A survey of techniques for architecting DRAM caches. IEEE Transactions on Parallel and Distributed Systems 27, 6 (June 2016), 18521863. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. [75] Mohd-Yusof Jamaludin, Swaminarayan Sriram, and Germann Timothy C.. 2013. Co-Design for Molecular Dynamics: An Exascale Proxy Application. Technical Report LA-UR 13-20839. Los Alamos National Laboratory. http://www.lanl.gov/orgs/adtsc/publications/science_highlights_2013/docs/Pg88_89.pdfGoogle ScholarGoogle Scholar
  76. [76] Moore Gordon E.. 1975. Progress in digital integrated electronics. International Electron Devices Meeting, IEEE 21 (1975), 1113.Google ScholarGoogle Scholar
  77. [77] Morgan Timothy P.. 2022. “Milan-X” 3D Vertical Cache Yields Epyc HPC Bang for the Buck Boost. https://www.nextplatform.com/2022/03/21/milan-x-3d-vertical-cache-yields-epyc-hpc-bang-for-the-buck-boost/Google ScholarGoogle Scholar
  78. [78] Nakajima Takahito, Katouda Michio, Kamiya Muneaki, and Nakatsuka Yutaka. 2014. NTChem: A high-performance software package for quantum molecular simulation. International Journal of Quantum Chemistry 115, 5 (Dec. 2014), 349359. Google ScholarGoogle ScholarCross RefCross Ref
  79. [79] Nickolls John and Dally William J.. 2010. The GPU computing era. IEEE Micro 30, 2 (2010), 5669. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. [80] Nori Anant Vithal, Gaur Jayesh, Rai Siddharth, Subramoney Sreenivas, and Wang Hong. 2018. Criticality aware tiered cache hierarchy: A fundamental relook at multi-level cache hierarchies. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE Press, Los Angeles, CA, USA, 96109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. [81] Corporation NVIDIA. 2022. NVIDIA H100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/h100/Google ScholarGoogle Scholar
  82. [82] Okazaki Ryohei, Tabata Takekazu, Sakashita Sota, Kitamura Kenichi, Takagi Noriko, Sakata Hideki, Ishibashi Takeshi, Nakamura Takeo, and Ajima Yuichiro. 2020. Supercomputer Fugaku CPU A64FX Realizing High Performance, High-Density Packaging, and Low Power Consumption. Fujitsu Technical Review. Fujitsu Limited. 9 pages. https://www.fujitsu.com/global/documents/about/resources/publications/technicalreview/2020-03/article03.pdfGoogle ScholarGoogle Scholar
  83. [83] Oliveira Geraldo F., Gómez-Luna Juan, Orosa Lois, Ghose Saugata, Vijaykumar Nandita, Fernandez Ivan, Sadrosadati Mohammad, and Mutlu Onur. 2021. DAMOV: A new methodology and benchmark suite for evaluating data movement bottlenecks. IEEE Access 9 (2021), 134457134502. Google ScholarGoogle ScholarCross RefCross Ref
  84. [84] Ono Kenji, Iwata Masako, Tamaki Tsuyoshi, Kawashima Yasuhiro, Akasaka Kei, Suzuki Soichiro, Onishi Junya, Uzawa Ken, Hamaguchi Kazuhiro, Miyazaki Yohei, and Imano Masashi. 2016. FFV-C Package. http://avr-aics-riken.github.io/ffvc_package/Google ScholarGoogle Scholar
  85. [85] Or-Bach Zvi. 2017. A 1,000x improvement in computer systems by bridging the processor-memory gap. In 2017 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S). IEEE Press, Burlingame, CA, USA, 14. Google ScholarGoogle ScholarCross RefCross Ref
  86. [86] Owens John D., Houston Mike, Luebke David, Green Simon, Stone John E., and Phillips James C.. 2008. GPU computing. Proc. IEEE 96, 5 (2008), 879899. Google ScholarGoogle ScholarCross RefCross Ref
  87. [87] Park Jongsoo, Smelyanskiy Mikhail, Yang Ulrike Meier, Mudigere Dheevatsa, and Dubey Pradeep. 2015. High-performance algebraic multigrid solver optimized for multi-core based distributed parallel systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). ACM, Austin, TX, USA, 54:1–54:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. [88] Petersson N. A. and Sjögreen B.. 2017. User’s Guide to SW4, Version 2.0. Technical Report LLNL-SM-741439. Lawrence Livermore National Laboratory.Google ScholarGoogle Scholar
  89. [89] Podobas Artur, Sano Kentaro, and Matsuoka Satoshi. 2020. A survey on coarse-grained reconfigurable architectures from a performance perspective. IEEE Access 8 (July 2020). Google ScholarGoogle ScholarCross RefCross Ref
  90. [90] Pouchet Louis-Noel and Taylor Mark. 2016. PolyBench/C 4.2.1 (Beta). https://sourceforge.net/projects/polybench/Google ScholarGoogle Scholar
  91. [91] Renda Alex, Chen Yishen, Mendis Charith, and Carbin Michael. 2020. DiffTune: Optimizing CPU simulator parameters with learned differentiable surrogates. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Press, Athens, Greece, 442455. Google ScholarGoogle ScholarCross RefCross Ref
  92. [92] AICS RIKEN. 2015. Fiber Miniapp Suite. https://fiber-miniapp.github.io/Google ScholarGoogle Scholar
  93. [93] Science RIKEN Center for Computational. 2021. The Kernel Codes from Priority Issue Target Applications. https://github.com/RIKEN-RCCS/fs2020-tapp-kernelsGoogle ScholarGoogle Scholar
  94. [94] RIKEN-RCCS. 2020. Riken_simulator. https://github.com/RIKEN-RCCS/riken_simulatorGoogle ScholarGoogle Scholar
  95. [95] Rodrigues Arun, Cooper-Balis Elliot, Bergman Keren, Ferreira Kurt, Bunde David, and Hemmert K. Scott. 2012. Improvements to the structural simulation toolkit. In Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques (SIMUTOOLS’12). ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), Brussels, Belgium, 190195.Google ScholarGoogle ScholarDigital LibraryDigital Library
  96. [96] Sato Mitsuhisa, Ishikawa Yutaka, Tomita Hirofumi, Kodama Yuetsu, Odajima Tetsuya, Tsuji Miwako, Yashiro Hisashi, Aoki Masaki, Shida Naoyuki, Miyoshi Ikuo, Hirai Kouichi, Furuya Atsushi, Asato Akira, Morita Kuniki, and Shimizu Toshiyuki. 2020. Co-design for A64FX manycore processor and “Fugaku”. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’20). IEEE Press, Atlanta, GA, USA, 115.Google ScholarGoogle ScholarDigital LibraryDigital Library
  97. [97] Shiba Kota, Okada Mitsuji, Kosuge Atsutake, Hamada Mototsugu, and Kuroda Tadahiro. 2022. A 7-Nm FinFET 1.2-TB/s/Mm\(^2\) 3D-Stacked SRAM module with 0.7-pJ/b inductive coupling interface using over-SRAM coil and Manchester-encoded synchronous transceiver. IEEE Journal of Solid-State Circuits (2022), 112. Google ScholarGoogle ScholarCross RefCross Ref
  98. [98] Shiba Kota, Omori Tatsuo, Ueyoshi Kodai, Takamaeda-Yamazaki Shinya, Motomura Masato, Hamada Mototsugu, and Kuroda Tadahiro. 2021. A 96-MB 3D-Stacked SRAM using inductive coupling with 0.4-V transmitter, termination scheme and 12:1 SerDes in 40-Nm CMOS. IEEE Transactions on Circuits and Systems I: Regular Papers (TCAS-I) 68, 2 (Feb. 2021), 692703. Google ScholarGoogle ScholarCross RefCross Ref
  99. [99] Shilov Anton. 2022. TSMC Roadmap Update: N3E in 2024, N2 in 2026, Major Changes Incoming. https://www.anandtech.com/show/17356/tsmc-roadmap-update-n3e-in-2024-n2-in-2026-major-changes-incomingGoogle ScholarGoogle Scholar
  100. [100] Shulaker Max M., Wu Tony F., Sabry Mohamed M., Wei Hai, Wong H.-S. Philip, and Mitra Subhasish. 2015. Monolithic 3D integration: A path from concept to reality. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE’15). EDA Consortium, San Jose, CA, USA, 11971202.Google ScholarGoogle ScholarCross RefCross Ref
  101. [101] Sorby Hugh. 2017. MPI Stub. https://github.com/hsorby/mpistubGoogle ScholarGoogle Scholar
  102. [102] Corporation Standard Performance Evaluation. 2020. SPEC’s Benchmarks. https://www.spec.org/benchmarks.htmlGoogle ScholarGoogle Scholar
  103. [103] Stephens Nigel, Biles Stuart, Boettcher Matthias, Eapen Jacob, Eyole Mbou, Gabrielli Giacomo, Horsnell Matt, Magklis Grigorios, Martinez Alejandro, Premillieu Nathanael, Reid Alastair, Rico Alejandro, and Walker Paul. 2017. The ARM scalable vector extension. IEEE Micro 37, 02 (March 2017), 2639. Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. [104] Strohmaier Erich, Dongarra Jack, Simon Horst, and Meuer Martin. 2021. TOP500. http://www.top500.org/Google ScholarGoogle Scholar
  105. [105] Suggs David, Subramony Mahesh, and Bouvier Dan. 2020. The AMD “Zen 2” processor. IEEE Micro 40, 2 (2020), 4552. Google ScholarGoogle ScholarCross RefCross Ref
  106. [106] Tavakkoli Fatemeh, Ebrahimi Siavash, Wang Shujuan, and Vafai Kambiz. 2016. Analysis of critical thermal issues in 3D integrated circuits. International Journal of Heat and Mass Transfer 97 (2016), 337352. Google ScholarGoogle ScholarCross RefCross Ref
  107. [107] Theis Thomas N. and Wong H.-S. Philip. 2017. The end of Moore’s law: A new beginning for information technology. Computing in Science Engineering 19, 2 (2017), 4150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  108. [108] Tomita Hirofumi and Satoh Masaki. 2004. A new dynamical framework of nonhydrostatic global model using the icosahedral grid. Fluid Dynamics Research 34, 6 (2004), 357400. http://stacks.iop.org/1873-7005/34/i=6/a=A03Google ScholarGoogle ScholarCross RefCross Ref
  109. [109] Tramm John R., Siegel Andrew R., Islam Tanzima, and Schulz Martin. 2014. XSBench - The development and verification of a performance abstraction for Monte Carlo reactor analysis. In PHYSOR 2014 - The Role of Reactor Physics toward a Sustainable Future. JAEA, Kyoto, 113. Google ScholarGoogle ScholarCross RefCross Ref
  110. [110] Wijngaart Rob F. Van der. 2002. The NAS Parallel Benchmarks 2.4. Technical Report NAS-02-007. NASA Ames Research Center. 8 pages. https://www.nas.nasa.gov/assets/pdf/techreports/2002/nas-02-007.pdfGoogle ScholarGoogle Scholar
  111. [111] Vasudevan Aravind, Anderson Andrew, and Gregg David. 2017. Parallel multi channel convolution using general matrix multiplication. In 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE Press, Seattle, WA, USA, 1924. Google ScholarGoogle ScholarCross RefCross Ref
  112. [112] Vetter Jeffery S., DeBenedictis Erik P., and Conte Thomas M.. 2017. Architectures for the Post-Moore era. IEEE Micro 37, 04 (July 2017), 68. Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. [113] Voskuilen Gwendolyn, Rodrigues Arun F., and Hammond Simon D.. 2016. Analyzing allocation behavior for multi-level memory. In Proceedings of the Second International Symposium on Memory Systems (MEMSYS’16). Association for Computing Machinery, New York, NY, USA, 204207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. [114] Wang Shaoxi, Yin Yue, Hu Chenxia, and Rezai Pouya. 2018. 3D integrated circuit cooling with microfluidics. Micromachines 9, 6 (2018), 114. Google ScholarGoogle ScholarCross RefCross Ref
  115. [115] Warnock James, Curran Brian, Badar John, Fredeman Gregory, Plass Donald, Chan Yuen, Carey Sean, Salem Gerard, Schroeder Friedrich, Malgioglio Frank, Mayer Guenter, Berry Christopher, Wood Michael, Chan Yiu-Hing, Mayo Mark, Isakson John, Nagarajan Charudhattan, Werner Tobias, Sigal Leon, Nigaglioni Ricardo, Cichanowski Mark, Zitz Jeffrey, Ziegler Matthew, Bronson Tim, Strevig Gerald, Dreps Daniel, Puri Ruchir, Malone Douglas, Wendel Dieter, Mak Pak-Kin, and Blake Michael. 2015. 4.1 22nm Next-generation IBM system z microprocessor. In 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers. IEEE Press, San Francisco, CA, USA, 13. Google ScholarGoogle ScholarCross RefCross Ref
  116. [116] Wolf M. M., Berry J. W., and Stark D. T.. 2015. A task-based linear algebra building blocks approach for scalable graph analytics. In 2015 IEEE High Performance Extreme Computing Conference (HPEC’15). IEEE Press, Waltham, MA, USA, 16. Google ScholarGoogle ScholarCross RefCross Ref
  117. [117] Yamamura Shuji, Akizuki Yasunobu, Sekiguchi Hideyuki, Maruyama Takumi, Sano Tsutomu, Miyazaki Hiroyuki, and Yoshida Toshio. 2022. A64FX: 52-core processor designed for the 442PetaFLOPS Supercomputer Fugaku. In IEEE International Solid-State Circuits Conference, ISSCC 2022, San Francisco, CA, USA, February 20–26, 2022. IEEE, San Francisco, CA, USA, 352354. Google ScholarGoogle ScholarCross RefCross Ref
  118. [118] Yoshida Toshio. 2018. Fujitsu high performance CPU for the Post-K computer. In 2018 IEEE Hot Chips 30 Symposium (HCS). IEEE Computer Society, California, USA, 22. http://www.fujitsu.com/jp/Images/20180821hotchips30.pdfGoogle ScholarGoogle Scholar
  119. [119] Young Vinson, Chou Chiachen, Jaleel Aamer, and Qureshi Moinuddin. 2018. ACCORD: Enabling associativity for gigascale DRAM caches by coordinating way-install and way-prediction. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, Los Angeles, CA, USA, 328339. Google ScholarGoogle ScholarDigital LibraryDigital Library
  120. [120] Zhang Yuang, Li Li, Lu Zhonghai, Jantsch Axel, Gao Minglun, Pan Hongbing, and Han Feng. 2014. A survey of memory architecture for 3D chip multi-processors. Microprocessors and Microsystems 38, 5 (2014), 415430. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Architecture and Code Optimization
          ACM Transactions on Architecture and Code Optimization  Volume 20, Issue 4
          December 2023
          426 pages
          ISSN:1544-3566
          EISSN:1544-3973
          DOI:10.1145/3630263
          • Editor:
          • David Kaeli
          Issue’s Table of Contents

          Copyright © 2023 Copyright held by the owner/author(s).

          This work is licensed under a Creative Commons Attribution International 4.0 License.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 14 December 2023
          • Online AM: 25 October 2023
          • Accepted: 13 October 2023
          • Revised: 5 October 2023
          • Received: 21 December 2022
          Published in taco Volume 20, Issue 4

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader