Abstract
Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method oblivious to the memory subsystem to gauge the upper-bound in performance improvements when data movement costs are eliminated. Then, using the gem5 simulator, we model two variants of a hypothetical LARge Cache processor (LARC), fabricated in 1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of experiments involving a broad set of proxy-applications and benchmarks, we aim to reveal how HPC CPU performance will evolve, and conclude an average boost of 9.56× for cache-sensitive HPC applications, on a per-chip basis. Additionally, we exhaustively document our methodological exploration to motivate HPC centers to drive their own technological agenda through enhanced co-design.
1 INTRODUCTION
Historically, the reliable performance increase of von Neumann-based general-purpose processors (CPUs) was driven by two technological trends. The first, observed by Gordon E. Moore [76], is that the number of transistors in an integrated circuit doubles roughly every two years. The second, called Dennard’s scaling [30], postulates that as transistors get smaller their power density stays constant. These trends synergized well, allowing computer architectures to continuously improve performance through, for example, aggressive pipelining and superscalar techniques without running into thermal limitations by, e.g., reducing the operating voltage. In the early 2000s, Dennard’s scaling ended [51] and forced architects to shift their attention from improving instruction-level parallelism to exploiting on-chip multiple-instruction multiple-data parallelism [43]. This immediate remedy to the end of Dennard’s scaling applies to this day in the form of processors such as Fujitsu A64FX [96], AMD Ryzen [105], or NVIDIA GPUs [79, 86].
Unfortunately, Moore’s law is impending termination [107], and we are entering a post-Moore era [112], home to a diversity of architectures, such as quantum-, neuromorphic-, or reconfigurable computing [49]. Many of these prototypes hold promise but are still immature, focus on a niche use case, or incur long development cycles. However, there is one salient solution that is growing in maturity and which can facilitate performance improvements in the decades to come even for the classic von Neumann CPUs we have come to rely upon—3D integrated circuit (IC) stacking [14]. 3D ICs refer to the general technologies of vertically building integrated circuits and can be done in multiple ways, such as by stacking multiple discrete dies and connecting them using coarse through-silicon vias (TSVs) or growing the 3D integrated circuit monolithically on the wafer [100].
Recent advances in 3D integrated circuits have enabled many times higher capacity for on-chip memory (caches) than traditional systems (e.g., AMD V-Cache [40]). Intuition tells us that an increased cache size, resulting from 3D-stacking, will help alleviate the performance bottlenecks of key scientific applications. To demonstrate this, we conduct a pilot study where we execute one of the important proxy-apps from the DoE ExaScale Computing Project (ECP) suite, MiniFE [50] (cf. Section 3.3), on AMD EPYC Milan and Milan-X CPUs—two architecturally similar processors with vastly different L3 cache sizes [17]. Figure 1 overviews our result of the pilot study, and we see that for a subset of problem sizes, in particular the 160 × 160 × 160 input, the 3-times larger L3 capacity of Milan-X yields up-to 3.4× improvements over baseline Milan for this memory-bound application, which motivates us to further research 3D-stacked caches.
3D integrated circuits have various benefits [52], including (i) shorter wire lengths in the interconnect leading to reduced power consumption, (ii) improved memory bandwidth through on-chip integration that can alleviate performance bottlenecks in memory-bound applications, (iii) higher package density yielding more compute and smaller system footprint, and (iv) possibly lower fabrication cost due to smaller die size (thus improved yield). All these are very desirable benefits in today’s exascale (and future) High-Performance Computing (HPC) systems. But how far can 3D ICs (with a focus on increased on-chip cache) take us in HPC?
Contributions: We study our research questions from three different levels of abstraction: (i) we design a novel exploration framework that allows us to simulate HPC applications running on a hypothetical processor having infinitely large L1D cache. We use this framework, that is orders of magnitude faster than cycle-accurate simulators, to estimate an upper-bound for cache-based improvements; (ii) we model a hypothetical LARge Cache processor (LARC), that builds on the design of A64FX, with an LLC (Last Level Caches) designed with eight stacked SRAM dies under 1.5 nm manufacturing assumption; (iii) we complement our study with a plethora of simulations of HPC proxy-applications and CPU micro-benchmarks; and lastly (iv) we find that over half (31 out of 52) of the simulated applications experience a \(\ge \,2\times\) speedup on LARC’s Core Memory Group (CMG) that occupies only one fourth the area of the baseline A64FX CMG. For applications that are responsive to larger cache capacity, this would translate to an average improvement of 9.56× (geometric mean) when we assume ideal scaling and compare at the full chip level.
The novelty in this paper lies in the purpose which LARC serves, and not the design of LARC itself. As Figure 2 shows, the capacity (and bandwidth; not shown) of the LLC have increased at a moderately gradual slope over the last two decades—with Milan-X being a noticeable outlier in per-core LLC. However, we are querying the effect of an LLC, that is an order of magnitude above the trend line as depicted in Figure 2, on HPC applications. On top of our provided baseline, further application-specific restructuring to utilize large caches [69] will result in even greater benefit.
2 CPUS EMPOWERED WITH HIGH-CAPACITY CACHE: THE FUTURE OF HPC?
The memory bandwidth of modern systems has been the bottleneck (the “memory wall” [71]) ever since CPU performance started to outgrow the bandwidth of memory subsystems in the early 1990s [70]. Today, this trend continues to shape the performance optimization landscape in high-performance computing [83, 85]. Diverse memory technologies are emerging to overcome said data movement bottleneck, such as Processing-in-Memory (PIM) [12], 3D-stackable High-Bandwidth Memory (HBM) [74], deeper (and more complex) memory hierarchies [115], and—the topic of the present paper—novel 3D-stacked caches [14, 68, 98].
In this study, our aspiration is to gauge the far end of processor technology and how it may evolve in six to eight years from now, circa 2028, when processors using 1.5 nm technology are expected to be available according to the IEEE IRDS Roadmap [53, Figure ES9]. More specifically, as 3D-stacked SRAM memory [120] becomes more common, what are the performance implications for common HPC workloads, and what new challenges lie ahead for the community? However, before attempting to understand what performance may look like six years from now, we must describe how the processor itself might change. In this section, we introduce, motivate, and reason about our design choices of what we envision as a hypothetical CPU that capitalizes on large capacity 3D-stacked cache, briefly called LARC (LARge Cache processor). Before looking at LARC, we must first set and analyze a baseline processor.
2.1 LARC’s Baseline: The A64FX Processor
We choose to base our future CPU design on the A64FX [118]. Fujitsu’s Arm-based A64FX is powering Supercomputer Fugaku [96], leader of the HPCG (TOP500 [104]; cf. Section 3.3) and Graph500 performance charts. A64FX is manufactured in 7 nm technology and has a total of 52 Arm cores (with Scalable Vector Extensions [103]) distributed across four compute clusters, called Core Memory Groups (CMGs). Twelve cores are available to the user, and one core is exclusively used for management. Each core has a local 64 KiB instruction and data-cache, and is capable of delivering 70.4 Gflop/s (IEEE-754 double-precision) performance—accumulated: 845 Gflop/s per CMG (user cores) or 3.4 Tflop/s for the entire chip. Each CMG contains a 8 MiB L2 cache slice, delivering over 900 GB/s bandwidth to the CMG [118]. The combined L2 cache, which is the CPU’s 32 MiB last level cache (LLC), is kept coherent through a ring interconnect that connects the four CMGs. Inside the CMG, a crossbar switch is used to connect the cores and the L2 slice. The L2 cache has 16-way set associativity, a line-size of 256 bytes, and the bus-width between the L1 and L2 cache is set to be 128 bytes (read) and 64 bytes (write).
We emphasize that our aim is not to propose a successor of A64FX, nor are we particularly restricting our vision by the design constrains of A64FX (e.g., power budget). However, we build our design on A64FX because: (i) as mentioned above, A64FX represents the high-end in performance for commercially available CPUs, so it is a logical starting point. (ii) A64FX is the only commercially-available CPU, currently in continued production, with HBM. The expected bandwidth ratio between future HBM and future 3D-stacked caches is similar to the ratio between traditional DRAM and LLC bandwidths [80], which is what applications and performance models are accustomed to. (iii) The A64FX LLC cache design (particularly the L2 slices connected by a crossbar switch) happens to be convenient and thus, requires a minimal effort to extend the design in a simulated environment.
In conclusion, while we extend the A64FX architecture, our workflow itself can be generalized to cover any of the processors supported by CPU simulators (e.g., variants of gem5 [13] can simulate other architectures, including x86).
2.2 Floorplan Analysis for Fujitsu A64FX
In order to estimate the floorplan of the future LARC processor built on 1.5 nm technology, we first need the floorplan of the current A64FX processor built at 7 nm . We do know that the die size of A64FX is \({\approx }400\text{mm}^{2}\) [96]. With the openly-available die shots including processor core segments highlighted [82], we can estimate most of the A64FX floorplan, including the size of CMGs and processor cores, as shown in Figure 3. Overall, each CMG is \({\approx }48\text{mm}^{2}\) in area, where an A64FX core occupies \({\approx }2.25\text{mm}^{2}\) area. The remaining parts of the CMG consist of the L2 cache slice and controller as well as the interconnect for intra-CMG communication.
2.3 From A64FX’s to LARC’s CMG Layout
Knowing the floorplan, we proceed to describe how we envision the CMG design with 1.5 nm technology. We scale the CMG by moving four generations, from 7 nm to 1.5 nm , and reduce the silicon footprint by around 8× (\({\approx }\,\text{1.7$\times $}\) per generation) for the entire CMG [39]. The new CMG consumes as little as 6 mm2 of silicon area. Next, we reclaim the area currently occupied by the L2 cache and controller and replace it with three additional CPU cores, yielding a total of 16. Further, inline with the projected year 2019\(\rightarrow\)2028 growth in the number of cores [54, Table SA-1], we double the core count of the CMG to 32, which leads to it occupying \({\approx }\ 12 \text{mm}^{2}\) of silicon area. We pessimistically leave the interconnect area unchanged and continue to use it as the primary means for communication. We call this new variant as LARC’s CMG. Finally, we assume the same die size, and hence, LARC would have 16 CMGs, each with 32 cores, in comparison to A64FX’s 4 CMG with 12+1 cores each. For LARC, we ignore the management core. However, our performance analysis will remain on the CMG level, instead of full chip, due to limitations we detail in Section 3.2.
2.4 LARC’s Vertically Stacked Cache
In the above design, we removed the L2 cache and controller from the CMG of LARC. We now assume that the L2 cache can be directly placed vertically on the CMG through 3D stacking [68]. We build our estimations based on experiments from Shiba et al. [98], who demonstrated the feasibility of stacking up-to-eight SRAM dies on top of a processor using a ThruChip Interface (TCI). The capacity and bandwidth of stacked memory is a function of several parameters: the number of channels available (\(N_\text{ch}\)), the per-channel capacity (\(N_\text{cap}\) in KiB), their width (W in bytes), the number of stacked dies (\(N_\text{dies}\)), and the operating frequency (\(f_\text{clk}\) in GHz). Shiba et al. [98] estimated that at a 10 nm process technology, eight stacks would provide \({\approx }\ 512 \text{MiB}\) of aggregated SRAM capacity for a footprint of \({\approx }\ 121 \text{mm}^{2}\). In their design, each stack has 128 channels of 512 KiB capacity. In our work, we conservatively assume an 8× scaling from 10 nm to 1.5 nm , and thus, at 12 mm2 area (the size of one LARC CMG), \(N_\text{ch}\) on each die would be \({\approx }\,\text{102}\) (=128*8/10).
We approximate \(N_\text{ch}\) to a nearby sum of power-of-two number, viz., \(N_\text{ch}=\text{96}\). Thus, with eight stacked dies (\(N_\text{dies}=\text{8}\)), our 3D SRAM cache has a total storage capacity of \(N_\text{dies} \cdot N_\text{ch} \cdot N_\text{cap}= 384\ \text{MiB}\) per CMG. We estimate the bandwidth in a similar way. We know from previous studies [98] that 3D-stacked SRAM, built on 40 nm technology, can operate at 300 MHz . We conservatively expect the same SRAM to operate at (\(f_\text{clk}\)=)1 GHz when moving from 40 nm \(\rightarrow\)1.5 nm . To account for the increased working set size of future applications, we assume a channel width (W) of 16 byte , compared to the 4 byte width assumed in [98]. With this, the CMG bandwidth becomes: \(N_\text{ch} \cdot f_\text{clk} \cdot W = 1536\ \text{GB/s}\). The read- and write-latency of their SRAM cache is 3 cycles, including the vertical data movement overhead [98].
While stacked DRAM caches theoretically provide higher capacity than stacked SRAM caches, they have limitations. For example, the latency of stacked DRAM is only 50% lower compared to DDR3 DRAM, and hence, they exacerbate miss latency; they requires refresh operations which consumes energy and reduces availability; and due to their large size, the stacked DRAM caches require special techniques for managing metadata and avoiding bandwidth bloat [23, 74]. The tag size of a stacked DRAM may exceed the LLC capacity, and hence, the tags may need to be stored in the DRAM itself which worsens hit latency. Set-associative designs and serial tag-data accesses further increase hit latency. Proposed architectural techniques and mitigation strategies, such as Loh-Hill cache [67], have yet to solve these problems. By contrast, 3D SRAM caches do not suffer from any of these issues. In fact, at iso-capacity, a 3D SRAM cache has even lower access latency than a 2D SRAM cache. Since stacked 3D SRAM caches have lower capacity than stacked DRAM, its metadata (e.g., tag) can be easily stored in SRAM itself, further reducing the access latency.
For our cache design, we assume a 256 B cache block design, which avoids bandwidth bloat. Each tag takes 6 B and as such, the total tag array size for each CMG becomes 9 MiB . This tag array can be easily placed in the cache itself. We assume that tag and data accesses happen sequentially. The tags and data of a cache set are stored on a single die. Hence, on every access, only one die needs to be activated. Since this takes only few cycles, the overall miss penalty remains small and comparable to that of A64FX’ LLC.
To show that our cache projections are realistic, we compare it with AMD’s 3D V-cache design. It uses a single stacked die for the L3 cache, providing 64 MiB capacity (in addition to the 32 MiB cache in the base die) at 7 nm [26, 40] and only 3 to 4 cycles of extra latency compared to the non-stacked version [21]. It has 36 mm2 area and has a bandwidth of 2 TB/s . When stacking additional dies on top, and assuming an 8× scaling of the area by going from 7 nm to 1.5 nm , we speculate that the LLC capacity of this commercial processor could easily exceed that of our proposed LARC.
2.5 LARC’s Core Memory Group (CMG)
At last, we detail our experimental CMG built on a hypothetical 1.5 nm technology: the LARC CMG. An illustration of this system is shown in Figure 3. Each CMG consists of 32 A64FX-like cores, which keeps the L1 instruction- and data-cache to 64 KiB each, yielding a per CMG performance of \(\approx \,\)2.3 Tflop/s (IEEE-754 double-precision). A 384 MiB L2 cache is stacked vertically on the top of the CMG through eight SRAM layers.
We keep the HBM memory bandwidth per CMG to its current A64FX value of 256 GB/s to be able to quantify performance improvements from the proposed large capacity 3D cache in isolation from any improvements that would come from increased HBM bandwidth. Furthermore, we make no assumption on the technology scaling of blocks that contain hard-to-scale-down analog components (e.g., TofuD or PCIe IP blocks) and instead focus exclusively on scaling the CMG-part of the System-on-Chip (i.e., processing cores, L1/L2 caches, and intra-chip interconnects).
While our study focuses on evaluating a single CMG, we conclude that a complete, hypothetical LARC CPU, with a die size similar to the current A64FX, would contain 512 processing cores, 6 GiB of stacked L2 cache, a peak L2 bandwidth of 24.6 TB/s , a peak HBM bandwidth of 4.1 TB/s , and a total of 36 Tflop/s of raw, double-precision, compute. The A64FX processor has a peak HBM bandwidth of 1 TB/s , whereas our envisioned LARC CPU has 4× more CMGs and hence, a peak HBM bandwidth of 4.1 TB/s . Thus, compared to A64FX, LARC has higher effective bandwidth of external memory. Further changes to the HBM generation are beyond the scope of this study.
2.6 LARC’s Power and Thermal Considerations
To estimate the power consumption of LARC, we analyze A64FX’s current consumption and extrapolate to 1.5 nm by leveraging public technology roadmaps. A64FX’s peak power, achieved while running DGEMM, is 122 W [117]; where 95 W correspond to core power and 15 W correspond to the memory interface (MIF), and hence, we conclude 1.98 W/core and 3.75 W/MIF . Therefore, a LARC CMG with 32 cores in 7 nm would consume 67.1 W . TSMC projects that shrinking from 7 nm to 5 nm yields a power reduction of about 30% [99], i.e., 46.98 W for LARC’s CMG in 5 nm . IRDS’s roadmap [53, Figure ES9] indicates a further compounded power reduction (at iso frequency) of 42% when moving from 5 nm to 1.5 nm , i.e., 27.37 W for LARC’s CMG in 1.5 nm . As the full LARC chip is estimated to include 16 CMGs, we project a total power of 438 Watt (not including the L2 cache).
Next, we estimate the power consumed by the principal part of this study—the 384 MiB L2 cache. A 4 MiB SRAM L2 cache in 7 nm consumes 64 mW of static power [44]. Assuming a similar (pessimistic) static power consumption at 1.5 nm and extrapolated to 384 MiB , we find that our cache would have a static power consumption of 6.14 W . Scaled to the full 16 CMGs of our hypothetical LARC, we arrive at a static power consumption of 98.3 W . This static power consumption of caches represents between 90% and 98% of the entire power consumption (at 350 K temperature, see, e.g., [5, 20]), where the remainder is the dynamic power consumption. If we assume a pessimistic 9:1 ratio between static and dynamic power, then this yields a total power consumption of 109.23 W for 6 GiB of chip-wide stacked L2 cache.
To conclude, a LARC processor (16 CMG) would have to be designed for a thermal design power (TDP) of 547 W. While this expected TDP is more than the current A64FX, it is not entirely unlike emerging architectures, such as NVIDIA’s H100 [81] that consumes up to 700 W or the AMD Instinct MI250X GPU [3] at 560 W . We stress that our estimate of 547 W is peak power draw achieved only during parallel DGEMM execution. Adjusting for Stream Triad, based on the breakdown in [117], we conclude a realistic, and considerably lower, power consumption of 420 W for bandwidth-bound applications running on the whole LARC chip.
Finally, while this L2 cache power estimation might appear pessimistic, there are ample opportunities to further reduce power consumption. To save static energy, all the un-accessed dies can be changed to data-retentive, low-power (sleep) state. To deal with remaining thermal issues after stacking the cache layers underneath the cores instead of on top, one can additionally adapt simple direct-die cooling or advanced techniques [18, 106], such as high-\(\kappa\) thermal compound [42], microfluid cooling [114], or thermal-aware floorplanning, task-scheduling and data-placement optimizations. Specifically, microfluid cooling can handle power densities of 3.5 W/mm2 and hot-spot power levels of over 20 W/mm2 for 3D-stacked chips [1]. By contrast, our LARC CPU has a power density of 2.85 W/mm2 at 192 mm2 if we ignore adjunct components such as I/O die, PCIe, TofuD interface, and the like, and around half the power density at 400 mm2 if these components are included.
3 PROJECTING PERFORMANCE IMPROVEMENT IN SIMULATED ENVIRONMENTS
Analyzing LARC’s feasibility is only the first step, and hence we have to demonstrate the effects of the proposed changes on real workloads to allow a meaningful cost-benefit analysis by CPU vendors. This section details two simulation approaches (one novel; one established) and discusses the HPC applications, which we evaluate extensively in Sections 4 and 5.
3.1 Simulating Unrestricted Locality with MCA
Designing and executing even initial studies (i.e., no complex memory models, etc.) with cycle-level gem5 simulations for realistic workloads takes substantial time with unknown outcome. Therefore, one would want to have a first-order approximation of a very large and fast cache. Regrettably, and to the best of our knowledge, existing approaches for fast first-order approximations do generally not support complex HPC applications, i.e., the existing tools neither handle multi-threading correctly nor do they have support for MPI applications [6]. Hence, we design a simulation approach, using Machine Code Analyzers (MCA), which can estimate the speedup for a given application orders-of-magnitude faster than gem5 (typically hours instead of months; cf. next section). This upper bound in expected performance improvement allows us to: (i) get a perspective on the best possible performance improvement if all read/writes can be satisfied from the cache; and (ii) justify more accurate simulations and classify their results with respect to the baseline and the upper bound.
Machine Code Analyzers, such as llvm-mca [66], have been designed to study microarchitectures, improve compilers, and investigate resource pressure for application kernels. Usually, the input for these tools is a short Assembly sequence and they output, among other things, an expected throughput for a given CPU when the sequence is executed many times and all data is available in L1 data cache. For most real applications, the latter assumption is obviously incorrect, however, it is ideal to gauge an upper bound on performance when all the memory-bottlenecks disappear.
Unfortunately, it is neither feasible to record all executed instructions in one long sequence, nor to analyze a full program sequence with llvm-mca. Hence, we break the program execution into basic blocks (at most tens or hundreds of instructions) and evaluate their throughput individually. For a given combination of a program and input (called workload hereafter), the basic blocks and their dependencies create a directed Control Flow Graph (CFG) [56] with one source (program start) and one sink (program termination). All intermediate nodes (representing basic blocks) of the graph can have multiple parent- and dependent-nodes, as well as self-references (e.g., basic blocks of
We utilize the Software Development Emulator (SDE) [57] from Intel to record the basic blocks and their caller/callee dependencies for a workload with modest runtime overhead (typically in order of 1000× slowdown). SDE also notes down the number of invocations per CFG edge for a workload, i.e., how often the program counter (PC) jumped from one specific basic block to another specific block. We developed a program which parses the output of Intel SDE and establishes an internal representation of the Control Flow Graph. The internal CFG nodes are then amended with Assembly extracted from the program’s binary, since SDE’s Assembly output is not compatible with Machine Code Analyzers. Our program subsequently executes a Machine Code Analyzer for each basic block, getting in return an estimated cycles-per-iteration metric (CPIter). We record the per-block CPIter at the directed CFG edge from caller to callee, which already holds the number of invocations of this edge, effectively creating a “weighted” graph. Figure 4 showcases the result and it is easy to see that the summation of all edges in the CFG is equivalent to the estimated runtime of the entire workload (assuming all data is inside the L1 data cache).
The above outlined approach works for both sequential and parallel programs. Intel SDE can record the instruction execution and caller/callee dependencies for thread-parallel programs, e.g., pthreads, OpenMP, or TBB. Furthermore, we can attach SDE to individual MPI ranks to get the data for it. Therefore, we are able to estimate the runtime for MPI+X parallelized HPC applications by the following equation: (1) \(\begin{equation} \textstyle \text{t}_\text{app} := \frac{ \max \limits _{r\,\in \,\text{ranks}} \big (\max \limits _{t\,\in \,\text{threads}_{r}} (\,\sum \limits _{\text{edges}\,e\,\in \,\text{CFG}_{t,r}} \text{CPIter}_{e} \cdot \#\text{calls}_{e}\,)\big) }{ \text{processor frequency in Hz} } \end{equation}\) under the assumption that MPI ranks and threads do not share computational resources,1 where we sum up the number of cycles required for each block (i.e., CFG edges) considering only the “slowest” thread and rank, and divide by the CPU frequency to convert the total cycles into runtime.
The self-imposed restriction of Machine Code Analyzers is the limited accuracy compared to cycle-accurate simulators, due to their distinct design goal. To improve our CPIter estimate, we rely on four different MCAs, namely llvm-mca [66], Intel ACA (IACA) [55], uiCA [2], and OSACA [65], and take the median of the results. Another shortcoming of MCA tools is that most of them estimate the throughput of basic blocks in isolation while assuming looping behavior of the assembly block (PC jumps from last back to first instruction). Neither “block looping” nor an empty instruction pipeline (single iteration of the block) are realistic for some blocks. Hence, for non-looping basic blocks, we estimate the CPIter by feeding the MCA tool with the blocks of caller and callee, and the callee’s CPIter is calculated by subtracting the cycle of retirement of its last instruction from the caller’s last instruction retirement (instead of when the callee’s first instructions are decoded, which can overlap with execution of caller instructions). Further, we correct some cycle estimates for specific instructions within our tool in post-processing, since we encountered a few unsupported or grossly mis-estimated instructions while validating our tool against benchmarks. We refer the reader to Section 4.1 for more details.
3.2 Cycle-level Accuracy: CPUs Simulated in gem5
While the MCAs can give a first-order approximation, we still require highly accurate predictions for our 3D-stacked, cache-rich CPU. Hence, we employ an open-source system architecture simulator, called gem5 [13]. It supports Arm, x86, and RISC-V CPUs to varying degrees of accuracy, and can be extended with memory models for higher simulation fidelity of the memory subsystem. We use gem5’s “syscall emulation” mode to execute applications directly without booting a Linux kernel.
Fortunately, RIKEN released their gem5 version which was specially tailored for A64FX’s co-design to support SVE, HBM2, and other advanced features [94]. Hence, it is well suited to simulate our LARC proposal in Section 2.4. This version of gem5 has been validated for A64FX [62], and can be used with production compilers from Fujitsu. Albeit, while evaluating RIKEN’s gem5, we noticed a few drawbacks, such as the lack of support for: (i) dynamically linked binaries; (ii) adequate memory management (freeing memory after application’s
We modify gem5 to remedy the first three problems. However, the last two problems remain intractable without major changes to the simulator’s codebase, and hence we limit ourselves to single-CMG simulations (with one MPI rank). Relying on the assumption that most HPC codes are weak scaled across multiple NUMA domains and compute nodes, we believe the single-rank approach still serves as a solid foundation for future performance projection. However, even single-rank MPI binaries require numerous unsupported system calls. To circumvent this problem, we extend and deploy an MPI stub library [101].
3.3 Relevant HPC (Proxy-)Apps and Benchmarks
Instead of relying on a narrow set of cherry-picked applications, we attempt to cover a broad spectrum of typical scientific/HPC workloads. We customize and extend a publicly available benchmarking framework2 [34, 35] with a few additional benchmarks and necessary features to perform the MCA- and gem5-based simulations. The benchmark complexity ranges from simple kernels to large code bases (O(100,000s) lines-of-code) which are used by vendors for architecture comparisons and used by HPC centers for hardware procurements [41]. Hereafter, we detail the list of 127 included workloads, summed up across all benchmark suites, which are sized to fit within a single node and which could be simulated with gem5 in a reasonable time (\(\le \,\)six months).
Polyhedral Benchmark Suite.
The PolyBench/C suite contains 30 single-threaded, scientific kernels which can be parameterized in memory occupancy (\(\in [16\ \text{KiB}, 120\ \text{MiB}]\)) [90]. Unless stated otherwise, we use the largest configuration.
TOP500, STREAM, and Deep Learning Benchmarks.
High Performance Linpack (HPL) [36] solves a dense system of linear equations \(Ax = b\) of size 36,864 in our case. High Performance Conjugate Gradients (HPCG) [37] applies a conjugate gradient solver to a system of linear equation (with sparse matrix A). We choose \(120^3\) for HPCG’s global problem size. BabelStream [29] evaluates the memory subsystem of CPUs and accelerators, and we configure 2 GiB input vectors. Moreover, we implement a micro-benchmark, DLproxy, to isolate the single-precision GEMM operation (\(m=\text{1577088}; n=\text{27}; k=\text{32}\)) which is commonly found in 2D deep convolutional neural networks, such as 224×224 ImageNet classification workloads [111].
NASA Advanced Supercomputing Parallel Benchmarks.
The NAS Parallel Benchmarks (NPB) [11, 110] consists of nine kernels and proxy-apps which are common in computational fluid dynamics (CFD). The original MPI-only set has been expanded with ten OpenMP-only benchmarks [60] and we select the
RIKEN’s Fiber Mini-Apps and TAPP Kernels.
To aid the co-design of Supercomputer Fugaku, RIKEN developed the Fiber proxy-application set [92], a benchmark suite representing the scientific priority areas of Japan. Additionally, RIKEN released scaled-down TAPP kernels [93] of their priority applications which are tailored for fast simulations with gem5 [62]. Our workloads are as follows: FFB [46] with the 3D-flow problem discretized into 50\(\times 50\times\)50 sub-regions; FFVC [84] using 144\(\times 144\times\)144 cuboids; MODYLAS [9] with the
Exascale Computing Project Proxy-Applications.
The US-based supercomputing centers curated a co-design benchmarking suite for their recent exascale efforts [41]. We select eleven applications of the aforementioned benchmarking framework with the following workloads. AMG [87] with the
3.3.1 SPEC CPU & SPEC OMP Benchmarks.
The Standard Performance Evaluation Corporation [102] offers, among others, two HPC-focused benchmark suits: SPEC CPU® 2017[speed] (ten integer-heavy, single-threaded; ten OpenMP-parallelized, floating-point benchmarks) and SPEC OMP® 2012 (14 OpenMP-parallelized benchmarks). All SPEC tests hereafter are based on non-compliant runs with the
4 MCA-BASED SIMULATION RESULTS
Sections 4.1 and 4.2 are dedicated to our MCA-based estimation of the upper bound on performance improvement with abundant L1 cache. First, we evaluate the accuracy of this approach, and then apply the novel methodology to our benchmarking sets.
4.1 MCA-based Simulator Validation
During the development of our MCA-based simulator, we implemented numerous micro-benchmarks to fine-tune the CPI estimation capabilities while comparing the results to an Intel® Xeon® processor E5-2650v4 (formerly code named Broadwell). Our micro-benchmarks comprise MPI-/OpenMP-only, MPI+OpenMP, and single-threaded tests (exercising recursive functions, floating-point- or integer-intensive operations, L1-localised, or stream-like operation).
Needless to say, applying MCA-based simulations to full workloads or complex application kernels is still error-prone, since these tools are designed to analyze small Assembly sequences without guarantee for accurate absolute performance numbers. Regardless, we validate the current status of our tool using PolyBench/C with
The data shows that on average our MCA-based method slightly overestimates: MCA approach predicts faster execution times than it should. Only seven out of 30 workloads are expected to run slower than what we observe on the real Broadwell (i.e., y-value \(\le\)1). For eight of the PolyBench tests, our tool estimates the runtime to be over 2× faster than our measurements. Hence, we can conclude that for 73% of the micro-benchmarks, the MCA-based method is reasonably accurate: within 2× slower-to-2× faster. While a 2× discrepancy might appear high, we have to point out that our cross-validations using SST [95, 113] and third-party gem5 models [7] for Intel CPUs yield similar inaccuracies,4 but our MCA-based method is substantially faster.
Another indicator for the accuracy of our MCA-approach can be drawn from DGEMM (double precision
4.2 Speedup-potential with Unrestricted Locality
In this section, we take on the entire benchmark suite from Section 3.3 with the MCA-based approach and evaluate their speedup potential when all data fits into L1.
The baseline measurements for the speedup estimates are conducted on a dual-socket Intel Broadwell E5-2650v4 system with 48 cores (2-way hyper-threading enabled, cores are set to 2.2 GHz , turbo boost disabled). For all listed benchmarks, excluding SPEC CPU and OMP, we focus on the solver times only, i.e., we ignore data initialization and post-processing phases. Since most proxy-apps are parallelized with MPI and/or OpenMP, we perform an initial sweep of possible configurations of ranks and threads to determine the fastest time-to-solution (TTS) for our strong-scaling benchmarks, and the highest figure-of-merit (as reported by the benchmarks) for weak-scaling workloads. The highest performing configurations is executed ten times to determine the TTS of the kernel as our reference point in Figure 6.
The same MPI/OMP configurations are then used for our MCA-based estimate. Under the assumption that some MPI-parallized benchmarks experience imbalances, we randomly sample up to nine ranks (in addition to rank 0),5 execute the selected rank with Intel SDE (and the remaining ranks normally), and calculate the estimated runtime using Equation (1) and the 2.2 GHz processor frequency. The resulting runtime estimate is divided by the measured runtime to determine the upper-bound speedup potential per application when all its data would fit into L1D, see Figure 6.
For PolyBench/C workloads, we see similar speedup trends as for its smallest inputs which we used in Figure 5, although the expected speedup for
NPB’s OpenMP version of a conjugate gradient (CG) solver is another workload with a large theoretical performance gain of 13.1×. In total, we expect a (GM=)3× gain for all NAS Parallel Benchmarks; specifically, (GM=)4× for the OpenMP versions and (GM=)2.3× for the MPI versions. The potential gain for CG is not surprising, since these solvers are predominantly bound by memory bandwidth and are sensitive to memory latency [38]. High Performance Linpack is unsurprisingly not expected to gain any performance by placing all its data into L1 cache, as this benchmark is compute-bound. In fact, our MCA tool expected a small runtime decrease of 11%. By contrast, DLproxy, which uses MKL’s SGEMM, would benefit from a large L1, since MKL cannot achieve peak Gflop/s for the tall/skinny matrix in this workload (cf. Section 3.3). XSBench and miniAMR show the highest gains for ECP’s and RIKEN’s proxy-apps, with a value of 7.3× and 7.4×, respectively. This appears to be in line with the expectation from the roofline characteristics of the benchmarks when measured on a similar compute node [33].
A deeper look at roofline analysis in [33] reveals that there is no strong correlation between the position of an application on the roofline model and the expected performance gain from solely running out of L1D cache. We speculate that other, hidden bottlenecks are exposed by our MCA approach, such as data dependencies and lack of concurrency in the applications, which limit the expected speedup. Apart from noticeable outliers in the expected speedup, such as lbm, ilbdc, and especially swim, the potential from enlarged L1D is rather slim for SPEC, and only (GM=)1.9× runtime reduction can be expected across all 34 workloads.
5 GEM5-BASED SIMULATION RESULTS
In Section 5.1, we detail our choice for the simulated architectures in gem5. Similarly structured to the MCA-based simulations, Sections 5.2 and 5.3 highlight our validation of gem5 for our proposed CPU architectures and evaluate numerous benchmarks and proxy applications on said architecture, and we summarize the results in Section 5.4.
5.1 LARC CMG Models in gem5 and A64FXS Baseline
As we discussed in Section 2.4, we envision one LARC CMG to have 32 cores, 384 MiB L2 cache, and 1.6 TB/s L2 bandwidth. Regretfully, gem5 (at least RIKEN’s version) can only be configured with L2 cache sizes that are 2X, and therefore we either have to scale up or down LARC’s L2 cache size. Hence, we explore both as distinct options, one conservative and one technologically aggressive configuration. The conservative option, called LARCC, is limited to 256 MiB L2 cache at \(\sim \,\)800 GB/s , while the aggressive version, LARCA, doubles both values, to 512 MiB and \(\sim \,\)1.6 TB/s , respectively.
Starting at a baseline, i.e., a simulated version of A64FX which we label as A64FXS, and in order to materialize the properties of the LARC CMG (cf. Section 2.4), we modify three parameters in our gem5 model. We modify: (i) the number of cores in the system to match 32 (up from A64FXS’ baseline of 12); (ii) the size of the total L2 cache to match the capacity of the eight stacked layers (256/512 MiB , up from A64FXS’ L2 size of 8 MiB per CMG); and (iii) we adjust the number of L2 banks in LARCA to control the bandwidth.
We introduce a fourth gem5 configuration, called A64FX32, which simulates one baseline A64FXS CMG but with 32 cores. These four configurations A64FXS \(\rightarrow\)A64FX32 \(\rightarrow\)LARCC \(\rightarrow\)LARCA should allow us to determine the speedup gains from the larger core count and larger L2 cache, individually. The core frequency is universally set to 2.2 GHz . Table 2 summarizes the four gem5 configurations.
5.2 gem5-based Simulation and Configuration Validation
We perform OpenMP tests to verify our gem5 simulator for up to 32 cores. For the L2 cache size and bandwidth changes, we employ a STREAM Triad benchmark, parameterized to avoid cache line conflicts among participating threads. Splitting the A64FXS CMG L2 cache into 12 chunks (one per thread) yields a working size of 683 KiB . Hence, the three 128 KiB vectors of the Triad operation will fit into the L2 cache. We increase the total vector size in proportion to the number of threads and test the achievable L2 bandwidth for LARCC and LARCA. Additionally, Figure 7(a) includes the baseline A64FXS CMG scaled to 12 cores. The simulation shows that LARCC’s L2 bandwidth peaks out at 792 GB/s and LARCA’s bandwidth goes up to 1450 GB/s for this particular test case, which is, respectively, 1% and 9% lower than our estimates shown above. The baseline A64FXS closely matches the bandwidth of the real A64FX CPU executing this test.
Another validation test we perform is setting the number of cores to the maximum (12 and 32, respectively) and scale the vector size from 2 KiB per core to a total of 1 GiB for the three vectors. Figure 7(b) shows the results for this simulation. In the memory range of tens to hundreds of KiB, the Triad operation can be done from L1 cache, for which LARCC and LARCA show higher bandwidth. Their 2.7× higher core count results in 2.6× higher aggregated L1 bandwidth. For the Triad, for the memory sizes that fit into L2 cache, we see a behavior similar to Figure 7(a). Past 8 MiB , the A64FXS configuration shows the expected bandwidth drop to HBM2 level, while for LARCC and LARCA, the expected L2 cache bandwidth is maintained until 256 MiB and 512 MiB , respectively. This validates that our gem5 settings yield the expected LLC characteristics.
Lastly, to validate the LARC configuration and to see the changes applied to more complex science kernels, we perform a sensitivity analysis of cache parameters with the RIKEN TAPP kernels. In Figure 8, we vary L2 cache access latency, size, and bandwidth in ranges beyond our LARCC and LARCA target architectures. This analysis will help us in adjusting our expectations when future LARC-like architectures deviate from our design parameters, e.g., by stacking less SRAM layers or having higher L2 access latency. In this parameter sweep, LARCC will be the baseline and we vary one parameter while keeping the others fixed. The top row of Figure 8 shows the latency sweep, where we choose 22 cycles as best latency (which is 2× the data load latency from L1 for SVE instructions in A64FX). The worst case of 52 cycles is equidistant to our baseline in the opposite direction, and two additional latencies are selected in between. Similarly, we adjust the L2 size (middle row; simulating more or less SRAM stacks or a larger or smaller semiconductor process nodes) and L2 bank bits in gem5, see bottom row of Figure 8. The latter indirectly controls the L2 bandwidth of the simulated architectures. The latency change has minimal impact, since HPC applications are typically not latency bound. However, the L2 cache capacity and bandwidth can have a significant impact on performance, as expected, since they determine the amount of data that can be stored and accessed quickly. For some of the TAPP kernels, though, the performance is unaffected by these parameters,6 since these kernels are actually shrunk-down versions specifically designed for cycle-level architecture simulations, and therefore have low memory footprint.
5.3 Speedup-potential with Restricted Locality
To further refine our projections gained by abundant cache, we proceed with the cycle-level simulations of the proxy-applications and benchmarks listed in Section 3.3.
We compile all benchmarks with Fujitsu’s Software Technical Computing Suite (v4.6.1) targeting the real A64FX, and simulate the single-rank workloads in gem5 for our four configurations. Unfortunately, three of our MPI-based benchmarks require multi-rank MPI: MODYLAS, NICAM, and NTChem, and hence we omit them. Furthermore, we skip the MPI-only versions of NPB. Hereafter, we only report proxy applications and benchmarks which ran to completion within gem5 (i.e., gem5-crashes or simulated application-crashes are excluded when infeasible to patch, and simulations exceeding the 6-months time limit are ignored).
The per-configuration speedup is given relative to the baseline A64FXS configuration. We exclude initialization and post-processing times, and measure only the main kernel runtime, except for the SPEC benchmarks as described in Section 4.2. These results are presented in Figure 9 and show the effects of the gradual expansion of simulated resources. The average (single CMG) speedups from LARCC and LARCA are \(\approx \,1.9\times\) and \(\approx \,2.1\times\), respectively, with some applications reaching \(\approx \,4.4\times\) for LARCC and \(\approx \,4.6\times\) for LARCA.
As expected, most benchmarks benefit from the additional cores and cache capacity, most prominently MG-OMP which gains a small speedup of \(\approx \,1.3\times\) from the extra cores, \(\approx \,2\times\) speedup from the extra cache, and with 512 MiB cache and higher bandwidth reaches \(\approx \,4.6\times\) speedup. Comparable incremental improvements with all three architecture steps are observable in other workloads, such as TAPP kernel 7 (DifferOpVer) and 17 (MatVecSplit), showing good scaling on multiple cores and being memory-bound since they benefit from the additional cores and cache capacity. TAPP kernels 19 and 20, XSBench, roms, and imagick (SPEC OMP) show similar gain in runtime, but the difference between LARCC and LARCA is smaller, implying that the problem size either fits into the 256 MiB L2 (e.g., XSBench) or the workload arrives at a point of diminishing returns from the 2× larger cache. TAPP kernels 8, 9, 12–15, and FT-OMP suffer a slowdown from cache contention in A64FX32. LARCC and LARCA avoid the cache contention, resulting in speedups similar to the benchmarks discussed earlier. EP-OMP, CoMD, and other compute-bound benchmarks benefit only from the higher core count, with both LARCs providing similar speedup as A64FX32.
Expectedly, single-threaded workloads (all of PolyBench’s benchmarks) show little to no improvements over A64FXS, i.e., they do not benefit from more cores. However, these benchmarks also do not show a performance gain from a larger 3D-stacked L2 cache, albeit their working set size exceeding A64FXS’ 8 MiB L2 yet fitting into LARC’ larger cache. We only see a limited speedup of (GM=)4.3% across all of them and no noteworthy outliers, and hence omit them in Figure 9. We attribute other outliers, such as the slowdown of imagick (SPEC-CPU), to similar intrinsic property of the benchmark: our testing on a real A64FX reveals that imagick has a sweetspot at 8 OpenMP threads, and scales negatively thereafter; and the TAPP kernels 3–6 and 18 were customized for the 12-core A64FX CMG and cannot run effectively on 32 threads without a rewrite. Hence, we limit gem5 to 12 cores for these TAPP kernels, and we see that only the MatVecDotP kernel of the ADVENTURE application [4] benefits from a larger L2. Further proxy-applications and benchmarks missing from Figure 9, yet appearing in Figure 6, are the unfavorable result of persistent, repeatable simulator errors—sometimes occurring after months of simulation.
We should note that in some cases the benchmarks’ implementation and the quality of the compiler may skew the results, for instance, BabelStream measuring memory bandwidth on a 2 GiB buffer. Being unoptimized for A64FX, BabelStream’s baseline underperforms in terms of per-core bandwidth (compared to STREAM Triad tests in Figures 7(a) and 7(b)) which in turn results in performance gain when the number of cores increases to 32.
Overall, the speedup on A64FX32 can originate from the following reasons: (i) the program is compute-bound (a valid result); (ii) the workload exhibits both compute-bound and memory-bound tendencies in different components of a proxy-application (a valid result); (iii) the program is highly latency-bound, and hence the speedup can be the result of the larger aggregate L1 cache (a valid result); or (iv) a poor baseline resulting in a slightly misleading result.
We confirm the validity of attributing improvement to the high capacity L2 by inspecting the L2 cache-miss rates of our gem5 simulations (with the miss rate of some selected examples listed in Table 3). The reduction in cache-miss rates reported in the table is consistent with the performance improvements we observe in Figure 9.
5.4 Summary of the Results
Our gem5 simulations indicate that more than half (31 out of 52) of the applications experienced a larger than two times speedup on LARCA compared to the baseline A64FXS CMG. For over two-thirds (24 out of 31) of these applications, the performance gains are directly attributed to the larger (3D-stacked) cache, i.e., with at least 10% gain by either of the two LARC configurations over the A64FX32 variant. Most notably, out of all the RIKEN TAPP kernels that experienced meaningful speedup on LARC, a majority benefited from the expanded cache, rather than the increase in number of cores. This carries particular importance as these kernels are highly tuned for A64FX.
6 DISCUSSION AND LIMITATIONS
In this study, we simulated a single LARC CMG in gem5, and its potential future effect on common HPC workloads.
6.1 The Prospect of LARC
In reality, if a LARC processor were incepted in 2028, it would contain 16 LARC CMGs, which correspond to the same silicon area as the current A64FX CPU, and it is important to understand what impact such a processor would have on the HPC community and its applications. Unfortunately, it is hard to give a conclusive answer to such a forward looking question today. However, if we do ideal scaling of both A64FX and LARC CMGs and compare at the full chip level, then a LARC system in 2028 could give between 4.91× (
6.2 Considerations and Limitations
Our MCA-based estimation framework only gives a first-order approximation for a hypothetical CPU with sufficiently large L1 cache to host the entire data structures of a specific workload. This approach has some advantages and disadvantages and should be used with caution, but it also has capabilities which we have not yet detailed, such as estimating the runtime of the same binary/ workload for different (ISA-compatible) x86 systems by simply replacing the MCA target architecture and altering the CPU clock frequency.
We emphasize that we run applications as they are, i.e., without any algorithmic optimizations to the larger last level cache, in our MCA- and gem5-based simulators. This is also true to our motivating experiment shown in Figure 1. While the cache capacity of AMD’s Milan-X CPU is about three times that of Milan, it is far from what we envision in 2028. Hence, our Milan-X results serve as a first-order indication of what SRAM—in its current available SoTA—can offer.
Another notable aspect, which is outside the main scope of this extrapolation study, is the heat dissipation of CPU cores in the face of the 3D-stacked cache. It has been reported that AMD’s Milan-X carefully stacks caches above areas of the chip that are not used for compute, i.e., mostly above caches [77]. Our assumption is that, by 2028, manufacturing technologies will have advanced enough to overcome this limitation. Yet, for interested readers we provide further details on thermal and power estimates for our hypothetical LARC CPU in Section 2.6.
7 RELATED WORK
Stacked Memory and Caches: The size of LLC has increased for the last 25 years [58], a trend anticipated to continue into the future. Yet, 2D IC becomes hard to exploit for additional performance, despite recent attempts by IBM [27, 59]. However, 3D-stacking is becoming a promising alternative [52], as demonstrated by AMD’s 3D V-Cache [40], Samsung’s proposed 3D SRAM stacking solution [64] based on 7 nm TSVs, or the most recent study of 7 nm TCI-based 2- and 4-layer SRAM stacks by Shiba et al. [97]. Moreover, academics explored 3D-stacked DRAM cache [48, 119], but these incur much higher latency and power consumption [74, 98]. Non-Volatile Memory is considered as LLC alternative, yet it suffers similar latency issues [63]. Lastly, NVIDIA applied for a patent of an 8-layer memory stack fused with a processor die [28], theorizing a 50× improvement in bytes-to-flop ratio. However, what differs our work from the work of our peers is: (i) we focus on the real-world impact of future caches, several magnitudes larger than those found today.
Performance modeling tools and methodologies: Computer architecture research is often based on simulators, such as the Structural Simulation Toolkit (SST) [95] or CODES [24], for efficiently evaluating and optimizing HPC architectures and applications. The gem5 simulator, by Binkert et al. [13], is widely used by academia and vendors for micro-and full-system architecture emulation and simulation. It supports validated models for x86 [7] and Arm [62]. We refer the interested reader to www.gem5.org/publications/ for an comprehensive library of gem5-based research and derivative works. However, what differs our work from the work of our peers is: (ii): unlike prior work that utilizes (relatively) small kernels, our work operates on large-scale MPI/OpenMP-parallelized proxy-applications in order to quantify the impact of caches on realistic workloads. To our knowledge of reported research-driven gem5 simulations, this is the largest scale of cycle-accurate simulations conducted in terms of the aggregate number of instructions simulated (\(6.08 \times 10^{13}\)).
Other methods such as MUSA by Grass et al. [45] are closer to our MCA-based approach, since MUSA uses PIN which is the basis for Intel SDE (used in this study), but focus on MPI analysis and multi-node workloads. We are not the first to utilize Machine Code Analyzers, see [2, 65] and derivative works such as [8, 19, 22, 25, 72, 91]. However, what differs our work from the work of our peers is: (iii): instead of estimating accurate performance of existing system architectures, our MCA-based approach tries to gauge the upper-bound in obtainable performance, and exposes bottlenecks better than the roofline approach, for common HPC applications.
8 CONCLUSION
We aspire to understand the performance implications of emerging SRAM-based die-stacking on future HPC processors. We first designed a methodology to project the upper bound that an infinitely large cache would have on relevant HPC applications. We find that several well-known HPC applications and benchmarks have ample opportunities to exploit an increased cache capacity.
We further expand our study by proposing a hypothetical processor (called LARC) in 1.5 nm technology. This processor would have nearly 6 GiB L2 cache memory; compared to our baseline A64FXS CPU architecture with 32 MiB L2 cache. Next, we exercise a single LARC CMG using a plethora of HPC applications and benchmarks using the gem5 simulator and contrast the observed performance against the existing A64FXS CMG. We find that the LARC CMG would (on average) be 1.9× faster than the corresponding A64FXS CMG, albeit consuming \(\frac{1}{4}\)th of the area. When area-normalized to the real A64FX CMG (by assuming optimistic ideal scaling), we can expect to see an average boost of 9.56× for cache-sensitive HPC applications by the end of this decade.
Finally, we expect that the larger caches will motivate and facilitate algorithmic advances that in combination with the abundant cache can potentially yield an order of magnitude gain in performance, as demonstrated by the tile low-rank (TLR) approximations [69]. These approaches however require a minimum size of the cache to reach their fullest potential. We firmly believe that the combination of high-bandwidth, large, 3D-stacked caches, and algorithmic advances, is the path forward for the next generation of HPC processors when attempting to break the “memory wall”.
9 FAIR COMMITMENT BY THE AUTHORS
We developed a framework of scripts and git submodules to manage the R&D of LARC, to set up the benchmarking infrastructure, and to perform the simulations. After cloning our repository https://gitlab.com/domke/LARC
(or downloading the artifacts from https://doi.org/10.5281/zenodo.6420658), one has access to all benchmarks (see Section 3.3), patches, scripts, and our collected data. Only minor modifications to the configuration files should be necessary, such as changing host names, paths to compilers, or downloading licensed third-party software, before testing on another system. If users deviate from our OS version (CentOS Linux release 7.9.2009, and
ACKNOWLEDGMENT
Furthermore, we thank Masazumi Nakamura from AMD for providing us early access to the Milan-X platform.
JD, EV, BG, and MW designed the study; JD and EV conducted the experiments; BG fixed gem5 issues; JD, EV, MW, AP, MP, LZ and PC analyzed the results; SM and AP developed the stacked cache model, and all authors participated in brainstorming and writing the manuscript.
Footnotes
1 Resource over-subscription is outside the scope of this study and our tool.
Footnote2 Exact benchmark versions, git commits, inputs, and the like, are provided in our artifacts which are referenced in Section 9.
Footnote3 For details of flags, tools, versions, and executions environments, please refer to Section 9.
Footnote4 A large-scale survey of academic simulators in realistic scenarios, beyond carefully selected and tuned micro-kernels, is—in our humble opinion—consequential, and yet outside the scope of this paper. Although, reference [6] provides a data point.
Footnote5 Sampling at most ten out of all MPI ranks should not substantially alter the result but saves resources, since we have to execute SDE once per rank.
Footnote6 The MatVecSplit oddity (runtime increases for 128 MiB) needs further investigation. It shows an enlarged counter of
FootnoteLoadLockedRequests —this artifact could be attributed to software (such as barrier implementation in the OpenMP runtime).
- [1] 2023. Heterogeneous Integration Roadmap 2023 Edition - Chapter 20: Thermal.
Technical Report . IEEE Electronics Packaging Society. 1–39. https://eps.ieee.org/images/files/HIR_2023/ch20_thermalfinal.pdfGoogle Scholar - [2] . 2021. A Parametric Microarchitecture Model for Accurate Basic Block Throughput Prediction on Recent Intel CPUs. https://arxiv.org/pdf/2107.14210.pdfGoogle Scholar
- [3] . 2021. AMD Instinct™ MI250X Accelerator. https://www.amd.com/en/products/server-accelerators/instinct-mi250xGoogle Scholar
- [4] . 2019. Development of Computational Mechanics System for Large Scale Analysis and Design — ADVENTURE Project. https://adventure.sys.t.u-tokyo.ac.jp/Google Scholar
- [5] . 2003. A single-V/Sub t/ low-leakage gated-ground cache for deep submicron. IEEE Journal of Solid-State Circuits 38, 2 (2003), 319–328. Google ScholarCross Ref
- [6] . 2019. A survey of computer architecture simulation techniques and tools. IEEE Access 7 (2019), 78120–78145. Google ScholarCross Ref
- [7] . 2019. Validation of the Gem5 simulator for X86 architectures. In 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE, Denver, CO, USA, 53–58. Google ScholarCross Ref
- [8] . 2021. Execution-cache-memory modeling and performance tuning of sparse matrix-vector multiplication and lattice quantum chromodynamics on A64FX. Concurrency and Computation: Practice and Experience (
Aug. 2021), 30. Google ScholarCross Ref - [9] . 2013. MODYLAS: A highly parallelized general-purpose molecular dynamics simulation program for large-scale systems with long-range forces calculated by fast multipole method (FMM) and highly scalable fine-grained new parallel processing algorithms. Journal of Chemical Theory and Computation 9, 7 (2013), 3201–3209. Google ScholarCross Ref
- [10] . 2022. NEK5000. http://nek5000.mcs.anl.govGoogle Scholar
- [11] . 1991. The NAS parallel benchmarks – summary and preliminary results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (SC’91). ACM, New York, NY, USA, 158–165. Google ScholarDigital Library
- [12] . 2014. Near-data processing: Insights from a MICRO-46 workshop. IEEE Micro 34, 4 (2014), 36–42. Google ScholarCross Ref
- [13] . 2011. The Gem5 simulator. ACM SIGARCH Computer Architecture News 39, 2 (
Aug. 2011), 1–7. Google ScholarDigital Library - [14] . 2006. Die stacking (3D) microarchitecture. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 39). IEEE Computer Society, USA, 469–479. Google ScholarDigital Library
- [15] . 1997. Matrix market: A web resource for test matrix collections. In Proceedings of the IFIP TC2/WG2.5 Working Conference on Quality of Numerical Software: Assessment and Enhancement. Chapman & Hall, Ltd., London, UK, 125–137. http://dl.acm.org/citation.cfm?id=265834.265854Google ScholarDigital Library
- [16] . 2012. Multi-block/Multi-core SSOR preconditioner for the QCD quark solver for K computer. Proceedings, 30th International Symposium on Lattice Field Theory (Lattice 2012): Cairns, Australia, June 24–29, 2012 LATTICE2012 (2012), 188. Google ScholarCross Ref
- [17] . 2022. AMD Releases Milan-X CPUs With 3D V-Cache: EPYC 7003 Up to 64 Cores and 768 MB L3 Cache. https://www.anandtech.com/show/17323/amd-releases-milan-x-cpus-with-3d-vcache-epyc-7003Google Scholar
- [18] . 2019. A survey of optimization techniques for thermal-aware 3D processors. Journal of Systems Architecture 97, C (
Aug. 2019), 397–415. Google ScholarDigital Library - [19] . 2011. Parallelism and data movement characterization of contemporary application classes. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’11). Association for Computing Machinery, New York, NY, USA, 95–104. Google ScholarDigital Library
- [20] . 2018. Analysing the role of last level caches in controlling chip temperature. IEEE Transactions on Sustainable Computing 3, 4 (2018), 289–305.Google ScholarCross Ref
- [21] . 2022. AMD’s V-Cache Tested: The Latency Teaser. https://chipsandcheese.com/2022/01/14/amds-v-cache-tested-the-latency-teaser/Google Scholar
- [22] . 2019. BHive: A benchmark suite and measurement framework for validating X86-64 basic block performance models. In 2019 IEEE International Symposium on Workload Characterization (IISWC). IEEE Press, Orlando, FL, USA, 167–177. Google ScholarCross Ref
- [23] . 2015. BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). Association for Computing Machinery, New York, NY, USA, 198–210. Google ScholarDigital Library
- [24] . 2011. CODES: Enabling co-design of multi-layer exascale storage architectures. In Workshop on Emerging Supercomputing Technologies 2011 (WEST 2011). OSTI.GOV, Tuscon, Arizona, USA, 1–6. Google ScholarCross Ref
- [25] . 2019. Memory and parallelism analysis using a platform-independent approach. In Proceedings of the 22nd International Workshop on Software and Compilers for Embedded Systems (SCOPES’19). Association for Computing Machinery, New York, NY, USA, 23–26. Google ScholarDigital Library
- [26] . 2021. AMD Demonstrates Stacked 3D V-Cache Technology: 192 MB at 2 TB/Sec. https://www.anandtech.com/show/16725/amd-demonstrates-stacked-vcache-technology-2-tbsec-for-15-gamingGoogle Scholar
- [27] . 2021. Did IBM Just Preview the Future of Caches? https://www.anandtech.com/show/16924/did-ibm-just-preview-the-future-of-cachesGoogle Scholar
- [28] . [n. d.]. Memory Stacked on Processor for High Bandwidth. https://patents.justia.com/patent/20230275068Google Scholar
- [29] . 2016. GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models. In High Performance Computing, , , and (Eds.). Springer, Cham, 489–507.Google Scholar
- [30] . 1974. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE Journal of Solid-State Circuits 9, 5 (1974), 256–268. Google ScholarCross Ref
- [31] . 2016. Replicating HPC I/O Workloads with proxy applications. In Proceedings of the 1st Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS’16). IEEE Press, Piscataway, NJ, USA, 13–18. Google ScholarCross Ref
- [32] . 2012. High-order curvilinear finite element methods for lagrangian hydrodynamics. SIAM Journal on Scientific Computing 34, 5 (2012), B606–B641. Google ScholarDigital Library
- [33] . 2019. Double-precision FPUs in high-performance computing: An embarrassment of riches?. In 2019 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019, Rio de Janeiro, Brazil, May 20–24, 2019. IEEE Press, Rio de Janeiro, Brazil, 78–88. Google ScholarCross Ref
- [34] . 2021. Matrix Engine Study. https://gitlab.com/domke/MEstudyGoogle Scholar
- [35] . 2021. Matrix engines for high performance computing: A paragon of performance or grasping at straws?. In 2021 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, Oregon, USA, May 17–21, 2021. IEEE Press, Portland, Oregon, USA, 1056–1065.Google Scholar
- [36] . 1988. The LINPACK benchmark: An explanation. In Proceedings of the 1st International Conference on Supercomputing. Springer-Verlag, London, UK, UK, 456–474. http://dl.acm.org/citation.cfm?id=647970.742568Google ScholarDigital Library
- [37] . 2015. HPCG Benchmark: A New Metric for Ranking High Performance Computing Systems.
Technical Report ut-eecs-15-736. University of Tennessee. https://library.eecs.utk.edu/pub/594Google Scholar - [38] . 2016. A new metric for ranking high-performance computing systems. National Science Review 3, 1 (2016), 30–35. Google ScholarCross Ref
- [39] . 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA’11). Association for Computing Machinery, New York, NY, USA, 365–376. Google ScholarDigital Library
- [40] . 2022. The AMD next generation Zen 3 core. IEEE Micro 42, 3 (2022), 7–12. Google ScholarCross Ref
- [41] . 2018. ECP Proxy Apps Suite. https://proxyapps.exascaleproject.org/ecp-proxy-apps-suite/Google Scholar
- [42] . 2020. 8.1 Lakefield and mobility compute: A 3D stacked 10nm and 22FFL hybrid processor system in 12×12mm2, 1mm package-on-package. In 2020 IEEE International Solid- State Circuits Conference - (ISSCC). IEEE Press, San Francisco, CA, USA, 144–146. Google ScholarCross Ref
- [43] . 1983. The NYU ultracomputer—designing an MIMD shared memory parallel computer. IEEE Trans. Comput. C-32, 2 (1983), 175–189. Google ScholarDigital Library
- [44] . 2015. Asymmetric underlapped FinFET based Robust SRAM design at 7nm node. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE’15). EDA Consortium, San Jose, CA, USA, 659–664.Google ScholarCross Ref
- [45] . 2016. MUSA: A multi-level simulation approach for next-generation HPC machines. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16). IEEE Press, Salt Lake City, UT, USA, 526–537. Google ScholarCross Ref
- [46] . 2006. Basic features of the fluid dynamics simulation software “FrontFlow/Blue”. Seisan Kenkyu 58, 1 (2006), 11–15. Google ScholarCross Ref
- [47] . 2016. HACC: Extreme scaling and performance across diverse architectures. Commun. ACM 60, 1 (
Dec. 2016), 97–104. Google ScholarDigital Library - [48] . 2021. Improving the performance of block-based DRAM caches via tag-data decoupling. IEEE Trans. Comput. 70, 11 (2021), 1914–1927. Google ScholarCross Ref
- [49] . 2018. A Rogues Gallery of Post-Moore’s Law Options. https://www.nextplatform.com/2018/08/27/a-rogues-gallery-of-post-moores-law-options/Google Scholar
- [50] . 2009. Improving Performance via Mini-applications.
Technical Report SAND2009-5574. Sandia National Laboratories.Google Scholar - [51] . 2012. The Death of CPU Scaling: From One Core to Many – and Why We’re Still Stuck. https://www.extremetech.com/computing/116561-the-death-of-cpu-scaling-from-one-core-to-many-and-why-were-still-stuckGoogle Scholar
- [52] . 2018. Die stacking is happening. IEEE Micro 38, 1 (2018), 22–28. Google ScholarCross Ref
- [53] . 2021. International Roadmap for Devices and Systems (IRDS™) 2021 Edition – Executive Summary.
IEEE IRDS™ Roadmap . IEEE. 64 pages. https://irds.ieee.org/images/files/pdf/2021/2021IRDS_ES.pdfGoogle Scholar - [54] . 2021. International Roadmap for Devices and Systems (IRDS™) 2021 Edition – Systems and Architectures.
IEEE IRDS™ Roadmap . IEEE. 23 pages. https://irds.ieee.org/images/files/pdf/2021/2021IRDS_SA.pdfGoogle Scholar - [55] . 2012. Intel® Architecture Code Analyzer – User’s Guide. https://www.intel.com/content/dam/develop/external/us/en/documents/intel-architecture-code-analyzer-2-0-users-guide-157548.pdfGoogle Scholar
- [56] . 2020. Dynamic Control- Flow Graph (DCFG) and DCFG-Trace Format Specifications – For Format Version 1.00. https://www.intel.com/content/dam/develop/external/us/en/documents/dcfg-format-548994.pdfGoogle Scholar
- [57] . 2021. Intel® Software Development Emulator. https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.htmlGoogle Scholar
- [58] . 2021. Advances in microprocessor cache architectures over the last 25 years. IEEE Micro 41, 6 (2021), 78–88. Google ScholarDigital Library
- [59] . 2021. Real-time AI for enterprise workloads: The IBM Telum processor. In 2021 IEEE Hot Chips 33 Symposium (HCS). IEEE Computer Society, Palo Alto, CA, USA, 22. https://hc33.hotchips.org/assets/program/conference/day1/HC2021.C1.3IBMCristianJacobiFinal.pdfGoogle ScholarCross Ref
- [60] . 1999. The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance.
Technical Report NAS-99-011. NASA Ames Research Center. 26 pages. https://www.nas.nasa.gov/assets/pdf/techreports/1999/nas-99-011.pdfGoogle Scholar - [61] . 2015. GENESIS: A hybrid-parallel and multi-scale molecular dynamics simulator with enhanced sampling algorithms for biomolecular and cellular simulations. WIREs Computational Molecular Science 5, 4 (2015), 310–323. Google ScholarCross Ref
- [62] . 2020. Accuracy improvement of memory system simulation for modern shared memory processor. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPCAsia2020). Association for Computing Machinery, New York, NY, USA, 142–149. Google ScholarDigital Library
- [63] . 2018. Density tradeoffs of non-volatile memory as a replacement for SRAM based last level cache. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE Press, Los Angeles, CA, USA, 315–327. Google ScholarDigital Library
- [64] . 2021. 3D IC integration and 3D IC packaging. In Semiconductor Advanced Packaging. Springer, Singapore, 343–378. Google ScholarCross Ref
- [65] . 2018. Automated instruction stream throughput prediction for Intel and AMD microarchitectures. In 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). IEEE Press, Dallas, TX, USA, 121–131. Google ScholarCross Ref
- [66] . 2022. Llvm-Mca - LLVM Machine Code Analyzer. https://llvm.org/docs/CommandGuide/llvm-mca.htmlGoogle Scholar
- [67] . 2011. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). Association for Computing Machinery, New York, NY, USA, 454–464. Google ScholarDigital Library
- [68] . 2007. Processor design in 3D die-stacking technologies. IEEE Micro 27, 3 (
May 2007), 31–48. Google ScholarDigital Library - [69] . 2021. Meeting the real-time challenges of ground-based telescopes using low-rank matrix computations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’21). ACM, New York, NY, USA, 29:1–29:16. Google ScholarDigital Library
- [70] . 1995. Memory bandwidth and machine balance in current high performance computers. IEEE Technical Committee on Computer Architecture (TCCA) Newsletter 2, 19–25 (
Dec. 1995), 1–7.Google Scholar - [71] . 2004. Reflections on the memory wall. In Proceedings of the 1st Conference on Computing Frontiers (CF’04). Association for Computing Machinery, New York, NY, USA, 162. Google ScholarDigital Library
- [72] . 2019. Ithemal: Accurate, portable and fast basic block throughput estimation using deep neural networks. In Proceedings of the 36th International Conference on Machine Learning(
Proceedings of Machine Learning Research , Vol. 97), and (Eds.). PMLR, Long Beach, California, USA, 4505–4515. https://proceedings.mlr.press/v97/mendis19a.htmlGoogle Scholar - [73] . 2018. mVMC–open-source software for many-variable variational Monte Carlo method. Computer Physics Communications 235, Feb. 2019 (2018), 447–462. Google ScholarCross Ref
- [74] . 2016. A survey of techniques for architecting DRAM caches. IEEE Transactions on Parallel and Distributed Systems 27, 6 (
June 2016), 1852–1863. Google ScholarDigital Library - [75] . 2013. Co-Design for Molecular Dynamics: An Exascale Proxy Application.
Technical Report LA-UR 13-20839. Los Alamos National Laboratory. http://www.lanl.gov/orgs/adtsc/publications/science_highlights_2013/docs/Pg88_89.pdfGoogle Scholar - [76] . 1975. Progress in digital integrated electronics. International Electron Devices Meeting, IEEE 21 (1975), 11–13.Google Scholar
- [77] . 2022. “Milan-X” 3D Vertical Cache Yields Epyc HPC Bang for the Buck Boost. https://www.nextplatform.com/2022/03/21/milan-x-3d-vertical-cache-yields-epyc-hpc-bang-for-the-buck-boost/Google Scholar
- [78] . 2014. NTChem: A high-performance software package for quantum molecular simulation. International Journal of Quantum Chemistry 115, 5 (
Dec. 2014), 349–359. Google ScholarCross Ref - [79] . 2010. The GPU computing era. IEEE Micro 30, 2 (2010), 56–69. Google ScholarDigital Library
- [80] . 2018. Criticality aware tiered cache hierarchy: A fundamental relook at multi-level cache hierarchies. In Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE Press, Los Angeles, CA, USA, 96–109. Google ScholarDigital Library
- [81] . 2022. NVIDIA H100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/h100/Google Scholar
- [82] . 2020. Supercomputer Fugaku CPU A64FX Realizing High Performance, High-Density Packaging, and Low Power Consumption.
Fujitsu Technical Review . Fujitsu Limited. 9 pages. https://www.fujitsu.com/global/documents/about/resources/publications/technicalreview/2020-03/article03.pdfGoogle Scholar - [83] . 2021. DAMOV: A new methodology and benchmark suite for evaluating data movement bottlenecks. IEEE Access 9 (2021), 134457–134502. Google ScholarCross Ref
- [84] . 2016. FFV-C Package. http://avr-aics-riken.github.io/ffvc_package/Google Scholar
- [85] . 2017. A 1,000x improvement in computer systems by bridging the processor-memory gap. In 2017 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S). IEEE Press, Burlingame, CA, USA, 1–4. Google ScholarCross Ref
- [86] . 2008. GPU computing. Proc. IEEE 96, 5 (2008), 879–899. Google ScholarCross Ref
- [87] . 2015. High-performance algebraic multigrid solver optimized for multi-core based distributed parallel systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15). ACM, Austin, TX, USA, 54:1–54:12. Google ScholarDigital Library
- [88] . 2017. User’s Guide to SW4, Version 2.0.
Technical Report LLNL-SM-741439. Lawrence Livermore National Laboratory.Google Scholar - [89] . 2020. A survey on coarse-grained reconfigurable architectures from a performance perspective. IEEE Access 8 (
July 2020). Google ScholarCross Ref - [90] . 2016. PolyBench/C 4.2.1 (Beta). https://sourceforge.net/projects/polybench/Google Scholar
- [91] . 2020. DiffTune: Optimizing CPU simulator parameters with learned differentiable surrogates. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Press, Athens, Greece, 442–455. Google ScholarCross Ref
- [92] . 2015. Fiber Miniapp Suite. https://fiber-miniapp.github.io/Google Scholar
- [93] . 2021. The Kernel Codes from Priority Issue Target Applications. https://github.com/RIKEN-RCCS/fs2020-tapp-kernelsGoogle Scholar
- [94] . 2020. Riken_simulator. https://github.com/RIKEN-RCCS/riken_simulatorGoogle Scholar
- [95] . 2012. Improvements to the structural simulation toolkit. In Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques (SIMUTOOLS’12). ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), Brussels, Belgium, 190–195.Google ScholarDigital Library
- [96] . 2020. Co-design for A64FX manycore processor and “Fugaku”. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’20). IEEE Press, Atlanta, GA, USA, 1–15.Google ScholarDigital Library
- [97] . 2022. A 7-Nm FinFET 1.2-TB/s/Mm\(^2\) 3D-Stacked SRAM module with 0.7-pJ/b inductive coupling interface using over-SRAM coil and Manchester-encoded synchronous transceiver. IEEE Journal of Solid-State Circuits (2022), 1–12. Google ScholarCross Ref
- [98] . 2021. A 96-MB 3D-Stacked SRAM using inductive coupling with 0.4-V transmitter, termination scheme and 12:1 SerDes in 40-Nm CMOS. IEEE Transactions on Circuits and Systems I: Regular Papers (TCAS-I) 68, 2 (
Feb. 2021), 692–703. Google ScholarCross Ref - [99] . 2022. TSMC Roadmap Update: N3E in 2024, N2 in 2026, Major Changes Incoming. https://www.anandtech.com/show/17356/tsmc-roadmap-update-n3e-in-2024-n2-in-2026-major-changes-incomingGoogle Scholar
- [100] . 2015. Monolithic 3D integration: A path from concept to reality. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE’15). EDA Consortium, San Jose, CA, USA, 1197–1202.Google ScholarCross Ref
- [101] . 2017. MPI Stub. https://github.com/hsorby/mpistubGoogle Scholar
- [102] . 2020. SPEC’s Benchmarks. https://www.spec.org/benchmarks.htmlGoogle Scholar
- [103] . 2017. The ARM scalable vector extension. IEEE Micro 37, 02 (
March 2017), 26–39. Google ScholarDigital Library - [104] . 2021. TOP500. http://www.top500.org/Google Scholar
- [105] . 2020. The AMD “Zen 2” processor. IEEE Micro 40, 2 (2020), 45–52. Google ScholarCross Ref
- [106] . 2016. Analysis of critical thermal issues in 3D integrated circuits. International Journal of Heat and Mass Transfer 97 (2016), 337–352. Google ScholarCross Ref
- [107] . 2017. The end of Moore’s law: A new beginning for information technology. Computing in Science Engineering 19, 2 (2017), 41–50. Google ScholarDigital Library
- [108] . 2004. A new dynamical framework of nonhydrostatic global model using the icosahedral grid. Fluid Dynamics Research 34, 6 (2004), 357–400. http://stacks.iop.org/1873-7005/34/i=6/a=A03Google ScholarCross Ref
- [109] . 2014. XSBench - The development and verification of a performance abstraction for Monte Carlo reactor analysis. In PHYSOR 2014 - The Role of Reactor Physics toward a Sustainable Future. JAEA, Kyoto, 1–13. Google ScholarCross Ref
- [110] . 2002. The NAS Parallel Benchmarks 2.4.
Technical Report NAS-02-007. NASA Ames Research Center. 8 pages. https://www.nas.nasa.gov/assets/pdf/techreports/2002/nas-02-007.pdfGoogle Scholar - [111] . 2017. Parallel multi channel convolution using general matrix multiplication. In 2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE Press, Seattle, WA, USA, 19–24. Google ScholarCross Ref
- [112] . 2017. Architectures for the Post-Moore era. IEEE Micro 37, 04 (
July 2017), 6–8. Google ScholarDigital Library - [113] . 2016. Analyzing allocation behavior for multi-level memory. In Proceedings of the Second International Symposium on Memory Systems (MEMSYS’16). Association for Computing Machinery, New York, NY, USA, 204–207. Google ScholarDigital Library
- [114] . 2018. 3D integrated circuit cooling with microfluidics. Micromachines 9, 6 (2018), 1–14. Google ScholarCross Ref
- [115] . 2015. 4.1 22nm Next-generation IBM system z microprocessor. In 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers. IEEE Press, San Francisco, CA, USA, 1–3. Google ScholarCross Ref
- [116] . 2015. A task-based linear algebra building blocks approach for scalable graph analytics. In 2015 IEEE High Performance Extreme Computing Conference (HPEC’15). IEEE Press, Waltham, MA, USA, 1–6. Google ScholarCross Ref
- [117] . 2022. A64FX: 52-core processor designed for the 442PetaFLOPS Supercomputer Fugaku. In IEEE International Solid-State Circuits Conference, ISSCC 2022, San Francisco, CA, USA, February 20–26, 2022. IEEE, San Francisco, CA, USA, 352–354. Google ScholarCross Ref
- [118] . 2018. Fujitsu high performance CPU for the Post-K computer. In 2018 IEEE Hot Chips 30 Symposium (HCS). IEEE Computer Society, California, USA, 22. http://www.fujitsu.com/jp/Images/20180821hotchips30.pdfGoogle Scholar
- [119] . 2018. ACCORD: Enabling associativity for gigascale DRAM caches by coordinating way-install and way-prediction. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, Los Angeles, CA, USA, 328–339. Google ScholarDigital Library
- [120] . 2014. A survey of memory architecture for 3D chip multi-processors. Microprocessors and Microsystems 38, 5 (2014), 415–430. Google ScholarDigital Library
Index Terms
- At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads
Recommendations
PAC: Paged Adaptive Coalescer for 3D-Stacked Memory
HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed ComputingMany contemporary data-intensive applications exhibit irregular and highly concurrent memory access patterns and thus challenge the performance of conventional memory systems. Driven by an expanding need for high-bandwidth memory featuring low access ...
A New Metric to Measure Cache Utilization for HPC Workloads
MEMSYS '16: Proceedings of the Second International Symposium on Memory SystemsHigh performance computing (HPC) systems continue to add cores and memory to keep pace with increases in data processing needs, resulting in increased data movement across the memory hierarchy. With these systems becoming more and more energy ...
Effects of Multithreading on Cache Performance
Special issue on cache memory and related problemsAs the performance gap between processor and memory grows, memory latency becomes a major bottleneck in achieving high processor utilization. Multithreading has emerged as one of the most promising and exciting techniques used to tolerate memory latency ...
Comments