1 Introduction

With the help of increasing computational power, modern numerical weather predictions allow for a global, precise, short range forecast of a few days. In its infancy, numerical models were developed as early as for example in 1922 [1], which however took several months for a single person to create a prediction of a small area. After a century of development and research in weather prediction, the Global Forecast System (GFS) model [2] is capable of a total of four predictions per day simulating the whole globe, each of which forecasts up to 16 days. As those forecasts influence the daily life of people and businesses, they are required to produce correct predictions not only within an acceptable range of time but within an acceptable range of accuracy as well. The effects of imprecise or wrong predictions may lead to inadequate conclusions which may result in insufficient preparations for specific forecasted weather events such as it happened with the low-pressure system “Bernd” in July 2021 in Germany. At this event, such insufficient preparations led to large areas being destroyed and to many lives lost during a single night [3, 4]. A more accurate prediction may have resulted in a higher preparedness and thus less casualties would have occurred.

Creating a more accurate weather prediction imposes a trade-off which has to be made based on the execution time of the simulation: A higher accuracy can be achieved by increasing the amount of computations (e.g. simulation points) but a prediction that is made too late is not useful anymore. Thus, a fast and precise simulation is required to provide accurate results as early as possible. Especially, the occurrence of extraordinary weather events, such as heavy rainfall, are often hard to predict with current weather models as they require a fine mesh size or a regional simulation focus in order to achieve a high accuracy. Therefore such advanced methods have to be accelerated through parallelization to exploit the underlying computational environment in order to gain the maximum in available high performance. Even more, the extension of the existing hardware by the inclusion of specialized high performance computing units such as General-Purpose Graphic Processing Units, or GPUs in short, leads to a tremendous increase in computational power due to their ability of massive parallelization. This however requires homogeneous processing patterns which exploit the underlying single instruction multiple threads (SIMT) execution model, albeit this restriction decreases on newer GPU generations as the executed GPU threads become more independent. Together with the host and other computing nodes, the use of GPUs leads to a heterogeneous system in which the computation load can be distributed among a range of available components.

The Weather Research and Forecasting Model (WRF) [5] provides a framework for environmental and atmospheric simulations, which was developed for the use in research and teaching. The WRF provides and combines a set of category modules each of which simulates a specific facet of the overall simulation. The most recent microphysics scheme of this category is the WRF Single-Moment 7-class (WSM7) scheme [6], which simulates the behavior of water vapor, hydrometeors and precipitation in the atmosphere. However, due to dependencies to other calculations within the same time step and other time steps, the computation of the microphysics forms a bottleneck so that runtime improvements would have a significant impact on the overall simulation time. The WSM7 scheme uses seven different kinds of hydrometeors (cloud vapor and water, ice, rain, snow, graupel and hail), their atmospheric interaction with each other and additional external factors to model the expected precipitation. The WSM7 simulation is based on a three-dimensional grid structure representing the physical simulation area. The simulation is executed for each grid point and for each vertical layer and has to be executed in every time step of the overall weather simulation. Each instance of the WSM7 simulation requires data of neighboring grid points and vertical layers, which limits the possibilities for parallelization efforts across multiple time steps. However, within the same time step the division of the computations into independent grid points makes the WSM7 amendable for the SIMT execution scheme of GPUs.

The WSM7 reference implementation is provided at [7] in Fortran and does not exploit any parallelism. The increased accuracy of the WSM7 scheme introduces additional calculations and thus increases the simulation time which in turn limits the time frame that can be predicted. The work in this article aims for a suitable parallelization of the computations of the WSM7 in order to decrease the simulation time per time step. For this, a C++ based parallelization of the simulation is proposed which exploits the capabilities of multicore CPUs. Additionally, the capabilities of GPUs can be utilized which allows for a hybrid CPU/GPU implementation of the WSM7 simulations. This requires an analysis of data structures and data transfers. To ensure an adequate positioning of the computations on CPU and GPU, the simulation is split into separate tasks comprising a well-defined portion of computations. For these tasks the placement on CPU or GPU can be determined individually in each simulation time step. This allows for a dynamic load balancing mechanism to be introduced. The parallelization approach is evaluated according to the accuracy and computational performance of the WSM7 simulations using the example of the weather situation in central Europe during the low pressure region “Bernd” in July 2021.

In detail, this article provides the following contributions:

  • A C++/CPU parallelization for the WSM7 scheme

  • A CUDA/GPU parallelization for the WSM7 scheme

  • An accuracy and performance evaluation of the parallel CPU/GPU implementations

  • A method for dynamic load balancing of the WSM7 tasks for CPU/GPU execution.

This article is structured as follows: Sect. 2 introduces the WSM7 simulation and highlights related work. In Sect. 3, an overview of the heterogeneous parallelization and implementation details is presented. Then, the accuracy (Sect. 4) and performance (Sect. 5) of the resulting parallel implementation are evaluated and discussed. In Sect. 5.4 a mechanism for dynamic load balancing between CPU and GPU is presented and evaluated. Section 6 summarizes and concludes.

2 WRF Single-Moment 7-class (WSM7) scheme

An advanced simulation of ice-microphysical processes is proposed by Hong et al. [8] in 2004 with the class of WRF Single-Moment-Microphysics schemes (WSMMPs). The first class of the WSMMPs, the WRF Single-Moment 3-class (WSM3) scheme [8], describes three categories of hydrometeors, including water vapor, cloud water/ice and rain/snow. The WRF Single-Moment 5-class (WSM5) scheme [8] additionally considers the composed classes water/ice and rain/snow as individual categories. Different approaches for parallel C and GPU implementations of the WSM5 are provided by Mielikainen et al. [9] and Ridwan et al. [10] reaching speedup values of up to 206 (considering memory transfers) and 403, respectively.

The inclusion of hail as a sixth category of hydrometeors by Hong and Lim [11] denotes the WRF Single-Moment 6-class (WSM6) scheme. Various GPU parallelizations provide an acceleration of the WSM6, including Huang et al. [12] who propose an implementation based on C achieving a speedup of 216. Another approach is followed by da Silva et al. [13] who focus on the usage of OpenACC directives in combination with a job-scheduler for minimizing the programming effort, which is required when porting the original Fortran implementation to another language. This approach reaches a speedup of 67 compared to a 24-thread CPU. For the same reason, Kim et al. [14] provide a different approach relying on CUDA Fortran features without the necessity of porting the model to another language. Their approach however limits the achieved speedup to values between 4 and 5. These different parallelization approaches show that the structure of these simulations is amendable for the parallelization with GPUs.

With the WSM7 scheme by Bae et al. [6] as the most recent WSMMPs class the WSM6 scheme is extended to a four-ice microphysics scheme by considering the hydrometeor hail as an additional category rather than solely relying on graupel. The consequences on simulated precipitation by the inclusion of hail are an increase in heavy rain and a well-organization of rain on convective squall lines; whereas, the stratiform region decreases. Thus the WSM7 scheme is important in predicting heavy rain events such as they occurred during the low pressure zone “Bernd” [3, 4] in July 2021.

3 Heterogeneous parallelization of WSM7

This chapter describes the parallelization approach and implementations details of the WSM7 scheme for the C++ and CUDA implementation.

3.1 Parallelization of the WSM7 scheme

Fig. 1
figure 1

Parallelization approach of the WSM7 scheme: While model calculations are performed inside a mesh for each mesh point, each column along the vertical axis, as marked in the displayed mesh, is independent from each other. Therefore the columns on the horizontal mesh points along the longitude and latitude coordinates can be parallelized

The computations of the WSM7 scheme are performed at mesh points along the longitude and latitude of the simulated region. For each of these mesh points, the vertical column is simulated independent from its neighboring columns. The general parallelization of the WSM7 scheme relies on this independence of vertical columns as displayed in Fig. 1. The calculations performed for a specific column of the horizontal mesh are grouped together into a task, thus allowing the parallel and independent execution of different tasks, i.e., columns.

For each of the mesh points a series of calculations is executed which is in detail described in [6]. This article focuses on the parallelization of these calculations to obtain a hybrid CPU-GPU implementation and additional optimizations. In the following, the details for the parallelization approach are presented. After that, details of the data management are shown as the data has to be distributed among the different components. A custom dynamic load balancing method for heterogeneous architecture is presented which aims to achieve the lowest execution time for each time step.

3.2 Implementation overview

The original implementation of the WSM7 scheme by Bae et al. [6] is written in Fortran. For this work the Fortran version is reimplemented in C++ using CUDA due to the extensive usage of object orientation for the creation of separate tasks, which cannot be done in Fortran based approaches such as CUDA-Fortran. The re-implementation in C++ is done such that the original program structure and the extensions to the new implementation language only have a minor impact on code readability, so that the resulting code can be readable and manageable by domain scientists.

The calculations on the GPU are organized into tasks, which execute the calculations of a column (see Fig. 1). Those tasks are executed asynchronously through the usage of CUDA-Streams, whereby synchronizations are applied according to a set of predecessor tasks. Additionally, asynchronous data transfers between the host and device have to be considered according to preceding tasks as well. These predecessors are determined based on dependencies due to the usage of shared variables. The following operations of a task T on a shared variable V are possible: (a) T only reads V, (b) T writes on V and (c) T reads and modifies V. Case (a) denotes a read-after-write or true dependency; whereas, dependencies of the type read-after-read are not considered since they do not lead to inconsistencies. For case (b) write-after-write and write-after-read dependencies have to be considered among the involved tasks and case (c) can be described as a combination of both cases (a) and (b). Since the CPU part does not exploit an asynchronous execution, these considerations apply only for the GPU part.

The distribution of the computational load among the heterogeneous components in the system is based on the factor \(p_\textrm{dev}\in [0,1]\), which can be manually set and denotes the percentage of horizontal mesh points to be calculated on the GPU. In each time step, the device functions for the specified number of mesh points are called asynchronously. After the device functions have been started, the host can be utilized to execute the calculations for the remaining mesh points. Finally, host and device are synchronized and the results are passed to the WRF model.

The whole execution has to be designed to be thread-safe as the underlying WRF model may use multithreading such that each thread receives a portion of the horizontal mesh which it has to calculate. For this work, this concept is extended by a designated master thread which solely coordinates the GPU invocations of each thread and handles as well its own host share; while, the other threads only have to handle their host functions.

3.3 Memory management for heterogeneous computing

Fig. 2
figure 2

Data categorization for the heterogeneous WSM7 approach for the host share (left) and the device share (right). The following categories are used: Fortran data, host data and device data

The original memory and data layout of the WSM7 scheme is not suitable for the heterogeneous approach, especially when moving data between CPU and GPU. Thus, data is categorized into 3 types which follow the possible execution options, as shown in Fig. 2 (a) Fortran data, (b) Host data and (c) Device data. Here, (a) denotes the data which is used in the original Fortran implementation and which also can be directly accessed inside the host share. However, (a) may also contain additional data which is not directly necessary inside the WSM7 scheme, so host invocations remove these elements and may rearrange the data to provide a more efficient access pattern, which leads to the memory type (b). On the other hand, the GPU cannot access (a) directly so data transfers between the host and device are necessary, which are implemented as a different type of task and follow the principle of predecessor tasks. Similarly to (b), the other types of tasks may rearrange variables in (c) achieving an efficient access pattern and therefore reducing the memory latency on the device.

With the re-implementation of the WSM7 scheme for the heterogeneous approach, further optimizations are possible concerning the frequently combined usage of certain variables. For example, the terms \(P_\textrm{raci}\), \(P_\textrm{saci}\), \(P_\textrm{gaci}\), \(P_\textrm{haci}\) for the accretion of ice through the different hydrometeors from the WSM7 scheme are calculated inside a single task and therefore are accessed together. So instead of accessing each of them separately, those values are merged into a single variable \(P_{{\textbf {X}}\mathrm aci}\) using the native float4 data structure of CUDA, which allows the GPU to read four float elements at once reducing the overall amount of required memory accesses. The same technique is used for four-dimensional variables, where the first three ones are aligned according to the dimensions of the domain and the fourth dimension includes additional information.

3.4 Dynamic load distribution

The heterogeneous approach aims to fully utilize both resources, the CPU and the GPU; thus, an equal distribution of the calculation load between both of them is required. In general, the factor \(p_\textrm{dev}\) already allows for a constant distribution. However, using a constant value throughout a simulation run might not deliver a minimal runtime. Additionally, certain runtime aspects have to be known beforehand to determine an optimal value \(p_\textrm{dev}\).

In order to avoid a manual constant for the distribution, a method for the dynamic distribution of the load is presented in the following. This method makes the assumption that between two consecutive time steps, the changes in calculation load are minimal. The dynamic task distribution calculates a new factor \(\widetilde{p}_\textrm{dev}\) for the following time step based on the host and device runtimes \(t_H\) and \(t_D\) of the current time step. To achieve this, three steps have to be performed:

  1. (1)

    To initialize the calculation, the first two time steps are fully executed on the GPU (\(p_\textrm{dev}=1\)). In the very first time step, some initialization is performed that should be ignored in the adjustment of \(p_\textrm{dev}\).

  2. (2)

    In the third time step, the host is included using an arbitrary value \(p_\textrm{dev}<1\) to obtain independent values for \(t_H\) and \(t_D\).

  3. (3)

    For all following time steps a presumed optimal value \(\widetilde{p}_\textrm{dev}\) is calculated such that the host and device runtimes should match: \(\widetilde{t}_{H}=\widetilde{t}_{D}\).

This optimization is realized by calculating the achieved GPU Speedup \(S_\textrm{dev}\) for the current time step in relation to the host component:

$$\begin{aligned} S_\textrm{dev}:=\frac{t_{H}}{(1-p_\textrm{dev})}\cdot \frac{p_\textrm{dev}}{t_D}, \text {with } 0<p_\textrm{dev}<1. \end{aligned}$$
(1)

Since this specific speedup \(S_\textrm{dev}\) describes the relation between the device load and host load the new load distribution for the next time step is calculated from the current achieved speedup:

$$\begin{aligned} \widetilde{p}_\textrm{dev}:=1-\frac{1}{S_\textrm{dev}}, {\text {with } 0<\widetilde{p}_\textrm{dev}<1}. \end{aligned}$$
(2)

After that, suitable values are found and step (3) can be repeated until the end providing a dynamic task distribution based on the achieved speedup of the GPU execution in comparison with the CPU.

This method of distribution has, however, the drawback that minor fluctuations in the measured runtimes \(t_H\) and \(t_D\) might force a small yet constant change of \(p_\textrm{dev}\) when a close to optimal value has already been found in the previous time step. Therefore after step (3) the conditions are tightened for n future adjustment of \(\widetilde{p}_\textrm{dev}\). As the following runtimes have to match with \(\widetilde{t}_{H}=\widetilde{t}_{D}\), step (3) has to be performed in general, when the deviation between both runtimes \(t_H\) and \(t_D\) in the current time step is greater than a given threshold. This adjustment method loosely meets the criteria of matching runtimes \(t_H\) and \(t_D\) and can be used for a coarse adjustment.

When \(t_H\) is smaller than \(t_D\), the host is able to handle slightly more load and therefore receives a single iteration from the devices. Otherwise, when \(t_H\) is bigger than \(t_D\), the host handles more load than the devices which should be avoided since the host also has to manage the tasks and controls the GPU which induces a small overhead. So the host reduces its workload by a single task which is then calculated by the device. By this, both future runtimes might not match perfectly, but the host gains extra free time to manage the heterogeneous approach. Contrary to the general calculation in step (3), those methods provide a method for fine adjustment of the distribution in order to avoid constant fluctuations such that an optimal spot might be held for a longer period of simulation time.

4 Accuracy of the CUDA-WSM7

Since weather forecasts try to predict a chaotic system, minimal deviations in the intermediate steps may lead to unforeseeable impacts on the prediction result. Therefore, in this section the accuracy of the parallelized CUDA-WSM7 implementation is compared to the Fortran implementation. The deviations of the calculation accuracy are studied by comparing the precipitation forecast of “Bernd” for both implementations over the whole period of interest. The discussion will first analyze the short term effects, which will show deviations of the calculation results of a single time step by starting in each time step with the values of the Fortran version. Afterward, the effects of accumulating those deviations over multiple time steps in succession (long term effects) are discussed and compared to the results of the Fortran version.

4.1 Execution environment

As a basis for the evaluation of the heterogeneous CUDA-WSM7 implementation, the meteorological occurrence of the low pressure zone “Bernd” from Mid-July 2021 is simulated using the presented CUDA-WSM7 approach. Therefore, the \(1750 \times 1750\;\;{\text{km}}\) simulation domain, as shown in Fig. 3, focuses on central Europe considering the period from 13.07.2021 0:00 UTC until 16.07.2021 0:00 UTC when “Bernd” resulted in the heaviest rainfalls for some of the viewed area, of which some local rainfall can be categorized as a “once in a century” event [3]. For this, the underlying meteorological data is taken from the GFS Analysis Model [2], which provides the worldwide atmospheric state with a resolution of 0.5° in time intervals of 6 h. This data is then processed by the WRF preprocessing system(WPS), which is included in the WRF project and which combines the GFS data with additional information so it can be used for a simulation run.

Fig. 3
figure 3

Simulated domain of central Europe for the evaluation of the heterogeneous CUDA-WSM7 implementation

The simulation runs are performed on a single CPU-GPU system consisting of a host with a Intel Core i7-7700 CPU, which provides 4 physical cores (8 logical cores) with a maximum frequency of 4.2 GHz. The host system provides 32 GB of main memory, where 4 modules are used with transfer rates of 2400 MT/s. The system includes an NVIDIA GTX 1070 GPU which provides 8 GB GDDR5 main memory with a measured average bandwidth of 195.565 GB/s without any specific caching effects. The GPU has 15 streaming multiprocessors and a total of 1920 CUDA cores. Regarding the data transfers between the host and device, the GPU is capable of reaching a maximum bandwidth of 15,75 GB/s; whereas, average value of 12.9959 has been reached in tests. For compilation, the GCC and GFORTRAN compilers are used in Version 9.4 for the host using the options -O3 -ftree-vectorize -funroll-loops –restrict and for the device the NVCC compiler is used in version 11.6.

In the WRF framework, the basics physics suite option is set to CONUS (Continental U.S.) with the following additions:

  • Microphysics (mp_physics) is described by using the WSM7 scheme,

  • Cumulus Parameterization (cu_physics) is provided by the Kain-Fritsch scheme,

  • Both the longwave (ra_lw_physics) and shortwave (ra_sw_physics) radiation are set on the CAM scheme,

  • Surface layer (sf_sfclay_physics) is using the revised MM5 surface layer scheme,

  • Urban regions are considered with the urban surface (sf_urban_physics) option using the urban canopy scheme.

Additionally, the WRF framework was compiled utilizing single precision float values.

4.2 Short term effects

For an evaluation of the short term effects of the GPU implementation, results of both the C++ host and CUDA device implementations are compared with the Fortran version. Here, the focus lies on the impact on the following resulting WSM7 variables:

 

Variable

Meaning

Temperature

th

Potential temperature

Mixing ratio of

q

Cloud vapor

qc

Coud water

qr

Rain

qi

Ice

qh

Hail

qg

Graupel

Accumulated precipitation of

rain

Rain

snow

Snow

graupel

Graupel

hail

Hail

Time step precipitation of

rainncv

Rain

snowncv

Snow

graupelncv

Graupel

hailncv

Hail

Precipitation ratio of

sr

Snow

These variables are first calculated for the C++ conversion in each time step, leading to the C++ results \(V_C\). Then, the Fortran results \(V_F\) are calculated and the mean squared error (\(\textrm{MSE}\)) between the two results is used for comparison:

$$\begin{aligned} \textrm{MSE}=\frac{1}{n}\sum _{i=1}^{n}(V_{C_i}-V_{F_i})^2, \end{aligned}$$
(3)

where n denotes the total amount of considered mesh points. Finally, due to considerable fluctuations in the result which occur between single time steps, the evolution of the average \(\overline{\textrm{MSE}}\) over the whole simulation is viewed. The comparison is done over 720 time steps beginning at 14.07.2021 12:00 UTC with a constant step size of 30 s and the usage of the reference domain with \(350\times 350\) horizontal mesh points.

Fig. 4
figure 4

Average \(\textrm{MSE}\) values for the host share for potential temperature and specific mixing ratios (left) and mixing ratios of precipitation categories (right)

At first, the results for the C++ host implementation are studied in detail, which are displayed in Fig. 4. As shown, the deviation for the mixing ratios as well as the potential temperature th for the C++ host version are greater than zero indicating both versions C++ and Fortran differ in their results. The snow precipitation ratio sr however shows no deviation at all as well as the resulting precipitation variables: the corresponding \(\overline{\textrm{MSE}}\) values are not displayed separately. Even though the C++ conversion might lead to small differences in the results for certain variables of interest, the overall effect on the simulated precipitation is negligible in single time steps.

Fig. 5
figure 5

Average \(\textrm{MSE}\) values for the device share for precipitation variables

For the GPU implementation, a higher deviation in the precipitation results occurs as Fig. 5 indicates. Especially, the precipitation variables and their time step wise counterpart show deviating \(\textrm{MSE}\) values in the orders of magnitudes higher than \(10^{-20}\) and \(10^{-22}\), respectively, indicating a small deviation in the precipitation results after all time steps.

One reason for this slightly deviating behavior between host and device implementation can be explained by the different calculation precision for the functions exp(x), log(x) and pow(xy) between the CPU and the GPU. As explained in the CUDA C++ Programming Guide [15], the single precision version of these device functions can reach errors up to 2 units in the last place (ulp) [16] for expf(x), 1 ulp for logf(x) and up to 10 ulp for powf(xy) due to differences in their implementation on the GPU hardware. A possible solution for this can be the usage of the double precision counterparts of these functions on the GPU and the host system. Consequently, a change to double precision for all calculation reduces the \(\textrm{MSE}\) to 0 but the necessary runtime highly depends on the actual GPU hardware and thus limits comparability.

4.3 Long term effects

As shown in the previous section, certain mathematical device functions lack the same accuracy as their host counterpart resulting in deviating precipitation results over the course of multiple time steps. The consequences of this behavior of the GPU implementation for a forecast over a longer period are shown in this subsection.

The forecast is calculated for the period from 13.07.2021 0:00 UTC until 16.07.2021 0:00 UTC with a constant time step size of 30 s leading to 8640 time steps using \(350\times 350\) horizontal mesh points. At first, the simulation is performed using only the current Fortran implementation and the resulting precipitation values are accumulated and then another run using the parallel GPU implementation is performed for comparison. While doing so, the various precipitation categories rain, snow, graupel and hail are added up leading to a total amount of accumulated precipitation for each mesh point.

Fig. 6
figure 6

Accumulated amount of precipitation for the simulation period from 13.07.2021 0:00 UTC until 16.07.2021 0:00 UTC. On the left the accumulated precipitation for the Fortran implementation is displayed whereas on the right the results for the GPU version are shown

As shown in Fig. 6, deviations in terms of accumulated precipitation between the two different forecast approaches are evident including both intensity and location. In general, over all horizontal mesh points the resulting accumulated precipitation average is given with 9.3277 mm for Fortran and 9.3272 mm for the GPU version leading to a negligible difference of \(0.0058\%\) thus only the distribution of the precipitation may vary. The GPU version predicts higher peaks in locally restricted precipitation and forecasts a broader coherent band of intense rainfall which turns out to be weaker in the Fortran version.

In addition to the accumulated precipitation, the amount of heavy rainfall events is counted according to the definition of the Deutscher Wetterdienst [17], where precipitation of at least 15 mm in 1 h or at least 20 mm in a 6 h period on a single horizontal mesh point is classified as a single heavy rainfall event. This consideration reveals a slight increase of \(2.86\%\) for such events in the 1 h period for the GPU version in comparison with the Fortran version and a minor increase of still \(0.93\%\) for the 6 h period. So in conclusion, the deviations from the GPU version result in a low yet measurable increase in such heavy rainfall events during the forecasted period.

4.4 Summary of accuracy evaluation

The accuracy of the CUDA implementation is analyzed in terms of the short term and long term effects. The short term effects show deviations between different time steps of the simulation which are mainly resulting from a different implementation of multiple math library functions. The usage of the more accurate functions for a more precise data type is able to reduce this inaccuracy but does not eliminate it.

This leads to some deviations in long term forecasts for the GPU implementation. The overall amount of precipitation which predicted by the CUDA implementation is by \(0.0058\%\) higher than for the Fortran implementation. The GPU implementation tends to predict heavy rainfall events with a ca. \(1 - 2\%\) higher chance than the Fortran implementation and the areas of heavy rainfall tend to spread over a wider area.

5 Performance evaluation

After the evaluation of the accuracy of the WSM7 results of the heterogeneous CPU-GPU implementation, a performance evaluation is conducted in this section. The performance of the GPU implementation is analyzed in detail and afterward the performance of the host implementation is analyzed as well. Finally, the impact of the dynamic load distribution is evaluated in order to estimate additional performance gains through the heterogeneous approach.

5.1 Execution environment

The performance evaluation is conducted using the setup described in Sect. 4.1 with the following differences: The performance evaluation is done with the central European domain (see Fig. 3) where configurations with \(200\times 200\) (8.75 km), \(300\times 300\) (5.50 km) and \(400\times 400\) (4.40 km) horizontal mesh points (mesh size) and 45 vertical levels are considered. For each run, 720 time steps are simulated containing the period from 14.07.2021 12:00 UTC until 15:00 UTC using a time step size of 15 s.

5.2 Performance of the GPU implementation

The GPU implementation is evaluated in relation to the existing Fortran version and each task of the heterogeneous implementation is executed on the GPU. This process is repeated for different numbers of CPU threads. However, only a single GPU is used for each configuration, and thus in the GPU implementation multiple CPU threads have to share the same GPU.

Fig. 7
figure 7

Achieved average speedup values (inverse ratio to one thread) for the WSM7 scheme for the reference domain over 720 time steps on the GPU in relation to the Fortran implementation using \(200 \times 200\), \(300 \times 300\) and \(400 \times 400\) horizontal mesh points. Additionally, a maximum speedup is displayed for each configuration where the maximum is calculated from the number of memory accesses and data transfers on the GPU for each time step

The results of this study are displayed in Fig. 7, which depicts the relation \(t_F/t_G\) of the execution time of one time step of the WSM7in the Fortran implementation (\(t_F\)) and the GPU implementation (\(t_G\)). The usage of only one thread leads to the highest achieved speedup values for each configuration with up to 28.51; whereas, a higher number of threads reduces the measured speedup until 7.94. The GPU is not able to scale with the number of CPU threads resulting in a reduction of the speedup with an increasing number of CPU threads. A higher number of CPU threads allows the overlapping of memory and computational tasks for the GPU which in theory can increase the overall GPU utilization, but in this case is not able to further reduce the runtime.

In addition to the number of threads, the number of mesh points, and therefore the overall utilization, has an influence on the achieved speedup. For each thread configuration the lowest speedup values are measured for \(200\times 200\) horizontal mesh points whereas for one and four threads the configuration with \(400\times 400\) mesh points achieves the highest values. The main reason for this behavior is the overhead which is induced by the heterogeneous approach as the various tasks have to be initiated during each time step. So since the \(200\times 200\) mesh points configuration has the lowest number of calculations, the administration of the single tasks has a higher impact in relation to a higher mesh count which results in the lowest speedup values. As a result, a higher overall utilization of the GPU should be preferred, especially to overcome such overhead impacts.

In comparison with the Fortran version, the memory accesses on the GPU may form a bottleneck as the memory bandwidth is not able to match the available GPU performance. Figure 7 additionally shows a maximum speedup which is calculated with the number of memory accesses and data transfers per time step in relation to the available memory transfer bandwidth while ignoring the time required for calculations. This shows whether there is further need for additional optimization in the implementation of the calculations or whether the presented implementation is already limited by the available memory bandwidth. The difference between measured speedup and calculated maximum speedup decreases with an increasing number of threads. This reduction in difference is caused by the higher GPU utilization which allows a higher degree of overlapping of memory transfer and calculation kernels on the CUDA stream. On one side, the decreasing differences between the achieved and maximum speedup with an increase in CPU threads indicate a higher GPU utilization. On the other side, the performance of the GPU execution is significantly limited by the memory bandwidth. So in conclusion, the excessive amount of accesses on the GPU memory limits further possible performance gains significantly. The experiment also show that the a higher number of mesh points, i.e., smaller grid size, leads to a slightly higher speedup.

5.3 Performance of the host implementation

Fig. 8
figure 8

Runtime of the WSM7 scheme for the reference domain over 720 time steps on the CPU using the C++ host components of each task in comparison with the Fortran implementation using \(200 \times 200\) and \(300 \times 300\) horizontal mesh points. Additionally, the achieved speedup of the C++ host functions in relation to their Fortran counterparts is displayed in parenthesis

The presented heterogeneous approach aims to use the GPU and CPU for the execution of the tasks. In this section, the C++ host component that implements the task execution model is discussed and compared to the Fortran implementation. The resulting runtime and speedup values are displayed in Fig. 8. The C++ implementation shows a higher runtime for each execution configuration which indicates a huge slowdown in comparison with the Fortran version. The main reason for this behavior is a result of the different optimization possibilities between the Fortran and the task approach in C.

Fig. 9
figure 9

Runtime comparison of the WSM7 scheme between optimized and unoptimized Fortran and C++ implementations using \(350 \times 350\) mesh points and utilizing eight threads. Additionally, the achieved speedup of the C++ host functions in relation to their Fortran counterparts are displayed as numbers in parenthesis

Figure 9 shows the runtimes of the Fortran and C++ implementation with and without compiler optimizations. The C++ implementation uses a different code structure which isolates the separate tasks. Thus, the compiler is not able to perform optimizations across multiple steps but only within a single task. In the Fortran version however, all the calculations are described within a single module inside a single loop, which allows for more sophisticated compiler optimization resulting in a higher CPU performance. This effect of the Fortran implementation is reflected in the reduction of the speedup of the C++ implementation, which benefits less from these additional compiler optimizations. In addition, the small performance loss in the unoptimized configuration arises from the overhead, which is induced from splitting the calculations into tasks in the heterogeneous approach. Therefore, the host part of the heterogeneous approach should be preferably performed using the Fortran implementation since a significantly higher overall performance can be reached this way.

5.4 Evaluation of the dynamic load distribution

For the evaluation of the dynamic distribution of the work load the period from 14.07.2021 6:00 UTC until 18:00 UTC is simulated using a time step size of 15 s. This results in 2880 time steps which are executed with four threads. During those simulations, the reference domain of central Europe is set using the following mesh point configurations: \(300\times 300\), \(400\times 400\) and \(500\times 500\). In this evaluation, the standard Fortran version of the WSM7 scheme and the heterogeneous approach with dynamic distribution are compared. The heterogeneous approach utilizes the CPU through the Fortran implementation where one of the four threads starts the master thread for the GPU implementation. Therefore for each time step, first the tasks on the GPU are initiated, then the Fortran version is calculating the host part and afterward the results of the components are synchronized.

Fig. 10
figure 10

Comparison of the achieved speedup (and runtimes) of the WSM7 scheme between a GPU only configuration and the deployment of the dynamic CPU/GPU workload distribution for different mesh point configurations. The speedup is in relation to the existing Fortran version and is calculated for a simulation run over 2880 time steps with a time step size of 15 s

The advantage gains which arise from the usage of the dynamic CPU/GPU workload distribution is displayed in Fig. 10. The results show that for the smallest mesh point configuration \(300 \times 300\) no performance advantage is achieved with a dynamic load distribution in comparison with a GPU only execution. For the larger configurations however, the dynamic workload distribution demonstrates an increase in performance as the corresponding speedup values are higher than during the GPU only runs.

Fig. 11
figure 11

Development of the GPU load factor \(p_\textrm{dev}\) during a simulation run with 2880 time steps under the dynamic load distribution for the WSM7 scheme. In blue, the raw values are shown; while, the red lines displays the result of an applied low-pass filter

The explanation for the reduced speedup for \(300 \times 300\) mesh points is found in the development of the GPU load factor \(p_\textrm{dev}\) as shown in Fig. 11. Here, the raw values for the GPU load factor \(p_\textrm{dev}\) fluctuate very strongly; whereas, the fluctuations turn out to be smaller for \(400\times 400\) mesh points and minimal for \(500\times 500\) mesh points as displayed in Fig. 11. Due to this high amount of fluctuations, the distribution never stays within the optimal spot so either the host receives to much workload, resulting in unused waiting time for the device or vice versa. These fluctuations have a small negative impact on the runtime, which accumulates over the time steps leading to a higher runtime than using only the GPU. The additional adaptions, such as the coarse and fine adjustments for the distribution, should avoid this problem. However, the computational load per time step does not have runtimes which are stable enough to avoid these fluctuations. For the larger configurations these fluctuations are less distinct as the corresponding workload per time step is more stable. This allows the dynamic distribution to stay within optimal positions over multiple time steps and therefore reducing the overall runtime.

Figure 11 additionally shows that a constant workload factor \(p_\textrm{dev}\) which is kept throughout the whole execution phase does not provide an optimal solution as the characteristics of the simulation vary at different time steps. In the first time steps, the simulation lacks a sufficient precipitation of every kind and thus whole passages from the WSM7 implementation are not executed or only partially executed. On the GPU this results in fewer calculations, which can reduce the overlapping of memory accesses and calculations leading to an overall reduced performance and thus a lower workload assigned to the GPU. In the same situation, the CPU is not affected by the memory accesses in this way which makes it suitable to compute even more mesh points and gains a performance advantage over the GPU.

During the simulated weather condition the work load increases, as the amount of precipitation increases, and the GPU gains more of an advantage and thus, a shift of the workload toward the GPU occurs. Thereafter the fluctuations, especially in the \(500\times 500\) configuration, follow the same scheme: in phases of reduced precipitation the distribution prefers the host and the GPU is preferred during higher computational loads, repeatedly adapting to the simulation conditions.

5.5 Summary of performance evaluation

Regarding the performance gains, the GPU version is able to obtain a higher performance reaching a speedup of up to 28.51. The usage of CPU multithreading delivers a speedup of 9.03 for four threads against the sequential Fortran implementation. The splitting into multiple tasks further limits the performance of the C++ host implementation. Thus, the Fortran implementation is chosen to execute the calculations on the host system and the C++ module only executes the necessary communication with the GPU.

The cooperation of the CPU and GPU implementations through the dynamic load distribution is able to increase the speedup compared to a GPU only execution of the WSM7 scheme. During simulation phases of low precipitation, the dynamic load distribution tends to prefer the CPU since the memory latency has a smaller impact on the runtime. However, during phases with a higher calculation load an automatic shift toward the GPU occurs.

6 Conclusion

In this article an approach for the parallelization of the WSM7 microphysics scheme for precipitation simulation of the WRF framework for atmospheric simulations is presented. The approach provides an implementation for heterogeneous systems consisting of a host(CPU) and a device(GPU). For this heterogeneous approach a general implementation is presented and analyzed. The computations are split into different tasks that can be distributed to nodes of the heterogeneous system. The required memory accesses are transformed from the original Fortran layout to a C++ and CUDA friendly format. Additionally, a dynamic load distribution scheme is implemented which adjusts the distribution of the tasks in each time step.

The presented implementation is evaluated according to the accuracy and execution time of the precipitation simulation. The simulation results of the heterogeneous program show a small deviation of up to \(0.0058\%\) compared to the original Fortran implementation. The performance evaluation shows a speedup of up to 28.51 for the heterogeneous implementation of the WSM7 scheme compared to the original Fortran implementation. This is achieved by using the original Fortran implementation for the host execution and a C++/CUDA module for the GPU communication and computation. The presented distribution scheme for the calculations between the CPU and GPU is able to equally distribute the calculations and thus reduces the overall runtime for larger simulation systems.

The investigations of this article have shown that task-based load balancing techniques for CPU-GPU platforms can lead to computational performance gain of the simulation of atmospheric interactions without compromising the prediction behavior. Thus, the results show that this kind of heterogeneous parallelization is a promising approach which could be applied to other heterogeneous systems, such as multi-GPU systems, multi-processor systems or cluster and grid systems. Each approach will be accompanied by extensive studies concerning the adequate parallel programming technique needed and the impact of the parallel programming on the specific simulation method chosen.