当前位置: X-MOL 学术ACM Trans. Archit. Code Optim. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
COWS for High Performance: Cost Aware Work Stealing for Irregular Parallel Loop
ACM Transactions on Architecture and Code Optimization ( IF 1.6 ) Pub Date : 2024-01-19 , DOI: 10.1145/3633331
Prasoon Mishra 1 , V. Krishna Nandivada 1
Affiliation  

Parallel libraries such as OpenMP distribute the iterations of parallel-for-loops among the threads, using a programmer-specified scheduling policy. While the existing scheduling policies perform reasonably well in the context of balanced workloads, in computations that involve highly imbalanced workloads it is extremely non-trivial to obtain an efficient distribution of work (even using non-static scheduling methods like dynamic and guided). In this paper, we present a scheme called COst aware Work Stealing (COWS) to efficiently extend the idea of work-stealing to OpenMP.

In contrast to the traditional work-stealing schedulers, COWS takes into consideration that (i) not all iterations of a parallel-for-loops may take the same amount of time. (ii) identifying a suitable victim for stealing is important for load-balancing, and (iii) queues lead to significant overheads in traditional work-stealing and should be avoided. We present two variations of COWS: WSRI (a naive work-stealing scheme based on the number of remaining iterations) and WSRW (work-stealing scheme based on the amount of remaining workload). Since in irregular loops like those found in graph analytics it is not possible to statically compute the cost of the iterations of the parallel-for-loops, we use a combined compile-time + runtime approach, where the remaining workload of a loop is computed efficiently at runtime by utilizing the code generated by our compile-time component. We have performed an evaluation over seven different benchmark programs, using five different input datasets, on two different hardware across a varying number of threads; leading to a total number of 275 configurations. We show that in 225 out of 275 configurations, compared to the best OpenMP scheduling scheme for that configuration, our approach achieves clear performance gains.



中文翻译:

实现高性能的 COWS:不规则并行循环的成本意识工作窃取

OpenMP 等并行库使用程序员指定的调度策略在线程之间分配并行 for 循环的迭代。虽然现有的调度策略在平衡工作负载的情况下表现得相当好,但在涉及高度不平衡工作负载的计算中,获得有效的工作分配是非常重要的(即使使用动态和引导等非静态调度方法)。在本文中,我们提出了一种称为 COst 感知工作窃取 (COWS) 的方案,以有效地将工作窃取的思想扩展到 OpenMP。

与传统的工作窃取调度程序相比,COWS 考虑到 (i) 并非并行 for 循环的所有迭代都可能花费相同的时间。(ii) 确定合适的窃取受害者对于负载平衡很重要,并且 (iii) 队列会导致传统工作窃取的巨大开销,应避免。我们提出了 COWS 的两种变体:WSRI(基于剩余迭代次数的简单工作窃取方案)和 WSRW(基于剩余工作负载量的工作窃取方案)。由于在图形分析中发现的不规则循环中,不可能静态计算并行 for 循环的迭代成本,因此我们使用编译时 + 运行时相结合的方法,其中计算循环的剩余工作负载通过利用我们的编译时组件生成的代码,在运行时高效地进行。我们使用五个不同的输入数据集,在两个不同的硬件上跨不同数量的线程对七个不同的基准程序进行了评估;总共有 275 种配置。我们表明,在 275 种配置中的 225 种中,与该配置的最佳 OpenMP 调度方案相比,我们的方法实现了明显的性能提升。

更新日期:2024-01-19
down
wechat
bug