当前位置: X-MOL 学术GeoInformatica › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
joinTree: A novel join-oriented multivariate operator for spatio-temporal data management in Flink
GeoInformatica ( IF 2 ) Pub Date : 2022-08-04 , DOI: 10.1007/s10707-022-00470-5
Hangxu Ji , Gang Wu , Yuhai Zhao , Shiye Wang , Guoren Wang , George Y. Yuan

In the era of intelligent Internet, the management and analysis of massive spatio-temporal data is one of the important links to realize intelligent applications and build smart cities, in which the interaction of multi-source data is the basis of realizing spatio-temporal data management and analysis. As an important carrier to achieve the interactive calculation of massive data, Flink provides the advanced Operator Join to facilitate user program development. In a Flink job with multi-source data connection operations, the selection of join sequences and the data communication in the repartition phase are both key factors that affect the efficiency of the job. However, Flink does not provide any optimization mechanism for the two factors, which in turn leads to low job efficiency. If the enumeration method is used to find the optimal join sequence, the result will not be obtained in polynomial time, so the optimization effect cannot be achieved. We investigate the above problems, design and implement a more advanced Operator joinTree that can support multi-source data connection in Flink, and introduce two optimization strategies into the Operator. In summary, the advantages of our work are highlighted as follows: (1) the Operator enables Flink to support multi-source data connection operation, and reduces the amount of calculation and data communication by introducing lightweight optimization strategies to improve job efficiency; (2) with the optimization strategy for join sequence, the total running time can be reduced by 29% and the data communication can be reduced by 34% compared with traditional sequential execution; (3) the optimization strategy for data repartition can further enable the job to bring 35% performance improvement, and in the average case can reduce the data communication by 43%.



中文翻译:

joinTree:一种新颖的面向连接的多变量算子,用于 Flink 中的时空数据管理

智能互联网时代,海量时空数据的管理与分析是实现智能应用、建设智慧城市的重要环节之一,其中多源数据的交互是实现时空数据的基础管理和分析。Flink 作为实现海量数据交互计算的重要载体,提供了先进的 Operator Join,方便用户程序开发。在多源数据连接操作的 Flink 作业中,join 序列的选择和重分区阶段的数据通信都是影响作业效率的关键因素。但是,Flink 并没有针对这两个因素提供任何优化机制,进而导致作业效率低下。如果采用枚举的方法寻找最优的连接序列,在多项式时间内将无法得到结果,因此无法达到优化效果。我们针对上述问题,在 Flink 中设计并实现了更高级的可以支持多源数据连接的 Operator joinTree,并在 Operator 中引入了两种优化策略。综上所述,我们的工作优势突出如下:(1)Operator使Flink支持多源数据连接操作,通过引入轻量级优化策略减少计算量和数据通信量,提高工作效率;(2)采用join sequence的优化策略,与传统的顺序执行相比,总运行时间可减少29%,数据通信量可减少34%;

更新日期:2022-08-04
down
wechat
bug