Abstract
Data Stream Processing (DSP) applications analyze data flows in near real-time by means of operators, which process and transform incoming data. Operators handle high data rates running parallel replicas across multiple processors and hosts. To guarantee consistent performance without wasting resources in the face of variable workloads, auto-scaling techniques have been studied to adapt operator parallelism at run-time. However, most of the effort has been spent under the assumption of homogeneous computing infrastructures, neglecting the complexity of modern environments.
We consider the problem of deciding both how many operator replicas should be executed and which types of computing nodes should be acquired. We devise heterogeneity-aware policies by means of a two-layered hierarchy of controllers. While application-level components steer the adaptation process for whole applications, aiming to guarantee user-specified requirements, lower-layer components control auto-scaling of single operators. We tackle the fundamental challenge of performance and workload uncertainty, exploiting Bayesian optimization (BO) and reinforcement learning (RL) to devise policies. The evaluation shows that our approach is able to meet users’ requirements in terms of response time and adaptation overhead, while minimizing the cost due to resource usage, outperforming state-of-the-art baselines. We also demonstrate how partial model information is exploited to reduce training time for learning-based controllers.
- [1] . 2021. A reinforcement learning approach to reduce serverless function cold start frequency. In Proceedings of the IEEE CCGRID’21. 797–803.
DOI: Google ScholarCross Ref - [2] . 2020. COSE: Configuring serverless functions using statistical learning. In Proceedings of the IEEE INFOCOM’20. 129–138.
DOI: Google ScholarDigital Library - [3] , Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment 8, 12 (2015), 1792–1803.
DOI: Google ScholarDigital Library - [4] . 2018. Elasticity in cloud computing: State-of-the-art and research challenges. IEEE Transactions on Services Computing 11, 2 (2018), 430–447.
DOI: Google ScholarCross Ref - [5] . 2018. Distributed data stream processing and edge computing: A survey on resource elasticity and future directions. Journal of Network and Computer Applications 103 (2018), 1–17.
DOI: Google ScholarDigital Library - [6] . 2021. Feedback Systems: An Introduction for Scientists and Engineers (2nd. ed.). Princeton University Press.Google Scholar
- [7] . 2002. Models and issues in data stream systems. In Proceedings of the ACM PODS’02. 1–16.
DOI: Google ScholarDigital Library - [8] . 2018. Decentralized self-adaptation for elastic data stream processing. Future Generation Computer Systems 87 (2018), 171–185.
DOI: Google ScholarDigital Library - [9] . 2018. Optimal operator deployment and replication for elastic distributed data stream processing. Concurrency and Computation: Practice and Experience 30, 9 (2018), 20 pages.
DOI: Google ScholarCross Ref - [10] . 2022. Run-time adaptation of data stream processing systems: The state-of-the-art. ACM Computing Surveys 54, 11s (2022), 36 pages.
DOI: Google ScholarDigital Library - [11] . 2016. Elastic stateful stream processing in Storm. In Proceedings of the HPCS’16. IEEE, 583–590.
DOI: Google ScholarCross Ref - [12] . 2018. Adaptive scheduling parallel jobs with dynamic batching in Spark Streaming. IEEE Transactions on Parallel and Distributed Systems 29, 12 (2018), 2672–2685.
DOI: Google ScholarCross Ref - [13] . 2017. Elastic scaling for distributed latency-sensitive data stream operators. In Proceedings of the 25th Euromicro International Conference on Parallel, Distributed, and Network-based Processing, PDP’17. IEEE Computer Society, 61–68.
DOI: Google ScholarCross Ref - [14] . 2017. Proactive elasticity and energy awareness in data stream processing. Journal of Systems and Software 127 (2017), 302–319.
DOI: Google ScholarDigital Library - [15] . 2014. Multi-agent based architecture for dynamic VM consolidation in cloud data centers. In Proceedings of the 40th Euromicro Conference on Software Engineering and Advanced Applications. 111–118.
DOI: Google ScholarDigital Library - [16] . 2013. Integrating scale out and fault tolerance in stream processing using operator state management. In Proceedings of the 2013 ACM International Conference on Management of Data, SIGMOD’13. 725–736.
DOI: Google ScholarDigital Library - [17] . 2023. A survey on the evolution of stream processing systems. arXiv:2008.00842. Retrieved from https://arxiv.org/abs/2008.00842.Google Scholar
- [18] . 2018. A tutorial on Bayesian optimization. arXiv:1807.02811. Retrieved from https://arxiv.org/abs/1807.02811.Google Scholar
- [19] . 2017. DRS: Auto-scaling for real-time stream analytics. IEEE/ACM Transactions on Networking 25, 6 (2017), 3338–3352.
DOI: Google ScholarDigital Library - [20] . 2014. Elastic scaling for data stream processing. IEEE Transactions on Parallel and Distributed Systems 25, 6 (2014), 1447–1463.
DOI: Google ScholarDigital Library - [21] . 2021. Applying machine learning in self-adaptive systems: A systematic literature review. ACM Transactions on Autonomous and Adaptive Systems 15, 3, (2021), 37 pages.
DOI: Google ScholarDigital Library - [22] . 2012. StreamCloud: An elastic and scalable data streaming system. IEEE Transactions on Parallel and Distributed Systems 23, 12 (2012), 2351–2365.
DOI: Google ScholarDigital Library - [23] . 2021. Cloud resource scheduling with deep reinforcement learning and imitation learning. IEEE Internet of Things Journal 8, 5 (2021), 3576–3586.
DOI: Google ScholarCross Ref - [24] . 2014. Cloud-based data stream processing. In Proceedings of the 8th ACM International Conference on Distributed Event-based Systems. 238–245.
DOI: Google ScholarDigital Library - [25] . 2014. Auto-scaling techniques for elastic data stream processing. In Proceedings of the 2014 IEEE International Conference on Data Engineering Workshops. 296–302.
DOI: Google ScholarCross Ref - [26] . 2015. Online parameter optimization for elastic data stream processing. In Proceedings of the 6th ACM Symposium on Cloud Computing, SoCC’15. 276–287.
DOI: Google ScholarDigital Library - [27] . 2020. Q-Flink: A QoS-aware controller for Apache Flink. In Proceedings of the 20th IEEE/ACM International Symposium on Cluster, Cloud, and Internet Computing, CCGRID’20. 629–638.
DOI: Google ScholarCross Ref - [28] . 2018. Uncertainty-aware elastic virtual machine scheduling for stream processing systems. In Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGRID’18. 62–71.
DOI: Google ScholarDigital Library - [29] . 2016. An uncertainty-aware approach to optimal configuration of stream processing systems. In Proceedings of the 24th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS’16. 39–48.
DOI: Google ScholarCross Ref - [30] . 2017. A preventive auto-parallelization approach for elastic stream processing. In Proceedings of the 37th IEEE International Conference on Distributed Computing Systems, ICDCS’17. 1532–1542.
DOI: Google ScholarCross Ref - [31] . 2008. Placement strategies for internet-scale data stream systems. IEEE Internet Computing 12, 6 (2008), 50–60.
DOI: Google ScholarDigital Library - [32] . 2018. Model-free control for distributed stream data processing using deep reinforcement learning. Proceedings of the VLDB Endowment 11, 6 (2018), 705–718.
DOI: Google ScholarDigital Library - [33] . 2017. A hierarchical framework of cloud resource allocation and power management using deep reinforcement learning. In Proceedings of the 37th IEEE International Conference on Distributed Computing Systems, ICDCS’17. 372–382.
DOI: Google ScholarCross Ref - [34] . 2018. A stepwise auto-profiling method for performance optimization of streaming applications. ACM Transactions on Autonomous and Adaptive Systems 12, 4 (2018), 24:1–24:33.
DOI: Google ScholarDigital Library - [35] . 2015. Elastic stream processing with latency guarantees. In Proceedings of the 35th IEEE International Conference on Distributed Computing Systems, ICDCS’15. 399–410.
DOI: Google ScholarCross Ref - [36] . 2018. Elastic symbiotic scaling of operators and resources in stream processing systems. IEEE Transactions on Parallel and Distributed Systems 29, 3 (2018), 572–585.
DOI: Google ScholarCross Ref - [37] . 2014. A review of auto-scaling techniques for elastic applications in cloud environments. Journal of Grid Computing 12, 4 (2014), 559–592.
DOI: Google ScholarDigital Library - [38] . 2019. Applications of deep reinforcement learning in communications and networking: A survey. IEEE Communications Surveys and Tutorials 21, 4 (2019), 3133–3174.
DOI: Google ScholarDigital Library - [39] . 2011. Fast reinforcement learning for energy-efficient wireless communication. IEEE Transactions on Signal Processing 59, 12 (2011), 6262–6266.
DOI: Google ScholarDigital Library - [40] . 2020. Turbine: Facebook’s service management platform for stream processing. In Proceedings of the 36th IEEE International Conference on Data Engineering, ICDE’20. 1591–1602.
DOI: Google ScholarCross Ref - [41] . 2016. A game-theoretic approach for elastic distributed data stream processing. ACM Transactions on Autonomous and Adaptive Systems 11, 2 (2016), 13:1–13:34.
DOI: Google ScholarDigital Library - [42] . 2020. Auto-tuning parameter choices in HPC applications using Bayesian optimization. In Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium, IPDPS’20. 831–840.
DOI: Google ScholarCross Ref - [43] , Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529–533.
DOI: Google ScholarCross Ref - [44] . 2019. BGElasor: Elastic-scaling framework for distributed streaming processing with deep neural network. In Proceedings of the 16th IFIP WG 10.3 International Conference on Network and Parallel Computing, NPC’19. Springer, 120–131.
DOI: Google ScholarDigital Library - [45] . 2019. Efficient operator placement for distributed data stream processing applications. IEEE Transactions on Parallel and Distributed Systems 30, 8 (2019), 1753–1767.
DOI: Google ScholarCross Ref - [46] . 2015. The power of both choices: Practical load balancing for distributed stream processing engines. In Proceedings of the 2015 IEEE International Conference on Data Engineering, ICDE’15. 137–148.
DOI: Google ScholarCross Ref - [47] . 2009. VCONF: A reinforcement learning approach to virtual machines auto-configuration. In Proceedings of the 6th ACM International Conference on Autonomic Computing, ICAC’09. 137–146.
DOI: Google ScholarDigital Library - [48] . 2006. Gaussian Processes for Machine Learning. MIT Press.Google ScholarDigital Library
- [49] . 2019. Combining it all: Cost minimal and low-latency stream processing across distributed heterogeneous infrastructures. In Proceedings of the 20th ACM International Middleware Conference, Middleware’19. 255–267.
DOI: Google ScholarDigital Library - [50] . 2019. A comprehensive survey on parallelization and elasticity in stream processing. ACM Computing Surveys 52, 2 (2019), 36:1–36:37.
DOI: Google ScholarDigital Library - [51] . 2023. Query processing on heterogeneous CPU/GPU systems. ACM Computing Surveys 55, 1 (2023), 38 pages.
DOI: Google ScholarDigital Library - [52] . 2019. Horizontal and vertical scaling of container-based applications using reinforcement learning. In Proceedings of the 12th IEEE International Conference on Cloud Computing, CLOUD’19. 329–338.
DOI: Google ScholarCross Ref - [53] . 2021. MEAD: Model-based vertical auto-scaling for data stream processing. In Proceedings of the 21th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID’21. 314–323.
DOI: Google ScholarCross Ref - [54] . 2019. Reinforcement learning based policies for elastic stream processing on heterogeneous resources. In Proceedings of the 13th ACM International Conference on Distributed and Event-based Systems, DEBS’19. 31–42.
DOI: Google ScholarDigital Library - [55] . 2021. Heterogeneity-aware elastic scaling of streaming applications on cloud platforms. The Journal of Supercomputing 77, 9 (2021), 10512–10539.
DOI: Google ScholarDigital Library - [56] . 2016. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE 104, 1 (2016), 148–175.
DOI: Google ScholarCross Ref - [57] . 2018. Toward reliable and rapid elasticity for streaming dataflows on clouds. In Proceedings of the IEEE 38th International Conference on Distributed Computing Systems, ICDCS’18. 1096–1106.
DOI: Google ScholarCross Ref - [58] . 2019. Multi-objective reinforcement learning for reconfiguring data stream analytics on edge computing. In Proceedings of 48th International Conference on Parallel Processing, ICPP’19. ACM, 106:1–106:10.
DOI: Google ScholarDigital Library - [59] . 2020. Auto-sizing for stream processing applications at LinkedIn. In Proceedings of the 12th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’20. 8 pages. Retrieved from https://www.usenix.org/conference/hotcloud20/presentation/singh.Google Scholar
- [60] . 2012. Practical Bayesian optimization of machine learning algorithms. In Proceedings of the Advances in Neural Information Processing Systems. , , , and (Eds.), Vol. 25. Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper/2012/file/05311655a15b75fab86956663e1819cd-Paper.pdf.Google Scholar
- [61] . 2005. The 8 requirements of real-time stream processing. SIGMOD Record 34, 4 (2005), 42–47.
DOI: Google ScholarDigital Library - [62] . 2018. Reinforcement Learning: An Introduction (2nd. ed.). MIT Press, Cambridge, MA.Google Scholar
- [63] . 2019. Migration modeling and learning algorithms for containers in fog computing. IEEE Transactions on Services Computing 12, 5 (2019), 712–725.
DOI: Google ScholarCross Ref - [64] . 2007. On the use of hybrid reinforcement learning for autonomic resource allocation. Cluster Computing 10, 3 (2007), 287–299.
DOI: Google ScholarDigital Library - [65] . 2017. Into the storm: Descrying optimal configurations using genetic algorithms and Bayesian optimization. In Proceedings of the 2017 IEEE 2nd International Workshops on Foundations and Applications of Self* Systems. 175–180.
DOI: Google ScholarCross Ref - [66] . 2020. Spur: Mitigating slow instances in large-scale streaming pipelines. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD’20. ACM, 2271–2285.
DOI: Google ScholarDigital Library - [67] . 1992. Q-learning. Machine Learning 8, 3-4 (1992), 279–292.
DOI: Google ScholarDigital Library - [68] , Bradley Schmerl, Vincenzo Grassi, Sam Malek, Raffaela Mirandola, Christian Prehofer, Jochen Wuttke, Jesper Andersson, Holger Giese, and Karl M. Göschka. 2013. On patterns for decentralized control in self-adaptive systems. In Proceedings of the Software Engineering for Self-Adaptive Systems II. Springer, 76–107.
DOI: Google ScholarCross Ref - [69] . 2021. Model-based reinforcement learning for elastic stream processing in edge computing. In Proceedings of the IEEE 28th International Conference on High Performance Computing, Data, and Analytics, HiPC’21. 292–301.
DOI: Google ScholarCross Ref
Index Terms
- Hierarchical Auto-scaling Policies for Data Stream Processing on Heterogeneous Resources
Recommendations
Model-based auto-scaling of distributed data stream processing applications
Middleware'20 Doctoral Symposium: Proceedings of the 21st International Middleware Conference Doctoral SymposiumData Stream Processing (DSP) enables near real-time analysis of fast data streams, produced, e.g., by Internet-of-Things devices. Distributed DSP systems exploit distributed computing infrastructures, possibly spanning both Cloud and Fog/Edge platforms, ...
Maximum Sustainable Throughput Prediction for Data Stream Processing over Public Clouds
CCGrid '17: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid ComputingIn cloud-based stream processing services, the maximum sustainable throughput (MST) is defined as the maximum throughput that a system composed of a fixed number of virtual machines (VMs) can ingest indefinitely. If the incoming data rate exceeds the ...
Performance modelling and verification of cloud-based auto-scaling policies
AbstractAuto-scaling, a key property of cloud computing, allows application owners to acquire and release resources on demand. However, the shared environment, along with the exponentially large configuration space of available parameters, ...
Highlights- Proposes a framework for formal reasoning about cloud-based auto-scaling policies.
Comments