Hybrid actor-critic algorithm for quantum reinforcement learning at CERN beam lines,Quantum Science and Technology

当前位置： X-MOL 学术 › Quantum Sci. Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Hybrid actor-critic algorithm for quantum reinforcement learning at CERN beam lines
Quantum Science and Technology ( IF 6.7 ) Pub Date : 2024-02-21 , DOI: 10.1088/2058-9565/ad261b
Michael Schenk , Elías F Combarro , Michele Grossi , Verena Kain , Kevin Shing Bruce Li , Mircea-Marian Popa , Sofia Vallecorsa

Free energy-based reinforcement learning (FERL) with clamped quantum Boltzmann machines (QBM) was shown to significantly improve the learning efficiency compared to classical Q-learning with the restriction, however, to discrete state-action space environments. In this paper, the FERL approach is extended to multi-dimensional continuous state-action space environments to open the doors for a broader range of real-world applications. First, free energy-based Q-learning is studied for discrete action spaces, but continuous state spaces and the impact of experience replay on sample efficiency is assessed. In a second step, a hybrid actor-critic (A-C) scheme for continuous state-action spaces is developed based on the deep deterministic policy gradient algorithm combining a classical actor network with a QBM-based critic. The results obtained with quantum annealing (QA), both simulated and with D-Wave QA hardware, are discussed, and the performance is compared to classical reinforcement learning methods. The environments used throughout represent existing particle accelerator beam lines at the European Organisation for Nuclear Research. Among others, the hybrid A-C agent is evaluated on the actual electron beam line of the Advanced Wakefield Experiment (AWAKE).

中文翻译：

CERN 光束线量子强化学习的混合行动批评算法

然而，与经典 Q 学习相比，基于自由能量的强化学习 (FERL) 和钳位量子玻尔兹曼机 (QBM) 可以显着提高学习效率，但受到离散状态-动作空间环境的限制。在本文中，FERL 方法被扩展到多维连续状态动作空间环境，为更广泛的现实世界应用打开了大门。首先，研究了离散动作空间的基于自由能的 Q 学习，但评估了连续状态空间和经验回放对样本效率的影响。第二步，基于深度确定性策略梯度算法，结合经典的行动者网络和基于 QBM 的批评者，开发了一种用于连续状态-动作空间的混合行动者-批评者 (AC) 方案。讨论了通过模拟和 D-Wave QA 硬件获得的量子退火 (QA) 结果，并将其性能与经典强化学习方法进行了比较。整个过程中使用的环境代表欧洲核研究组织现有的粒子加速器束线。其中，混合交流剂在高级韦克菲尔德实验（AWAKE）的实际电子束线上进行了评估。

更新日期：2024-02-21

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>