Abstract
As one of the most fundamental topics in reinforcement learning (RL), sample efficiency is essential to the deployment of deep RL algorithms. Unlike most existing exploration methods that sample an action from different types of posterior distributions, we focus on the policy sampling process and propose an efficient selective sampling approach to improve sample efficiency by modeling the internal hierarchy of the environment. Specifically, we first employ clustering methods in the policy sampling process to generate an action candidate set. Then we introduce a clustering buffer for modeling the internal hierarchy, which consists of on-policy data, off-policy data, and expert data to evaluate actions from the clusters in the action candidate set in the exploration stage. In this way, our approach is able to take advantage of the supervision information in the expert demonstration data. Experiments on six different continuous locomotion environments demonstrate superior reinforcement learning performance and faster convergence of selective sampling. In particular, on the LGSVL task, our method can reduce the number of convergence steps by 46.7% and the convergence time by 28.5%. Furthermore, our code is open-source for reproducibility. The code is available at https://github.com/Shihwin/SelectiveSampling.
摘要
作为强化学习领域最基本的主题之一, 样本效率对于深度强化学习算法的部署至关重要. 与现有大多数从不同类型的后验分布中对动作进行采样的探索方法不同, 我们专注于策略的采样过程, 提出一种有效的选择性采样方法, 通过对环境的内部层次结构建模来提高样本效率. 具体来说, 首先在策略采样过程中使用聚类方法生成动作候选集, 随后引入一个用于对内部层次结构建模的聚类缓冲区, 它由同轨数据、 异轨数据以及专家数据组成, 用于评估探索阶段动作候选集中不同类别动作的价值. 通过这种方式, 我们的方法能够更多地利用专家示教数据中的监督信息. 在6种不同的连续运动环境中进行了实验, 结果表明选择性采样方法具有卓越的强化学习性能和更快的收敛速度. 特别地, 在LGSVL任务中, 该方法可以减少46.7%的收敛步数和28.5%的收敛时间. 代码已开源, 见https://github.com/Shihwin/SelectiveSampling.
Data availability
The code is available at https://github.com/Shihwin/SelectiveSampling. The other data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Andrychowicz M, Wolski F, Ray A, et al., 2017. Hindsight experience replay. Proc 31st Int Conf on Neural Information Processing Systems, p.5055–5065. https://doi.org/10.5555/3295222.3295258
Bellemare MG, Srinivasan S, Ostrovski G, et al., 2016. Unifying count-based exploration and intrinsic motivation. Proc 30th Int Conf on Neural Information Processing Systems, p.1479–1487. https://doi.org/10.5555/3157096.3157262
Brockman G, Cheung V, Pettersson L, et al., 2016. OpenAI Gym. https://arxiv.org/abs/1606.01540
Cheung WC, Simchi-Levi D, Zhu RH, 2020. Reinforcement learning for non-stationary Markov decision processes: the blessing of (more) optimism. Proc 37th Int Conf on Machine Learning, Article 172. https://doi.org/10.5555/3524938.3525110
Dai XY, Zhao C, Wang X, et al., 2022. Image-based traffic signal control via world models. Front Inform Technol Electron Eng, 23(12):1795–1813. https://doi.org/10.1631/FITEE.2200323
Fu J, Luo K, Levine S, 2017. Learning robust rewards with adversarial inverse reinforcement learning. https://arxiv.org/abs/1710.11248
Goodfellow I, Pouget-Abadie J, Mirza M, et al., 2020. Generative adversarial networks. Commun ACM, 63(11):139–144. https://doi.org/10.1145/3422622
Haarnoja T, Zhou A, Abbeel P, et al., 2018. Soft actorcritic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proc 35th Int Conf on Machine Learning, p.1861–1870.
Hester T, Vecerik M, Pietquin O, et al., 2018. Deep Q-learning from demonstrations. Proc AAAI Conf on Artificial Intelligence, p.394. https://doi.org/10.1609/aaai.v32i1.11757
Ho J, Ermon S, 2016. Generative adversarial imitation learning. Proc 30th Int Conf on Neural Information Processing Systems, p.4572–4580. https://doi.org/10.5555/3157382.3157608
Houthooft R, Chen X, Duan Y, et al., 2016. VIME: variational information maximizing exploration. Proc 30th Int Conf on Neural Information Processing Systems, p.1117–1125. https://doi.org/10.5555/3157096.3157221
Huang ZZ, Chen J, Zhang JP, et al., 2023. Learning representation for clustering via prototype scattering and positive sampling. IEEE Trans Patt Anal Mach Intell, 45(6):7509–7524. https://doi.org/10.1109/TPAMI.2022.3216454
Kingma DP, Welling M, 2014. Auto-encoding variational Bayes. Proc 2nd Int Conf on Learning Representations.
Li HQ, Huang J, Cao Z, et al., 2023. Stochastic pedestrian avoidance for autonomous vehicles using hybrid reinforcement learning. Front Inform Technol Electron Eng, 24(1):131–140. https://doi.org/10.1631/FITEE.2200128
Liu SP, Tian GH, Cui YC, et al., 2022. A deep Q-learning network based active object detection model with a novel training algorithm for service robots. Front Inform Technol Electron Eng, 23(11):1673–1683. https://doi.org/10.1631/FITEE.2200109
Moerland TM, Broekens J, Plaat A, et al., 2023. Modelbased reinforcement learning: a survey. Found Trends Mach Learn, 16(1):1–118. https://doi.org/10.1561/2200000086
Murtagh F, Legendre P, 2014. Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion. J Classif, 31(3):274–295. https://doi.org/10.1007/s00357-014-9161-z
Nair A, McGrew B, Andrychowicz M, et al., 2018. Overcoming exploration in reinforcement learning with demonstrations. Proc IEEE Int Conf on Robotics and Automation, p.6292–6299. https://doi.org/10.1109/ICRA.2018.8463162
Niu C, Zhang J, Wang G, et al., 2020. GATCluster: self-supervised Gaussian-attention network for image clustering. Proc 16th European Conf on Computer Vision, p.735–751. https://doi.org/10.1007/978-3-030-58595-2_44
Niu C, Shan HM, Wang G, 2022. SPICE: semantic pseudolabeling for image clustering. IEEE Trans Image Process, 31:7264–7278. https://doi.org/10.1109/TIP.2022.3221290
Ravichandar H, Polydoros AS, Chernova S, et al., 2020. Recent advances in robot learning from demonstration. Ann Rev Contr Robot Auton Syst, 3:297–330. https://doi.org/10.1146/annurev-control-100819-063206
Rong GD, Shin BH, Tabatabaee H, et al., 2020. LGSVL Simulator: a high fidelity simulator for autonomous driving. Proc 23rd Int Conf on Intelligent Transportation Systems, p.1–6. https://doi.org/10.1109/ITSC45102.2020.9294422
Schaul T, Quan J, Antonoglou I, et al., 2016. Prioritized experience replay. Proc 4th Int Conf on Learning Representations.
Schulman J, Levine S, Moritz P, et al., 2015. Trust region policy optimization. Proc 32nd Int Conf on Machine Learning, p.1889–1897. https://doi.org/10.5555/3045118.3045319
Schulman J, Moritz P, Levine S, et al., 2016. High-dimensional continuous control using generalized advantage estimation. Proc 4th Int Conf on Learning Representations.
Schulman J, Wolski F, Dhariwal P, et al., 2017. Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347
Silver D, Huang A, Maddison CJ, et al., 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489. https://doi.org/10.1038/nature16961
Sun PQ, Zhou WG, Li HQ, 2020. Attentive experience replay. Proc AAAI Conf on Artificial Intelligence, p.5900–5907. https://doi.org/10.1609/aaai.v34i04.6049
Sutton RS, Barto AG, 1998. Reinforcement Learning: an Introduction. MIT Press, Cambridge, UK.
Vecerik M, Hester T, Scholz J, et al., 2017. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. https://arxiv.org/abs/1707.08817
Vinyals O, Babuschkin I, Czarnecki WM, et al., 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354. https://doi.org/10.1038/s41586-019-1724-z
Wang SC, Li B, 2020. Implicit posterior sampling reinforcement learning for continuous control. Proc 27th Int Conf on Neural Information Processing, p.452–460. https://doi.org/10.1007/978-3-030-63833-7_38
Xie JY, Girshick R, Farhadi A, 2016. Unsupervised deep embedding for clustering analysis. Proc 33rd Int Conf on Machine Learning, p.478–487. https://doi.org/10.5555/3045390.3045442
Xue JR, Hu B, Li LX, et al., 2022. Human-machine augmented intelligence: research and applications. Front Inform Technol Electron Eng, 23(8):1139–1141. https://doi.org/10.1631/FITEE.2250000
Ye DH, Chen GB, Zhang W, et al., 2020. Towards playing full MOBA games with deep reinforcement learning. Proc 34th Int Conf on Neural Information Processing Systems, Article 53.
Zhang JP, Pu J, Xue JR, et al., 2023a. HiVeGPT: human-machine-augmented intelligent vehicles with generative pre-trained transformer. IEEE Trans Intell Veh, 8(3):2027–2033. https://doi.org/10.1109/TIV.2023.3256982
Zhang JP, Pu J, Chen J, et al., 2023b. DSiV: data science for intelligent vehicles. IEEE Trans Intell Veh, 8(4):2628–2634. https://doi.org/10.1109/TIV.2023.3264601
Zhou J, Ke P, Qiu XP, et al., 2023. ChatGPT: potential, prospects, and limitations. Front Inform Technol Electron Eng, early access. https://doi.org/10.1631/FITEE.2300089
Author information
Authors and Affiliations
Contributions
Shihmin WANG, Binqi ZHAO, Zhengfeng ZHANG, and Junping ZHANG designed the research. Shihmin WANG and Binqi ZHAO processed the data. Shihmin WANG drafted the paper. Junping ZHANG and Jian PU helped organize the paper. Shihmin WANG, Junping ZHANG, and Jian PU revised and finalized the paper.
Corresponding author
Ethics declarations
Junping ZHANG and Jian PU are an editorial board member and a corresponding expert of Frontiers of Information Technology & Electronic Engineering, respectively, and they were not involved with the peer review process of this paper. All authors declare that they have no conflict of interest.
Additional information
Project supported by the National Natural Science Foundation of China (No. 62176059), the Shanghai Municipal Science and Technology Major Project (No. 2018SHZDZX01), Zhangjiang Lab, and the Shanghai Center for Brain Science and Brain-inspired Technology
Rights and permissions
About this article
Cite this article
Wang, S., Zhao, B., Zhang, Z. et al. Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning. Front Inform Technol Electron Eng 24, 1541–1556 (2023). https://doi.org/10.1631/FITEE.2300084
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.2300084