Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning

Wang, Shihmin; Zhao, Binqi; Zhang, Zhengfeng; Zhang, Junping; Pu, Jian

doi:10.1631/FITEE.2300084

Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning

基于专家示教聚类经验池的高效深度强化学习

Research Article
Published: 07 December 2023

Volume 24, pages 1541–1556, (2023)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Shihmin Wang (王士珉) ORCID: orcid.org/0000-0002-7288-8323¹,
Binqi Zhao (赵彬琦)¹,
Zhengfeng Zhang (张政锋)¹,
Junping Zhang (张军平)¹ &
…
Jian Pu (浦剑) ORCID: orcid.org/0000-0002-0892-1213²

98 Accesses
Explore all metrics

Abstract

As one of the most fundamental topics in reinforcement learning (RL), sample efficiency is essential to the deployment of deep RL algorithms. Unlike most existing exploration methods that sample an action from different types of posterior distributions, we focus on the policy sampling process and propose an efficient selective sampling approach to improve sample efficiency by modeling the internal hierarchy of the environment. Specifically, we first employ clustering methods in the policy sampling process to generate an action candidate set. Then we introduce a clustering buffer for modeling the internal hierarchy, which consists of on-policy data, off-policy data, and expert data to evaluate actions from the clusters in the action candidate set in the exploration stage. In this way, our approach is able to take advantage of the supervision information in the expert demonstration data. Experiments on six different continuous locomotion environments demonstrate superior reinforcement learning performance and faster convergence of selective sampling. In particular, on the LGSVL task, our method can reduce the number of convergence steps by 46.7% and the convergence time by 28.5%. Furthermore, our code is open-source for reproducibility. The code is available at https://github.com/Shihwin/SelectiveSampling.

摘要

作为强化学习领域最基本的主题之一, 样本效率对于深度强化学习算法的部署至关重要. 与现有大多数从不同类型的后验分布中对动作进行采样的探索方法不同, 我们专注于策略的采样过程, 提出一种有效的选择性采样方法, 通过对环境的内部层次结构建模来提高样本效率. 具体来说, 首先在策略采样过程中使用聚类方法生成动作候选集, 随后引入一个用于对内部层次结构建模的聚类缓冲区, 它由同轨数据、异轨数据以及专家数据组成, 用于评估探索阶段动作候选集中不同类别动作的价值. 通过这种方式, 我们的方法能够更多地利用专家示教数据中的监督信息. 在6种不同的连续运动环境中进行了实验, 结果表明选择性采样方法具有卓越的强化学习性能和更快的收敛速度. 特别地, 在LGSVL任务中, 该方法可以减少46.7%的收敛步数和28.5%的收敛时间. 代码已开源, 见https://github.com/Shihwin/SelectiveSampling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data availability

The code is available at https://github.com/Shihwin/SelectiveSampling. The other data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Andrychowicz M, Wolski F, Ray A, et al., 2017. Hindsight experience replay. Proc 31^st Int Conf on Neural Information Processing Systems, p.5055–5065. https://doi.org/10.5555/3295222.3295258
Bellemare MG, Srinivasan S, Ostrovski G, et al., 2016. Unifying count-based exploration and intrinsic motivation. Proc 30^th Int Conf on Neural Information Processing Systems, p.1479–1487. https://doi.org/10.5555/3157096.3157262
Brockman G, Cheung V, Pettersson L, et al., 2016. OpenAI Gym. https://arxiv.org/abs/1606.01540
Cheung WC, Simchi-Levi D, Zhu RH, 2020. Reinforcement learning for non-stationary Markov decision processes: the blessing of (more) optimism. Proc 37^th Int Conf on Machine Learning, Article 172. https://doi.org/10.5555/3524938.3525110
Dai XY, Zhao C, Wang X, et al., 2022. Image-based traffic signal control via world models. Front Inform Technol Electron Eng, 23(12):1795–1813. https://doi.org/10.1631/FITEE.2200323
Article Google Scholar
Fu J, Luo K, Levine S, 2017. Learning robust rewards with adversarial inverse reinforcement learning. https://arxiv.org/abs/1710.11248
Goodfellow I, Pouget-Abadie J, Mirza M, et al., 2020. Generative adversarial networks. Commun ACM, 63(11):139–144. https://doi.org/10.1145/3422622
Article MathSciNet Google Scholar
Haarnoja T, Zhou A, Abbeel P, et al., 2018. Soft actorcritic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proc 35^th Int Conf on Machine Learning, p.1861–1870.
Hester T, Vecerik M, Pietquin O, et al., 2018. Deep Q-learning from demonstrations. Proc AAAI Conf on Artificial Intelligence, p.394. https://doi.org/10.1609/aaai.v32i1.11757
Ho J, Ermon S, 2016. Generative adversarial imitation learning. Proc 30^th Int Conf on Neural Information Processing Systems, p.4572–4580. https://doi.org/10.5555/3157382.3157608
Houthooft R, Chen X, Duan Y, et al., 2016. VIME: variational information maximizing exploration. Proc 30^th Int Conf on Neural Information Processing Systems, p.1117–1125. https://doi.org/10.5555/3157096.3157221
Huang ZZ, Chen J, Zhang JP, et al., 2023. Learning representation for clustering via prototype scattering and positive sampling. IEEE Trans Patt Anal Mach Intell, 45(6):7509–7524. https://doi.org/10.1109/TPAMI.2022.3216454
Article Google Scholar
Kingma DP, Welling M, 2014. Auto-encoding variational Bayes. Proc 2^nd Int Conf on Learning Representations.
Li HQ, Huang J, Cao Z, et al., 2023. Stochastic pedestrian avoidance for autonomous vehicles using hybrid reinforcement learning. Front Inform Technol Electron Eng, 24(1):131–140. https://doi.org/10.1631/FITEE.2200128
Article Google Scholar
Liu SP, Tian GH, Cui YC, et al., 2022. A deep Q-learning network based active object detection model with a novel training algorithm for service robots. Front Inform Technol Electron Eng, 23(11):1673–1683. https://doi.org/10.1631/FITEE.2200109
Article Google Scholar
Moerland TM, Broekens J, Plaat A, et al., 2023. Modelbased reinforcement learning: a survey. Found Trends Mach Learn, 16(1):1–118. https://doi.org/10.1561/2200000086
Article MATH Google Scholar
Murtagh F, Legendre P, 2014. Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion. J Classif, 31(3):274–295. https://doi.org/10.1007/s00357-014-9161-z
Article MathSciNet MATH Google Scholar
Nair A, McGrew B, Andrychowicz M, et al., 2018. Overcoming exploration in reinforcement learning with demonstrations. Proc IEEE Int Conf on Robotics and Automation, p.6292–6299. https://doi.org/10.1109/ICRA.2018.8463162
Niu C, Zhang J, Wang G, et al., 2020. GATCluster: self-supervised Gaussian-attention network for image clustering. Proc 16^th European Conf on Computer Vision, p.735–751. https://doi.org/10.1007/978-3-030-58595-2_44
Niu C, Shan HM, Wang G, 2022. SPICE: semantic pseudolabeling for image clustering. IEEE Trans Image Process, 31:7264–7278. https://doi.org/10.1109/TIP.2022.3221290
Article Google Scholar
Ravichandar H, Polydoros AS, Chernova S, et al., 2020. Recent advances in robot learning from demonstration. Ann Rev Contr Robot Auton Syst, 3:297–330. https://doi.org/10.1146/annurev-control-100819-063206
Article Google Scholar
Rong GD, Shin BH, Tabatabaee H, et al., 2020. LGSVL Simulator: a high fidelity simulator for autonomous driving. Proc 23^rd Int Conf on Intelligent Transportation Systems, p.1–6. https://doi.org/10.1109/ITSC45102.2020.9294422
Schaul T, Quan J, Antonoglou I, et al., 2016. Prioritized experience replay. Proc 4^th Int Conf on Learning Representations.
Schulman J, Levine S, Moritz P, et al., 2015. Trust region policy optimization. Proc 32^nd Int Conf on Machine Learning, p.1889–1897. https://doi.org/10.5555/3045118.3045319
Schulman J, Moritz P, Levine S, et al., 2016. High-dimensional continuous control using generalized advantage estimation. Proc 4^th Int Conf on Learning Representations.
Schulman J, Wolski F, Dhariwal P, et al., 2017. Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347
Silver D, Huang A, Maddison CJ, et al., 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489. https://doi.org/10.1038/nature16961
Article Google Scholar
Sun PQ, Zhou WG, Li HQ, 2020. Attentive experience replay. Proc AAAI Conf on Artificial Intelligence, p.5900–5907. https://doi.org/10.1609/aaai.v34i04.6049
Sutton RS, Barto AG, 1998. Reinforcement Learning: an Introduction. MIT Press, Cambridge, UK.
MATH Google Scholar
Vecerik M, Hester T, Scholz J, et al., 2017. Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. https://arxiv.org/abs/1707.08817
Vinyals O, Babuschkin I, Czarnecki WM, et al., 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350–354. https://doi.org/10.1038/s41586-019-1724-z
Article Google Scholar
Wang SC, Li B, 2020. Implicit posterior sampling reinforcement learning for continuous control. Proc 27^th Int Conf on Neural Information Processing, p.452–460. https://doi.org/10.1007/978-3-030-63833-7_38
Xie JY, Girshick R, Farhadi A, 2016. Unsupervised deep embedding for clustering analysis. Proc 33^rd Int Conf on Machine Learning, p.478–487. https://doi.org/10.5555/3045390.3045442
Xue JR, Hu B, Li LX, et al., 2022. Human-machine augmented intelligence: research and applications. Front Inform Technol Electron Eng, 23(8):1139–1141. https://doi.org/10.1631/FITEE.2250000
Article Google Scholar
Ye DH, Chen GB, Zhang W, et al., 2020. Towards playing full MOBA games with deep reinforcement learning. Proc 34^th Int Conf on Neural Information Processing Systems, Article 53.
Zhang JP, Pu J, Xue JR, et al., 2023a. HiVeGPT: human-machine-augmented intelligent vehicles with generative pre-trained transformer. IEEE Trans Intell Veh, 8(3):2027–2033. https://doi.org/10.1109/TIV.2023.3256982
Article Google Scholar
Zhang JP, Pu J, Chen J, et al., 2023b. DSiV: data science for intelligent vehicles. IEEE Trans Intell Veh, 8(4):2628–2634. https://doi.org/10.1109/TIV.2023.3264601
Article Google Scholar
Zhou J, Ke P, Qiu XP, et al., 2023. ChatGPT: potential, prospects, and limitations. Front Inform Technol Electron Eng, early access. https://doi.org/10.1631/FITEE.2300089

Download references

Author information

Authors and Affiliations

Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, 200433, China
Shihmin Wang (王士珉), Binqi Zhao (赵彬琦), Zhengfeng Zhang (张政锋) & Junping Zhang (张军平)
Institute of Science and Technology for Brain-inspired Intelligence, Fudan University, Shanghai, 200433, China
Jian Pu (浦剑)

Authors

Shihmin Wang (王士珉)
View author publications
You can also search for this author in PubMed Google Scholar
Binqi Zhao (赵彬琦)
View author publications
You can also search for this author in PubMed Google Scholar
Zhengfeng Zhang (张政锋)
View author publications
You can also search for this author in PubMed Google Scholar
Junping Zhang (张军平)
View author publications
You can also search for this author in PubMed Google Scholar
Jian Pu (浦剑)
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Shihmin WANG, Binqi ZHAO, Zhengfeng ZHANG, and Junping ZHANG designed the research. Shihmin WANG and Binqi ZHAO processed the data. Shihmin WANG drafted the paper. Junping ZHANG and Jian PU helped organize the paper. Shihmin WANG, Junping ZHANG, and Jian PU revised and finalized the paper.

Corresponding author

Correspondence to Jian Pu (浦剑).

Ethics declarations

Junping ZHANG and Jian PU are an editorial board member and a corresponding expert of Frontiers of Information Technology & Electronic Engineering, respectively, and they were not involved with the peer review process of this paper. All authors declare that they have no conflict of interest.

Additional information

Project supported by the National Natural Science Foundation of China (No. 62176059), the Shanghai Municipal Science and Technology Major Project (No. 2018SHZDZX01), Zhangjiang Lab, and the Shanghai Center for Brain Science and Brain-inspired Technology

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, S., Zhao, B., Zhang, Z. et al. Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning. Front Inform Technol Electron Eng 24, 1541–1556 (2023). https://doi.org/10.1631/FITEE.2300084

Download citation

Received: 12 February 2023
Accepted: 19 May 2023
Published: 07 December 2023
Issue Date: November 2023
DOI: https://doi.org/10.1631/FITEE.2300084

Key words

关键词

CLC number

TP181

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning

Abstract

摘要

Access this article

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

关键词

CLC number

Search

Navigation