Skip to main content
Log in

Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning

基于专家示教聚类经验池的高效深度强化学习

  • Research Article
  • Published:
Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Abstract

As one of the most fundamental topics in reinforcement learning (RL), sample efficiency is essential to the deployment of deep RL algorithms. Unlike most existing exploration methods that sample an action from different types of posterior distributions, we focus on the policy sampling process and propose an efficient selective sampling approach to improve sample efficiency by modeling the internal hierarchy of the environment. Specifically, we first employ clustering methods in the policy sampling process to generate an action candidate set. Then we introduce a clustering buffer for modeling the internal hierarchy, which consists of on-policy data, off-policy data, and expert data to evaluate actions from the clusters in the action candidate set in the exploration stage. In this way, our approach is able to take advantage of the supervision information in the expert demonstration data. Experiments on six different continuous locomotion environments demonstrate superior reinforcement learning performance and faster convergence of selective sampling. In particular, on the LGSVL task, our method can reduce the number of convergence steps by 46.7% and the convergence time by 28.5%. Furthermore, our code is open-source for reproducibility. The code is available at https://github.com/Shihwin/SelectiveSampling.

摘要

作为强化学习领域最基本的主题之一, 样本效率对于深度强化学习算法的部署至关重要. 与现有大多数从不同类型的后验分布中对动作进行采样的探索方法不同, 我们专注于策略的采样过程, 提出一种有效的选择性采样方法, 通过对环境的内部层次结构建模来提高样本效率. 具体来说, 首先在策略采样过程中使用聚类方法生成动作候选集, 随后引入一个用于对内部层次结构建模的聚类缓冲区, 它由同轨数据、 异轨数据以及专家数据组成, 用于评估探索阶段动作候选集中不同类别动作的价值. 通过这种方式, 我们的方法能够更多地利用专家示教数据中的监督信息. 在6种不同的连续运动环境中进行了实验, 结果表明选择性采样方法具有卓越的强化学习性能和更快的收敛速度. 特别地, 在LGSVL任务中, 该方法可以减少46.7%的收敛步数和28.5%的收敛时间. 代码已开源, 见https://github.com/Shihwin/SelectiveSampling.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Data availability

The code is available at https://github.com/Shihwin/SelectiveSampling. The other data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Download references

Author information

Authors and Affiliations

Authors

Contributions

Shihmin WANG, Binqi ZHAO, Zhengfeng ZHANG, and Junping ZHANG designed the research. Shihmin WANG and Binqi ZHAO processed the data. Shihmin WANG drafted the paper. Junping ZHANG and Jian PU helped organize the paper. Shihmin WANG, Junping ZHANG, and Jian PU revised and finalized the paper.

Corresponding author

Correspondence to Jian Pu  (浦剑).

Ethics declarations

Junping ZHANG and Jian PU are an editorial board member and a corresponding expert of Frontiers of Information Technology & Electronic Engineering, respectively, and they were not involved with the peer review process of this paper. All authors declare that they have no conflict of interest.

Additional information

Project supported by the National Natural Science Foundation of China (No. 62176059), the Shanghai Municipal Science and Technology Major Project (No. 2018SHZDZX01), Zhangjiang Lab, and the Shanghai Center for Brain Science and Brain-inspired Technology

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, S., Zhao, B., Zhang, Z. et al. Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning. Front Inform Technol Electron Eng 24, 1541–1556 (2023). https://doi.org/10.1631/FITEE.2300084

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/FITEE.2300084

Key words

关键词

CLC number

Navigation