当前位置: X-MOL 学术Front. Inform. Technol. Electron. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning
Frontiers of Information Technology & Electronic Engineering ( IF 3 ) Pub Date : 2023-12-07 , DOI: 10.1631/fitee.2300084
Shihmin Wang , Binqi Zhao , Zhengfeng Zhang , Junping Zhang , Jian Pu

As one of the most fundamental topics in reinforcement learning (RL), sample efficiency is essential to the deployment of deep RL algorithms. Unlike most existing exploration methods that sample an action from different types of posterior distributions, we focus on the policy sampling process and propose an efficient selective sampling approach to improve sample efficiency by modeling the internal hierarchy of the environment. Specifically, we first employ clustering methods in the policy sampling process to generate an action candidate set. Then we introduce a clustering buffer for modeling the internal hierarchy, which consists of on-policy data, off-policy data, and expert data to evaluate actions from the clusters in the action candidate set in the exploration stage. In this way, our approach is able to take advantage of the supervision information in the expert demonstration data. Experiments on six different continuous locomotion environments demonstrate superior reinforcement learning performance and faster convergence of selective sampling. In particular, on the LGSVL task, our method can reduce the number of convergence steps by 46.7% and the convergence time by 28.5%. Furthermore, our code is open-source for reproducibility. The code is available at https://github.com/Shihwin/SelectiveSampling.



中文翻译:

将专家演示嵌入聚类缓冲区,以实现有效的深度强化学习

作为强化学习 (RL) 中最基本的主题之一,样本效率对于深度 RL 算法的部署至关重要。与大多数现有的从不同类型的后验分布中采样动作的探索方法不同,我们专注于策略采样过程,并提出一种有效的选择性采样方法,通过对环境的内部层次结构进行建模来提高采样效率。具体来说,我们首先在策略采样过程中采用聚类方法来生成动作候选集。然后,我们引入一个用于对内部层次结构进行建模的聚类缓冲区,该缓冲区由策略内数据、策略外数据和专家数据组成,用于在探索阶段评估动作候选集中的集群中的动作。这样,我们的方法就能够利用专家演示数据中的监督信息。在六种不同的连续运动环境上进行的实验证明了卓越的强化学习性能和选择性采样的更快收敛。特别是,在 LGSVL 任务上,我们的方法可以减少 46.7% 的收敛步骤数和 28.5% 的收敛时间。此外,我们的代码是开源的,以确保可重复性。代码可在 https://github.com/Shihwin/SelectiveSampling 获取。

更新日期:2023-12-07
down
wechat
bug