当前位置: X-MOL 学术IEEE Trans. Games › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multigoal Reinforcement Learning via Exploring Entropy-Regularized Successor Matching
IEEE Transactions on Games ( IF 2.3 ) Pub Date : 2023-08-11 , DOI: 10.1109/tg.2023.3304315
Xiaoyun Feng 1 , Yun Zhou 2
Affiliation  

Multigoal reinforcement learning (RL) algorithms tend to achieve and generalize over diverse goals. However, unlike single-goal agents, multigoal agents struggle to break through the exploration bottleneck with a fair share of interactions, owing to rarely reusable goal-oriented experiences with sparse goal-reaching rewards. Therefore, well-arranged behavior goals during training are essential for multigoal agents, especially in long-horizon tasks. To this end, we propose efficient multigoal exploration on the basis of maximizing the entropy of successor features and Exploring entropy-regularized successor matching, namely, E $^{2}$ SM. E $^{2}$ SM adopts the idea of a successor feature and extends it to entropy-regularized goal-reaching successor mapping that serves as a more stable state feature under sparse rewards. The key contribution of our work is to perform intrinsic goal setting with behavior goals that are more likely to be achieved in terms of future state occupancies as well as promising in expanding the exploration frontier. Experiments on challenging long-horizon manipulation tasks show that E $^{2}$ SM deals well with sparse rewards and in pursuit of maximal state-covering, E $^{2}$ SM efficiently identifies valuable behavior goals toward specific goal-reaching by matching the successor mapping.

中文翻译:

通过探索熵正则化后继匹配的多目标强化学习

多目标强化学习 (RL) 算法倾向于实现并泛化不同的目标。然而,与单目标智能体不同,多目标智能体很难通过公平的交互份额来突破探索瓶颈,因为很少可重用的目标导向经验和稀疏的目标达成奖励。因此,训练期间精心安排的行为目标对于多目标智能体至关重要,尤其是在长期任务中。为此,我们提出了在最大化后继特征熵和探索熵正则化后继匹配的基础上的高效多目标探索,即E $^{2}$ SM。乙 $^{2}$ SM采用后继特征的思想,并将其扩展到熵正则化的目标到达后继映射,作为稀疏奖励下更稳定的状态特征。我们工作的关键贡献是通过行为目标进行内在目标设定,这些目标更有可能在未来的国家占有率方面实现,并且有希望扩大探索前沿。具有挑战性的长视野操纵任务的实验表明,E $^{2}$ SM 很好地处理稀疏奖励并追求最大状态覆盖,E $^{2}$ SM 通过匹配后继映射,有效地识别有价值的行为目标,以实现特定的目标。
更新日期:2023-08-11
down
wechat
bug