Policy generation network for zero-shot policy learning,Computational Intelligence

当前位置： X-MOL 学术 › Comput. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Policy generation network for zero-shot policy learning
Computational Intelligence ( IF 2.8 ) Pub Date : 2023-07-04 , DOI: 10.1111/coin.12591
Yiming Qian _{1,

2} , Fengyi Zhang _{1,

2} , Zhiyong Liu _{1,

2,

3}

Affiliation

Lifelong reinforcement learning is able to continually accumulate shared knowledge by estimating the inter-task relationships based on training data for the learned tasks in order to accelerate learning for new tasks by knowledge reuse. The existing methods employ a linear model to represent the inter-task relationships by incorporating task features in order to accomplish a new task without any learning. But these methods may be ineffective for general scenarios, where linear models build inter-task relationships from low-dimensional task features to high-dimensional policy parameters space. Also, the deficiency of calculating errors from objective function may arise in the lifelong reinforcement learning process when some errors of policy parameters restrain others due to inter-parameter correlation. In this paper, we develop a policy generation network that nonlinearly models the inter-task relationships by mapping low-dimensional task features to the high-dimensional policy parameters, in order to represent the shared knowledge more effectively. At the same time, we propose a novel objective function of lifelong reinforcement learning to relieve the deficiency of calculating errors by adding weight constraints for errors. We empirically demonstrate that our method improves the zero-shot policy performance across a variety of dynamical systems.

中文翻译：

用于零样本策略学习的策略生成网络

终身强化学习能够根据已学习任务的训练数据估计任务间关系，不断积累共享知识，从而通过知识重用加速新任务的学习。现有的方法采用线性模型通过合并任务特征来表示任务间关系，以便在不进行任何学习的情况下完成新任务。但这些方法对于一般场景可能无效，因为线性模型构建从低维任务特征到高维策略参数空间的任务间关系。此外，当策略参数的某些错误由于参数间的相关性而抑制其他错误时，在终身强化学习过程中可能会出现目标函数计算错误的缺陷。在本文中，我们开发了一种策略生成网络，通过将低维任务特征映射到高维策略参数来对任务间关系进行非线性建模，以便更有效地表示共享知识。同时，我们提出了一种新颖的终身强化学习目标函数，通过添加误差权重约束来弥补计算误差的不足。我们凭经验证明，我们的方法可以提高各种动态系统的零样本策略性能。

更新日期：2023-07-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>