VDSC: Enhancing Exploration Timing with Value Discrepancy and State Counts,arXiv - CS - Artificial Intelligence

当前位置： X-MOL 学术 › arXiv.cs.AI › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

VDSC: Enhancing Exploration Timing with Value Discrepancy and State Counts
arXiv - CS - Artificial Intelligence Pub Date : 2024-03-26 , DOI: arxiv-2403.17542
Marius Captari, Remo Sasso, Matthia Sabatelli

Despite the considerable attention given to the questions of \textit{how much} and \textit{how to} explore in deep reinforcement learning, the investigation into \textit{when} to explore remains relatively less researched. While more sophisticated exploration strategies can excel in specific, often sparse reward environments, existing simpler approaches, such as $\epsilon$-greedy, persist in outperforming them across a broader spectrum of domains. The appeal of these simpler strategies lies in their ease of implementation and generality across a wide range of domains. The downside is that these methods are essentially a blind switching mechanism, which completely disregards the agent's internal state. In this paper, we propose to leverage the agent's internal state to decide \textit{when} to explore, addressing the shortcomings of blind switching mechanisms. We present Value Discrepancy and State Counts through homeostasis (VDSC), a novel approach for efficient exploration timing. Experimental results on the Atari suite demonstrate the superiority of our strategy over traditional methods such as $\epsilon$-greedy and Boltzmann, as well as more sophisticated techniques like Noisy Nets.

中文翻译：

VDSC：通过值差异和状态计数增强探索计时

尽管在深度强化学习中对 \textit{多少} 和 \textit{如何} 探索的问题给予了相当多的关注，但对 \textit{何时} 探索的研究仍然相对较少。虽然更复杂的探索策略可以在特定的、通常稀疏的奖励环境中表现出色，但现有的更简单的方法（例如 $\epsilon$-greedy）在更广泛的领域中仍然表现优于它们。这些更简单策略的吸引力在于它们易于实施且在广泛领域具有通用性。缺点是这些方法本质上是一种盲目切换机制，完全无视智能体的内部状态。在本文中，我们建议利用代理的内部状态来决定 \textit{when} 进行探索，解决盲目切换机制的缺点。我们通过稳态（VDSC）提出了价值差异和状态计数，这是一种有效探索计时的新方法。 Atari 套件上的实验结果证明了我们的策略相对于 $\epsilon$-greedy 和 Boltzmann 等传统方法以及 Noisy Nets 等更复杂的技术的优越性。

更新日期：2024-03-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>