当前位置:
X-MOL 学术
›
arXiv.cs.AI
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
VDSC: Enhancing Exploration Timing with Value Discrepancy and State Counts
arXiv - CS - Artificial Intelligence Pub Date : 2024-03-26 , DOI: arxiv-2403.17542 Marius Captari, Remo Sasso, Matthia Sabatelli
arXiv - CS - Artificial Intelligence Pub Date : 2024-03-26 , DOI: arxiv-2403.17542 Marius Captari, Remo Sasso, Matthia Sabatelli
Despite the considerable attention given to the questions of \textit{how
much} and \textit{how to} explore in deep reinforcement learning, the
investigation into \textit{when} to explore remains relatively less researched.
While more sophisticated exploration strategies can excel in specific, often
sparse reward environments, existing simpler approaches, such as
$\epsilon$-greedy, persist in outperforming them across a broader spectrum of
domains. The appeal of these simpler strategies lies in their ease of
implementation and generality across a wide range of domains. The downside is
that these methods are essentially a blind switching mechanism, which
completely disregards the agent's internal state. In this paper, we propose to
leverage the agent's internal state to decide \textit{when} to explore,
addressing the shortcomings of blind switching mechanisms. We present Value
Discrepancy and State Counts through homeostasis (VDSC), a novel approach for
efficient exploration timing. Experimental results on the Atari suite
demonstrate the superiority of our strategy over traditional methods such as
$\epsilon$-greedy and Boltzmann, as well as more sophisticated techniques like
Noisy Nets.
中文翻译:
VDSC:通过值差异和状态计数增强探索计时
尽管在深度强化学习中对 \textit{多少} 和 \textit{如何} 探索的问题给予了相当多的关注,但对 \textit{何时} 探索的研究仍然相对较少。虽然更复杂的探索策略可以在特定的、通常稀疏的奖励环境中表现出色,但现有的更简单的方法(例如 $\epsilon$-greedy)在更广泛的领域中仍然表现优于它们。这些更简单策略的吸引力在于它们易于实施且在广泛领域具有通用性。缺点是这些方法本质上是一种盲目切换机制,完全无视智能体的内部状态。在本文中,我们建议利用代理的内部状态来决定 \textit{when} 进行探索,解决盲目切换机制的缺点。我们通过稳态(VDSC)提出了价值差异和状态计数,这是一种有效探索计时的新方法。 Atari 套件上的实验结果证明了我们的策略相对于 $\epsilon$-greedy 和 Boltzmann 等传统方法以及 Noisy Nets 等更复杂的技术的优越性。
更新日期:2024-03-28
中文翻译:
VDSC:通过值差异和状态计数增强探索计时
尽管在深度强化学习中对 \textit{多少} 和 \textit{如何} 探索的问题给予了相当多的关注,但对 \textit{何时} 探索的研究仍然相对较少。虽然更复杂的探索策略可以在特定的、通常稀疏的奖励环境中表现出色,但现有的更简单的方法(例如 $\epsilon$-greedy)在更广泛的领域中仍然表现优于它们。这些更简单策略的吸引力在于它们易于实施且在广泛领域具有通用性。缺点是这些方法本质上是一种盲目切换机制,完全无视智能体的内部状态。在本文中,我们建议利用代理的内部状态来决定 \textit{when} 进行探索,解决盲目切换机制的缺点。我们通过稳态(VDSC)提出了价值差异和状态计数,这是一种有效探索计时的新方法。 Atari 套件上的实验结果证明了我们的策略相对于 $\epsilon$-greedy 和 Boltzmann 等传统方法以及 Noisy Nets 等更复杂的技术的优越性。