Finding the optimal exploration-exploitation trade-off online through Bayesian risk estimation and minimization,Artificial Intelligence

当前位置： X-MOL 学术 › Artif. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Finding the optimal exploration-exploitation trade-off online through Bayesian risk estimation and minimization
Artificial Intelligence ( IF 14.4 ) Pub Date : 2024-02-21 , DOI: 10.1016/j.artint.2024.104096
Stewart Jamieson , Jonathan P. How , Yogesh Girdhar

We propose (EBRM) over policy sets as an approach to online learning across a wide range of settings. Many real-world online learning problems have complexities such as action- and belief-dependent rewards, time-discounting of reward, and heterogeneous costs for actions and feedback; we find that existing online learning heuristics cannot leverage most problem-specific information, to the detriment of their performance. We introduce a belief-space Markov decision process (BMDP) model that can capture these complexities, and further apply the concepts of , , and risks to online learning. These risk functions describe the risk inherent to the learning problem, the risk due to the agent's lack of knowledge, and the relative quality of its policy, respectively. We demonstrate how computing and minimizing these risk functions guides the online learning agent towards the optimal exploration-exploitation trade-off in any stochastic online learning problem, constituting the basis of the EBRM approach. We also show how Bayes' risk, the minimization objective in stochastic online learning problems, can be decomposed into the aforementioned aleatoric, epistemic, and process risks.

中文翻译：

通过贝叶斯风险估计和最小化在线找到最佳探索-利用权衡

我们建议将（EBRM）政策集作为跨各种环境的在线学习方法。许多现实世界的在线学习问题都具有复杂性，例如依赖于行动和信念的奖励、奖励的时间贴现以及行动和反馈的异质成本；我们发现现有的在线学习启发式方法无法利用大多数特定问题的信息，从而损害了其性能。我们引入了信念空间马尔可夫决策过程（BMDP）模型，该模型可以捕获这些复杂性，并进一步将、和风险的概念应用于在线学习。这些风险函数分别描述了学习问题固有的风险、由于代理缺乏知识而导致的风险以及其策略的相对质量。我们演示了计算和最小化这些风险函数如何引导在线学习代理在任何随机在线学习问题中实现最佳探索-利用权衡，这构成了 EBRM 方法的基础。我们还展示了贝叶斯风险（随机在线学习问题中的最小化目标）如何分解为上述任意风险、认知风险和过程风险。

更新日期：2024-02-21

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>