当前位置: X-MOL 学术arXiv.cs.AI › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Imitating Cost-Constrained Behaviors in Reinforcement Learning
arXiv - CS - Artificial Intelligence Pub Date : 2024-03-26 , DOI: arxiv-2403.17456
Qian Shao, Pradeep Varakantham, Shih-Fen Cheng

Complex planning and scheduling problems have long been solved using various optimization or heuristic approaches. In recent years, imitation learning that aims to learn from expert demonstrations has been proposed as a viable alternative to solving these problems. Generally speaking, imitation learning is designed to learn either the reward (or preference) model or directly the behavioral policy by observing the behavior of an expert. Existing work in imitation learning and inverse reinforcement learning has focused on imitation primarily in unconstrained settings (e.g., no limit on fuel consumed by the vehicle). However, in many real-world domains, the behavior of an expert is governed not only by reward (or preference) but also by constraints. For instance, decisions on self-driving delivery vehicles are dependent not only on the route preferences/rewards (depending on past demand data) but also on the fuel in the vehicle and the time available. In such problems, imitation learning is challenging as decisions are not only dictated by the reward model but are also dependent on a cost-constrained model. In this paper, we provide multiple methods that match expert distributions in the presence of trajectory cost constraints through (a) Lagrangian-based method; (b) Meta-gradients to find a good trade-off between expected return and minimizing constraint violation; and (c) Cost-violation-based alternating gradient. We empirically show that leading imitation learning approaches imitate cost-constrained behaviors poorly and our meta-gradient-based approach achieves the best performance.

中文翻译:

模仿强化学习中的成本受限行为

复杂的规划和调度问题长期以来一直使用各种优化或启发式方法来解决。近年来,旨在从专家演示中学习的模仿学习被提出作为解决这些问题的可行替代方案。一般来说,模仿学习旨在通过观察专家的行为来学习奖励(或偏好)模型或直接学习行为策略。模仿学习和逆强化学习的现有工作主要集中在无约束环境中的模仿(例如,对车辆消耗的燃料没有限制)。然而,在许多现实世界的领域中,专家的行为不仅受到奖励(或偏好)的控制,还受到约束的控制。例如,自动驾驶送货车辆的决策不仅取决于路线偏好/奖励(取决于过去的需求数据),还取决于车辆的燃料和可用时间。在此类问题中,模仿学习具有挑战性,因为决策不仅由奖励模型决定,还取决于成本约束模型。在本文中,我们通过(a)基于拉格朗日的方法提供了在存在轨迹成本约束的情况下匹配专家分布的多种方法; (b) 元梯度,以在预期回报和最小化约束违规之间找到良好的权衡; (c) 基于成本违规的交替梯度。我们的经验表明,领先的模仿学习方法很难模仿成本受限的行为,而我们基于元梯度的方法实现了最佳性能。
更新日期:2024-03-28
down
wechat
bug