DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions,arXiv - CS - Machine Learning

当前位置： X-MOL 学术 › arXiv.cs.LG › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions
arXiv - CS - Machine Learning Pub Date : 2024-03-26 , DOI: arxiv-2403.17827
Sammy Christen, Shreyas Hampali, Fadime Sener, Edoardo Remelli, Tomas Hodan, Eric Sauser, Shugao Ma, Bugra Tekin

Generating natural hand-object interactions in 3D is challenging as the resulting hand and object motions are expected to be physically plausible and semantically meaningful. Furthermore, generalization to unseen objects is hindered by the limited scale of available hand-object interaction datasets. We propose DiffH2O, a novel method to synthesize realistic, one or two-handed object interactions from provided text prompts and geometry of the object. The method introduces three techniques that enable effective learning from limited data. First, we decompose the task into a grasping stage and a text-based interaction stage and use separate diffusion models for each. In the grasping stage, the model only generates hand motions, whereas in the interaction phase both hand and object poses are synthesized. Second, we propose a compact representation that tightly couples hand and object poses. Third, we propose two different guidance schemes to allow more control of the generated motions: grasp guidance and detailed textual guidance. Grasp guidance takes a single target grasping pose and guides the diffusion model to reach this grasp at the end of the grasping stage, which provides control over the grasping pose. Given a grasping motion from this stage, multiple different actions can be prompted in the interaction phase. For textual guidance, we contribute comprehensive text descriptions to the GRAB dataset and show that they enable our method to have more fine-grained control over hand-object interactions. Our quantitative and qualitative evaluation demonstrates that the proposed method outperforms baseline methods and leads to natural hand-object motions. Moreover, we demonstrate the practicality of our framework by utilizing a hand pose estimate from an off-the-shelf pose estimator for guidance, and then sampling multiple different actions in the interaction stage.

中文翻译：

DiffH2O：基于扩散的文本描述手-物体交互合成

在 3D 中生成自然的手部与物体交互具有挑战性，因为最终的手部和物体运动预计在物理上合理且在语义上有意义。此外，可用的手部物体交互数据集的规模有限，阻碍了对看不见的物体的推广。我们提出了 DiffH2O，这是一种根据提供的文本提示和对象几何形状合成真实的单手或双手对象交互的新颖方法。该方法引入了三种技术，可以从有限的数据中进行有效的学习。首先，我们将任务分解为抓取阶段和基于文本的交互阶段，并为每个阶段使用单独的扩散模型。在抓取阶段，模型仅生成手部动作，而在交互阶段，手部和物体的姿势都会被合成。其次，我们提出了一种紧密耦合手部和物体姿势的紧凑表示。第三，我们提出了两种不同的指导方案，以更好地控制生成的动作：掌握指导和详细的文本指导。抓取引导采用单一目标抓取姿势，并引导扩散模型在抓取阶段结束时达到该抓取，从而提供对抓取姿势的控制。给定此阶段的抓取动作，可以在交互阶段提示多种不同的动作。对于文本指导，我们向 GRAB 数据集提供了全面的文本描述，并表明它们使我们的方法能够对手部对象交互进行更细粒度的控制。我们的定量和定性评估表明，所提出的方法优于基线方法并导致自然的手部物体运动。此外，我们通过利用现成姿势估计器的手部姿势估计进行指导，然后在交互阶段对多个不同的动作进行采样，展示了我们框架的实用性。

更新日期：2024-03-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>