Trimodal Navigable Region Segmentation Model: Grounding Navigation Instructions in Urban Areas,IEEE Robotics and Automation Letters

当前位置： X-MOL 学术 › IEEE Robot. Automation Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Trimodal Navigable Region Segmentation Model: Grounding Navigation Instructions in Urban Areas
IEEE Robotics and Automation Letters ( IF 5.2 ) Pub Date : 2024-03-18 , DOI: 10.1109/lra.2024.3376957
Naoki Hosomi ₁ , Shumpei Hatanaka ₁ , Yui Iioka ₁ , Wei Yang ₁ , Katsuyuki Kuyo ₁ , Teruhisa Misu ₂ , Kentaro Yamada ₃ , Komei Sugiura ₁

Affiliation

In this study, we develop a model that enables mobilities to have more friendly interactions with users. Specifically, we focus on the referring navigable regions task in which a model grounds navigable regions of the road using the mobility's camera image and natural language navigation instructions. This task is challenging because of the requirement of vision-and-language comprehension in situations that involve rapidly changing environments with other mobilities. The performance of existing methods is insufficient, partly because they do not consider features related to scene context, such as semantic segmentation information. Therefore, it is important to incorporate these features into a multimodal encoder. In this study, we propose a trimodal (three modalities of language, image, and mask) encoder-decoder model called the Trimodal Navigable Region Segmentation Model. We introduce the Text-Mask Encoder Block to process semantic segmentation masks and the Day-Night Classification Branch to balance the input modalities. We validated our model on the Talk2Car-RegSeg dataset. The results demonstrated that our method outperformed the baseline method for standard metrics.

中文翻译：

三模态通航区域分割模型：在城市地区落地导航指令

在这项研究中，我们开发了一个模型，使移动出行能够与用户进行更友好的交互。具体来说，我们专注于参考可导航区域任务，其中模型使用移动设备的相机图像和自然语言导航指令来确定道路的可导航区域。这项任务具有挑战性，因为在涉及快速变化的环境和其他移动性的情况下需要视觉和语言理解。现有方法的性能不足，部分原因是它们没有考虑与场景上下文相关的特征，例如语义分割信息。因此，将这些功能合并到多模态编码器中非常重要。在本研究中，我们提出了一种三模态（语言、图像和掩模的三种模态）编码器-解码器模型，称为三模态可导航区域分割模型。我们引入文本掩码编码器块来处理语义分割掩码，并引入日夜分类分支来平衡输入模态。我们在 Talk2Car-RegSeg 数据集上验证了我们的模型。结果表明，我们的方法优于标准指标的基线方法。

更新日期：2024-03-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>