Disentangling Structure and Appearance in ViT Feature Space,ACM Transactions on Graphics

当前位置： X-MOL 学术 › ACM Trans. Graph. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Disentangling Structure and Appearance in ViT Feature Space
ACM Transactions on Graphics ( IF 6.2 ) Pub Date : 2023-11-30 , DOI: 10.1145/3630096
Narek Tumanyan ₁ , Omer Bar-Tal ₁ , Shir Amir ₁ , Shai Bagon ₁ , Tali Dekel ₁

Affiliation

We present a method for semantically transferring the visual appearance of one natural image to another. Specifically, our goal is to generate an image in which objects in a source structure image are “painted” with the visual appearance of their semantically related objects in a target appearance image. To integrate semantic information into our framework, our key idea is to leverage a pre-trained and fixed Vision Transformer (ViT) model. Specifically, we derive novel disentangled representations of structure and appearance extracted from deep ViT features. We then establish an objective function that splices the desired structure and appearance representations, interweaving them together in the space of ViT features. Based on our objective function, we propose two frameworks of semantic appearance transfer – “Splice”, which works by training a generator on a single and arbitrary pair of structure-appearance images, and “SpliceNet”, a feed-forward real-time appearance transfer model trained on a dataset of images from a specific domain. Our frameworks do not involve adversarial training, nor do they require any additional input information such as semantic segmentation or correspondences. We demonstrate high-resolution results on a variety of in-the-wild image pairs, under significant variations in the number of objects, pose, and appearance. Code and supplementary material are available in our project page: splice-vit.github.io.

中文翻译：

解开 ViT 特征空间中的结构和外观

我们提出了一种从语义上将一个自然图像的视觉外观转移到另一个自然图像的方法。具体来说，我们的目标是生成一个图像，其中源结构图像中的对象被“绘制”为目标外观图像中语义相关对象的视觉外观。为了将语义信息集成到我们的框架中，我们的关键思想是利用预先训练的固定视觉变换器（ViT）模型。具体来说，我们得出了从深层 ViT 特征中提取的结构和外观的新颖解缠结表示。然后，我们建立一个目标函数，将所需的结构和外观表示拼接起来，将它们在 ViT 特征空间中交织在一起。基于我们的目标函数，我们提出了两个语义外观迁移框架——“Splice”，它的工作原理是在单个和任意一对结构外观图像上训练生成器，以及“SpliceNet”，一种前馈实时外观在特定领域的图像数据集上训练的传输模型。我们的框架不涉及对抗性训练，也不需要任何额外的输入信息，例如语义分割或对应关系。我们在物体数量、姿势和外观显着变化的情况下，在各种野外图像对上展示了高分辨率结果。代码和补充材料可在我们的项目页面中找到：splice-vit.github.io。

更新日期：2023-11-30

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>