样式: 排序: IF: - GO 导出 标记为已读
-
MetaCap: Meta-learning Priors from Multi-View Imagery for Sparse-view Human Performance Capture and Rendering arXiv.cs.CV Pub Date : 2024-03-27 Guoxing Sun, Rishabh Dabral, Pascal Fua, Christian Theobalt, Marc Habermann
Faithful human performance capture and free-view rendering from sparse RGB observations is a long-standing problem in Vision and Graphics. The main challenges are the lack of observations and the inherent ambiguities of the setting, e.g. occlusions and depth ambiguity. As a result, radiance fields, which have shown great promise in capturing high-frequency appearance and geometry details in dense setups
-
Benchmarking Object Detectors with COCO: A New Path Forward arXiv.cs.CV Pub Date : 2024-03-27 Shweta Singh, Aayan Yadav, Jitesh Jain, Humphrey Shi, Justin Johnson, Karan Desai
The Common Objects in Context (COCO) dataset has been instrumental in benchmarking object detectors over the past decade. Like every dataset, COCO contains subtle errors and imperfections stemming from its annotation procedure. With the advent of high-performing models, we ask whether these errors of COCO are hindering its utility in reliably benchmarking further progress. In search for an answer,
-
ObjectDrop: Bootstrapping Counterfactuals for Photorealistic Object Removal and Insertion arXiv.cs.CV Pub Date : 2024-03-27 Daniel Winter, Matan Cohen, Shlomi Fruchter, Yael Pritch, Alex Rav-Acha, Yedid Hoshen
Diffusion models have revolutionized image editing but often generate images that violate physical laws, particularly the effects of objects on the scene, e.g., occlusions, shadows, and reflections. By analyzing the limitations of self-supervised approaches, we propose a practical solution centered on a \q{counterfactual} dataset. Our method involves capturing a scene before and after removing a single
-
Garment3DGen: 3D Garment Stylization and Texture Generation arXiv.cs.CV Pub Date : 2024-03-27 Nikolaos Sarafianos, Tuur Stuyck, Xiaoyu Xiang, Yilei Li, Jovan Popovic, Rakesh Ranjan
We introduce Garment3DGen a new method to synthesize 3D garment assets from a base mesh given a single input image as guidance. Our proposed approach allows users to generate 3D textured clothes based on both real and synthetic images, such as those generated by text prompts. The generated assets can be directly draped and simulated on human bodies. First, we leverage the recent progress of image to
-
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models arXiv.cs.CV Pub Date : 2024-03-27 Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia
In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three
-
Gamba: Marry Gaussian Splatting with Mamba for single view 3D reconstruction arXiv.cs.CV Pub Date : 2024-03-27 Qiuhong Shen, Xuanyu Yi, Zike Wu, Pan Zhou, Hanwang Zhang, Shuicheng Yan, Xinchao Wang
We tackle the challenge of efficiently reconstructing a 3D asset from a single image with growing demands for automated 3D content creation pipelines. Previous methods primarily rely on Score Distillation Sampling (SDS) and Neural Radiance Fields (NeRF). Despite their significant success, these approaches encounter practical limitations due to lengthy optimization and considerable memory usage. In
-
Object Pose Estimation via the Aggregation of Diffusion Features arXiv.cs.CV Pub Date : 2024-03-27 Tianfu Wang, Guosheng Hu, Hongguang Wang
Estimating the pose of objects from images is a crucial task of 3D scene understanding, and recent approaches have shown promising results on very large benchmarks. However, these methods experience a significant performance drop when dealing with unseen objects. We believe that it results from the limited generalizability of image features. To address this problem, we have an in-depth analysis on
-
SplatFace: Gaussian Splat Face Reconstruction Leveraging an Optimizable Surface arXiv.cs.CV Pub Date : 2024-03-27 Jiahao Luo, Jing Liu, James Davis
We present SplatFace, a novel Gaussian splatting framework designed for 3D human face reconstruction without reliance on accurate pre-determined geometry. Our method is designed to simultaneously deliver both high-quality novel view rendering and accurate 3D mesh reconstructions. We incorporate a generic 3D Morphable Model (3DMM) to provide a surface geometric structure, making it possible to reconstruct
-
Towards Image Ambient Lighting Normalization arXiv.cs.CV Pub Date : 2024-03-27 Florin-Alexandru Vasluianu, Tim Seizinger, Zongwei Wu, Rakesh Ranjan, Radu Timofte
Lighting normalization is a crucial but underexplored restoration task with broad applications. However, existing works often simplify this task within the context of shadow removal, limiting the light sources to one and oversimplifying the scene, thus excluding complex self-shadows and restricting surface classes to smooth ones. Although promising, such simplifications hinder generalizability to more
-
SAT-NGP : Unleashing Neural Graphics Primitives for Fast Relightable Transient-Free 3D reconstruction from Satellite Imagery arXiv.cs.CV Pub Date : 2024-03-27 Camille Billouard, Dawa Derksen, Emmanuelle Sarrazin, Bruno Vallet
Current stereo-vision pipelines produce high accuracy 3D reconstruction when using multiple pairs or triplets of satellite images. However, these pipelines are sensitive to the changes between images that can occur as a result of multi-date acquisitions. Such variations are mainly due to variable shadows, reflexions and transient objects (cars, vegetation). To take such changes into account, Neural
-
Dense Vision Transformer Compression with Few Samples arXiv.cs.CV Pub Date : 2024-03-27 Hanxiao Zhang, Yifan Zhou, Guo-Hua Wang, Jianxin Wu
Few-shot model compression aims to compress a large model into a more compact one with only a tiny training set (even without labels). Block-level pruning has recently emerged as a leading technique in achieving high accuracy and low latency in few-shot CNN compression. But, few-shot compression for Vision Transformers (ViT) remains largely unexplored, which presents a new challenge. In particular
-
Annolid: Annotate, Segment, and Track Anything You Need arXiv.cs.CV Pub Date : 2024-03-27 Chen Yang, Thomas A. Cleland
Annolid is a deep learning-based software package designed for the segmentation, labeling, and tracking of research targets within video files, focusing primarily on animal behavior analysis. Based on state-of-the-art instance segmentation methods, Annolid now harnesses the Cutie video object segmentation model to achieve resilient, markerless tracking of multiple animals from single annotated frames
-
Deep Learning for Robust and Explainable Models in Computer Vision arXiv.cs.CV Pub Date : 2024-03-27 Mohammadreza Amirian
Recent breakthroughs in machine and deep learning (ML and DL) research have provided excellent tools for leveraging enormous amounts of data and optimizing huge models with millions of parameters to obtain accurate networks for image processing. These developments open up tremendous opportunities for using artificial intelligence (AI) in the automation and human assisted AI industry. However, as more
-
FlexEdit: Flexible and Controllable Diffusion-based Object-centric Image Editing arXiv.cs.CV Pub Date : 2024-03-27 Trong-Tung Nguyen, Duc-Anh Nguyen, Anh Tran, Cuong Pham
Our work addresses limitations seen in previous approaches for object-centric editing problems, such as unrealistic results due to shape discrepancies and limited control in object replacement or insertion. To this end, we introduce FlexEdit, a flexible and controllable editing framework for objects where we iteratively adjust latents at each denoising step using our FlexEdit block. Initially, we optimize
-
Homogeneous Tokenizer Matters: Homogeneous Visual Tokenizer for Remote Sensing Image Understanding arXiv.cs.CV Pub Date : 2024-03-27 Run Shao, Zhaoyang Zhang, Chao Tao, Yunsheng Zhang, Chengli Peng, Haifeng Li
The tokenizer, as one of the fundamental components of large models, has long been overlooked or even misunderstood in visual tasks. One key factor of the great comprehension power of the large language model is that natural language tokenizers utilize meaningful words or subwords as the basic elements of language. In contrast, mainstream visual tokenizers, represented by patch-based methods such as
-
HandBooster: Boosting 3D Hand-Mesh Reconstruction by Conditional Synthesis and Sampling of Hand-Object Interactions arXiv.cs.CV Pub Date : 2024-03-27 Hao Xu, Haipeng Li, Yinqiao Wang, Shuaicheng Liu, Chi-Wing Fu
Reconstructing 3D hand mesh robustly from a single image is very challenging, due to the lack of diversity in existing real-world datasets. While data synthesis helps relieve the issue, the syn-to-real gap still hinders its usage. In this work, we present HandBooster, a new approach to uplift the data diversity and boost the 3D hand-mesh reconstruction performance by training a conditional generative
-
Artifact Reduction in 3D and 4D Cone-beam Computed Tomography Images with Deep Learning -- A Review arXiv.cs.CV Pub Date : 2024-03-27 Mohammadreza Amirian, Daniel Barco, Ivo Herzig, Frank-Peter Schilling
Deep learning based approaches have been used to improve image quality in cone-beam computed tomography (CBCT), a medical imaging technique often used in applications such as image-guided radiation therapy, implant dentistry or orthopaedics. In particular, while deep learning methods have been applied to reduce various types of CBCT image artifacts arising from motion, metal objects, or low-dose acquisition
-
CosalPure: Learning Concept from Group Images for Robust Co-Saliency Detection arXiv.cs.CV Pub Date : 2024-03-27 Jiayi Zhu, Qing Guo, Felix Juefei-Xu, Yihao Huang, Yang Liu, Geguang Pu
Co-salient object detection (CoSOD) aims to identify the common and salient (usually in the foreground) regions across a given group of images. Although achieving significant progress, state-of-the-art CoSODs could be easily affected by some adversarial perturbations, leading to substantial accuracy reduction. The adversarial perturbations can mislead CoSODs but do not change the high-level semantic
-
Attention Calibration for Disentangled Text-to-Image Personalization arXiv.cs.CV Pub Date : 2024-03-27 Yanbing Zhang, Mengping Yang, Qin Zhou, Zhe Wang
Recent thrilling progress in large-scale text-to-image (T2I) models has unlocked unprecedented synthesis quality of AI-generated content (AIGC) including image generation, 3D and video composition. Further, personalized techniques enable appealing customized production of a novel concept given only several images as reference. However, an intriguing problem persists: Is it possible to capture multiple
-
OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning arXiv.cs.CV Pub Date : 2024-03-27 Noor Ahmed, Anna Kukleva, Bernt Schiele
Few-Shot Class-Incremental Learning (FSCIL) introduces a paradigm in which the problem space expands with limited data. FSCIL methods inherently face the challenge of catastrophic forgetting as data arrives incrementally, making models susceptible to overwriting previously acquired knowledge. Moreover, given the scarcity of labeled samples available at any given time, models may be prone to overfitting
-
A Semi-supervised Nighttime Dehazing Baseline with Spatial-Frequency Aware and Realistic Brightness Constraint arXiv.cs.CV Pub Date : 2024-03-27 Xiaofeng Cong, Jie Gui, Jing Zhang, Junming Hou, Hao Shen
Existing research based on deep learning has extensively explored the problem of daytime image dehazing. However, few studies have considered the characteristics of nighttime hazy scenes. There are two distinctions between nighttime and daytime haze. First, there may be multiple active colored light sources with lower illumination intensity in nighttime scenes, which may cause haze, glow and noise
-
ParCo: Part-Coordinating Text-to-Motion Synthesis arXiv.cs.CV Pub Date : 2024-03-27 Qiran Zou, Shangyuan Yuan, Shian Du, Yu Wang, Chang Liu, Yi Xu, Jie Chen, Xiangyang Ji
We study a challenging task: text-to-motion synthesis, aiming to generate motions that align with textual descriptions and exhibit coordinated movements. Currently, the part-based methods introduce part partition into the motion synthesis process to achieve finer-grained generation. However, these methods encounter challenges such as the lack of coordination between different part motions and difficulties
-
VersaT2I: Improving Text-to-Image Models with Versatile Reward arXiv.cs.CV Pub Date : 2024-03-27 Jianshu Guo, Wenhao Chai, Jie Deng, Hsiang-Wei Huang, Tian Ye, Yichen Xu, Jiawei Zhang, Jenq-Neng Hwang, Gaoang Wang
Recent text-to-image (T2I) models have benefited from large-scale and high-quality data, demonstrating impressive performance. However, these T2I models still struggle to produce images that are aesthetically pleasing, geometrically accurate, faithful to text, and of good low-level quality. We present VersaT2I, a versatile training framework that can boost the performance with multiple rewards of any
-
I2CKD : Intra- and Inter-Class Knowledge Distillation for Semantic Segmentation arXiv.cs.CV Pub Date : 2024-03-27 Ayoub Karine, Thibault Napoléon, Maher Jridi
This paper proposes a new knowledge distillation method tailored for image semantic segmentation, termed Intra- and Inter-Class Knowledge Distillation (I2CKD). The focus of this method is on capturing and transferring knowledge between the intermediate layers of teacher (cumbersome model) and student (compact model). For knowledge extraction, we exploit class prototypes derived from feature maps. To
-
DiffusionFace: Towards a Comprehensive Dataset for Diffusion-Based Face Forgery Analysis arXiv.cs.CV Pub Date : 2024-03-27 Zhongxi Chen, Ke Sun, Ziyin Zhou, Xianming Lin, Xiaoshuai Sun, Liujuan Cao, Rongrong Ji
The rapid progress in deep learning has given rise to hyper-realistic facial forgery methods, leading to concerns related to misinformation and security risks. Existing face forgery datasets have limitations in generating high-quality facial images and addressing the challenges posed by evolving generative techniques. To combat this, we present DiffusionFace, the first diffusion-based face forgery
-
Density-guided Translator Boosts Synthetic-to-Real Unsupervised Domain Adaptive Segmentation of 3D Point Clouds arXiv.cs.CV Pub Date : 2024-03-27 Zhimin Yuan, Wankang Zeng, Yanfei Su, Weiquan Liu, Ming Cheng, Yulan Guo, Cheng Wang
3D synthetic-to-real unsupervised domain adaptive segmentation is crucial to annotating new domains. Self-training is a competitive approach for this task, but its performance is limited by different sensor sampling patterns (i.e., variations in point density) and incomplete training strategies. In this work, we propose a density-guided translator (DGT), which translates point density between domains
-
DiffStyler: Diffusion-based Localized Image Style Transfer arXiv.cs.CV Pub Date : 2024-03-27 Shaoxu Li
Image style transfer aims to imbue digital imagery with the distinctive attributes of style targets, such as colors, brushstrokes, shapes, whilst concurrently preserving the semantic integrity of the content. Despite the advancements in arbitrary style transfer methods, a prevalent challenge remains the delicate equilibrium between content semantics and style attributes. Recent developments in large-scale
-
Scaling Vision-and-Language Navigation With Offline RL arXiv.cs.CV Pub Date : 2024-03-27 Valay Bundele, Mahesh Bhupati, Biplab Banerjee, Aditya Grover
The study of vision-and-language navigation (VLN) has typically relied on expert trajectories, which may not always be available in real-world situations due to the significant effort required to collect them. On the other hand, existing approaches to training VLN agents that go beyond available expert data involve data augmentations or online exploration which can be tedious and risky. In contrast
-
$\mathrm{F^2Depth}$: Self-supervised Indoor Monocular Depth Estimation via Optical Flow Consistency and Feature Map Synthesis arXiv.cs.CV Pub Date : 2024-03-27 Xiaotong Guo, Huijie Zhao, Shuwei Shao, Xudong Li, Baochang Zhang
Self-supervised monocular depth estimation methods have been increasingly given much attention due to the benefit of not requiring large, labelled datasets. Such self-supervised methods require high-quality salient features and consequently suffer from severe performance drop for indoor scenes, where low-textured regions dominant in the scenes are almost indiscriminative. To address the issue, we propose
-
Backpropagation-free Network for 3D Test-time Adaptation arXiv.cs.CV Pub Date : 2024-03-27 Yanshuo Wang, Ali Cheraghian, Zeeshan Hayder, Jie Hong, Sameera Ramasinghe, Shafin Rahman, David Ahmedt-Aristizabal, Xuesong Li, Lars Petersson, Mehrtash Harandi
Real-world systems often encounter new data over time, which leads to experiencing target domain shifts. Existing Test-Time Adaptation (TTA) methods tend to apply computationally heavy and memory-intensive backpropagation-based approaches to handle this. Here, we propose a novel method that uses a backpropagation-free approach for TTA for the specific case of 3D data. Our model uses a two-stream architecture
-
ECNet: Effective Controllable Text-to-Image Diffusion Models arXiv.cs.CV Pub Date : 2024-03-27 Sicheng Li, Keqiang Sun, Zhixin Lai, Xiaoshi Wu, Feng Qiu, Haoran Xie, Kazunori Miyata, Hongsheng Li
The conditional text-to-image diffusion models have garnered significant attention in recent years. However, the precision of these models is often compromised mainly for two reasons, ambiguous condition input and inadequate condition guidance over single denoising loss. To address the challenges, we introduce two innovative solutions. Firstly, we propose a Spatial Guidance Injector (SGI) which enhances
-
A Channel-ensemble Approach: Unbiased and Low-variance Pseudo-labels is Critical for Semi-supervised Classification arXiv.cs.CV Pub Date : 2024-03-27 Jiaqi Wu, Junbiao Pang, Baochang Zhang, Qingming Huang
Semi-supervised learning (SSL) is a practical challenge in computer vision. Pseudo-label (PL) methods, e.g., FixMatch and FreeMatch, obtain the State Of The Art (SOTA) performances in SSL. These approaches employ a threshold-to-pseudo-label (T2L) process to generate PLs by truncating the confidence scores of unlabeled data predicted by the self-training method. However, self-trained models typically
-
BAM: Box Abstraction Monitors for Real-time OoD Detection in Object Detection arXiv.cs.CV Pub Date : 2024-03-27 Changshun Wu, Weicheng He, Chih-Hong Cheng, Xiaowei Huang, Saddek Bensalem
Out-of-distribution (OoD) detection techniques for deep neural networks (DNNs) become crucial thanks to their filtering of abnormal inputs, especially when DNNs are used in safety-critical applications and interact with an open and dynamic environment. Nevertheless, integrating OoD detection into state-of-the-art (SOTA) object detection DNNs poses significant challenges, partly due to the complexity
-
ViTAR: Vision Transformer with Any Resolution arXiv.cs.CV Pub Date : 2024-03-27 Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang
his paper tackles a significant challenge faced by Vision Transformers (ViTs): their constrained scalability across different image resolutions. Typically, ViTs experience a performance decline when processing resolutions different from those seen during training. Our work introduces two key innovations to address this issue. Firstly, we propose a novel module for dynamic resolution adjustment, designed
-
Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation arXiv.cs.CV Pub Date : 2024-03-27 Ba Hung Ngo, Nhat-Tuong Do-Tran, Tuan-Ngoc Nguyen, Hae-Gon Jeon, Tae Jong Choi
Most domain adaptation (DA) methods are based on either a convolutional neural networks (CNNs) or a vision transformers (ViTs). They align the distribution differences between domains as encoders without considering their unique characteristics. For instance, ViT excels in accuracy due to its superior ability to capture global representations, while CNN has an advantage in capturing local representations
-
MonoHair: High-Fidelity Hair Modeling from a Monocular Video arXiv.cs.CV Pub Date : 2024-03-27 Keyu Wu, Lingchen Yang, Zhiyi Kuang, Yao Feng, Xutao Han, Yuefan Shen, Hongbo Fu, Kun Zhou, Youyi Zheng
Undoubtedly, high-fidelity 3D hair is crucial for achieving realism, artistic expression, and immersion in computer graphics. While existing 3D hair modeling methods have achieved impressive performance, the challenge of achieving high-quality hair reconstruction persists: they either require strict capture conditions, making practical applications difficult, or heavily rely on learned prior data,
-
Learning Inclusion Matching for Animation Paint Bucket Colorization arXiv.cs.CV Pub Date : 2024-03-27 Yuekun Dai, Shangchen Zhou, Qinyue Li, Chongyi Li, Chen Change Loy
Colorizing line art is a pivotal task in the production of hand-drawn cel animation. This typically involves digital painters using a paint bucket tool to manually color each segment enclosed by lines, based on RGB values predetermined by a color designer. This frame-by-frame process is both arduous and time-intensive. Current automated methods mainly focus on segment matching. This technique migrates
-
DODA: Diffusion for Object-detection Domain Adaptation in Agriculture arXiv.cs.CV Pub Date : 2024-03-27 Shuai Xiang, Pieter M. Blok, James Burridge, Haozhou Wang, Wei Guo
The diverse and high-quality content generated by recent generative models demonstrates the great potential of using synthetic data to train downstream models. However, in vision, especially in objection detection, related areas are not fully explored, the synthetic images are merely used to balance the long tails of existing datasets, and the accuracy of the generated labels is low, the full potential
-
PIPNet3D: Interpretable Detection of Alzheimer in MRI Scans arXiv.cs.CV Pub Date : 2024-03-27 Lisa Anita De Santi, Jörg Schlötterer, Michael Scheschenja, Joel Wessendorf, Meike Nauta, Vincenzo Positano, Christin Seifert
Information from neuroimaging examinations (CT, MRI) is increasingly used to support diagnoses of dementia, e.g., Alzheimer's disease. While current clinical practice is mainly based on visual inspection and feature engineering, Deep Learning approaches can be used to automate the analysis and to discover new image-biomarkers. Part-prototype neural networks (PP-NN) are an alternative to standard blackbox
-
Uncertainty-Aware SAR ATR: Defending Against Adversarial Attacks via Bayesian Neural Networks arXiv.cs.CV Pub Date : 2024-03-27 Tian Ye, Rajgopal Kannan, Viktor Prasanna, Carl Busart
Adversarial attacks have demonstrated the vulnerability of Machine Learning (ML) image classifiers in Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) systems. An adversarial attack can deceive the classifier into making incorrect predictions by perturbing the input SAR images, for example, with a few scatterers attached to the on-ground objects. Therefore, it is critical to develop
-
Multi-scale Unified Network for Image Classification arXiv.cs.CV Pub Date : 2024-03-27 Wenzhuo Liu, Fei Zhu, Cheng-Lin Liu
Convolutional Neural Networks (CNNs) have advanced significantly in visual representation learning and recognition. However, they face notable challenges in performance and computational efficiency when dealing with real-world, multi-scale image inputs. Conventional methods rescale all input images into a fixed size, wherein a larger fixed size favors performance but rescaling small size images to
-
Efficient Test-Time Adaptation of Vision-Language Models arXiv.cs.CV Pub Date : 2024-03-27 Adilbek Karmanov, Dayan Guan, Shijian Lu, Abdulmotaleb El Saddik, Eric Xing
Test-time adaptation with pre-trained vision-language models has attracted increasing attention for tackling distribution shifts during the test time. Though prior studies have achieved very promising performance, they involve intensive computation which is severely unaligned with test-time adaptation. We design TDA, a training-free dynamic adapter that enables effective and efficient test-time adaptation
-
Towards Non-Exemplar Semi-Supervised Class-Incremental Learning arXiv.cs.CV Pub Date : 2024-03-27 Wenzhuo Liu, Fei Zhu, Cheng-Lin Liu
Deep neural networks perform remarkably well in close-world scenarios. However, novel classes emerged continually in real applications, making it necessary to learn incrementally. Class-incremental learning (CIL) aims to gradually recognize new classes while maintaining the discriminability of old ones. Existing CIL methods have two limitations: a heavy reliance on preserving old data for forgetting
-
SGDM: Static-Guided Dynamic Module Make Stronger Visual Models arXiv.cs.CV Pub Date : 2024-03-27 Wenjie Xing, Zhenchao Cui, Jing Qi
The spatial attention mechanism has been widely used to improve object detection performance. However, its operation is currently limited to static convolutions lacking content-adaptive features. This paper innovatively approaches from the perspective of dynamic convolution. We propose Razor Dynamic Convolution (RDConv) to address thetwo flaws in dynamic weight convolution, making it hard to implement
-
AIR-HLoc: Adaptive Image Retrieval for Efficient Visual Localisation arXiv.cs.CV Pub Date : 2024-03-27 Changkun Liu, Huajian Huang, Zhengyang Ma, Tristan Braud
State-of-the-art (SOTA) hierarchical localisation pipelines (HLoc) rely on image retrieval (IR) techniques to establish 2D-3D correspondences by selecting the $k$ most similar images from a reference image database for a given query image. Although higher values of $k$ enhance localisation robustness, the computational cost for feature matching increases linearly with $k$. In this paper, we observe
-
DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment arXiv.cs.CV Pub Date : 2024-03-27 Jiuming Liu, Dong Zhuo, Zhiheng Feng, Siting Zhu, Chensheng Peng, Zhe Liu, Hesheng Wang
Information inside visual and LiDAR data is well complementary derived from the fine-grained texture of images and massive geometric information in point clouds. However, it remains challenging to explore effective visual-LiDAR fusion, mainly due to the intrinsic data structure inconsistency between two modalities: Images are regular and dense, but LiDAR points are unordered and sparse. To address
-
Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding arXiv.cs.CV Pub Date : 2024-03-27 Zhiheng Cheng, Qingyue Wei, Hongru Zhu, Yan Wang, Liangqiong Qu, Wei Shao, Yuyin Zhou
The Segment Anything Model (SAM) has garnered significant attention for its versatile segmentation abilities and intuitive prompt-based interface. However, its application in medical imaging presents challenges, requiring either substantial training costs and extensive medical datasets for full model fine-tuning or high-quality prompts for optimal performance. This paper introduces H-SAM: a prompt-free
-
Toward Interactive Regional Understanding in Vision-Large Language Models arXiv.cs.CV Pub Date : 2024-03-27 Jungbeom Lee, Sanghyuk Chun, Sangdoo Yun
Recent Vision-Language Pre-training (VLP) models have demonstrated significant advancements. Nevertheless, these models heavily rely on image-text pairs that capture only coarse and global information of an image, leading to a limitation in their regional understanding ability. In this work, we introduce \textbf{RegionVLM}, equipped with explicit regional modeling capabilities, allowing them to understand
-
Enhancing Generative Class Incremental Learning Performance with Model Forgetting Approach arXiv.cs.CV Pub Date : 2024-03-27 Taro Togo, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama
This study presents a novel approach to Generative Class Incremental Learning (GCIL) by introducing the forgetting mechanism, aimed at dynamically managing class information for better adaptation to streaming data. GCIL is one of the hot topics in the field of computer vision, and this is considered one of the crucial tasks in society, specifically the continual learning of generative models. The ability
-
TAFormer: A Unified Target-Aware Transformer for Video and Motion Joint Prediction in Aerial Scenes arXiv.cs.CV Pub Date : 2024-03-27 Liangyu Xu, Wanxuan Lu, Hongfeng Yu, Yongqiang Mao, Hanbo Bi, Chenglong Liu, Xian Sun, Kun Fu
As drone technology advances, using unmanned aerial vehicles for aerial surveys has become the dominant trend in modern low-altitude remote sensing. The surge in aerial video data necessitates accurate prediction for future scenarios and motion states of the interested target, particularly in applications like traffic management and disaster response. Existing video prediction methods focus solely
-
Few-shot Online Anomaly Detection and Segmentation arXiv.cs.CV Pub Date : 2024-03-27 Shenxing Wei, Xing Wei, Zhiheng Ma, Songlin Dong, Shaochen Zhang, Yihong Gong
Detecting anomaly patterns from images is a crucial artificial intelligence technique in industrial applications. Recent research in this domain has emphasized the necessity of a large volume of training data, overlooking the practical scenario where, post-deployment of the model, unlabeled data containing both normal and abnormal samples can be utilized to enhance the model's performance. Consequently
-
Middle Fusion and Multi-Stage, Multi-Form Prompts for Robust RGB-T Tracking arXiv.cs.CV Pub Date : 2024-03-27 Qiming Wang, Yongqiang Bai, Hongxing Song
RGB-T tracking, a vital downstream task of object tracking, has made remarkable progress in recent years. Yet, it remains hindered by two major challenges: 1) the trade-off between performance and efficiency; 2) the scarcity of training data. To address the latter challenge, some recent methods employ prompts to fine-tune pre-trained RGB tracking models and leverage upstream knowledge in a parameter-efficient
-
LayoutFlow: Flow Matching for Layout Generation arXiv.cs.CV Pub Date : 2024-03-27 Julian Jorge Andrade Guerreiro, Naoto Inoue, Kento Masui, Mayu Otani, Hideki Nakayama
Finding a suitable layout represents a crucial task for diverse applications in graphic design. Motivated by simpler and smoother sampling trajectories, we explore the use of Flow Matching as an alternative to current diffusion-based layout generation models. Specifically, we propose LayoutFlow, an efficient flow-based model capable of generating high-quality layouts. Instead of progressively denoising
-
Don't Look into the Dark: Latent Codes for Pluralistic Image Inpainting arXiv.cs.CV Pub Date : 2024-03-27 Haiwei Chen, Yajie Zhao
We present a method for large-mask pluralistic image inpainting based on the generative framework of discrete latent codes. Our method learns latent priors, discretized as tokens, by only performing computations at the visible locations of the image. This is realized by a restrictive partial encoder that predicts the token label for each visible block, a bidirectional transformer that infers the missing
-
Multi-Layer Dense Attention Decoder for Polyp Segmentation arXiv.cs.CV Pub Date : 2024-03-27 Krushi Patel, Fengjun Li, Guanghui Wang
Detecting and segmenting polyps is crucial for expediting the diagnosis of colon cancer. This is a challenging task due to the large variations of polyps in color, texture, and lighting conditions, along with subtle differences between the polyp and its surrounding area. Recently, vision Transformers have shown robust abilities in modeling global context for polyp segmentation. However, they face two
-
The Effects of Short Video-Sharing Services on Video Copy Detection arXiv.cs.CV Pub Date : 2024-03-26 Rintaro Yanagi, Yamato Okamoto, Shuhei Yokoo, Shin'ichi Satoh
The short video-sharing services that allow users to post 10-30 second videos (e.g., YouTube Shorts and TikTok) have attracted a lot of attention in recent years. However, conventional video copy detection (VCD) methods mainly focus on general video-sharing services (e.g., YouTube and Bilibili), and the effects of short video-sharing services on video copy detection are still unclear. Considering that
-
EgoLifter: Open-world 3D Segmentation for Egocentric Perception arXiv.cs.CV Pub Date : 2024-03-26 Qiao Gu, Zhaoyang Lv, Duncan Frost, Simon Green, Julian Straub, Chris Sweeney
In this paper we present EgoLifter, a novel system that can automatically segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects. The system is specifically designed for egocentric data where scenes contain hundreds of objects captured from natural (non-scanning) motion. EgoLifter adopts 3D Gaussians as the underlying representation of 3D scenes and objects
-
TDIP: Tunable Deep Image Processing, a Real Time Melt Pool Monitoring Solution arXiv.cs.CV Pub Date : 2024-03-26 Javid Akhavan, Youmna Mahmoud, Ke Xu, Jiaqi Lyu, Souran Manoochehri
In the era of Industry 4.0, Additive Manufacturing (AM), particularly metal AM, has emerged as a significant contributor due to its innovative and cost-effective approach to fabricate highly intricate geometries. Despite its potential, this industry still lacks real-time capable process monitoring algorithms. Recent advancements in this field suggest that Melt Pool (MP) signatures during the fabrication
-
QuakeSet: A Dataset and Low-Resource Models to Monitor Earthquakes through Sentinel-1 arXiv.cs.CV Pub Date : 2024-03-26 Daniele Rege Cambrin, Paolo Garza
Earthquake monitoring is necessary to promptly identify the affected areas, the severity of the events, and, finally, to estimate damages and plan the actions needed for the restoration process. The use of seismic stations to monitor the strength and origin of earthquakes is limited when dealing with remote areas (we cannot have global capillary coverage). Identification and analysis of all affected
-
Segment Any Medical Model Extended arXiv.cs.CV Pub Date : 2024-03-26 Yihao Liu, Jiaming Zhang, Andres Diaz-Pinto, Haowei Li, Alejandro Martin-Gomez, Amir Kheradmand, Mehran Armand
The Segment Anything Model (SAM) has drawn significant attention from researchers who work on medical image segmentation because of its generalizability. However, researchers have found that SAM may have limited performance on medical images compared to state-of-the-art non-foundation models. Regardless, the community sees potential in extending, fine-tuning, modifying, and evaluating SAM for analysis