-
Shotit: compute-efficient image-to-video search engine for the cloud arXiv.cs.MM Pub Date : 2024-04-18 Leslie Wong
With the rapid growth of information technology, users are exposed to a massive amount of data online, including image, music, and video. This has led to strong needs to provide effective corresponsive search services such as image, music, and video search services. Most of them are operated based on keywords, namely using keywords to find related image, music, and video. Additionally, there are image-to-image
-
A Perspective on Deep Vision Performance with Standard Image and Video Codecs arXiv.cs.MM Pub Date : 2024-04-18 Christoph Reich, Oliver Hahn, Daniel Cremers, Stefan Roth, Biplob Debnath
Resource-constrained hardware, such as edge devices or cell phones, often rely on cloud servers to provide the required computational resources for inference in deep vision models. However, transferring image and video data from an edge or mobile device to a cloud server requires coding to deal with network constraints. The use of standardized codecs, such as JPEG or H.264, is prevalent and required
-
Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion arXiv.cs.MM Pub Date : 2024-04-17 Xinghan Wang, Zixi Kang, Yadong Mu
Human motion understanding is a fundamental task with diverse practical applications, facilitated by the availability of large-scale motion capture datasets. Recent studies focus on text-motion tasks, such as text-based motion generation, editing and question answering. In this study, we introduce the novel task of text-based human motion grounding (THMG), aimed at precisely localizing temporal segments
-
Prompt-Guided Generation of Structured Chest X-Ray Report Using a Pre-trained LLM arXiv.cs.MM Pub Date : 2024-04-17 Hongzhao Li, Hongyu Wang, Xia Sun, Hua He, Jun Feng
Medical report generation automates radiology descriptions from images, easing the burden on physicians and minimizing errors. However, current methods lack structured outputs and physician interactivity for clear, clinically relevant reports. Our method introduces a prompt-guided approach to generate structured chest X-ray reports using a pre-trained large language model (LLM). First, we identify
-
DRepMRec: A Dual Representation Learning Framework for Multimodal Recommendation arXiv.cs.MM Pub Date : 2024-04-17 Kangning Zhang, Yingjie Qin, Ruilong Su, Yifan Liu, Jiarui Jin, Weinan Zhang, Yong Yu
Multimodal Recommendation focuses mainly on how to effectively integrate behavior and multimodal information in the recommendation task. Previous works suffer from two major issues. Firstly, the training process tightly couples the behavior module and multimodal module by jointly optimizing them using the sharing model parameters, which leads to suboptimal performance since behavior signals and modality
-
Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning arXiv.cs.MM Pub Date : 2024-04-16 Zhengyang Liang, Meiyu Liang, Wei Huang, Yawen Li, Zhe Xue
In recent years, pre-trained multimodal large models have attracted widespread attention due to their outstanding performance in various multimodal applications. Nonetheless, the extensive computational resources and vast datasets required for their training present significant hurdles for deployment in environments with limited computational resources. To address this challenge, we propose a novel
-
Retrieval Augmented Verification : Unveiling Disinformation with Structured Representations for Zero-Shot Real-Time Evidence-guided Fact-Checking of Multi-modal Social media posts arXiv.cs.MM Pub Date : 2024-04-16 Arka Ujjal Dey, Artemis Llabrés, Ernest Valveny, Dimosthenis Karatzas
Social Media posts, where real images are unscrupulously reused along with provocative text to promote a particular idea, have been one of the major sources of disinformation. By design, these claims are without editorial oversight and accessible to a vast population who otherwise may not have access to multiple information sources. This implies the need to fact-check these posts and clearly explain
-
AllTheDocks road safety dataset: A cyclist's perspective and experience arXiv.cs.MM Pub Date : 2024-04-16 Chia-Yen Chiang, Ruikang Zhong, Jennifer Ding, Joseph Wood, Stephen Bee, Mona Jaber
Active travel is an essential component in intelligent transportation systems. Cycling, as a form of active travel, shares the road space with motorised traffic which often affects the cyclists' safety and comfort and therefore peoples' propensity to uptake cycling instead of driving. This paper presents a unique dataset, collected by cyclists across London, that includes video footage, accelerometer
-
Referring Flexible Image Restoration arXiv.cs.MM Pub Date : 2024-04-16 Runwei Guan, Rongsheng Hu, Zhuhao Zhou, Tianlang Xue, Ka Lok Man, Jeremy Smith, Eng Gee Lim, Weiping Ding, Yutao Yue
In reality, images often exhibit multiple degradations, such as rain and fog at night (triple degradations). However, in many cases, individuals may not want to remove all degradations, for instance, a blurry lens revealing a beautiful snowy landscape (double degradations). In such scenarios, people may only desire to deblur. These situations and requirements shed light on a new challenge in image
-
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search arXiv.cs.MM Pub Date : 2024-04-16 Jintao Sun, Zhedong Zheng, Gangyi Ding
In text-based person search endeavors, data generation has emerged as a prevailing practice, addressing concerns over privacy preservation and the arduous task of manual annotation. Although the number of synthesized data can be infinite in theory, the scientific conundrum persists that how much generated data optimally fuels subsequent model training. We observe that only a subset of the data in these
-
ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis arXiv.cs.MM Pub Date : 2024-04-15 Aashish Anantha Ramakrishnan, Sharon X. Huang, Dongwon Lee
Text-to-Image (T2I) Synthesis has made tremendous strides in enhancing synthesized image quality, but current datasets evaluate model performance only on descriptive, instruction-based prompts. Real-world news image captions take a more pragmatic approach, providing high-level situational and Named-Entity (NE) information and limited physical object descriptions, making them abstractive. To evaluate
-
Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics arXiv.cs.MM Pub Date : 2024-04-14 Haosong Peng, Wei Feng, Hao Li, Yufeng Zhan, Qihua Zhou, Yuanqing Xia
The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers have
-
Do LLMs Understand Visual Anomalies? Uncovering LLM Capabilities in Zero-shot Anomaly Detection arXiv.cs.MM Pub Date : 2024-04-15 Jiaqi Zhu, Shaofeng Cai, Fang Deng, Junran Wu
Large vision-language models (LVLMs) are markedly proficient in deriving visual representations guided by natural language. Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges by pairing images with textual descriptions indicative of normal and abnormal conditions, referred to as anomaly prompts. However, existing approaches depend on static anomaly
-
State Space Model for New-Generation Network Alternative to Transformers: A Survey arXiv.cs.MM Pub Date : 2024-04-15 Xiao Wang, Shiao Wang, Yuhe Ding, Yuehang Li, Wentao Wu, Yao Rong, Weizhe Kong, Ju Huang, Shihao Li, Haoxiang Yang, Ziwen Wang, Bo Jiang, Chenglong Li, Yaowei Wang, Yonghong Tian, Jin Tang
In the post-deep learning era, the Transformer architecture has demonstrated its powerful performance across pre-trained big models and various downstream tasks. However, the enormous computational demands of this architecture have deterred many researchers. To further reduce the complexity of attention models, numerous efforts have been made to design more efficient methods. Among them, the State
-
Seeing Text in the Dark: Algorithm and Benchmark arXiv.cs.MM Pub Date : 2024-04-13 Chengpei Xu, Hao Fu, Long Ma, Wenjing Jia, Chengqi Zhang, Feng Xia, Xiaoyu Ai, Binghao Li, Wenjie Zhang
Localizing text in low-light environments is challenging due to visual degradations. Although a straightforward solution involves a two-stage pipeline with low-light image enhancement (LLE) as the initial step followed by detector, LLE is primarily designed for human vision instead of machine and can accumulate errors. In this work, we propose an efficient and effective single-stage approach for localizing
-
Calibration & Reconstruction: Deep Integrated Language for Referring Image Segmentation arXiv.cs.MM Pub Date : 2024-04-12 Yichen Yan, Xingjian He, Sihan Chen, Jing Liu
Referring image segmentation aims to segment an object referred to by natural language expression from an image. The primary challenge lies in the efficient propagation of fine-grained semantic information from textual features to visual features. Many recent works utilize a Transformer to address this challenge. However, conventional transformer decoders can distort linguistic information with deeper
-
Multimodal Emotion Recognition by Fusing Video Semantic in MOOC Learning Scenarios arXiv.cs.MM Pub Date : 2024-04-11 Yuan Zhang, Xiaomei Tao, Hanxu Ai, Tao Chen, Yanling Gan
In the Massive Open Online Courses (MOOC) learning scenario, the semantic information of instructional videos has a crucial impact on learners' emotional state. Learners mainly acquire knowledge by watching instructional videos, and the semantic information in the videos directly affects learners' emotional states. However, few studies have paid attention to the potential influence of the semantic
-
GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models arXiv.cs.MM Pub Date : 2024-04-10 Zewei Zhang, Huan Liu, Jun Chen, Xiangyu Xu
In this paper, we introduce GoodDrag, a novel approach to improve the stability and image quality of drag editing. Unlike existing methods that struggle with accumulated perturbations and often result in distortions, GoodDrag introduces an AlDD framework that alternates between drag and denoising operations within the diffusion process, effectively improving the fidelity of the result. We also propose
-
Efficient and Generic Point Model for Lossless Point Cloud Attribute Compression arXiv.cs.MM Pub Date : 2024-04-10 Kang You, Pan Gao, Zhan Ma
The past several years have witnessed the emergence of learned point cloud compression (PCC) techniques. However, current learning-based lossless point cloud attribute compression (PCAC) methods either suffer from high computational complexity or deteriorated compression performance. Moreover, the significant variations in point cloud scale and sparsity encountered in real-world applications make developing
-
Demonstration of MaskSearch: Efficiently Querying Image Masks for Machine Learning Workflows arXiv.cs.MM Pub Date : 2024-04-09 Lindsey Linxi Wei, Chung Yik Edward Yeung, Hongjian Yu, Jingchuan Zhou, Dong He, Magdalena Balazinska
We demonstrate MaskSearch, a system designed to accelerate queries over databases of image masks generated by machine learning models. MaskSearch formalizes and accelerates a new category of queries for retrieving images and their corresponding masks based on mask properties, which support various applications, from identifying spurious correlations learned by models to exploring discrepancies between
-
Dynamic Resolution Guidance for Facial Expression Recognition arXiv.cs.MM Pub Date : 2024-04-09 Jie Ou, Xu Li, Tianxiang Jiang, Yuanlun Xie
Facial expression recognition (FER) is vital for human-computer interaction and emotion analysis, yet recognizing expressions in low-resolution images remains challenging. This paper introduces a practical method called Dynamic Resolution Guidance for Facial Expression Recognition (DRGFER) to effectively recognize facial expressions in images with varying resolutions without compromising FER model
-
ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos arXiv.cs.MM Pub Date : 2024-04-09 Sharana Dharshikgan Suresh Dass, Hrishav Bakul Barua, Ganesh Krishnasamy, Raveendran Paramesran, Raphael C. -W. Phan
Human action or activity recognition in videos is a fundamental task in computer vision with applications in surveillance and monitoring, self-driving cars, sports analytics, human-robot interaction and many more. Traditional supervised methods require large annotated datasets for training, which are expensive and time-consuming to acquire. This work proposes a novel approach using Cross-Architecture
-
Zero-Shot Relational Learning for Multimodal Knowledge Graphs arXiv.cs.MM Pub Date : 2024-04-09 Rui Cai, Shichao Pei, Xiangliang Zhang
Relational learning is an essential task in the domain of knowledge representation, particularly in knowledge graph completion (KGC).While relational learning in traditional single-modal settings has been extensively studied, exploring it within a multimodal KGC context presents distinct challenges and opportunities. One of the major challenges is inference on newly discovered relations without any
-
Band-Attention Modulated RetNet for Face Forgery Detection arXiv.cs.MM Pub Date : 2024-04-09 Zhida Zhang, Jie Cao, Wenkui Yang, Qihang Fan, Kai Zhou, Ran He
The transformer networks are extensively utilized in face forgery detection due to their scalability across large datasets.Despite their success, transformers face challenges in balancing the capture of global context, which is crucial for unveiling forgery clues, with computational complexity.To mitigate this issue, we introduce Band-Attention modulated RetNet (BAR-Net), a lightweight network designed
-
BatSort: Enhanced Battery Classification with Transfer Learning for Battery Sorting and Recycling arXiv.cs.MM Pub Date : 2024-04-08 Yunyi Zhao, Wei Zhang, Erhai Hu, Qingyu Yan, Cheng Xiang, King Jet Tseng, Dusit Niyato
Battery recycling is a critical process for minimizing environmental harm and resource waste for used batteries. However, it is challenging, largely because sorting batteries is costly and hardly automated to group batteries based on battery types. In this paper, we introduce a machine learning-based approach for battery-type classification and address the daunting problem of data scarcity for the
-
3DMambaIPF: A State Space Model for Iterative Point Cloud Filtering via Differentiable Rendering arXiv.cs.MM Pub Date : 2024-04-08 Qingyuan Zhou, Weidong Yang, Ben Fei, Jingyi Xu, Rui Zhang, Keyi Liu, Yeqi Luo, Ying He
Noise is an inevitable aspect of point cloud acquisition, necessitating filtering as a fundamental task within the realm of 3D vision. Existing learning-based filtering methods have shown promising capabilities on small-scale synthetic or real-world datasets. Nonetheless, the effectiveness of these methods is constrained when dealing with a substantial quantity of point clouds. This limitation primarily
-
TCAN: Text-oriented Cross Attention Network for Multimodal Sentiment Analysis arXiv.cs.MM Pub Date : 2024-04-06 Ming Zhou, Weize Quan, Ziqi Zhou, Kai Wang, Tong Wang, Dong-Ming Yan
Multimodal Sentiment Analysis (MSA) endeavors to understand human sentiment by leveraging language, visual, and acoustic modalities. Despite the remarkable performance exhibited by previous MSA approaches, the presence of inherent multimodal heterogeneities poses a challenge, with the contribution of different modalities varying considerably. Past research predominantly focused on improving representation
-
WebXR, A-Frame and Networked-Aframe as a Basis for an Open Metaverse: A Conceptual Architecture arXiv.cs.MM Pub Date : 2024-04-08 Giuseppe Macario
This work proposes a WebXR-based cross-platform conceptual architecture, leveraging the A-Frame and Networked-Aframe frameworks, in order to facilitate the development of an open, accessible, and interoperable metaverse. By introducing the concept of spatial web app, this research contributes to the discourse on the metaverse, offering an architecture that democratizes access to virtual environments
-
Fantastic Animals and Where to Find Them: Segment Any Marine Animal with Dual SAM arXiv.cs.MM Pub Date : 2024-04-07 Pingping Zhang, Tianyu Yan, Yang Liu, Huchuan Lu
As an important pillar of underwater intelligence, Marine Animal Segmentation (MAS) involves segmenting animals within marine environments. Previous methods don't excel in extracting long-range contextual features and overlook the connectivity between discrete pixels. Recently, Segment Anything Model (SAM) offers a universal framework for general segmentation tasks. Unfortunately, trained with natural
-
D2SL: Decouple Defogging and Semantic Learning for Foggy Domain-Adaptive Segmentation arXiv.cs.MM Pub Date : 2024-04-07 Xuan Sun, Zhanfu An, Yuyu Liu
We investigated domain adaptive semantic segmentation in foggy weather scenarios, which aims to enhance the utilization of unlabeled foggy data and improve the model's adaptability to foggy conditions. Current methods rely on clear images as references, jointly learning defogging and segmentation for foggy images. Despite making some progress, there are still two main drawbacks: (1) the coupling of
-
InstructHumans: Editing Animated 3D Human Textures with Instructions arXiv.cs.MM Pub Date : 2024-04-05 Jiayin Zhu, Linlin Yang, Angela Yao
We present InstructHumans, a novel framework for instruction-driven 3D human texture editing. Existing text-based editing methods use Score Distillation Sampling (SDS) to distill guidance from generative models. This work shows that naively using such scores is harmful to editing as they destroy consistency with the source avatar. Instead, we propose an alternate SDS for Editing (SDS-E) that selectively
-
WorDepth: Variational Language Prior for Monocular Depth Estimation arXiv.cs.MM Pub Date : 2024-04-04 Ziyao Zeng, Daniel Wang, Fengyu Yang, Hyoungseob Park, Yangchao Wu, Stefano Soatto, Byung-Woo Hong, Dong Lao, Alex Wong
Three-dimensional (3D) reconstruction from a single image is an ill-posed problem with inherent ambiguities, i.e. scale. Predicting a 3D scene from text description(s) is similarly ill-posed, i.e. spatial arrangements of objects described. We investigate the question of whether two inherently ambiguous modalities can be used in conjunction to produce metric-scaled reconstructions. To test this, we
-
BioVL-QR: Egocentric Biochemical Video-and-Language Dataset Using Micro QR Codes arXiv.cs.MM Pub Date : 2024-04-04 Taichi Nishimura, Koki Yamamoto, Yuto Haneji, Keiya Kajimura, Chihiro Nishiwaki, Eriko Daikoku, Natsuko Okuda, Fumihito Ono, Hirotaka Kameko, Shinsuke Mori
This paper introduces a biochemical vision-and-language dataset, which consists of 24 egocentric experiment videos, corresponding protocols, and video-and-language alignments. The key challenge in the wet-lab domain is detecting equipment, reagents, and containers is difficult because the lab environment is scattered by filling objects on the table and some objects are indistinguishable. Therefore
-
DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement arXiv.cs.MM Pub Date : 2024-04-03 Hao Wu, Huabin Liu, Yu Qiao, Xiao Sun
We present Dive Into the BoundarieS (DIBS), a novel pretraining framework for dense video captioning (DVC), that elaborates on improving the quality of the generated event captions and their associated pseudo event boundaries from unlabeled videos. By leveraging the capabilities of diverse large language models (LLMs), we generate rich DVC-oriented caption candidates and optimize the corresponding
-
Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model arXiv.cs.MM Pub Date : 2024-04-02 Xu He, Qiaochu Huang, Zhensong Zhang, Zhiwei Lin, Zhiyong Wu, Sicheng Yang, Minglei Li, Zhiyi Chen, Songcen Xu, Xiaofei Wu
Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons, resulting in the omission of appearance information, we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed
-
CIRP: Cross-Item Relational Pre-training for Multimodal Product Bundling arXiv.cs.MM Pub Date : 2024-04-02 Yunshan Ma, Yingzhi He, Wenjun Zhong, Xiang Wang, Roger Zimmermann, Tat-Seng Chua
Product bundling has been a prevailing marketing strategy that is beneficial in the online shopping scenario. Effective product bundling methods depend on high-quality item representations, which need to capture both the individual items' semantics and cross-item relations. However, previous item representation learning methods, either feature fusion or graph learning, suffer from inadequate cross-modal
-
SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding arXiv.cs.MM Pub Date : 2024-04-01 Wenrui Li, Xiaopeng Hong, Xiaopeng Fan
Temporal video grounding (TVG) is a critical task in video content understanding. Despite significant advancements, existing methods often limit in capturing the fine-grained relationships between multimodal inputs and the high computational costs with processing long video sequences. To address these limitations, we introduce a novel SpikeMba: multi-modal spiking saliency mamba for temporal video
-
Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation arXiv.cs.MM Pub Date : 2024-03-31 Taekyung Ki, Dongchan Min, Gyeongsu Chae
In this paper, we present Export3D, a one-shot 3D-aware portrait animation method that is able to control the facial expression and camera view of a given portrait image. To achieve this, we introduce a tri-plane generator that directly generates a tri-plane of 3D prior by transferring the expression parameter of 3DMM into the source image. The tri-plane is then decoded into the image of different
-
Multimodal Pretraining, Adaptation, and Generation for Recommendation: A Survey arXiv.cs.MM Pub Date : 2024-03-31 Qijiong Liu, Jieming Zhu, Yanting Yang, Quanyu Dai, Zhaocheng Du, Xiao-Ming Wu, Zhou Zhao, Rui Zhang, Zhenhua Dong
Personalized recommendation serves as a ubiquitous channel for users to discover information or items tailored to their interests. However, traditional recommendation models primarily rely on unique IDs and categorical features for user-item matching, potentially overlooking the nuanced essence of raw item contents across multiple modalities such as text, image, audio, and video. This underutilization
-
ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models arXiv.cs.MM Pub Date : 2024-03-29 Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, Kaipeng Zhang
This paper presents ConvBench, a novel multi-turn conversation evaluation benchmark tailored for Large Vision-Language Models (LVLMs). Unlike existing benchmarks that assess individual capabilities in single-turn dialogues, ConvBench adopts a three-level multimodal capability hierarchy, mimicking human cognitive processes by stacking up perception, reasoning, and creativity. Each level focuses on a
-
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions arXiv.cs.MM Pub Date : 2024-03-28 Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, Ming-Wei Chang
Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Recent work leverages text instructions to allow users to more freely express their search intents. However, existing work primarily focuses on image pairs that are visually similar and/or can be characterized
-
Break-for-Make: Modular Low-Rank Adaptations for Composable Content-Style Customization arXiv.cs.MM Pub Date : 2024-03-28 Yu Xu, Fan Tang, Juan Cao, Yuxin Zhang, Oliver Deussen, Weiming Dong, Jintao Li, Tong-Yee Lee
Personalized generation paradigms empower designers to customize visual intellectual properties with the help of textual descriptions by tuning or adapting pre-trained text-to-image models on a few images. Recent works explore approaches for concurrently customizing both content and detailed visual style appearance. However, these existing approaches often generate images where the content and style
-
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding arXiv.cs.MM Pub Date : 2024-03-27 Xintong Wang, Jingheng Pan, Liang Ding, Chris Biemann
Large Vision-Language Models (LVLMs) are increasingly adept at generating contextually detailed and coherent responses from visual inputs. However, their application in multimodal decision-making and open-ended generation is hindered by a notable rate of hallucinations, where generated text inaccurately represents the visual contents. To address this issue, this paper introduces the Instruction Contrastive
-
Bringing Textual Prompt to AI-Generated Image Quality Assessment arXiv.cs.MM Pub Date : 2024-03-27 Bowen Qu, Haohui Li, Wei Gao
AI-Generated Images (AGIs) have inherent multimodal nature. Unlike traditional image quality assessment (IQA) on natural scenarios, AGIs quality assessment (AGIQA) takes the correspondence of image and its textual prompt into consideration. This is coupled in the ground truth score, which confuses the unimodal IQA methods. To solve this problem, we introduce IP-IQA (AGIs Quality Assessment via Image
-
Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models arXiv.cs.MM Pub Date : 2024-03-27 Yiwu Zhong, Zi-Yuan Hu, Michael R. Lyu, Liwei Wang
Visual representation learning has been a cornerstone in computer vision, evolving from supervised learning with human-annotated labels to aligning image-text pairs from the Internet. Despite recent advancements in multi-modal large language models (MLLMs), the visual representations they rely on, such as CLIP embeddings, often lack access to external world knowledge critical for real-world visual
-
Spectral Convolutional Transformer: Harmonizing Real vs. Complex Multi-View Spectral Operators for Vision Transformer arXiv.cs.MM Pub Date : 2024-03-26 Badri N. Patro, Vinay P. Namboodiri, Vijay S. Agneeswaran
Transformers used in vision have been investigated through diverse architectures - ViT, PVT, and Swin. These have worked to improve the attention mechanism and make it more efficient. Differently, the need for including local information was felt, leading to incorporating convolutions in transformers such as CPVT and CvT. Global information is captured using a complex Fourier basis to achieve global
-
Boosting Diffusion Models with Moving Average Sampling in Frequency Domain arXiv.cs.MM Pub Date : 2024-03-26 Yurui Qian, Qi Cai, Yingwei Pan, Yehao Li, Ting Yao, Qibin Sun, Tao Mei
Diffusion models have recently brought a powerful revolution in image generation. Despite showing impressive generative capabilities, most of these models rely on the current sample to denoise the next one, possibly resulting in denoising instability. In this paper, we reinterpret the iterative denoising process as model optimization and leverage a moving average mechanism to ensemble all the prior
-
FastPerson: Enhancing Video Learning through Effective Video Summarization that Preserves Linguistic and Visual Contexts arXiv.cs.MM Pub Date : 2024-03-26 Kazuki Kawamura, Jun Rekimoto
Quickly understanding lengthy lecture videos is essential for learners with limited time and interest in various topics to improve their learning efficiency. To this end, video summarization has been actively researched to enable users to view only important scenes from a video. However, these studies focus on either the visual or audio information of a video and extract important segments in the video
-
Panonut360: A Head and Eye Tracking Dataset for Panoramic Video arXiv.cs.MM Pub Date : 2024-03-26 Yutong Xu, Junhao Du, Jiahe Wang, Yuwei Ning, Sihan Zhou Yang Cao
With the rapid development and widespread application of VR/AR technology, maximizing the quality of immersive panoramic video services that match users' personal preferences and habits has become a long-standing challenge. Understanding the saliency region where users focus, based on data collected with HMDs, can promote multimedia encoding, transmission, and quality assessment. At the same time,
-
Towards Low-Latency and Energy-Efficient Hybrid P2P-CDN Live Video Streaming arXiv.cs.MM Pub Date : 2024-03-25 Reza Farahani, Christian Timmerer, Hermann Hellwagner
Streaming segmented videos over the Hypertext Transfer Protocol (HTTP) is an increasingly popular approach in both live and video-on-demand (VoD) applications. However, designing a scalable and adaptable framework that reduces servers energy consumption and supports low latency and high quality services, particularly for live video streaming scenarios, is still challenging for Over-The-Top (OTT) service
-
TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models arXiv.cs.MM Pub Date : 2024-03-25 Zhongwei Zhang, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Ting Yao, Yang Cao, Tao Mei
Recent advances in text-to-video generation have demonstrated the utility of powerful diffusion models. Nevertheless, the problem is not trivial when shaping diffusion models to animate static image (i.e., image-to-video generation). The difficulty originates from the aspect that the diffusion process of subsequent animated frames should not only preserve the faithful alignment with the given image
-
VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation arXiv.cs.MM Pub Date : 2024-03-25 Yang Chen, Yingwei Pan, Haibo Yang, Ting Yao, Tao Mei
Recent innovations on text-to-3D generation have featured Score Distillation Sampling (SDS), which enables the zero-shot learning of implicit 3D models (NeRF) by directly distilling prior knowledge from 2D diffusion models. However, current SDS-based models still struggle with intricate text prompts and commonly result in distorted 3D models with unrealistic textures or cross-view inconsistency issues
-
Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution arXiv.cs.MM Pub Date : 2024-03-25 Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wengang Zhou, Jiebo Luo, Tao Mei
Diffusion models are just at a tipping point for image super-resolution task. Nevertheless, it is not trivial to capitalize on diffusion models for video super-resolution which necessitates not only the preservation of visual appearance from low-resolution to high-resolution videos, but also the temporal consistency across video frames. In this paper, we propose a novel approach, pursuing Spatial Adaptation
-
Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization arXiv.cs.MM Pub Date : 2024-03-24 Linzhi Wu, Xingyu Zhang, Yakun Zhang, Changyan Zheng, Tiejun Liu, Liang Xie, Ye Yan, Erwei Yin
Lip reading, the process of interpreting silent speech from visual lip movements, has gained rising attention for its wide range of realistic applications. Deep learning approaches greatly improve current lip reading systems. However, lip reading in cross-speaker scenarios where the speaker identity changes, poses a challenging problem due to inter-speaker variability. A well-trained lip reading system
-
DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes arXiv.cs.MM Pub Date : 2024-03-23 Hao Yan, Zhihui Ke, Xiaobo Zhou, Tie Qiu, Xidong Shi, Dadong Jiang
Implicit neural representations for video (NeRV) have recently become a novel way for high-quality video representation. However, existing works employ a single network to represent the entire video, which implicitly confuse static and dynamic information. This leads to an inability to effectively compress the redundant static information and lack the explicitly modeling of global temporal-coherent
-
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models arXiv.cs.MM Pub Date : 2024-03-22 Qiong Wu, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji
In this paper, we propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs), termed Efficient Attention Skipping (EAS). Concretely, we first reveal that multi-head attentions (MHAs), the main computational overhead of MLLMs, are often redundant to downstream tasks. Based on this observation, EAS evaluates the attention redundancy and skips the less
-
A Picture Is Worth a Graph: Blueprint Debate on Graph for Multimodal Reasoning arXiv.cs.MM Pub Date : 2024-03-22 Changmeng Zheng, Dayong Liang, Wengyu Zhang, Xiao-Yong Wei, Tat-Seng Chua, Qing Li
This paper presents a pilot study aimed at introducing multi-agent debate into multimodal reasoning. The study addresses two key challenges: the trivialization of opinions resulting from excessive summarization and the diversion of focus caused by distractor concepts introduced from images. These challenges stem from the inductive (bottom-up) nature of existing debating schemes. To address the issue
-
AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks arXiv.cs.MM Pub Date : 2024-03-21 Max Ku, Cong Wei, Weiming Ren, Huan Yang, Wenhu Chen
Video-to-video editing involves editing a source video along with additional control (such as text prompts, subjects, or styles) to generate a new video that aligns with the source video and the provided control. Traditional methods have been constrained to certain editing types, limiting their ability to meet the wide range of user demands. In this paper, we introduce AnyV2V, a novel training-free
-
DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance arXiv.cs.MM Pub Date : 2024-03-20 Zixuan Wang, Jia Jia, Shikun Sun, Haozhe Wu, Rong Han, Zhenyu Li, Di Tang, Jiaqing Zhou, Jiebo Luo
Choreographers determine what the dances look like, while cameramen determine the final presentation of dances. Recently, various methods and datasets have showcased the feasibility of dance synthesis. However, camera movement synthesis with music and dance remains an unsolved challenging problem due to the scarcity of paired data. Thus, we present DCM, a new multi-modal 3D dataset, which for the first
-
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis arXiv.cs.MM Pub Date : 2024-03-20 Yumeng Li, William Beluch, Margret Keuper, Dan Zhang, Anna Khoreva
Despite tremendous progress in the field of text-to-video (T2V) synthesis, open-sourced T2V diffusion models struggle to generate longer videos with dynamically varying and evolving content. They tend to synthesize quasi-static videos, ignoring the necessary visual change-over-time implied in the text prompt. At the same time, scaling these models to enable longer, more dynamic video synthesis often