arXiv - CS - Multimedia期刊最新论文, 计算机, 应用类期刊,

Shotit: compute-efficient image-to-video search engine for the cloud

arXiv.cs.MM Pub Date : 2024-04-18
Leslie Wong

With the rapid growth of information technology, users are exposed to a massive amount of data online, including image, music, and video. This has led to strong needs to provide effective corresponsive search services such as image, music, and video search services. Most of them are operated based on keywords, namely using keywords to find related image, music, and video. Additionally, there are image-to-image

更新日期：2024-04-19

详情收藏

A Perspective on Deep Vision Performance with Standard Image and Video Codecs

arXiv.cs.MM Pub Date : 2024-04-18
Christoph Reich, Oliver Hahn, Daniel Cremers, Stefan Roth, Biplob Debnath

Resource-constrained hardware, such as edge devices or cell phones, often rely on cloud servers to provide the required computational resources for inference in deep vision models. However, transferring image and video data from an edge or mobile device to a cloud server requires coding to deal with network constraints. The use of standardized codecs, such as JPEG or H.264, is prevalent and required

更新日期：2024-04-19

详情收藏

Text-controlled Motion Mamba: Text-Instructed Temporal Grounding of Human Motion

arXiv.cs.MM Pub Date : 2024-04-17
Xinghan Wang, Zixi Kang, Yadong Mu

Human motion understanding is a fundamental task with diverse practical applications, facilitated by the availability of large-scale motion capture datasets. Recent studies focus on text-motion tasks, such as text-based motion generation, editing and question answering. In this study, we introduce the novel task of text-based human motion grounding (THMG), aimed at precisely localizing temporal segments

更新日期：2024-04-18

详情收藏

Prompt-Guided Generation of Structured Chest X-Ray Report Using a Pre-trained LLM

arXiv.cs.MM Pub Date : 2024-04-17
Hongzhao Li, Hongyu Wang, Xia Sun, Hua He, Jun Feng

Medical report generation automates radiology descriptions from images, easing the burden on physicians and minimizing errors. However, current methods lack structured outputs and physician interactivity for clear, clinically relevant reports. Our method introduces a prompt-guided approach to generate structured chest X-ray reports using a pre-trained large language model (LLM). First, we identify

更新日期：2024-04-18

详情收藏

DRepMRec: A Dual Representation Learning Framework for Multimodal Recommendation

arXiv.cs.MM Pub Date : 2024-04-17
Kangning Zhang, Yingjie Qin, Ruilong Su, Yifan Liu, Jiarui Jin, Weinan Zhang, Yong Yu

Multimodal Recommendation focuses mainly on how to effectively integrate behavior and multimodal information in the recommendation task. Previous works suffer from two major issues. Firstly, the training process tightly couples the behavior module and multimodal module by jointly optimizing them using the sharing model parameters, which leads to suboptimal performance since behavior signals and modality

更新日期：2024-04-18

详情收藏

Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning

arXiv.cs.MM Pub Date : 2024-04-16
Zhengyang Liang, Meiyu Liang, Wei Huang, Yawen Li, Zhe Xue

In recent years, pre-trained multimodal large models have attracted widespread attention due to their outstanding performance in various multimodal applications. Nonetheless, the extensive computational resources and vast datasets required for their training present significant hurdles for deployment in environments with limited computational resources. To address this challenge, we propose a novel

更新日期：2024-04-18

详情收藏

Retrieval Augmented Verification : Unveiling Disinformation with Structured Representations for Zero-Shot Real-Time Evidence-guided Fact-Checking of Multi-modal Social media posts

arXiv.cs.MM Pub Date : 2024-04-16
Arka Ujjal Dey, Artemis Llabrés, Ernest Valveny, Dimosthenis Karatzas

Social Media posts, where real images are unscrupulously reused along with provocative text to promote a particular idea, have been one of the major sources of disinformation. By design, these claims are without editorial oversight and accessible to a vast population who otherwise may not have access to multiple information sources. This implies the need to fact-check these posts and clearly explain

更新日期：2024-04-17

详情收藏

AllTheDocks road safety dataset: A cyclist's perspective and experience

arXiv.cs.MM Pub Date : 2024-04-16
Chia-Yen Chiang, Ruikang Zhong, Jennifer Ding, Joseph Wood, Stephen Bee, Mona Jaber

Active travel is an essential component in intelligent transportation systems. Cycling, as a form of active travel, shares the road space with motorised traffic which often affects the cyclists' safety and comfort and therefore peoples' propensity to uptake cycling instead of driving. This paper presents a unique dataset, collected by cyclists across London, that includes video footage, accelerometer

更新日期：2024-04-17

详情收藏

Referring Flexible Image Restoration

arXiv.cs.MM Pub Date : 2024-04-16
Runwei Guan, Rongsheng Hu, Zhuhao Zhou, Tianlang Xue, Ka Lok Man, Jeremy Smith, Eng Gee Lim, Weiping Ding, Yutao Yue

In reality, images often exhibit multiple degradations, such as rain and fog at night (triple degradations). However, in many cases, individuals may not want to remove all degradations, for instance, a blurry lens revealing a beautiful snowy landscape (double degradations). In such scenarios, people may only desire to deblur. These situations and requirements shed light on a new challenge in image

更新日期：2024-04-17

详情收藏

From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search

arXiv.cs.MM Pub Date : 2024-04-16
Jintao Sun, Zhedong Zheng, Gangyi Ding

In text-based person search endeavors, data generation has emerged as a prevailing practice, addressing concerns over privacy preservation and the arduous task of manual annotation. Although the number of synthesized data can be infinite in theory, the scientific conundrum persists that how much generated data optimally fuels subsequent model training. We observe that only a subset of the data in these

更新日期：2024-04-17

详情收藏

ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis

arXiv.cs.MM Pub Date : 2024-04-15
Aashish Anantha Ramakrishnan, Sharon X. Huang, Dongwon Lee

Text-to-Image (T2I) Synthesis has made tremendous strides in enhancing synthesized image quality, but current datasets evaluate model performance only on descriptive, instruction-based prompts. Real-world news image captions take a more pragmatic approach, providing high-level situational and Named-Entity (NE) information and limited physical object descriptions, making them abstractive. To evaluate

更新日期：2024-04-17

详情收藏

Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics

arXiv.cs.MM Pub Date : 2024-04-14
Haosong Peng, Wei Feng, Hao Li, Yufeng Zhan, Qihua Zhou, Yuanqing Xia

The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers have

更新日期：2024-04-16

详情收藏

Do LLMs Understand Visual Anomalies? Uncovering LLM Capabilities in Zero-shot Anomaly Detection

arXiv.cs.MM Pub Date : 2024-04-15
Jiaqi Zhu, Shaofeng Cai, Fang Deng, Junran Wu

Large vision-language models (LVLMs) are markedly proficient in deriving visual representations guided by natural language. Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges by pairing images with textual descriptions indicative of normal and abnormal conditions, referred to as anomaly prompts. However, existing approaches depend on static anomaly

更新日期：2024-04-16

详情收藏

State Space Model for New-Generation Network Alternative to Transformers: A Survey

arXiv.cs.MM Pub Date : 2024-04-15
Xiao Wang, Shiao Wang, Yuhe Ding, Yuehang Li, Wentao Wu, Yao Rong, Weizhe Kong, Ju Huang, Shihao Li, Haoxiang Yang, Ziwen Wang, Bo Jiang, Chenglong Li, Yaowei Wang, Yonghong Tian, Jin Tang

In the post-deep learning era, the Transformer architecture has demonstrated its powerful performance across pre-trained big models and various downstream tasks. However, the enormous computational demands of this architecture have deterred many researchers. To further reduce the complexity of attention models, numerous efforts have been made to design more efficient methods. Among them, the State

更新日期：2024-04-16

详情收藏

Seeing Text in the Dark: Algorithm and Benchmark

arXiv.cs.MM Pub Date : 2024-04-13
Chengpei Xu, Hao Fu, Long Ma, Wenjing Jia, Chengqi Zhang, Feng Xia, Xiaoyu Ai, Binghao Li, Wenjie Zhang

Localizing text in low-light environments is challenging due to visual degradations. Although a straightforward solution involves a two-stage pipeline with low-light image enhancement (LLE) as the initial step followed by detector, LLE is primarily designed for human vision instead of machine and can accumulate errors. In this work, we propose an efficient and effective single-stage approach for localizing

更新日期：2024-04-16

详情收藏

Calibration & Reconstruction: Deep Integrated Language for Referring Image Segmentation

arXiv.cs.MM Pub Date : 2024-04-12
Yichen Yan, Xingjian He, Sihan Chen, Jing Liu

Referring image segmentation aims to segment an object referred to by natural language expression from an image. The primary challenge lies in the efficient propagation of fine-grained semantic information from textual features to visual features. Many recent works utilize a Transformer to address this challenge. However, conventional transformer decoders can distort linguistic information with deeper

更新日期：2024-04-15

详情收藏

Multimodal Emotion Recognition by Fusing Video Semantic in MOOC Learning Scenarios

arXiv.cs.MM Pub Date : 2024-04-11
Yuan Zhang, Xiaomei Tao, Hanxu Ai, Tao Chen, Yanling Gan

In the Massive Open Online Courses (MOOC) learning scenario, the semantic information of instructional videos has a crucial impact on learners' emotional state. Learners mainly acquire knowledge by watching instructional videos, and the semantic information in the videos directly affects learners' emotional states. However, few studies have paid attention to the potential influence of the semantic

更新日期：2024-04-12

详情收藏

GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models

arXiv.cs.MM Pub Date : 2024-04-10
Zewei Zhang, Huan Liu, Jun Chen, Xiangyu Xu

In this paper, we introduce GoodDrag, a novel approach to improve the stability and image quality of drag editing. Unlike existing methods that struggle with accumulated perturbations and often result in distortions, GoodDrag introduces an AlDD framework that alternates between drag and denoising operations within the diffusion process, effectively improving the fidelity of the result. We also propose

更新日期：2024-04-11

详情收藏

Efficient and Generic Point Model for Lossless Point Cloud Attribute Compression

arXiv.cs.MM Pub Date : 2024-04-10
Kang You, Pan Gao, Zhan Ma

The past several years have witnessed the emergence of learned point cloud compression (PCC) techniques. However, current learning-based lossless point cloud attribute compression (PCAC) methods either suffer from high computational complexity or deteriorated compression performance. Moreover, the significant variations in point cloud scale and sparsity encountered in real-world applications make developing

更新日期：2024-04-11

详情收藏

Demonstration of MaskSearch: Efficiently Querying Image Masks for Machine Learning Workflows

arXiv.cs.MM Pub Date : 2024-04-09
Lindsey Linxi Wei, Chung Yik Edward Yeung, Hongjian Yu, Jingchuan Zhou, Dong He, Magdalena Balazinska

We demonstrate MaskSearch, a system designed to accelerate queries over databases of image masks generated by machine learning models. MaskSearch formalizes and accelerates a new category of queries for retrieving images and their corresponding masks based on mask properties, which support various applications, from identifying spurious correlations learned by models to exploring discrepancies between

更新日期：2024-04-11

详情收藏

Dynamic Resolution Guidance for Facial Expression Recognition

arXiv.cs.MM Pub Date : 2024-04-09
Jie Ou, Xu Li, Tianxiang Jiang, Yuanlun Xie

Facial expression recognition (FER) is vital for human-computer interaction and emotion analysis, yet recognizing expressions in low-resolution images remains challenging. This paper introduces a practical method called Dynamic Resolution Guidance for Facial Expression Recognition (DRGFER) to effectively recognize facial expressions in images with varying resolutions without compromising FER model

更新日期：2024-04-10

详情收藏

ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos

arXiv.cs.MM Pub Date : 2024-04-09
Sharana Dharshikgan Suresh Dass, Hrishav Bakul Barua, Ganesh Krishnasamy, Raveendran Paramesran, Raphael C. -W. Phan

Human action or activity recognition in videos is a fundamental task in computer vision with applications in surveillance and monitoring, self-driving cars, sports analytics, human-robot interaction and many more. Traditional supervised methods require large annotated datasets for training, which are expensive and time-consuming to acquire. This work proposes a novel approach using Cross-Architecture

更新日期：2024-04-10

详情收藏

Zero-Shot Relational Learning for Multimodal Knowledge Graphs

arXiv.cs.MM Pub Date : 2024-04-09
Rui Cai, Shichao Pei, Xiangliang Zhang

Relational learning is an essential task in the domain of knowledge representation, particularly in knowledge graph completion (KGC).While relational learning in traditional single-modal settings has been extensively studied, exploring it within a multimodal KGC context presents distinct challenges and opportunities. One of the major challenges is inference on newly discovered relations without any

更新日期：2024-04-10

详情收藏

Band-Attention Modulated RetNet for Face Forgery Detection

arXiv.cs.MM Pub Date : 2024-04-09
Zhida Zhang, Jie Cao, Wenkui Yang, Qihang Fan, Kai Zhou, Ran He

The transformer networks are extensively utilized in face forgery detection due to their scalability across large datasets.Despite their success, transformers face challenges in balancing the capture of global context, which is crucial for unveiling forgery clues, with computational complexity.To mitigate this issue, we introduce Band-Attention modulated RetNet (BAR-Net), a lightweight network designed

更新日期：2024-04-10

详情收藏

BatSort: Enhanced Battery Classification with Transfer Learning for Battery Sorting and Recycling

arXiv.cs.MM Pub Date : 2024-04-08
Yunyi Zhao, Wei Zhang, Erhai Hu, Qingyu Yan, Cheng Xiang, King Jet Tseng, Dusit Niyato

Battery recycling is a critical process for minimizing environmental harm and resource waste for used batteries. However, it is challenging, largely because sorting batteries is costly and hardly automated to group batteries based on battery types. In this paper, we introduce a machine learning-based approach for battery-type classification and address the daunting problem of data scarcity for the

更新日期：2024-04-10

详情收藏

3DMambaIPF: A State Space Model for Iterative Point Cloud Filtering via Differentiable Rendering

arXiv.cs.MM Pub Date : 2024-04-08
Qingyuan Zhou, Weidong Yang, Ben Fei, Jingyi Xu, Rui Zhang, Keyi Liu, Yeqi Luo, Ying He

Noise is an inevitable aspect of point cloud acquisition, necessitating filtering as a fundamental task within the realm of 3D vision. Existing learning-based filtering methods have shown promising capabilities on small-scale synthetic or real-world datasets. Nonetheless, the effectiveness of these methods is constrained when dealing with a substantial quantity of point clouds. This limitation primarily

更新日期：2024-04-09

详情收藏

TCAN: Text-oriented Cross Attention Network for Multimodal Sentiment Analysis

arXiv.cs.MM Pub Date : 2024-04-06
Ming Zhou, Weize Quan, Ziqi Zhou, Kai Wang, Tong Wang, Dong-Ming Yan

Multimodal Sentiment Analysis (MSA) endeavors to understand human sentiment by leveraging language, visual, and acoustic modalities. Despite the remarkable performance exhibited by previous MSA approaches, the presence of inherent multimodal heterogeneities poses a challenge, with the contribution of different modalities varying considerably. Past research predominantly focused on improving representation

更新日期：2024-04-09

详情收藏

WebXR, A-Frame and Networked-Aframe as a Basis for an Open Metaverse: A Conceptual Architecture

arXiv.cs.MM Pub Date : 2024-04-08
Giuseppe Macario

This work proposes a WebXR-based cross-platform conceptual architecture, leveraging the A-Frame and Networked-Aframe frameworks, in order to facilitate the development of an open, accessible, and interoperable metaverse. By introducing the concept of spatial web app, this research contributes to the discourse on the metaverse, offering an architecture that democratizes access to virtual environments

更新日期：2024-04-09

详情收藏

Fantastic Animals and Where to Find Them: Segment Any Marine Animal with Dual SAM

arXiv.cs.MM Pub Date : 2024-04-07
Pingping Zhang, Tianyu Yan, Yang Liu, Huchuan Lu

As an important pillar of underwater intelligence, Marine Animal Segmentation (MAS) involves segmenting animals within marine environments. Previous methods don't excel in extracting long-range contextual features and overlook the connectivity between discrete pixels. Recently, Segment Anything Model (SAM) offers a universal framework for general segmentation tasks. Unfortunately, trained with natural

更新日期：2024-04-09

详情收藏

D2SL: Decouple Defogging and Semantic Learning for Foggy Domain-Adaptive Segmentation

arXiv.cs.MM Pub Date : 2024-04-07
Xuan Sun, Zhanfu An, Yuyu Liu

We investigated domain adaptive semantic segmentation in foggy weather scenarios, which aims to enhance the utilization of unlabeled foggy data and improve the model's adaptability to foggy conditions. Current methods rely on clear images as references, jointly learning defogging and segmentation for foggy images. Despite making some progress, there are still two main drawbacks: (1) the coupling of

更新日期：2024-04-09

详情收藏

InstructHumans: Editing Animated 3D Human Textures with Instructions

arXiv.cs.MM Pub Date : 2024-04-05
Jiayin Zhu, Linlin Yang, Angela Yao

We present InstructHumans, a novel framework for instruction-driven 3D human texture editing. Existing text-based editing methods use Score Distillation Sampling (SDS) to distill guidance from generative models. This work shows that naively using such scores is harmful to editing as they destroy consistency with the source avatar. Instead, we propose an alternate SDS for Editing (SDS-E) that selectively

更新日期：2024-04-08

详情收藏

WorDepth: Variational Language Prior for Monocular Depth Estimation

arXiv.cs.MM Pub Date : 2024-04-04
Ziyao Zeng, Daniel Wang, Fengyu Yang, Hyoungseob Park, Yangchao Wu, Stefano Soatto, Byung-Woo Hong, Dong Lao, Alex Wong

Three-dimensional (3D) reconstruction from a single image is an ill-posed problem with inherent ambiguities, i.e. scale. Predicting a 3D scene from text description(s) is similarly ill-posed, i.e. spatial arrangements of objects described. We investigate the question of whether two inherently ambiguous modalities can be used in conjunction to produce metric-scaled reconstructions. To test this, we

更新日期：2024-04-05

详情收藏

BioVL-QR: Egocentric Biochemical Video-and-Language Dataset Using Micro QR Codes

arXiv.cs.MM Pub Date : 2024-04-04
Taichi Nishimura, Koki Yamamoto, Yuto Haneji, Keiya Kajimura, Chihiro Nishiwaki, Eriko Daikoku, Natsuko Okuda, Fumihito Ono, Hirotaka Kameko, Shinsuke Mori

This paper introduces a biochemical vision-and-language dataset, which consists of 24 egocentric experiment videos, corresponding protocols, and video-and-language alignments. The key challenge in the wet-lab domain is detecting equipment, reagents, and containers is difficult because the lab environment is scattered by filling objects on the table and some objects are indistinguishable. Therefore

更新日期：2024-04-05

详情收藏

DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement

arXiv.cs.MM Pub Date : 2024-04-03
Hao Wu, Huabin Liu, Yu Qiao, Xiao Sun

We present Dive Into the BoundarieS (DIBS), a novel pretraining framework for dense video captioning (DVC), that elaborates on improving the quality of the generated event captions and their associated pseudo event boundaries from unlabeled videos. By leveraging the capabilities of diverse large language models (LLMs), we generate rich DVC-oriented caption candidates and optimize the corresponding

更新日期：2024-04-04

详情收藏

Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

arXiv.cs.MM Pub Date : 2024-04-02
Xu He, Qiaochu Huang, Zhensong Zhang, Zhiwei Lin, Zhiyong Wu, Sicheng Yang, Minglei Li, Zhiyi Chen, Songcen Xu, Xiaofei Wu

Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons, resulting in the omission of appearance information, we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed

更新日期：2024-04-03

详情收藏

CIRP: Cross-Item Relational Pre-training for Multimodal Product Bundling

arXiv.cs.MM Pub Date : 2024-04-02
Yunshan Ma, Yingzhi He, Wenjun Zhong, Xiang Wang, Roger Zimmermann, Tat-Seng Chua

Product bundling has been a prevailing marketing strategy that is beneficial in the online shopping scenario. Effective product bundling methods depend on high-quality item representations, which need to capture both the individual items' semantics and cross-item relations. However, previous item representation learning methods, either feature fusion or graph learning, suffer from inadequate cross-modal

更新日期：2024-04-03

详情收藏

SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding

arXiv.cs.MM Pub Date : 2024-04-01
Wenrui Li, Xiaopeng Hong, Xiaopeng Fan

Temporal video grounding (TVG) is a critical task in video content understanding. Despite significant advancements, existing methods often limit in capturing the fine-grained relationships between multimodal inputs and the high computational costs with processing long video sequences. To address these limitations, we introduce a novel SpikeMba: multi-modal spiking saliency mamba for temporal video

更新日期：2024-04-02

详情收藏

Learning to Generate Conditional Tri-plane for 3D-aware Expression Controllable Portrait Animation

arXiv.cs.MM Pub Date : 2024-03-31
Taekyung Ki, Dongchan Min, Gyeongsu Chae

In this paper, we present Export3D, a one-shot 3D-aware portrait animation method that is able to control the facial expression and camera view of a given portrait image. To achieve this, we introduce a tri-plane generator that directly generates a tri-plane of 3D prior by transferring the expression parameter of 3DMM into the source image. The tri-plane is then decoded into the image of different

更新日期：2024-04-02

详情收藏

Multimodal Pretraining, Adaptation, and Generation for Recommendation: A Survey

arXiv.cs.MM Pub Date : 2024-03-31
Qijiong Liu, Jieming Zhu, Yanting Yang, Quanyu Dai, Zhaocheng Du, Xiao-Ming Wu, Zhou Zhao, Rui Zhang, Zhenhua Dong

Personalized recommendation serves as a ubiquitous channel for users to discover information or items tailored to their interests. However, traditional recommendation models primarily rely on unique IDs and categorical features for user-item matching, potentially overlooking the nuanced essence of raw item contents across multiple modalities such as text, image, audio, and video. This underutilization

更新日期：2024-04-02

详情收藏

ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models

arXiv.cs.MM Pub Date : 2024-03-29
Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, Kaipeng Zhang

This paper presents ConvBench, a novel multi-turn conversation evaluation benchmark tailored for Large Vision-Language Models (LVLMs). Unlike existing benchmarks that assess individual capabilities in single-turn dialogues, ConvBench adopts a three-level multimodal capability hierarchy, mimicking human cognitive processes by stacking up perception, reasoning, and creativity. Each level focuses on a

更新日期：2024-04-01

详情收藏

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

arXiv.cs.MM Pub Date : 2024-03-28
Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, Ming-Wei Chang

Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Recent work leverages text instructions to allow users to more freely express their search intents. However, existing work primarily focuses on image pairs that are visually similar and/or can be characterized

更新日期：2024-03-29

详情收藏

Break-for-Make: Modular Low-Rank Adaptations for Composable Content-Style Customization

arXiv.cs.MM Pub Date : 2024-03-28
Yu Xu, Fan Tang, Juan Cao, Yuxin Zhang, Oliver Deussen, Weiming Dong, Jintao Li, Tong-Yee Lee

Personalized generation paradigms empower designers to customize visual intellectual properties with the help of textual descriptions by tuning or adapting pre-trained text-to-image models on a few images. Recent works explore approaches for concurrently customizing both content and detailed visual style appearance. However, these existing approaches often generate images where the content and style

更新日期：2024-03-29

详情收藏

Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding

arXiv.cs.MM Pub Date : 2024-03-27
Xintong Wang, Jingheng Pan, Liang Ding, Chris Biemann

Large Vision-Language Models (LVLMs) are increasingly adept at generating contextually detailed and coherent responses from visual inputs. However, their application in multimodal decision-making and open-ended generation is hindered by a notable rate of hallucinations, where generated text inaccurately represents the visual contents. To address this issue, this paper introduces the Instruction Contrastive

更新日期：2024-03-28

详情收藏

Bringing Textual Prompt to AI-Generated Image Quality Assessment

arXiv.cs.MM Pub Date : 2024-03-27
Bowen Qu, Haohui Li, Wei Gao

AI-Generated Images (AGIs) have inherent multimodal nature. Unlike traditional image quality assessment (IQA) on natural scenarios, AGIs quality assessment (AGIQA) takes the correspondence of image and its textual prompt into consideration. This is coupled in the ground truth score, which confuses the unimodal IQA methods. To solve this problem, we introduce IP-IQA (AGIs Quality Assessment via Image

更新日期：2024-03-28

详情收藏

Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models

arXiv.cs.MM Pub Date : 2024-03-27
Yiwu Zhong, Zi-Yuan Hu, Michael R. Lyu, Liwei Wang

Visual representation learning has been a cornerstone in computer vision, evolving from supervised learning with human-annotated labels to aligning image-text pairs from the Internet. Despite recent advancements in multi-modal large language models (MLLMs), the visual representations they rely on, such as CLIP embeddings, often lack access to external world knowledge critical for real-world visual

更新日期：2024-03-28

详情收藏

Spectral Convolutional Transformer: Harmonizing Real vs. Complex Multi-View Spectral Operators for Vision Transformer

arXiv.cs.MM Pub Date : 2024-03-26
Badri N. Patro, Vinay P. Namboodiri, Vijay S. Agneeswaran

Transformers used in vision have been investigated through diverse architectures - ViT, PVT, and Swin. These have worked to improve the attention mechanism and make it more efficient. Differently, the need for including local information was felt, leading to incorporating convolutions in transformers such as CPVT and CvT. Global information is captured using a complex Fourier basis to achieve global

更新日期：2024-03-28

详情收藏

Boosting Diffusion Models with Moving Average Sampling in Frequency Domain

arXiv.cs.MM Pub Date : 2024-03-26
Yurui Qian, Qi Cai, Yingwei Pan, Yehao Li, Ting Yao, Qibin Sun, Tao Mei

Diffusion models have recently brought a powerful revolution in image generation. Despite showing impressive generative capabilities, most of these models rely on the current sample to denoise the next one, possibly resulting in denoising instability. In this paper, we reinterpret the iterative denoising process as model optimization and leverage a moving average mechanism to ensemble all the prior

更新日期：2024-03-27

详情收藏

FastPerson: Enhancing Video Learning through Effective Video Summarization that Preserves Linguistic and Visual Contexts

arXiv.cs.MM Pub Date : 2024-03-26
Kazuki Kawamura, Jun Rekimoto

Quickly understanding lengthy lecture videos is essential for learners with limited time and interest in various topics to improve their learning efficiency. To this end, video summarization has been actively researched to enable users to view only important scenes from a video. However, these studies focus on either the visual or audio information of a video and extract important segments in the video

更新日期：2024-03-27

详情收藏

Panonut360: A Head and Eye Tracking Dataset for Panoramic Video

arXiv.cs.MM Pub Date : 2024-03-26
Yutong Xu, Junhao Du, Jiahe Wang, Yuwei Ning, Sihan Zhou Yang Cao

With the rapid development and widespread application of VR/AR technology, maximizing the quality of immersive panoramic video services that match users' personal preferences and habits has become a long-standing challenge. Understanding the saliency region where users focus, based on data collected with HMDs, can promote multimedia encoding, transmission, and quality assessment. At the same time,

更新日期：2024-03-27

详情收藏

Towards Low-Latency and Energy-Efficient Hybrid P2P-CDN Live Video Streaming

arXiv.cs.MM Pub Date : 2024-03-25
Reza Farahani, Christian Timmerer, Hermann Hellwagner

Streaming segmented videos over the Hypertext Transfer Protocol (HTTP) is an increasingly popular approach in both live and video-on-demand (VoD) applications. However, designing a scalable and adaptable framework that reduces servers energy consumption and supports low latency and high quality services, particularly for live video streaming scenarios, is still challenging for Over-The-Top (OTT) service

更新日期：2024-03-26

详情收藏

TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models

arXiv.cs.MM Pub Date : 2024-03-25
Zhongwei Zhang, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Ting Yao, Yang Cao, Tao Mei

Recent advances in text-to-video generation have demonstrated the utility of powerful diffusion models. Nevertheless, the problem is not trivial when shaping diffusion models to animate static image (i.e., image-to-video generation). The difficulty originates from the aspect that the diffusion process of subsequent animated frames should not only preserve the faithful alignment with the given image

更新日期：2024-03-26

详情收藏

VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation

arXiv.cs.MM Pub Date : 2024-03-25
Yang Chen, Yingwei Pan, Haibo Yang, Ting Yao, Tao Mei

Recent innovations on text-to-3D generation have featured Score Distillation Sampling (SDS), which enables the zero-shot learning of implicit 3D models (NeRF) by directly distilling prior knowledge from 2D diffusion models. However, current SDS-based models still struggle with intricate text prompts and commonly result in distorted 3D models with unrealistic textures or cross-view inconsistency issues

更新日期：2024-03-26

详情收藏

Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

arXiv.cs.MM Pub Date : 2024-03-25
Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wengang Zhou, Jiebo Luo, Tao Mei

Diffusion models are just at a tipping point for image super-resolution task. Nevertheless, it is not trivial to capitalize on diffusion models for video super-resolution which necessitates not only the preservation of visual appearance from low-resolution to high-resolution videos, but also the temporal consistency across video frames. In this paper, we propose a novel approach, pursuing Spatial Adaptation

更新日期：2024-03-26

详情收藏

Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization

arXiv.cs.MM Pub Date : 2024-03-24
Linzhi Wu, Xingyu Zhang, Yakun Zhang, Changyan Zheng, Tiejun Liu, Liang Xie, Ye Yan, Erwei Yin

Lip reading, the process of interpreting silent speech from visual lip movements, has gained rising attention for its wide range of realistic applications. Deep learning approaches greatly improve current lip reading systems. However, lip reading in cross-speaker scenarios where the speaker identity changes, poses a challenging problem due to inter-speaker variability. A well-trained lip reading system

更新日期：2024-03-26

详情收藏

DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes

arXiv.cs.MM Pub Date : 2024-03-23
Hao Yan, Zhihui Ke, Xiaobo Zhou, Tie Qiu, Xidong Shi, Dadong Jiang

Implicit neural representations for video (NeRV) have recently become a novel way for high-quality video representation. However, existing works employ a single network to represent the entire video, which implicitly confuse static and dynamic information. This leads to an inability to effectively compress the redundant static information and lack the explicitly modeling of global temporal-coherent

更新日期：2024-03-26

详情收藏

Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models

arXiv.cs.MM Pub Date : 2024-03-22
Qiong Wu, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji

In this paper, we propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs), termed Efficient Attention Skipping (EAS). Concretely, we first reveal that multi-head attentions (MHAs), the main computational overhead of MLLMs, are often redundant to downstream tasks. Based on this observation, EAS evaluates the attention redundancy and skips the less

更新日期：2024-03-25

详情收藏

A Picture Is Worth a Graph: Blueprint Debate on Graph for Multimodal Reasoning

arXiv.cs.MM Pub Date : 2024-03-22
Changmeng Zheng, Dayong Liang, Wengyu Zhang, Xiao-Yong Wei, Tat-Seng Chua, Qing Li

This paper presents a pilot study aimed at introducing multi-agent debate into multimodal reasoning. The study addresses two key challenges: the trivialization of opinions resulting from excessive summarization and the diversion of focus caused by distractor concepts introduced from images. These challenges stem from the inductive (bottom-up) nature of existing debating schemes. To address the issue

更新日期：2024-03-25

详情收藏

AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks

arXiv.cs.MM Pub Date : 2024-03-21
Max Ku, Cong Wei, Weiming Ren, Huan Yang, Wenhu Chen

Video-to-video editing involves editing a source video along with additional control (such as text prompts, subjects, or styles) to generate a new video that aligns with the source video and the provided control. Traditional methods have been constrained to certain editing types, limiting their ability to meet the wide range of user demands. In this paper, we introduce AnyV2V, a novel training-free

更新日期：2024-03-22

详情收藏

DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance

arXiv.cs.MM Pub Date : 2024-03-20
Zixuan Wang, Jia Jia, Shikun Sun, Haozhe Wu, Rong Han, Zhenyu Li, Di Tang, Jiaqing Zhou, Jiebo Luo

Choreographers determine what the dances look like, while cameramen determine the final presentation of dances. Recently, various methods and datasets have showcased the feasibility of dance synthesis. However, camera movement synthesis with music and dance remains an unsolved challenging problem due to the scarcity of paired data. Thus, we present DCM, a new multi-modal 3D dataset, which for the first

更新日期：2024-03-21

详情收藏

VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis

arXiv.cs.MM Pub Date : 2024-03-20
Yumeng Li, William Beluch, Margret Keuper, Dan Zhang, Anna Khoreva

Despite tremendous progress in the field of text-to-video (T2V) synthesis, open-sourced T2V diffusion models struggle to generate longer videos with dynamically varying and evolving content. They tend to synthesize quasi-static videos, ignoring the necessary visual change-over-time implied in the text prompt. At the same time, scaling these models to enable longer, more dynamic video synthesis often

更新日期：2024-03-21

详情收藏