arXiv - CS - Databases期刊最新论文, 计算机, 软件工程类期刊,

Real-Time Trajectory Synthesis with Local Differential Privacy

arXiv.cs.DB Pub Date : 2024-04-17
Yujia Hu, Yuntao Du, Zhikun Zhang, Ziquan Fang, Lu Chen, Kai Zheng, Yunjun Gao

Trajectory streams are being generated from location-aware devices, such as smartphones and in-vehicle navigation systems. Due to the sensitive nature of the location data, directly sharing user trajectories suffers from privacy leakage issues. Local differential privacy (LDP), which perturbs sensitive data on the user side before it is shared or analyzed, emerges as a promising solution for private

更新日期：2024-04-18

详情收藏

LLMTune: Accelerate Database Knob Tuning with Large Language Models

arXiv.cs.DB Pub Date : 2024-04-17
Xinmei Huang, Haoyang Li, Jing Zhang, Xinxin Zhao, Zhiming Yao, Yiyan Li, Zhuohao Yu, Tieying Zhang, Hong Chen, Cuiping Li

Database knob tuning is a critical challenge in the database community, aiming to optimize knob values to enhance database performance for specific workloads. DBMS often feature hundreds of tunable knobs, posing a significant challenge for DBAs to recommend optimal configurations. Consequently, many machine learning-based tuning methods have been developed to automate this process. Despite the introduction

更新日期：2024-04-18

详情收藏

climber++: Pivot-Based Approximate Similarity Search over Big Data Series

arXiv.cs.DB Pub Date : 2024-04-15
Liang Zhang, Mohamed Y. Eltabakh, Elke A. Rundensteiner, Khalid Alnuaim

The generation and collection of big data series are becoming an integral part of many emerging applications in sciences, IoT, finance, and web applications among several others. The terabyte-scale of data series has motivated recent efforts to design fully distributed techniques for supporting operations such as approximate kNN similarity search, which is a building block operation in most analytics

更新日期：2024-04-16

详情收藏

Optimizing Disjunctive Queries with Tagged Execution

arXiv.cs.DB Pub Date : 2024-04-14
Albert Kim, Samuel Madden

Despite decades of research into query optimization, optimizing queries with disjunctive predicate expressions remains a challenge. Solutions employed by existing systems (if any) are often simplistic and lead to much redundant work being performed by the execution engine. To address these problems, we propose a novel form of query execution called tagged execution. Tagged execution groups tuples into

更新日期：2024-04-16

详情收藏

Can LLMs substitute SQL? Comparing Resource Utilization of Querying LLMs versus Traditional Relational Databases

arXiv.cs.DB Pub Date : 2024-04-12
Xiang Zhang, Khatoon Khedri, Reza Rawassizadeh

Large Language Models (LLMs) can automate or substitute different types of tasks in the software engineering process. This study evaluates the resource utilization and accuracy of LLM in interpreting and executing natural language queries against traditional SQL within relational database management systems. We empirically examine the resource utilization and accuracy of nine LLMs varying from 7 to

更新日期：2024-04-16

详情收藏

Enc2DB: A Hybrid and Adaptive Encrypted Query Processing Framework

arXiv.cs.DB Pub Date : 2024-04-10
Hui Li, Jingwen Shi, Qi Tian, Zheng Li, Yan Fu, Bingqing Shen, Yaofeng Tu

As cloud computing gains traction, data owners are outsourcing their data to cloud service providers (CSPs) for Database Service (DBaaS), bringing in a deviation of data ownership and usage, and intensifying privacy concerns, especially with potential breaches by hackers or CSP insiders. To address that, encrypted database services propose encrypting every tuple and query statement before submitting

更新日期：2024-04-11

详情收藏

Automatic Configuration Tuning on Cloud Database: A Survey

arXiv.cs.DB Pub Date : 2024-04-09
Limeng Zhang, M. Ali Babar

Faced with the challenges of big data, modern cloud database management systems are designed to efficiently store, organize, and retrieve data, supporting optimal performance, scalability, and reliability for complex data processing and analysis. However, achieving good performance in modern databases is non-trivial as they are notorious for having dozens of configurable knobs, such as hardware setup

更新日期：2024-04-10

详情收藏

PM4Py.LLM: a Comprehensive Module for Implementing PM on LLMs

arXiv.cs.DB Pub Date : 2024-04-09
Alessandro Berti

pm4py is a process mining library for Python implementing several process mining (PM) artifacts and algorithms. It also offers methods to integrate PM with large language models (LLMs). This paper examines how the current paradigms of PM on LLM are implemented in pm4py, identifying challenges such as privacy, hallucinations, and the context window limit.

更新日期：2024-04-10

详情收藏

Balanced Partitioning for Optimizing Big Graph Computation: Complexities and Approximation Algorithms

arXiv.cs.DB Pub Date : 2024-04-09
Baoling Ning, Jianzhong Li

Graph partitioning is a key fundamental problem in the area of big graph computation. Previous works do not consider the practical requirements when optimizing the big data analysis in real applications. In this paper, motivated by optimizing the big data computing applications, two typical problems of graph partitioning are studied. The first problem is to optimize the performance of specific workloads

更新日期：2024-04-10

详情收藏

IA2: Leveraging Instance-Aware Index Advisor with Reinforcement Learning for Diverse Workloads

arXiv.cs.DB Pub Date : 2024-04-08
Taiyi Wang, Eiko Yoneki

This study introduces the Instance-A}ware Index A}dvisor (IA2), a novel deep reinforcement learning (DRL)-based approach for optimizing index selection in databases facing large action spaces of potential candidates. IA2 introduces the Twin Delayed Deep Deterministic Policy Gradient - Temporal Difference State-Wise Action Refinery (TD3-TD-SWAR) model, enabling efficient index selection by understanding

更新日期：2024-04-10

详情收藏

Faster Algorithms for Fair Max-Min Diversification in $\mathbb{R}^d$

arXiv.cs.DB Pub Date : 2024-04-06
Yash Kurkure, Miles Shamo, Joseph Wiseman, Sainyam Galhotra, Stavros Sintos

The task of extracting a diverse subset from a dataset, often referred to as maximum diversification, plays a pivotal role in various real-world applications that have far-reaching consequences. In this work, we delve into the realm of fairness-aware data subset selection, specifically focusing on the problem of selecting a diverse set of size $k$ from a large collection of $n$ data points (FairDiv)

更新日期：2024-04-09

详情收藏

Aleph Filter: To Infinity in Constant Time

arXiv.cs.DB Pub Date : 2024-04-06
Niv Dayan, Ioana Bercea, Rasmus Pagh

Filter data structures are widely used in various areas of computer science to answer approximate set-membership queries. In many applications, the data grows dynamically, requiring their filters to expand along with the data that they represent. However, existing methods for expanding filters cannot maintain stable performance, memory footprint, and false positive rate at the same time. We address

更新日期：2024-04-09

详情收藏

Qr-Hint: Actionable Hints Towards Correcting Wrong SQL Queries

arXiv.cs.DB Pub Date : 2024-04-05
Yihao Hu, Amir Gilad, Kristin Stephens-Martinez, Sudeepa Roy, Jun Yang

We describe a system called Qr-Hint that, given a (correct) target query Q* and a (wrong) working query Q, both expressed in SQL, provides actionable hints for the user to fix the working query so that it becomes semantically equivalent to the target. It is particularly useful in an educational setting, where novices can receive help from Qr-Hint without requiring extensive personal tutoring. Since

更新日期：2024-04-09

详情收藏

SLSM : An Efficient Strategy for Lazy Schema Migration on Shared-Nothing Databases

arXiv.cs.DB Pub Date : 2024-04-05
Zhilin Zeng, Hui Li, Xiyue Gao, Hui Zhang, Huiquan Zhang, Jiangtao Cui

By introducing intermediate states for metadata changes and ensuring that at most two versions of metadata exist in the cluster at the same time, shared-nothing databases are capable of making online, asynchronous schema changes. However, this method leads to delays in the deployment of new schemas since it requires waiting for massive data backfill. To shorten the service vacuum period before the

更新日期：2024-04-08

详情收藏

Semantic SQL -- Combining and optimizing semantic predicates in SQL

arXiv.cs.DB Pub Date : 2024-04-05
Akash Mittal, Anshul Bheemreddy, Huili Tao

In recent years, the surge in unstructured data analysis, facilitated by advancements in Machine Learning (ML), has prompted diverse approaches for handling images, text documents, and videos. Analysts, leveraging ML models, can extract meaningful information from unstructured data and store it in relational databases, allowing the execution of SQL queries for further analysis. Simultaneously, vector

更新日期：2024-04-08

详情收藏

Reservoir Sampling over Joins

arXiv.cs.DB Pub Date : 2024-04-04
Binyang Dai, Xiao Hu, Ke Yi

Sampling over joins is a fundamental task in large-scale data analytics. Instead of computing the full join results, which could be massive, a uniform sample of the join results would suffice for many purposes, such as answering analytical queries or training machine learning models. In this paper, we study the problem of how to maintain a random sample over joins while the tuples are streaming in

更新日期：2024-04-05

详情收藏

NL2KQL: From Natural Language to Kusto Query

arXiv.cs.DB Pub Date : 2024-04-03
Amir H. Abdi, Xinye Tang, Jeremias Eichelbaum, Mahan Das, Alex Klein, Nihal Irmak Pakis, William Blum, Daniel L Mace, Tanvi Raja, Namrata Padmanabhan, Ye Xing

Data is growing rapidly in volume and complexity. Proficiency in database query languages is pivotal for crafting effective queries. As coding assistants become more prevalent, there is significant opportunity to enhance database query languages. The Kusto Query Language (KQL) is a widely used query language for large semi-structured data such as logs, telemetries, and time-series for big data analytics

更新日期：2024-04-05

详情收藏

What Blocks My Blockchain's Throughput? Developing a Generalizable Approach for Identifying Bottlenecks in Permissioned Blockchains

arXiv.cs.DB Pub Date : 2024-04-02
Orestis Papageorgiou, Lasse Börtzler, Egor Ermolaev, Jyoti Kumari, Johannes Sedlmeir

Permissioned blockchains have been proposed for a variety of use cases that require decentralization yet address enterprise requirements that permissionless blockchains to date cannot satisfy -- particularly in terms of performance. However, popular permissioned blockchains still exhibit a relatively low maximum throughput in comparison to established centralized systems. Consequently, researchers

更新日期：2024-04-05

详情收藏

Practical Persistent Multi-Word Compare-and-Swap Algorithms for Many-Core CPUs

arXiv.cs.DB Pub Date : 2024-04-02
Kento Sugiura, Manabu Nishimura, Yoshiharu Ishikawa

In the last decade, academic and industrial researchers have focused on persistent memory because of the development of the first practical product, Intel Optane. One of the main challenges of persistent memory programming is to guarantee consistent durability over separate memory addresses, and Wang et al. proposed a persistent multi-word compare-and-swap (PMwCAS) algorithm to solve this problem.

更新日期：2024-04-03

详情收藏

Mining Sequential Patterns in Uncertain Databases Using Hierarchical Index Structure

arXiv.cs.DB Pub Date : 2024-03-31
Kashob Kumar Roy, Md Hasibul Haque Moon, Md Mahmudur Rahman, Chowdhury Farhan Ahmed, Carson K. Leung

In this uncertain world, data uncertainty is inherent in many applications and its importance is growing drastically due to the rapid development of modern technologies. Nowadays, researchers have paid more attention to mine patterns in uncertain databases. A few recent works attempt to mine frequent uncertain sequential patterns. Despite their success, they are incompetent to reduce the number of

更新日期：2024-04-03

详情收藏

GTS: GPU-based Tree Index for Fast Similarity Search

arXiv.cs.DB Pub Date : 2024-04-01
Yifan Zhu, Ruiyao Ma, Baihua Zheng, Xiangyu Ke, Lu Chen, Yunjun Gao

Similarity search, the task of identifying objects most similar to a given query object under a specific metric, has gathered significant attention due to its practical applications. However, the absence of coordinate information to accelerate similarity search and the high computational cost of measuring object similarity hinder the efficiency of existing CPU-based methods. Additionally, these methods

更新日期：2024-04-02

详情收藏

SoK: The Faults in our Graph Benchmarks

arXiv.cs.DB Pub Date : 2024-03-31
Puneet Mehrotra, Vaastav Anand, Daniel Margo, Milad Rezaei Hajidehi, Margo Seltzer

Graph-structured data is prevalent in domains such as social networks, financial transactions, brain networks, and protein interactions. As a result, the research community has produced new databases and analytics engines to process such data. Unfortunately, there is not yet widespread benchmark standardization in graph processing, and the heterogeneity of evaluations found in the literature can lead

更新日期：2024-04-02

详情收藏

Mining Weighted Sequential Patterns in Incremental Uncertain Databases

arXiv.cs.DB Pub Date : 2024-03-31
Kashob Kumar Roy, Md Hasibul Haque Moon, Md Mahmudur Rahman, Chowdhury Farhan Ahmed, Carson Kai-Sang Leung

Due to the rapid development of science and technology, the importance of imprecise, noisy, and uncertain data is increasing at an exponential rate. Thus, mining patterns in uncertain databases have drawn the attention of researchers. Moreover, frequent sequences of items from these databases need to be discovered for meaningful knowledge with great impact. In many real cases, weights of items and

更新日期：2024-04-02

详情收藏

Multi-Objective Genetic Algorithm for Materialized View Optimization in Data Warehouses

arXiv.cs.DB Pub Date : 2024-03-29
Mahdi Manavi

Materialized views can significantly improve database query performance but identifying the optimal set of views to materialize is challenging. Prior work on automating and optimizing materialized view selection has limitations in execution time and total cost. In this paper, we present a novel genetic algorithm based approach to materialized view selection that aims to minimize execution time and

更新日期：2024-04-02

详情收藏

Dataversifying Natural Sciences: Pioneering a Data Lake Architecture for Curated Data-Centric Experiments in Life \& Earth Sciences

arXiv.cs.DB Pub Date : 2024-03-29
Genoveva Vargas-SolarLIRIS, Jérôme DarmontERIC, Alejandro AdorjanLIRIS, Javier A. Espinosa-OviedoLIRIS, Carmem HaraERIC, Sabine LoudcherERIC, Regina MotzDIMAP, Martin MusicanteDIMAP, José-Luis Zechinelli-Martini

This vision paper introduces a pioneering data lake architecture designed to meet Life \& Earth sciences' burgeoning data management needs. As the data landscape evolves, the imperative to navigate and maximize scientific opportunities has never been greater. Our vision paper outlines a strategic approach to unify and integrate diverse datasets, aiming to cultivate a collaborative space conducive to

更新日期：2024-04-01

详情收藏

PURPLE: Making a Large Language Model a Better SQL Writer

arXiv.cs.DB Pub Date : 2024-03-29
Tonghui Ren, Yuankai Fan, Zhenying He, Ren Huang, Jiaqi Dai, Can Huang, Yinan Jing, Kai Zhang, Yifan Yang, X. Sean Wang

Large Language Model (LLM) techniques play an increasingly important role in Natural Language to SQL (NL2SQL) translation. LLMs trained by extensive corpora have strong natural language understanding and basic SQL generation abilities without additional tuning specific to NL2SQL tasks. Existing LLMs-based NL2SQL approaches try to improve the translation by enhancing the LLMs with an emphasis on user

更新日期：2024-04-01

详情收藏

Cleaning data with Swipe

arXiv.cs.DB Pub Date : 2024-03-28
Toon Boeckling, Antoon Bronselaer

The repair problem for functional dependencies is the problem where an input database needs to be modified such that all functional dependencies are satisfied and the difference with the original database is minimal. The output database is then called an optimal repair. If the allowed modifications are value updates, finding an optimal repair is NP-hard. A well-known approach to find approximations

更新日期：2024-03-29

详情收藏

Proving correctness for SQL implementations of OCL constraints

arXiv.cs.DB Pub Date : 2024-03-27
Hoang Nguyen Phuoc Bao, Manuel Clavel

In the context of the model-driven development of data-centric applications, OCL constraints play a major role in adding precision to the source models (e.g., data models and security models). Several code-generators have been proposed to bridge the gap between source models with OCL constraints and their corresponding database implementations. However, the database queries produced by these code-generators

更新日期：2024-03-28

详情收藏

Sm-Nd Isotope Data Compilation from Geoscientific Literature Using an Automated Tabular Extraction Method

arXiv.cs.DB Pub Date : 2024-03-27
Zhixin Guo, Tao Wang, Chaoyang Wang, Jianping Zhou, Guanjie Zheng, Xinbing Wang, Chenghu Zhou

The rare earth elements Sm and Nd significantly address fundamental questions about crustal growth, such as its spatiotemporal evolution and the interplay between orogenesis and crustal accretion. Their relative immobility during high-grade metamorphism makes the Sm-Nd isotopic system crucial for inferring crustal formation times. Historically, data have been disseminated sporadically in the scientific

更新日期：2024-03-28

详情收藏

Empirical Analysis of EIP-3675: Miner Dynamics, Transaction Fees, and Transaction Time

arXiv.cs.DB Pub Date : 2024-03-26
Umesh Bhatt, Sarvesh Pandey

The Ethereum Improvement Proposal 3675 (EIP-3675) marks a significant shift, transitioning from a Proof of Work (PoW) to a Proof of Stake (PoS) consensus mechanism. This transition resulted in a staggering 99.95% decrease in energy consumption. However, the transition prompts two critical questions: (1). How does EIP-3675 affect miners' dynamics? and (2). How do users determine priority fees, considering

更新日期：2024-03-27

详情收藏

Query Refinement for Diverse Top-$k$ Selection

arXiv.cs.DB Pub Date : 2024-03-26
Felix S. Campbell, Alon Silberstein, Yuval Moskovitch, Julia Stoyanovich

Database queries are often used to select and rank items as decision support for many applications. As automated decision-making tools become more prevalent, there is a growing recognition of the need to diversify their outcomes. In this paper, we define and study the problem of modifying the selection conditions of an ORDER BY query so that the result of the modified query closely fits some user-defined

更新日期：2024-03-27

详情收藏

When View- and Conflict-Robustness Coincide for Multiversion Concurrency Control

arXiv.cs.DB Pub Date : 2024-03-26
Brecht Vandevoort, Bas Ketsman, Frank Neven

A DBMS allows trading consistency for efficiency through the allocation of isolation levels that are strictly weaker than serializability. The robustness problem asks whether, for a given set of transactions and a given allocation of isolation levels, every possible interleaved execution of those transactions that is allowed under the provided allocation, is always safe. In the literature, safe is

更新日期：2024-03-27

详情收藏

Disambiguate Entity Matching through Relation Discovery with Large Language Models

arXiv.cs.DB Pub Date : 2024-03-26
Zezhou Huang

Entity matching is a critical challenge in data integration and cleaning, central to tasks like fuzzy joins and deduplication. Traditional approaches have focused on overcoming fuzzy term representations through methods such as edit distance, Jaccard similarity, and more recently, embeddings and deep neural networks, including advancements from large language models (LLMs) like GPT. However, the core

更新日期：2024-03-27

详情收藏

Corra: Correlation-Aware Column Compression

arXiv.cs.DB Pub Date : 2024-03-25
Hanwen Liu, Mihail Stoian, Alexander van Renen, Andreas Kipf

Column encoding schemes have witnessed a spark of interest lately. This is not surprising -- as data volume increases, being able to keep one's dataset in main memory for fast processing is a coveted desideratum. However, it also seems that single-column encoding schemes have reached a plateau in terms of the compression size they can achieve. We argue that this is because they do not exploit correlations

更新日期：2024-03-27

详情收藏

GGDMiner - Discovery of Graph Generating Dependencies for Graph Data Profiling

arXiv.cs.DB Pub Date : 2024-03-25
Larissa C. Shimomura, Nikolay Yakovets, George Fletcher

With the increasing use of graph-structured data, there is also increasing interest in investigating graph data dependencies and their applications, e.g., in graph data profiling. Graph Generating Dependencies (GGDs) are a class of dependencies for property graphs that can express the relation between different graph patterns and constraints based on their attribute similarities. Rich syntax and semantics

更新日期：2024-03-27

详情收藏

On Reporting Durable Patterns in Temporal Proximity Graphs

arXiv.cs.DB Pub Date : 2024-03-24
Pankaj K. Agarwal, Xiao Hu, Stavros Sintos, Jun Yang

Finding patterns in graphs is a fundamental problem in databases and data mining. In many applications, graphs are temporal and evolve over time, so we are interested in finding durable patterns, such as triangles and paths, which persist over a long time. While there has been work on finding durable simple patterns, existing algorithms do not have provable guarantees and run in strictly super-linear

更新日期：2024-03-26

详情收藏

ByteCard: Enhancing Data Warehousing with Learned Cardinality Estimation

arXiv.cs.DB Pub Date : 2024-03-24
Yuxing Han, Haoyu Wang, Lixiang Chen, Yifeng Dong, Xing Chen, Benquan Yu, Chengcheng Yang, Weining Qian

Cardinality estimation is a critical component and a longstanding challenge in modern data warehouses. ByteHouse, ByteDance's cloud-native engine for big data analysis in exabyte-scale environments, serves numerous internal decision-making business scenarios. With the increasing demand of ByteHouse, cardinality estimation becomes the bottleneck for efficiently processing queries. Specifically, the

更新日期：2024-03-26

详情收藏

Efficiently Estimating Mutual Information Between Attributes Across Tables

arXiv.cs.DB Pub Date : 2024-03-22
Aécio Santos, Flip Korn, Juliana Freire

Relational data augmentation is a powerful technique for enhancing data analytics and improving machine learning models by incorporating columns from external datasets. However, it is challenging to efficiently discover relevant external tables to join with a given input table. Existing approaches rely on data discovery systems to identify joinable tables from external sources, typically based on overlap

更新日期：2024-03-26

详情收藏

On Enforcing Existence and Non-Existence Constraints in MatBase

arXiv.cs.DB Pub Date : 2024-03-20
Christian Mancas

Existence constraints were defined in the Relational Data Model, but, unfortunately, are not provided by any Relational Database Management System, except for their NOT NULL particular case. Our (Elementary) Mathematical Data Model extended them to function products and introduced their dual non-existence constraints. MatBase, an intelligent data and knowledge base management system prototype based

更新日期：2024-03-25

详情收藏

Quantifying Semantic Query Similarity for Automated Linear SQL Grading: A Graph-based Approach

arXiv.cs.DB Pub Date : 2024-03-21
Leo Köberlein, Dominik Probst, Richard Lenz

Quantifying the semantic similarity between database queries is a critical challenge with broad applications, ranging from query log analysis to automated educational assessment of SQL skills. Traditional methods often rely solely on syntactic comparisons or are limited to checking for semantic equivalence. This paper introduces a novel graph-based approach to measure the semantic dissimilarity between

更新日期：2024-03-22

详情收藏

Gen-T: Table Reclamation in Data Lakes

arXiv.cs.DB Pub Date : 2024-03-21
Grace Fan, Roee Shraga, Renée J. Miller

We introduce the problem of Table Reclamation. Given a Source Table and a large table repository, reclamation finds a set of tables that, when integrated, reproduce the source table as closely as possible. Unlike query discovery problems like Query-by-Example or by-Target, Table Reclamation focuses on reclaiming the data in the Source Table as fully as possible using real tables that may be incomplete

更新日期：2024-03-22

详情收藏

Space-Efficient Indexes for Uncertain Strings

arXiv.cs.DB Pub Date : 2024-03-21
Esteban Gabory, Chang Liu, Grigorios Loukides, Solon P. Pissis, Wiktor Zuba

Strings in the real world are often encoded with some level of uncertainty. In the character-level uncertainty model, an uncertain string $X$ of length $n$ on an alphabet $\Sigma$ is a sequence of $n$ probability distributions over $\Sigma$. Given an uncertain string $X$ and a weight threshold $\frac{1}{z}\in(0,1]$, we say that pattern $P$ occurs in $X$ at position $i$, if the product of probabilities

更新日期：2024-03-22

详情收藏

Distance Comparison Operators for Approximate Nearest Neighbor Search: Exploration and Benchmark

arXiv.cs.DB Pub Date : 2024-03-20
Zeyu Wang, Haoran Xiong, Zhenying He, Peng Wang, Wei wang

Approximate nearest neighbor search (ANNS) on high-dimensional vectors has become a fundamental and essential component in various machine learning tasks. Prior research has shown that the distance comparison operation is the bottleneck of ANNS, which determines the query and indexing performance. To overcome this challenge, some novel methods have been proposed recently. The basic idea is to estimate

更新日期：2024-03-21

详情收藏

Efficient k-step Weighted Reachability Query Processing Algorithms

arXiv.cs.DB Pub Date : 2024-03-19
Lian Chen, Junfeng Zhou, Ming Du, Sheng Yu, Xian Tang, Ziyang Chen

Given a data graph G, a source vertex u and a target vertex v of a reachability query, the reachability query is used to answer whether there exists a path from u to v in G. Reachability query processing is one of the fundamental operations in graph data management, which is widely used in biological networks, communication networks, and social networks to assist data analysis. The data graphs in practical

更新日期：2024-03-21

详情收藏

Secure Query Processing with Linear Complexity

arXiv.cs.DB Pub Date : 2024-03-20
Qiyao Luo, Yilei Wang, Wei Dong, Ke Yi

We present LINQ, the first join protocol with linear complexity (in both running time and communication) under the secure multi-party computation model (MPC). It can also be extended to support all free-connex queries, a large class of select-join-aggregate queries, still with linear complexity. This matches the plaintext result for the query processing problem, as free-connex queries are the largest

更新日期：2024-03-21

详情收藏

Quantixar: High-performance Vector Data Management System

arXiv.cs.DB Pub Date : 2024-03-19
Gulshan Yadav, RahulKumar Yadav, Mansi Viramgama, Mayank Viramgama, Apeksha Mohite

Traditional database management systems need help efficiently represent and querying the complex, high-dimensional data prevalent in modern applications. Vector databases offer a solution by storing data as numerical vectors within a multi-dimensional space. This enables similarity-based search and analysis, such as image retrieval, recommendation engine generation, and natural language processing

更新日期：2024-03-20

详情收藏

Evaluating Datalog over Semirings: A Grounding-based Approach

arXiv.cs.DB Pub Date : 2024-03-19
Hangdong Zhao, Shaleen Deep, Paraschos Koutris, Sudeepa Roy, Val Tannen

Datalog is a powerful yet elegant language that allows expressing recursive computation. Although Datalog evaluation has been extensively studied in the literature, so far, only loose upper bounds are known on how fast a Datalog program can be evaluated. In this work, we ask the following question: given a Datalog program over a naturally-ordered semiring $\sigma$, what is the tightest possible runtime

更新日期：2024-03-20

详情收藏

Algorithmic Complexity Attacks on Dynamic Learned Indexes

arXiv.cs.DB Pub Date : 2024-03-19
Rui Yang, Evgenios M. Kornaropoulos, Yue Cheng

Learned Index Structures (LIS) view a sorted index as a model that learns the data distribution, takes a data element key as input, and outputs the predicted position of the key. The original LIS can only handle lookup operations with no support for updates, rendering it impractical to use for typical workloads. To address this limitation, recent studies have focused on designing efficient dynamic

更新日期：2024-03-20

详情收藏

Benchmarking Analytical Query Processing in Intel SGXv2

arXiv.cs.DB Pub Date : 2024-03-18
Adrian LutschTechnical University of Darmstadt, Muhammad El-HindiTechnical University of Darmstadt, Matthias HeinrichTechnical University of Darmstadt, Daniel RitterSAP SE, Zsolt IstvánTechnical University of Darmstadt, Carsten BinnigTechnical University of DarmstadtDFKI

The recently introduced second generation of Intel SGX (SGXv2) lifts memory size limitations of the first generation. Theoretically, this promises to enable secure and highly efficient analytical DBMSs in the cloud. To validate this promise, in this paper, we conduct the first in-depth evaluation study of running analytical query processing algorithms inside SGXv2. Our study reveals that state-of-the-art

更新日期：2024-03-19

详情收藏

Models for Storage in Database Backends

arXiv.cs.DB Pub Date : 2024-03-18
Edgard SchiebelbeinAmazon Web Services, Saalik HatiaAmazon Web Services, Annette BieniusaAmazon Web Services, Gustavo PetriAmazon Web Services, Carla FerreiraNOVA, Marc ShapiroDELYS

This paper describes ongoing work on developing a formal specification of a database backend. We present the formalisation of the expected behaviour of a basic transactional system that calls into a simple store API, and instantiate in two semantic models. The first one is a map-based, classical versioned key-value store; the second one, journal-based, appends individual transaction effects to a journal

更新日期：2024-03-19

详情收藏

Graph Theory for Consent Management: A New Approach for Complex Data Flows

arXiv.cs.DB Pub Date : 2024-03-17
Dorota Filipczuk, Enrico H. Gerding, George Konstantinidis

Through legislation and technical advances users gain more control over how their data is processed, and they expect online services to respect their privacy choices and preferences. However, data may be processed for many different purposes by several layers of algorithms that create complex data workflows. To date, there is no existing approach to automatically satisfy fine-grained privacy constraints

更新日期：2024-03-19

详情收藏

Exploring Distance Query Processing in Edge Computing Environments

arXiv.cs.DB Pub Date : 2024-03-17
Xiubo Zhang, Yujie He, Ye Li, Yan Li, Zijie Zhou, Dongyao Wei, Ryan

In the context of changing travel behaviors and the expanding user base of Geographic Information System (GIS) services, conventional centralized architectures responsible for handling shortest distance queries are facing increasing challenges, such as heightened load pressure and longer response times. To mitigate these concerns, this study is the first to develop an edge computing framework specially

更新日期：2024-03-19

详情收藏

Wait to be Faster: a Smart Pooling Framework for Dynamic Ridesharing

arXiv.cs.DB Pub Date : 2024-03-17
Xiaoyao Zhong, Jiabao Jin, Peng Cheng, Wangze Ni, Libin Zheng, Lei Chen, Xuemin Lin

Ridesharing services, such as Uber or Didi, have attracted considerable attention in recent years due to their positive impact on environmental protection and the economy. Existing studies require quick responses to orders, which lack the flexibility to accommodate longer wait times for better grouping opportunities. In this paper, we address a NP-hard ridesharing problem, called Minimal Extra Time

更新日期：2024-03-19

详情收藏

Bangladesh Agricultural Knowledge Graph: Enabling Semantic Integration and Data-driven Analysis--Full Version

arXiv.cs.DB Pub Date : 2024-03-18
Rudra Pratap Deb Nath, Tithi Rani Das, Tonmoy Chandro Das, S. M. Shafkat Raihan

In Bangladesh, agriculture is a crucial driver for addressing Sustainable Development Goal 1 (No Poverty) and 2 (Zero Hunger), playing a fundamental role in the economy and people's livelihoods. To enhance the sustainability and resilience of the agriculture industry through data-driven insights, the Bangladesh Bureau of Statistics and other organizations consistently collect and publish agricultural

更新日期：2024-03-19

详情收藏

Vector search with small radiuses

arXiv.cs.DB Pub Date : 2024-03-16
Gergely Szilvasy, Pierre-Emmanuel Mazaré, Matthijs Douze

In recent years, the dominant accuracy metric for vector search is the recall of a result list of fixed size (top-k retrieval), considering as ground truth the exact vector retrieval results. Although convenient to compute, this metric is distantly related to the end-to-end accuracy of a full system that integrates vector search. In this paper we focus on the common case where a hard decision needs

更新日期：2024-03-19

详情收藏

Accelerating Regular Path Queries over Graph Database with Processing-in-Memory

arXiv.cs.DB Pub Date : 2024-03-15
Ruoyan Ma, Shengan Zheng, Guifeng Wang, Jin Pu, Yifan Hua, Wentao Wang, Linpeng Huang

Regular path queries (RPQs) in graph databases are bottlenecked by the memory wall. Emerging processing-in-memory (PIM) technologies offer a promising solution to dispatch and execute path matching tasks in parallel within PIM modules. We present Moctopus, a PIM-based data management system for graph databases that supports efficient batch RPQs and graph updates. Moctopus employs a PIM-friendly dynamic

更新日期：2024-03-18

详情收藏

Interactive Trimming against Evasive Online Data Manipulation Attacks: A Game-Theoretic Approach

arXiv.cs.DB Pub Date : 2024-03-15
Yue Fu, Qingqing Ye, Rong Du, Haibo Hu

With the exponential growth of data and its crucial impact on our lives and decision-making, the integrity of data has become a significant concern. Malicious data poisoning attacks, where false values are injected into the data, can disrupt machine learning processes and lead to severe consequences. To mitigate these attacks, distance-based defenses, such as trimming, have been proposed, but they

更新日期：2024-03-18

详情收藏

KIF: A Framework for Virtual Integration of Heterogeneous Knowledge Bases using Wikidata

arXiv.cs.DB Pub Date : 2024-03-15
Guilherme Lima, Marcelo Machado, Elton Soares, Sandro R. Fiorini, Raphael Thiago, Leonardo G. Azevedo, Viviane T. da Silva, Renato Cerqueira

We present a knowledge integration framework (called KIF) that uses Wikidata as a lingua franca to integrate heterogeneous knowledge bases. These can be triplestores, relational databases, CSV files, etc., which may or may not use the Wikidata dialect of RDF. KIF leverages Wikidata's data model and vocabulary plus user-defined mappings to expose a unified view of the integrated bases while keeping

更新日期：2024-03-18

详情收藏

Rule based Complex Event Processing for an Air Quality Monitoring System in Smart City

arXiv.cs.DB Pub Date : 2024-03-16
Shashi Shekhar Kumar, Ritesh Chandra, Sonali Agarwal

In recent years, smart city-based development has gained momentum due to its versatile nature in architecture and planning for the systematic habitation of human beings. According to World Health Organization (WHO) report, air pollution causes serious respiratory diseases. Hence, it becomes necessary to real-time monitoring of air quality to minimize effect by taking time-bound decisions by the stakeholders

更新日期：2024-03-16

详情收藏

Query Rewriting via Large Language Models

arXiv.cs.DB Pub Date : 2024-03-14
Jie Liu, Barzan Mozafari

Query rewriting is one of the most effective techniques for coping with poorly written queries before passing them down to the query optimizer. Manual rewriting is not scalable, as it is error-prone and requires deep expertise. Similarly, traditional query rewriting algorithms can only handle a small subset of queries: rule-based techniques do not generalize to new query patterns and synthesis-based

更新日期：2024-03-15

详情收藏