-
Real-Time Trajectory Synthesis with Local Differential Privacy arXiv.cs.DB Pub Date : 2024-04-17 Yujia Hu, Yuntao Du, Zhikun Zhang, Ziquan Fang, Lu Chen, Kai Zheng, Yunjun Gao
Trajectory streams are being generated from location-aware devices, such as smartphones and in-vehicle navigation systems. Due to the sensitive nature of the location data, directly sharing user trajectories suffers from privacy leakage issues. Local differential privacy (LDP), which perturbs sensitive data on the user side before it is shared or analyzed, emerges as a promising solution for private
-
LLMTune: Accelerate Database Knob Tuning with Large Language Models arXiv.cs.DB Pub Date : 2024-04-17 Xinmei Huang, Haoyang Li, Jing Zhang, Xinxin Zhao, Zhiming Yao, Yiyan Li, Zhuohao Yu, Tieying Zhang, Hong Chen, Cuiping Li
Database knob tuning is a critical challenge in the database community, aiming to optimize knob values to enhance database performance for specific workloads. DBMS often feature hundreds of tunable knobs, posing a significant challenge for DBAs to recommend optimal configurations. Consequently, many machine learning-based tuning methods have been developed to automate this process. Despite the introduction
-
climber++: Pivot-Based Approximate Similarity Search over Big Data Series arXiv.cs.DB Pub Date : 2024-04-15 Liang Zhang, Mohamed Y. Eltabakh, Elke A. Rundensteiner, Khalid Alnuaim
The generation and collection of big data series are becoming an integral part of many emerging applications in sciences, IoT, finance, and web applications among several others. The terabyte-scale of data series has motivated recent efforts to design fully distributed techniques for supporting operations such as approximate kNN similarity search, which is a building block operation in most analytics
-
Optimizing Disjunctive Queries with Tagged Execution arXiv.cs.DB Pub Date : 2024-04-14 Albert Kim, Samuel Madden
Despite decades of research into query optimization, optimizing queries with disjunctive predicate expressions remains a challenge. Solutions employed by existing systems (if any) are often simplistic and lead to much redundant work being performed by the execution engine. To address these problems, we propose a novel form of query execution called tagged execution. Tagged execution groups tuples into
-
Can LLMs substitute SQL? Comparing Resource Utilization of Querying LLMs versus Traditional Relational Databases arXiv.cs.DB Pub Date : 2024-04-12 Xiang Zhang, Khatoon Khedri, Reza Rawassizadeh
Large Language Models (LLMs) can automate or substitute different types of tasks in the software engineering process. This study evaluates the resource utilization and accuracy of LLM in interpreting and executing natural language queries against traditional SQL within relational database management systems. We empirically examine the resource utilization and accuracy of nine LLMs varying from 7 to
-
Enc2DB: A Hybrid and Adaptive Encrypted Query Processing Framework arXiv.cs.DB Pub Date : 2024-04-10 Hui Li, Jingwen Shi, Qi Tian, Zheng Li, Yan Fu, Bingqing Shen, Yaofeng Tu
As cloud computing gains traction, data owners are outsourcing their data to cloud service providers (CSPs) for Database Service (DBaaS), bringing in a deviation of data ownership and usage, and intensifying privacy concerns, especially with potential breaches by hackers or CSP insiders. To address that, encrypted database services propose encrypting every tuple and query statement before submitting
-
Automatic Configuration Tuning on Cloud Database: A Survey arXiv.cs.DB Pub Date : 2024-04-09 Limeng Zhang, M. Ali Babar
Faced with the challenges of big data, modern cloud database management systems are designed to efficiently store, organize, and retrieve data, supporting optimal performance, scalability, and reliability for complex data processing and analysis. However, achieving good performance in modern databases is non-trivial as they are notorious for having dozens of configurable knobs, such as hardware setup
-
PM4Py.LLM: a Comprehensive Module for Implementing PM on LLMs arXiv.cs.DB Pub Date : 2024-04-09 Alessandro Berti
pm4py is a process mining library for Python implementing several process mining (PM) artifacts and algorithms. It also offers methods to integrate PM with large language models (LLMs). This paper examines how the current paradigms of PM on LLM are implemented in pm4py, identifying challenges such as privacy, hallucinations, and the context window limit.
-
Balanced Partitioning for Optimizing Big Graph Computation: Complexities and Approximation Algorithms arXiv.cs.DB Pub Date : 2024-04-09 Baoling Ning, Jianzhong Li
Graph partitioning is a key fundamental problem in the area of big graph computation. Previous works do not consider the practical requirements when optimizing the big data analysis in real applications. In this paper, motivated by optimizing the big data computing applications, two typical problems of graph partitioning are studied. The first problem is to optimize the performance of specific workloads
-
IA2: Leveraging Instance-Aware Index Advisor with Reinforcement Learning for Diverse Workloads arXiv.cs.DB Pub Date : 2024-04-08 Taiyi Wang, Eiko Yoneki
This study introduces the Instance-A}ware Index A}dvisor (IA2), a novel deep reinforcement learning (DRL)-based approach for optimizing index selection in databases facing large action spaces of potential candidates. IA2 introduces the Twin Delayed Deep Deterministic Policy Gradient - Temporal Difference State-Wise Action Refinery (TD3-TD-SWAR) model, enabling efficient index selection by understanding
-
Faster Algorithms for Fair Max-Min Diversification in $\mathbb{R}^d$ arXiv.cs.DB Pub Date : 2024-04-06 Yash Kurkure, Miles Shamo, Joseph Wiseman, Sainyam Galhotra, Stavros Sintos
The task of extracting a diverse subset from a dataset, often referred to as maximum diversification, plays a pivotal role in various real-world applications that have far-reaching consequences. In this work, we delve into the realm of fairness-aware data subset selection, specifically focusing on the problem of selecting a diverse set of size $k$ from a large collection of $n$ data points (FairDiv)
-
Aleph Filter: To Infinity in Constant Time arXiv.cs.DB Pub Date : 2024-04-06 Niv Dayan, Ioana Bercea, Rasmus Pagh
Filter data structures are widely used in various areas of computer science to answer approximate set-membership queries. In many applications, the data grows dynamically, requiring their filters to expand along with the data that they represent. However, existing methods for expanding filters cannot maintain stable performance, memory footprint, and false positive rate at the same time. We address
-
Qr-Hint: Actionable Hints Towards Correcting Wrong SQL Queries arXiv.cs.DB Pub Date : 2024-04-05 Yihao Hu, Amir Gilad, Kristin Stephens-Martinez, Sudeepa Roy, Jun Yang
We describe a system called Qr-Hint that, given a (correct) target query Q* and a (wrong) working query Q, both expressed in SQL, provides actionable hints for the user to fix the working query so that it becomes semantically equivalent to the target. It is particularly useful in an educational setting, where novices can receive help from Qr-Hint without requiring extensive personal tutoring. Since
-
SLSM : An Efficient Strategy for Lazy Schema Migration on Shared-Nothing Databases arXiv.cs.DB Pub Date : 2024-04-05 Zhilin Zeng, Hui Li, Xiyue Gao, Hui Zhang, Huiquan Zhang, Jiangtao Cui
By introducing intermediate states for metadata changes and ensuring that at most two versions of metadata exist in the cluster at the same time, shared-nothing databases are capable of making online, asynchronous schema changes. However, this method leads to delays in the deployment of new schemas since it requires waiting for massive data backfill. To shorten the service vacuum period before the
-
Semantic SQL -- Combining and optimizing semantic predicates in SQL arXiv.cs.DB Pub Date : 2024-04-05 Akash Mittal, Anshul Bheemreddy, Huili Tao
In recent years, the surge in unstructured data analysis, facilitated by advancements in Machine Learning (ML), has prompted diverse approaches for handling images, text documents, and videos. Analysts, leveraging ML models, can extract meaningful information from unstructured data and store it in relational databases, allowing the execution of SQL queries for further analysis. Simultaneously, vector
-
Reservoir Sampling over Joins arXiv.cs.DB Pub Date : 2024-04-04 Binyang Dai, Xiao Hu, Ke Yi
Sampling over joins is a fundamental task in large-scale data analytics. Instead of computing the full join results, which could be massive, a uniform sample of the join results would suffice for many purposes, such as answering analytical queries or training machine learning models. In this paper, we study the problem of how to maintain a random sample over joins while the tuples are streaming in
-
NL2KQL: From Natural Language to Kusto Query arXiv.cs.DB Pub Date : 2024-04-03 Amir H. Abdi, Xinye Tang, Jeremias Eichelbaum, Mahan Das, Alex Klein, Nihal Irmak Pakis, William Blum, Daniel L Mace, Tanvi Raja, Namrata Padmanabhan, Ye Xing
Data is growing rapidly in volume and complexity. Proficiency in database query languages is pivotal for crafting effective queries. As coding assistants become more prevalent, there is significant opportunity to enhance database query languages. The Kusto Query Language (KQL) is a widely used query language for large semi-structured data such as logs, telemetries, and time-series for big data analytics
-
What Blocks My Blockchain's Throughput? Developing a Generalizable Approach for Identifying Bottlenecks in Permissioned Blockchains arXiv.cs.DB Pub Date : 2024-04-02 Orestis Papageorgiou, Lasse Börtzler, Egor Ermolaev, Jyoti Kumari, Johannes Sedlmeir
Permissioned blockchains have been proposed for a variety of use cases that require decentralization yet address enterprise requirements that permissionless blockchains to date cannot satisfy -- particularly in terms of performance. However, popular permissioned blockchains still exhibit a relatively low maximum throughput in comparison to established centralized systems. Consequently, researchers
-
Practical Persistent Multi-Word Compare-and-Swap Algorithms for Many-Core CPUs arXiv.cs.DB Pub Date : 2024-04-02 Kento Sugiura, Manabu Nishimura, Yoshiharu Ishikawa
In the last decade, academic and industrial researchers have focused on persistent memory because of the development of the first practical product, Intel Optane. One of the main challenges of persistent memory programming is to guarantee consistent durability over separate memory addresses, and Wang et al. proposed a persistent multi-word compare-and-swap (PMwCAS) algorithm to solve this problem.
-
Mining Sequential Patterns in Uncertain Databases Using Hierarchical Index Structure arXiv.cs.DB Pub Date : 2024-03-31 Kashob Kumar Roy, Md Hasibul Haque Moon, Md Mahmudur Rahman, Chowdhury Farhan Ahmed, Carson K. Leung
In this uncertain world, data uncertainty is inherent in many applications and its importance is growing drastically due to the rapid development of modern technologies. Nowadays, researchers have paid more attention to mine patterns in uncertain databases. A few recent works attempt to mine frequent uncertain sequential patterns. Despite their success, they are incompetent to reduce the number of
-
GTS: GPU-based Tree Index for Fast Similarity Search arXiv.cs.DB Pub Date : 2024-04-01 Yifan Zhu, Ruiyao Ma, Baihua Zheng, Xiangyu Ke, Lu Chen, Yunjun Gao
Similarity search, the task of identifying objects most similar to a given query object under a specific metric, has gathered significant attention due to its practical applications. However, the absence of coordinate information to accelerate similarity search and the high computational cost of measuring object similarity hinder the efficiency of existing CPU-based methods. Additionally, these methods
-
SoK: The Faults in our Graph Benchmarks arXiv.cs.DB Pub Date : 2024-03-31 Puneet Mehrotra, Vaastav Anand, Daniel Margo, Milad Rezaei Hajidehi, Margo Seltzer
Graph-structured data is prevalent in domains such as social networks, financial transactions, brain networks, and protein interactions. As a result, the research community has produced new databases and analytics engines to process such data. Unfortunately, there is not yet widespread benchmark standardization in graph processing, and the heterogeneity of evaluations found in the literature can lead
-
Mining Weighted Sequential Patterns in Incremental Uncertain Databases arXiv.cs.DB Pub Date : 2024-03-31 Kashob Kumar Roy, Md Hasibul Haque Moon, Md Mahmudur Rahman, Chowdhury Farhan Ahmed, Carson Kai-Sang Leung
Due to the rapid development of science and technology, the importance of imprecise, noisy, and uncertain data is increasing at an exponential rate. Thus, mining patterns in uncertain databases have drawn the attention of researchers. Moreover, frequent sequences of items from these databases need to be discovered for meaningful knowledge with great impact. In many real cases, weights of items and
-
Multi-Objective Genetic Algorithm for Materialized View Optimization in Data Warehouses arXiv.cs.DB Pub Date : 2024-03-29 Mahdi Manavi
Materialized views can significantly improve database query performance but identifying the optimal set of views to materialize is challenging. Prior work on automating and optimizing materialized view selection has limitations in execution time and total cost. In this paper, we present a novel genetic algorithm based approach to materialized view selection that aims to minimize execution time and
-
Dataversifying Natural Sciences: Pioneering a Data Lake Architecture for Curated Data-Centric Experiments in Life \& Earth Sciences arXiv.cs.DB Pub Date : 2024-03-29 Genoveva Vargas-SolarLIRIS, Jérôme DarmontERIC, Alejandro AdorjanLIRIS, Javier A. Espinosa-OviedoLIRIS, Carmem HaraERIC, Sabine LoudcherERIC, Regina MotzDIMAP, Martin MusicanteDIMAP, José-Luis Zechinelli-Martini
This vision paper introduces a pioneering data lake architecture designed to meet Life \& Earth sciences' burgeoning data management needs. As the data landscape evolves, the imperative to navigate and maximize scientific opportunities has never been greater. Our vision paper outlines a strategic approach to unify and integrate diverse datasets, aiming to cultivate a collaborative space conducive to
-
PURPLE: Making a Large Language Model a Better SQL Writer arXiv.cs.DB Pub Date : 2024-03-29 Tonghui Ren, Yuankai Fan, Zhenying He, Ren Huang, Jiaqi Dai, Can Huang, Yinan Jing, Kai Zhang, Yifan Yang, X. Sean Wang
Large Language Model (LLM) techniques play an increasingly important role in Natural Language to SQL (NL2SQL) translation. LLMs trained by extensive corpora have strong natural language understanding and basic SQL generation abilities without additional tuning specific to NL2SQL tasks. Existing LLMs-based NL2SQL approaches try to improve the translation by enhancing the LLMs with an emphasis on user
-
Cleaning data with Swipe arXiv.cs.DB Pub Date : 2024-03-28 Toon Boeckling, Antoon Bronselaer
The repair problem for functional dependencies is the problem where an input database needs to be modified such that all functional dependencies are satisfied and the difference with the original database is minimal. The output database is then called an optimal repair. If the allowed modifications are value updates, finding an optimal repair is NP-hard. A well-known approach to find approximations
-
Proving correctness for SQL implementations of OCL constraints arXiv.cs.DB Pub Date : 2024-03-27 Hoang Nguyen Phuoc Bao, Manuel Clavel
In the context of the model-driven development of data-centric applications, OCL constraints play a major role in adding precision to the source models (e.g., data models and security models). Several code-generators have been proposed to bridge the gap between source models with OCL constraints and their corresponding database implementations. However, the database queries produced by these code-generators
-
Sm-Nd Isotope Data Compilation from Geoscientific Literature Using an Automated Tabular Extraction Method arXiv.cs.DB Pub Date : 2024-03-27 Zhixin Guo, Tao Wang, Chaoyang Wang, Jianping Zhou, Guanjie Zheng, Xinbing Wang, Chenghu Zhou
The rare earth elements Sm and Nd significantly address fundamental questions about crustal growth, such as its spatiotemporal evolution and the interplay between orogenesis and crustal accretion. Their relative immobility during high-grade metamorphism makes the Sm-Nd isotopic system crucial for inferring crustal formation times. Historically, data have been disseminated sporadically in the scientific
-
Empirical Analysis of EIP-3675: Miner Dynamics, Transaction Fees, and Transaction Time arXiv.cs.DB Pub Date : 2024-03-26 Umesh Bhatt, Sarvesh Pandey
The Ethereum Improvement Proposal 3675 (EIP-3675) marks a significant shift, transitioning from a Proof of Work (PoW) to a Proof of Stake (PoS) consensus mechanism. This transition resulted in a staggering 99.95% decrease in energy consumption. However, the transition prompts two critical questions: (1). How does EIP-3675 affect miners' dynamics? and (2). How do users determine priority fees, considering
-
Query Refinement for Diverse Top-$k$ Selection arXiv.cs.DB Pub Date : 2024-03-26 Felix S. Campbell, Alon Silberstein, Yuval Moskovitch, Julia Stoyanovich
Database queries are often used to select and rank items as decision support for many applications. As automated decision-making tools become more prevalent, there is a growing recognition of the need to diversify their outcomes. In this paper, we define and study the problem of modifying the selection conditions of an ORDER BY query so that the result of the modified query closely fits some user-defined
-
When View- and Conflict-Robustness Coincide for Multiversion Concurrency Control arXiv.cs.DB Pub Date : 2024-03-26 Brecht Vandevoort, Bas Ketsman, Frank Neven
A DBMS allows trading consistency for efficiency through the allocation of isolation levels that are strictly weaker than serializability. The robustness problem asks whether, for a given set of transactions and a given allocation of isolation levels, every possible interleaved execution of those transactions that is allowed under the provided allocation, is always safe. In the literature, safe is
-
Disambiguate Entity Matching through Relation Discovery with Large Language Models arXiv.cs.DB Pub Date : 2024-03-26 Zezhou Huang
Entity matching is a critical challenge in data integration and cleaning, central to tasks like fuzzy joins and deduplication. Traditional approaches have focused on overcoming fuzzy term representations through methods such as edit distance, Jaccard similarity, and more recently, embeddings and deep neural networks, including advancements from large language models (LLMs) like GPT. However, the core
-
Corra: Correlation-Aware Column Compression arXiv.cs.DB Pub Date : 2024-03-25 Hanwen Liu, Mihail Stoian, Alexander van Renen, Andreas Kipf
Column encoding schemes have witnessed a spark of interest lately. This is not surprising -- as data volume increases, being able to keep one's dataset in main memory for fast processing is a coveted desideratum. However, it also seems that single-column encoding schemes have reached a plateau in terms of the compression size they can achieve. We argue that this is because they do not exploit correlations
-
GGDMiner - Discovery of Graph Generating Dependencies for Graph Data Profiling arXiv.cs.DB Pub Date : 2024-03-25 Larissa C. Shimomura, Nikolay Yakovets, George Fletcher
With the increasing use of graph-structured data, there is also increasing interest in investigating graph data dependencies and their applications, e.g., in graph data profiling. Graph Generating Dependencies (GGDs) are a class of dependencies for property graphs that can express the relation between different graph patterns and constraints based on their attribute similarities. Rich syntax and semantics
-
On Reporting Durable Patterns in Temporal Proximity Graphs arXiv.cs.DB Pub Date : 2024-03-24 Pankaj K. Agarwal, Xiao Hu, Stavros Sintos, Jun Yang
Finding patterns in graphs is a fundamental problem in databases and data mining. In many applications, graphs are temporal and evolve over time, so we are interested in finding durable patterns, such as triangles and paths, which persist over a long time. While there has been work on finding durable simple patterns, existing algorithms do not have provable guarantees and run in strictly super-linear
-
ByteCard: Enhancing Data Warehousing with Learned Cardinality Estimation arXiv.cs.DB Pub Date : 2024-03-24 Yuxing Han, Haoyu Wang, Lixiang Chen, Yifeng Dong, Xing Chen, Benquan Yu, Chengcheng Yang, Weining Qian
Cardinality estimation is a critical component and a longstanding challenge in modern data warehouses. ByteHouse, ByteDance's cloud-native engine for big data analysis in exabyte-scale environments, serves numerous internal decision-making business scenarios. With the increasing demand of ByteHouse, cardinality estimation becomes the bottleneck for efficiently processing queries. Specifically, the
-
Efficiently Estimating Mutual Information Between Attributes Across Tables arXiv.cs.DB Pub Date : 2024-03-22 Aécio Santos, Flip Korn, Juliana Freire
Relational data augmentation is a powerful technique for enhancing data analytics and improving machine learning models by incorporating columns from external datasets. However, it is challenging to efficiently discover relevant external tables to join with a given input table. Existing approaches rely on data discovery systems to identify joinable tables from external sources, typically based on overlap
-
On Enforcing Existence and Non-Existence Constraints in MatBase arXiv.cs.DB Pub Date : 2024-03-20 Christian Mancas
Existence constraints were defined in the Relational Data Model, but, unfortunately, are not provided by any Relational Database Management System, except for their NOT NULL particular case. Our (Elementary) Mathematical Data Model extended them to function products and introduced their dual non-existence constraints. MatBase, an intelligent data and knowledge base management system prototype based
-
Quantifying Semantic Query Similarity for Automated Linear SQL Grading: A Graph-based Approach arXiv.cs.DB Pub Date : 2024-03-21 Leo Köberlein, Dominik Probst, Richard Lenz
Quantifying the semantic similarity between database queries is a critical challenge with broad applications, ranging from query log analysis to automated educational assessment of SQL skills. Traditional methods often rely solely on syntactic comparisons or are limited to checking for semantic equivalence. This paper introduces a novel graph-based approach to measure the semantic dissimilarity between
-
Gen-T: Table Reclamation in Data Lakes arXiv.cs.DB Pub Date : 2024-03-21 Grace Fan, Roee Shraga, Renée J. Miller
We introduce the problem of Table Reclamation. Given a Source Table and a large table repository, reclamation finds a set of tables that, when integrated, reproduce the source table as closely as possible. Unlike query discovery problems like Query-by-Example or by-Target, Table Reclamation focuses on reclaiming the data in the Source Table as fully as possible using real tables that may be incomplete
-
Space-Efficient Indexes for Uncertain Strings arXiv.cs.DB Pub Date : 2024-03-21 Esteban Gabory, Chang Liu, Grigorios Loukides, Solon P. Pissis, Wiktor Zuba
Strings in the real world are often encoded with some level of uncertainty. In the character-level uncertainty model, an uncertain string $X$ of length $n$ on an alphabet $\Sigma$ is a sequence of $n$ probability distributions over $\Sigma$. Given an uncertain string $X$ and a weight threshold $\frac{1}{z}\in(0,1]$, we say that pattern $P$ occurs in $X$ at position $i$, if the product of probabilities
-
Distance Comparison Operators for Approximate Nearest Neighbor Search: Exploration and Benchmark arXiv.cs.DB Pub Date : 2024-03-20 Zeyu Wang, Haoran Xiong, Zhenying He, Peng Wang, Wei wang
Approximate nearest neighbor search (ANNS) on high-dimensional vectors has become a fundamental and essential component in various machine learning tasks. Prior research has shown that the distance comparison operation is the bottleneck of ANNS, which determines the query and indexing performance. To overcome this challenge, some novel methods have been proposed recently. The basic idea is to estimate
-
Efficient k-step Weighted Reachability Query Processing Algorithms arXiv.cs.DB Pub Date : 2024-03-19 Lian Chen, Junfeng Zhou, Ming Du, Sheng Yu, Xian Tang, Ziyang Chen
Given a data graph G, a source vertex u and a target vertex v of a reachability query, the reachability query is used to answer whether there exists a path from u to v in G. Reachability query processing is one of the fundamental operations in graph data management, which is widely used in biological networks, communication networks, and social networks to assist data analysis. The data graphs in practical
-
Secure Query Processing with Linear Complexity arXiv.cs.DB Pub Date : 2024-03-20 Qiyao Luo, Yilei Wang, Wei Dong, Ke Yi
We present LINQ, the first join protocol with linear complexity (in both running time and communication) under the secure multi-party computation model (MPC). It can also be extended to support all free-connex queries, a large class of select-join-aggregate queries, still with linear complexity. This matches the plaintext result for the query processing problem, as free-connex queries are the largest
-
Quantixar: High-performance Vector Data Management System arXiv.cs.DB Pub Date : 2024-03-19 Gulshan Yadav, RahulKumar Yadav, Mansi Viramgama, Mayank Viramgama, Apeksha Mohite
Traditional database management systems need help efficiently represent and querying the complex, high-dimensional data prevalent in modern applications. Vector databases offer a solution by storing data as numerical vectors within a multi-dimensional space. This enables similarity-based search and analysis, such as image retrieval, recommendation engine generation, and natural language processing
-
Evaluating Datalog over Semirings: A Grounding-based Approach arXiv.cs.DB Pub Date : 2024-03-19 Hangdong Zhao, Shaleen Deep, Paraschos Koutris, Sudeepa Roy, Val Tannen
Datalog is a powerful yet elegant language that allows expressing recursive computation. Although Datalog evaluation has been extensively studied in the literature, so far, only loose upper bounds are known on how fast a Datalog program can be evaluated. In this work, we ask the following question: given a Datalog program over a naturally-ordered semiring $\sigma$, what is the tightest possible runtime
-
Algorithmic Complexity Attacks on Dynamic Learned Indexes arXiv.cs.DB Pub Date : 2024-03-19 Rui Yang, Evgenios M. Kornaropoulos, Yue Cheng
Learned Index Structures (LIS) view a sorted index as a model that learns the data distribution, takes a data element key as input, and outputs the predicted position of the key. The original LIS can only handle lookup operations with no support for updates, rendering it impractical to use for typical workloads. To address this limitation, recent studies have focused on designing efficient dynamic
-
Benchmarking Analytical Query Processing in Intel SGXv2 arXiv.cs.DB Pub Date : 2024-03-18 Adrian LutschTechnical University of Darmstadt, Muhammad El-HindiTechnical University of Darmstadt, Matthias HeinrichTechnical University of Darmstadt, Daniel RitterSAP SE, Zsolt IstvánTechnical University of Darmstadt, Carsten BinnigTechnical University of DarmstadtDFKI
The recently introduced second generation of Intel SGX (SGXv2) lifts memory size limitations of the first generation. Theoretically, this promises to enable secure and highly efficient analytical DBMSs in the cloud. To validate this promise, in this paper, we conduct the first in-depth evaluation study of running analytical query processing algorithms inside SGXv2. Our study reveals that state-of-the-art
-
Models for Storage in Database Backends arXiv.cs.DB Pub Date : 2024-03-18 Edgard SchiebelbeinAmazon Web Services, Saalik HatiaAmazon Web Services, Annette BieniusaAmazon Web Services, Gustavo PetriAmazon Web Services, Carla FerreiraNOVA, Marc ShapiroDELYS
This paper describes ongoing work on developing a formal specification of a database backend. We present the formalisation of the expected behaviour of a basic transactional system that calls into a simple store API, and instantiate in two semantic models. The first one is a map-based, classical versioned key-value store; the second one, journal-based, appends individual transaction effects to a journal
-
Graph Theory for Consent Management: A New Approach for Complex Data Flows arXiv.cs.DB Pub Date : 2024-03-17 Dorota Filipczuk, Enrico H. Gerding, George Konstantinidis
Through legislation and technical advances users gain more control over how their data is processed, and they expect online services to respect their privacy choices and preferences. However, data may be processed for many different purposes by several layers of algorithms that create complex data workflows. To date, there is no existing approach to automatically satisfy fine-grained privacy constraints
-
Exploring Distance Query Processing in Edge Computing Environments arXiv.cs.DB Pub Date : 2024-03-17 Xiubo Zhang, Yujie He, Ye Li, Yan Li, Zijie Zhou, Dongyao Wei, Ryan
In the context of changing travel behaviors and the expanding user base of Geographic Information System (GIS) services, conventional centralized architectures responsible for handling shortest distance queries are facing increasing challenges, such as heightened load pressure and longer response times. To mitigate these concerns, this study is the first to develop an edge computing framework specially
-
Wait to be Faster: a Smart Pooling Framework for Dynamic Ridesharing arXiv.cs.DB Pub Date : 2024-03-17 Xiaoyao Zhong, Jiabao Jin, Peng Cheng, Wangze Ni, Libin Zheng, Lei Chen, Xuemin Lin
Ridesharing services, such as Uber or Didi, have attracted considerable attention in recent years due to their positive impact on environmental protection and the economy. Existing studies require quick responses to orders, which lack the flexibility to accommodate longer wait times for better grouping opportunities. In this paper, we address a NP-hard ridesharing problem, called Minimal Extra Time
-
Bangladesh Agricultural Knowledge Graph: Enabling Semantic Integration and Data-driven Analysis--Full Version arXiv.cs.DB Pub Date : 2024-03-18 Rudra Pratap Deb Nath, Tithi Rani Das, Tonmoy Chandro Das, S. M. Shafkat Raihan
In Bangladesh, agriculture is a crucial driver for addressing Sustainable Development Goal 1 (No Poverty) and 2 (Zero Hunger), playing a fundamental role in the economy and people's livelihoods. To enhance the sustainability and resilience of the agriculture industry through data-driven insights, the Bangladesh Bureau of Statistics and other organizations consistently collect and publish agricultural
-
Vector search with small radiuses arXiv.cs.DB Pub Date : 2024-03-16 Gergely Szilvasy, Pierre-Emmanuel Mazaré, Matthijs Douze
In recent years, the dominant accuracy metric for vector search is the recall of a result list of fixed size (top-k retrieval), considering as ground truth the exact vector retrieval results. Although convenient to compute, this metric is distantly related to the end-to-end accuracy of a full system that integrates vector search. In this paper we focus on the common case where a hard decision needs
-
Accelerating Regular Path Queries over Graph Database with Processing-in-Memory arXiv.cs.DB Pub Date : 2024-03-15 Ruoyan Ma, Shengan Zheng, Guifeng Wang, Jin Pu, Yifan Hua, Wentao Wang, Linpeng Huang
Regular path queries (RPQs) in graph databases are bottlenecked by the memory wall. Emerging processing-in-memory (PIM) technologies offer a promising solution to dispatch and execute path matching tasks in parallel within PIM modules. We present Moctopus, a PIM-based data management system for graph databases that supports efficient batch RPQs and graph updates. Moctopus employs a PIM-friendly dynamic
-
Interactive Trimming against Evasive Online Data Manipulation Attacks: A Game-Theoretic Approach arXiv.cs.DB Pub Date : 2024-03-15 Yue Fu, Qingqing Ye, Rong Du, Haibo Hu
With the exponential growth of data and its crucial impact on our lives and decision-making, the integrity of data has become a significant concern. Malicious data poisoning attacks, where false values are injected into the data, can disrupt machine learning processes and lead to severe consequences. To mitigate these attacks, distance-based defenses, such as trimming, have been proposed, but they
-
KIF: A Framework for Virtual Integration of Heterogeneous Knowledge Bases using Wikidata arXiv.cs.DB Pub Date : 2024-03-15 Guilherme Lima, Marcelo Machado, Elton Soares, Sandro R. Fiorini, Raphael Thiago, Leonardo G. Azevedo, Viviane T. da Silva, Renato Cerqueira
We present a knowledge integration framework (called KIF) that uses Wikidata as a lingua franca to integrate heterogeneous knowledge bases. These can be triplestores, relational databases, CSV files, etc., which may or may not use the Wikidata dialect of RDF. KIF leverages Wikidata's data model and vocabulary plus user-defined mappings to expose a unified view of the integrated bases while keeping
-
Rule based Complex Event Processing for an Air Quality Monitoring System in Smart City arXiv.cs.DB Pub Date : 2024-03-16 Shashi Shekhar Kumar, Ritesh Chandra, Sonali Agarwal
In recent years, smart city-based development has gained momentum due to its versatile nature in architecture and planning for the systematic habitation of human beings. According to World Health Organization (WHO) report, air pollution causes serious respiratory diseases. Hence, it becomes necessary to real-time monitoring of air quality to minimize effect by taking time-bound decisions by the stakeholders
-
Query Rewriting via Large Language Models arXiv.cs.DB Pub Date : 2024-03-14 Jie Liu, Barzan Mozafari
Query rewriting is one of the most effective techniques for coping with poorly written queries before passing them down to the query optimizer. Manual rewriting is not scalable, as it is error-prone and requires deep expertise. Similarly, traditional query rewriting algorithms can only handle a small subset of queries: rule-based techniques do not generalize to new query patterns and synthesis-based