样式: 排序: IF: - GO 导出 标记为已读
-
Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance ACM Trans. Database Syst. (IF 1.8) Pub Date : 2024-04-10 Adriane Chapman, Luca Lauro, Paolo Missier, Riccardo Torlone
Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, we aim at providing data scientists with facilities to gain an in-depth understanding of how each step in the
-
Sharing Queries with Nonequivalent User-defined Aggregate Functions ACM Trans. Database Syst. (IF 1.8) Pub Date : 2024-04-10 Chao Zhang, Toumani Farouk
This article presents Sharing User-Defined Aggregate Function (SUDAF), a declarative framework that allows users to write User-defined Aggregate Functions (UDAFs) as mathematical expressions and use them in Structured Query Language statements. SUDAF rewrites partial aggregates of UDAFs using built-in aggregate functions and supports efficient dynamic caching and reusing of partial aggregates. Our
-
Database Repairing with Soft Functional Dependencies ACM Trans. Database Syst. (IF 1.8) Pub Date : 2024-04-10 Nofar Carmeli, Martin Grohe, Benny Kimelfeld, Ester Livshits, Muhammad Tibi
A common interpretation of soft constraints penalizes the database for every violation of every constraint, where the penalty is the cost (weight) of the constraint. A computational challenge is that of finding an optimal subset: a collection of database tuples that minimizes the total penalty when each tuple has a cost of being excluded. When the constraints are strict (i.e., have an infinite cost)
-
The Ring: Worst-case Optimal Joins in Graph Databases using (Almost) No Extra Space ACM Trans. Database Syst. (IF 1.8) Pub Date : 2024-03-23 Diego Arroyuelo, Adrián Gómez-Brandón, Aidan Hogan, Gonzalo Navarro, Juan Reutter, Javiel Rojas-Ledesma, Adrián Soto
We present an indexing scheme for triple-based graphs that supports join queries in worst-case optimal (wco) time within compact space. This scheme, called a ring, regards each triple as a cyclic string of length 3. Each rotation of the triples is lexicographically sorted and the values of the last attribute are stored as a column, so we obtain the order of the next column by stably re-sorting the
-
Fast Parallel Hypertree Decompositions in Logarithmic Recursion Depth ACM Trans. Database Syst. (IF 1.8) Pub Date : 2024-02-28 Georg Gottlob, Matthias Lanzinger, Cem Okulmus, Reinhard Pichler
Various classic reasoning problems with natural hypergraph representations are known to be tractable if a hypertree decomposition (HD) of low width exists. The resulting algorithms are attractive for practical use in fields like databases and constraint satisfaction. However, algorithmic use of HDs relies on the difficult task of first computing a decomposition of the hypergraph underlying a given
-
Linking Entities across Relations and Graphs ACM Trans. Database Syst. (IF 1.8) Pub Date : 2024-02-28 Wenfei Fan, Ping Lu, Kehan Pang, Ruochun Jin, Wenyuan Yu
This article proposes a notion of parametric simulation to link entities across a relational database 𝒟 and a graph G. Taking functions and thresholds for measuring vertex closeness, path associations, and important properties as parameters, parametric simulation identifies tuples t in 𝒟 and vertices v in G that refer to the same real-world entity, based on both topological and semantic matching
-
Ad Hoc Transactions through the Looking Glass: An Empirical Study of Application-Level Transactions in Web Applications ACM Trans. Database Syst. (IF 1.8) Pub Date : 2024-02-28 Zhaoguo Wang, Chuzhe Tang, Xiaodong Zhang, Qianmian Yu, Binyu Zang, Haibing Guan, Haibo Chen
Many transactions in web applications are constructed ad hoc in the application code. For example, developers might explicitly use locking primitives or validation procedures to coordinate critical code fragments. We refer to database operations coordinated by application code as ad hoc transactions. Until now, little is known about them. This paper presents the first comprehensive study on ad hoc
-
Identifying the Root Causes of DBMS Suboptimality ACM Trans. Database Syst. (IF 1.8) Pub Date : 2024-02-28 Sabah Currim, Richard T. Snodgrass, Young-Kyoon Suh
The query optimization phase within a database management system (DBMS) ostensibly finds the fastest query execution plan from a potentially large set of enumerated plans, all of which correctly compute the same result of the specified query. Sometimes the cost-based optimizer selects a slower plan, for a variety of reasons. Previous work has focused on increasing the performance of specific components
-
A family of centrality measures for graph data based on subgraphs ACM Trans. Database Syst. (IF 1.8) Pub Date : 2024-02-23 Sebastián Bugedo, Cristian Riveros, Jorge Salas
We present the theoretical foundations and first experimental study of a new approach in centrality measures for graph data. The main principle is straightforward: the more relevant subgraphs around a vertex, the more central it is in the network. We formalize the notion of “relevant subgraphs” by choosing a family of subgraphs that, given a graph G and a vertex v, assigns a subset of connected subgraphs
-
GraphZeppelin: How to Find Connected Components (Even When Graphs Are Dense, Dynamic, and Massive) ACM Trans. Database Syst. (IF 1.8) Pub Date : 2024-02-20 David Tench, Evan West, Victor Zhang, Michael A. Bender, Abiyaz Chowdhury, Daniel Delayo, J. Ahmed Dellas, Martín Farach-Colton, Tyler Seip, Kenny Zhang
Finding the connected components of a graph is a fundamental problem with uses throughout computer science and engineering. The task of computing connected components becomes more difficult when graphs are very large, or when they are dynamic, meaning the edge set changes over time subject to a stream of edge insertions and deletions. A natural approach to computing the connected components problem
-
Partial Order Multiway Search ACM Trans. Database Syst. (IF 1.8) Pub Date : 2023-11-13 Lu Shangqi, Wim Martens, Matthias Niewerth, Yufei Tao
Partial order multiway search (POMS) is a fundamental problem that finds applications in crowdsourcing, distributed file systems, software testing, and more. This problem involves an interaction between an algorithm 𝒜 and an oracle, conducted on a directed acyclic graph 𝒢 known to both parties. Initially, the oracle selects a vertex t in 𝒢 called the target. Subsequently, 𝒜 must identify the target
-
Cost-based Data Prefetching and Scheduling in Big Data Platforms over Tiered Storage Systems ACM Trans. Database Syst. (IF 1.8) Pub Date : 2023-11-13 Herodotos Herodotou, Elena Kakoulli
The use of storage tiering is becoming popular in data-intensive compute clusters due to the recent advancements in storage technologies. The Hadoop Distributed File System, for example, now supports storing data in memory, SSDs, and HDDs, while OctopusFS and hatS offer fine-grained storage tiering solutions. However, current big data platforms (such as Hadoop and Spark) are not exploiting the presence
-
DomainNet: Homograph Detection and Understanding in Data Lake Disambiguation ACM Trans. Database Syst. (IF 1.8) Pub Date : 2023-09-12 Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald
Modern data lakes are heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: How can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management, and data science, we show that
-
Model Counting Meets F0 Estimation ACM Trans. Database Syst. (IF 1.8) Pub Date : 2023-08-09 A. Pavan®, N. V. Vinodchandran®, Arnab Bhattacharyya®, Kuldeep S. Meel
Constraint satisfaction problems (CSPs) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities
-
Enabling Timely and Persistent Deletion in LSM-Engines ACM Trans. Database Syst. (IF 1.8) Pub Date : 2023-08-09 Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Zichen Zhu, Manos Athanassoulis
Data-intensive applications have fueled the evolution of log-structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as second-class citizens. A delete operation inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art
-
Efficient Bi-objective SQL Optimization for Enclaved Cloud Databases with Differentially Private Padding ACM Trans. Database Syst. (IF 1.8) Pub Date : 2023-06-26 Yaxing Chen, Qinghua Zheng, Zheng Yan
Hardware-enabled enclaves have been applied to efficiently enforce data security and privacy protection in cloud database services. Such enclaved systems, however, are reported to suffer from I/O-size (also referred to as communication-volume)-based side-channel attacks. Albeit differentially private padding has been exploited to defend against these attacks as a principle method, it introduces a challenging
-
Model Counting meets F0 Estimation ACM Trans. Database Syst. (IF 1.8) Pub Date : 2023-06-20 A. Pavan, N. V. Vinodchandran, Arnab Bhattacharyya, Kuldeep S. Meel
Constraint satisfaction problems (CSP’s) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities
-
Enabling Timely and Persistent Deletion in LSM-Engines ACM Trans. Database Syst. (IF 1.8) Pub Date : 2023-06-08 Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Zichen Zhu, Manos Athanassoulis
Data-intensive applications have fueled the evolution of log-structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as second-class citizens. A delete operation inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art
-
Proportionality on Spatial Data with Context ACM Trans. Database Syst. (IF 1.8) Pub Date : 2023-05-13 Georgios J. Fakas, Georgios Kalamatianos
More often than not, spatial objects are associated with some context, in the form of text, descriptive tags (e.g., points of interest, flickr photos), or linked entities in semantic graphs (e.g., Yago2, DBpedia). Hence, location-based retrieval should be extended to consider not only the locations but also the context of the objects, especially when the retrieved objects are too many and the query
-
Reversible Database Watermarking Based on Order-preserving Encryption for Data Sharing ACM Trans. Database Syst. (IF 1.8) Pub Date : 2023-05-13 Donghui Hu, Qing Wang, Song Yan, Xiaojun Liu, Meng Li, Shuli Zheng
In the era of big data, data sharing not only boosts the economy of the world but also brings about problems of privacy disclosure and copyright infringement. The collected data may contain users’ sensitive information; thus, privacy protection should be applied to the data prior to them being shared. Moreover, the shared data may be re-shared to third parties without the consent or awareness of the
-
Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries ACM Trans. Database Syst. (IF 1.8) Pub Date : 2023-03-13 Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, Mirek Riedewald
We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings, that is
-
Robust and Efficient Sorting with Offset-value Coding ACM Trans. Database Syst. (IF 1.8) Pub Date : 2023-03-13 Thanh Do, Goetz Graefe
Sorting and searching are large parts of database query processing, e.g., in the forms of index creation, index maintenance, and index lookup, and comparing pairs of keys is a substantial part of the effort in sorting and searching. We have worked on simple, efficient implementations of decades-old, neglected, effective techniques for fast comparisons and fast sorting, in particular offset-value coding
-
Efficiently Cleaning Structured Event Logs: A Graph Repair Approach ACM Trans. Database Syst. (IF 1.8) Pub Date : 2023-03-13 Ruihong Huang, Jianmin Wang, Shaoxu Song, Xuemin Lin, Xiaochen Zhu, Jian Pei
Event data are often dirty owing to various recording conventions or simply system errors. These errors may cause serious damage to real applications, such as inaccurate provenance answers, poor profiling results, or concealing interesting patterns from event data. Cleaning dirty event data is strongly demanded. While existing event data cleaning techniques view event logs as sequences, structural
-
Efficient Sorting, Duplicate Removal, Grouping, and Aggregation ACM Trans. Database Syst. (IF 1.8) Pub Date : 2023-01-06 Thanh Do, Goetz Graefe, Jeffrey Naughton
Database query processing requires algorithms for duplicate removal, grouping, and aggregation. Three algorithms exist: in-stream aggregation is most efficient by far but requires sorted input; sort-based aggregation relies on external merge sort; and hash aggregation relies on an in-memory hash table plus hash partitioning to temporary storage. Cost-based query optimization chooses which algorithm
-
Proximity Queries on Terrain Surface ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-12-16 Victor Junqiu Wei, Raymond Chi-Wing Wong, Cheng Long, David Mount, Hanan Samet
Due to the advance of the geo-spatial positioning and the computer graphics technology, digital terrain data has become increasingly popular nowadays. Query processing on terrain data has attracted considerable attention from both the academic and the industry communities. Proximity queries such as the shortest path/distance query, k nearest/farthest neighbor query, and top-k closest/farthest pairs
-
Deciding Robustness for Lower SQL Isolation Levels ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-11-06 Bas Ketsman, Christoph Koch, Frank Neven, Brecht Vandevoort
While serializability always guarantees application correctness, lower isolation levels can be chosen to improve transaction throughput at the risk of introducing certain anomalies. A set of transactions is robust against a given isolation level if every possible interleaving of the transactions under the specified isolation level is serializable. Robustness therefore always guarantees application
-
Conjunctive Queries: Unique Characterizations and Exact Learnability ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-11-06 Balder Ten Cate, Victor Dalmau
We answer the question of which conjunctive queries are uniquely characterized by polynomially many positive and negative examples and how to construct such examples efficiently. As a consequence, we obtain a new efficient exact learning algorithm for a class of conjunctive queries. At the core of our contributions lie two new polynomial-time algorithms for constructing frontiers in the homomorphism
-
Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-08-18 Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Alessio Conte, Benny Kimelfeld, Nicole Schweikardt
As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a polylogarithmic delay after a linear-time preprocessing phase. Such an evaluation is obtained by constructing, in the preprocessing phase, a data structure that supports
-
On Finding Rank Regret Representatives ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-08-18 Abolfazl Asudeh, Gautam Das, H. V. Jagadish, Shangqi Lu, Azade Nazi, Yufei Tao, Nan Zhang, Jianwen Zhao
Selecting the best items in a dataset is a common task in data exploration. However, the concept of “best” lies in the eyes of the beholder: Different users may consider different attributes more important and, hence, arrive at different rankings. Nevertheless, one can remove “dominated” items and create a “representative” subset of the data, comprising the “best items” in it. A Pareto-optimal representative
-
Persistent Summaries ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-08-18 Tianjing Zeng, Zhewei Wei, Ge Luo, Ke Yi, Xiaoyong Du, Ji-Rong Wen
A persistent data structure, also known as a multiversion data structure in the database literature, is a data structure that preserves all its previous versions as it is updated over time. Every update (inserting, deleting, or changing a data record) to the data structure creates a new version, while all the versions are kept in the data structure so that any previous version can still be queried
-
Influence Maximization Revisited: Efficient Sampling with Bound Tightened ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-08-18 Qintian Guo, Sibo Wang, Zhewei Wei, Wenqing Lin, Jing Tang
Given a social network G with n nodes and m edges, a positive integer k, and a cascade model C, the influence maximization (IM) problem asks for k nodes in G such that the expected number of nodes influenced by the k nodes under cascade model C is maximized. The state-of-the-art approximate solutions run in O(k(n+m)log n/ε2) expected time while returning a (1 - 1/e - ε) approximate solution with at
-
Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-06-25 Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Alessio Conte, Benny Kimelfeld, Nicole Schweikardt
As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a polylogarithmic delay after a linear-time preprocessing phase. Such an evaluation is obtained by constructing, in the preprocessing phase, a data structure that supports
-
Persistent Data Sketching ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-05-23
A persistent data structure, also known as a multiversion data structure in the database literature, is a data structure that preserves all its previous versions as it is updated over time. Every update (inserting, deleting, or changing a data record) to the data structure creates a new version, while all the versions are kept in the data structure so that any previous version can still be queried
-
Conjunctive Regular Path Queries with Capture Groups ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-05-23 Markus L. Schmid
In practice, regular expressions are usually extended by so-called capture groups or capture variables, which allow to capture a subexpression by a variable that can be referenced in the regular expression in order to describe repetitions of subwords. We investigate how this concept could be used for pattern-based graph querying; i.e., we investigate conjunctive regular path queries (CRPQs) that are
-
Incremental Graph Computations: Doable and Undoable ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-05-23 Wenfei Fan, Chao Tian
The incremental problem for a class \( {\mathcal {Q}} \) of graph queries aims to compute, given a query \( Q \in {\mathcal {Q}} \), graph G, answers Q(G) to Q in G and updates ΔG to G as input, changes ΔO to output Q(G) such that Q(G⊕ΔG) = Q(G)⊕ΔO. It is called bounded if its cost can be expressed as a polynomial function in the sizes of Q, ΔG and ΔO, which reduces the computations on possibly big
-
Mining Order-preserving Submatrices under Data Uncertainty: A Possible-world Approach and Efficient Approximation Methods ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-05-23 Ji Cheng, Da Yan, Wenwen Qu, Xiaotian Hao, Cheng Long, Wilfred Ng, Xiaoling Wang
Given a data matrix \( D \), a submatrix \( S \) of \( D \) is an order-preserving submatrix (OPSM) if there is a permutation of the columns of \( S \), under which the entry values of each row in \( S \) are strictly increasing. OPSM mining is widely used in real-life applications such as identifying coexpressed genes and finding customers with similar preference. However, noise is ubiquitous in real
-
Optimal Joins Using Compressed Quadtrees ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-05-23 Diego Arroyuelo, Gonzalo Navarro, Juan L. Reutter, Javiel Rojas-Ledesma
Worst-case optimal join algorithms have gained a lot of attention in the database literature. We now count several algorithms that are optimal in the worst case, and many of them have been implemented and validated in practice. However, the implementation of these algorithms often requires an enhanced indexing structure: to achieve optimality one either needs to build completely new indexes or must
-
Influence Maximization Revisited: Efficient Sampling with Bound Tightened ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-05-19 Qintian Guo, Sibo Wang, Zhewei Wei, Wenqing Lin, Jing Tang
Given a social network G with n nodes and m edges, a positive integer k, and a cascade model \(\mathcal {C} \), the influence maximization (IM) problem asks for k nodes in G such that the expected number of nodes influenced by the k nodes under cascade model \(\mathcal {C} \) is maximized. The state-of-the-art approximate solutions run in O(k(n + m)log n/ϵ2) expected time while returning a (1 − 1/e
-
Unified Route Planning for Shared Mobility: An Insertion-based Framework ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-05-02 Yongxin Tong, Yuxiang Zeng, Zimu Zhou, Lei Chen, Ke Xu
There has been a dramatic growth of shared mobility applications such as ride-sharing, food delivery, and crowdsourced parcel delivery. Shared mobility refers to transportation services that are shared among users, where a central issue is route planning. Given a set of workers and requests, route planning finds for each worker a route, i.e., a sequence of locations to pick up and drop off passengers/parcels
-
On Finding Rank Regret Representatives ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-04-27 Abolfazl Asudeh, Gautam Das, H. V. Jagadish, Shangqi Lu, Azade Nazi, Yufei Tao, Nan Zhang, Jianwen Zhao
Selecting the best items in a dataset is a common task in data exploration. However, the concept of “best” lies in the eyes of the beholder: different users may consider different attributes more important and, hence, arrive at different rankings. Nevertheless, one can remove “dominated” items and create a “representative” subset of the data, comprising the “best items” in it. A Pareto-optimal representative
-
Height Optimized Tries ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-04-06 Robert Binna, Eva Zangerle, Martin Pichl, Günther Specht, Viktor Leis
We present the Height Optimized Trie (HOT), a fast and space-efficient in-memory index structure. The core algorithmic idea of HOT is to dynamically vary the number of bits considered at each node, which enables a consistently high fanout and thereby good cache efficiency. For a fixed maximum node fanout, the overall tree height is minimal and its structure is deterministically defined. Multiple carefully
-
The Space-Efficient Core of Vadalog ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-04-06 Gerald Berger, Georg Gottlob, Andreas Pieris, Emanuel Sallinger
Vadalog is a system for performing complex reasoning tasks such as those required in advanced knowledge graphs. The logical core of the underlying Vadalog language is the warded fragment of tuple-generating dependencies (TGDs). This formalism ensures tractable reasoning in data complexity, while a recent analysis focusing on a practical implementation led to the reasoning algorithm around which the
-
Incremental Graph Computations: Doable and Undoable ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-03-10 Wenfei Fan, Chao Tian
The incremental problem for a class \({\mathcal {Q}} \) of graph queries aims to compute, given a query \(Q \in {\mathcal {Q}} \), graph G, answers Q(G) to Q in G and updates ΔG to G as input, changes ΔO to output Q(G) such that Q(G⊕ΔG) = Q(G)⊕ΔO. It is called bounded if its cost can be expressed as a polynomial function in the sizes of Q, ΔG and ΔO, which reduces the computations on possibly big G
-
Sampling a Near Neighbor in High DimensionsJust Accepted ACM Trans. Database Syst. (IF 1.8) Pub Date : 2022-02-04 Martin Aumüller, Sariel Har-Peled, Sepideh Mahabadi, Rasmus Pagh, Francesco Silvestri
Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points S and a radius parameter r > 0, the r-near neighbor (r-NN) problem asks for a data structure that, given any query point q, returns a point p within distance at most r from q. In this paper, we study the r-NN problem in the light of individual fairness and providing equal
-
A Formal Framework for Complex Event Recognition ACM Trans. Database Syst. (IF 1.8) Pub Date : 2021-12-08 Alejandro Grez, Cristian Riveros, Martín Ugarte, Stijn Vansummeren
Complex event recognition (CER) has emerged as the unifying field for technologies that require processing and correlating distributed data sources in real time. CER finds applications in diverse domains, which has resulted in a large number of proposals for expressing and processing complex events. Existing CER languages lack a clear semantics, however, which makes them hard to understand and generalize
-
On Directed Densest Subgraph Discovery ACM Trans. Database Syst. (IF 1.8) Pub Date : 2021-11-15 Chenhao Ma, Yixiang Fang, Reynold Cheng, Laks V. S. Lakshmanan, Wenjie Zhang, Xuemin Lin
Given a directed graph G, the directed densest subgraph (DDS) problem refers to the finding of a subgraph from G, whose density is the highest among all the subgraphs of G. The DDS problem is fundamental to a wide range of applications, such as fraud detection, community mining, and graph compression. However, existing DDS solutions suffer from efficiency and scalability problems: on a 3,000-edge graph
-
Timely Reporting of Heavy Hitters Using External Memory ACM Trans. Database Syst. (IF 1.8) Pub Date : 2021-11-15 Shikha Singh, Prashant Pandey, Michael A. Bender, Jonathan W. Berry, Martín Farach-Colton, Rob Johnson, Thomas M. Kroeger, Cynthia A. Phillips
Given an input stream S of size N, a ɸ-heavy hitter is an item that occurs at least ɸN times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = ɸ N-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem
-
Balancing Expressiveness and Inexpressiveness in View Design ACM Trans. Database Syst. (IF 1.8) Pub Date : 2021-11-15 Michael Benedikt, Pierre Bourhis, Louis Jachiet, Efthymia Tsamoura
We study the design of data publishing mechanisms that allow a collection of autonomous distributed data sources to collaborate to support queries. A common mechanism for data publishing is via views: functions that expose derived data to users, usually specified as declarative queries. Our autonomy assumption is that the views must be on individual sources, but with the intention of supporting integrated
-
SkinnerDB: Regret-bounded Query Evaluation via Reinforcement Learning ACM Trans. Database Syst. (IF 1.8) Pub Date : 2021-09-28 Immanuel Trummer, Junxiong Wang, Ziyun Wei, Deepak Maram, Samuel Moseley, Saehan Jo, Joseph Antonakakis, Ankush Rayabhari
SkinnerDB uses reinforcement learning for reliable join ordering, exploiting an adaptive processing engine with specialized join algorithms and data structures. It maintains no data statistics and uses no cost or cardinality models. Also, it uses no training workloads nor does it try to link the current query to seemingly similar queries in the past. Instead, it uses reinforcement learning to learn
-
Stream Data Cleaning under Speed and Acceleration Constraints ACM Trans. Database Syst. (IF 1.8) Pub Date : 2021-09-28 Shaoxu Song, Fei Gao, Aoqian Zhang, Jianmin Wang, Philip S. Yu
Stream data are often dirty, for example, owing to unreliable sensor reading or erroneous extraction of stock prices. Most stream data cleaning approaches employ a smoothing filter, which may seriously alter the data without preserving the original information. We argue that the cleaning should avoid changing those originally correct/clean data, a.k.a. the minimum modification rule in data cleaning
-
Error Bounded Line Simplification Algorithms for Trajectory Compression: An Experimental Evaluation ACM Trans. Database Syst. (IF 1.8) Pub Date : 2021-09-28 Xuelian Lin, Shuai Ma, Jiahao Jiang, Yanchen Hou, Tianyu Wo
Nowadays, various sensors are collecting, storing, and transmitting tremendous trajectory data, and it is well known that the storage, network bandwidth, and computing resources could be heavily wasted if raw trajectory data is directly adopted. Line simplification algorithms are effective approaches to attacking this issue by compressing a trajectory to a set of continuous line segments, and are commonly
-
Bag Query Containment and Information Theory ACM Trans. Database Syst. (IF 1.8) Pub Date : 2021-09-28 Mahmoud Abo Khamis, Phokion G. Kolaitis, Hung Q. Ngo, Dan Suciu
The query containment problem is a fundamental algorithmic problem in data management. While this problem is well understood under set semantics, it is by far less understood under bag semantics. In particular, it is a long-standing open question whether or not the conjunctive query containment problem under bag semantics is decidable. We unveil tight connections between information theory and the
-
On the Enumeration Complexity of Unions of Conjunctive Queries ACM Trans. Database Syst. (IF 1.8) Pub Date : 2021-05-30 Nofar Carmeli, Markus Kröll
We study the enumeration complexity of Unions of Conjunctive Queries (UCQs) . We aim to identify the UCQs that are tractable in the sense that the answer tuples can be enumerated with a linear preprocessing phase and a constant delay between every successive tuples. It has been established that, in the absence of self-joins and under conventional complexity assumptions, the CQs that admit such an evaluation
-
Optimizing One-time and Continuous Subgraph Queries using Worst-case Optimal Joins ACM Trans. Database Syst. (IF 1.8) Pub Date : 2021-05-30 Amine Mhedhbi, Chathura Kankanamge, Semih Salihoglu
We study the problem of optimizing one-time and continuous subgraph queries using the new worst-case optimal join plans. Worst-case optimal plans evaluate queries by matching one query vertex at a time using multiway intersections. The core problem in optimizing worst-case optimal plans is to pick an ordering of the query vertices to match. We make two main contributions: 1. A cost-based dynamic programming
-
Embedded Functional Dependencies and Data-completeness Tailored Database Design ACM Trans. Database Syst. (IF 1.8) Pub Date : 2021-05-30 Ziheng Wei, Sebastian Link
We establish a principled schema design framework for data with missing values. The framework is based on the new notion of an embedded functional dependency, which is independent of the interpretation of missing values, able to express completeness and integrity requirements on application data, and capable of capturing redundant data value occurrences that may cause problems with processing data
-
Graph Indexing for Efficient Evaluation of Label-constrained Reachability Queries ACM Trans. Database Syst. (IF 1.8) Pub Date : 2021-05-30 Yangjun Chen, Gagandeep Singh
Given a directed edge labeled graph G , to check whether vertex v is reachable from vertex u under a label set S is to know if there is a path from u to v whose edge labels across the path are a subset of S . Such a query is referred to as a label-constrained reachability ( LCR ) query. In this article, we present a new approach to store a compressed transitive closure of G in the form of intervals
-
Constant-Delay Enumeration for Nondeterministic Document Spanners ACM Trans. Database Syst. (IF 1.8) Pub Date : 2021-04-14 Antoine Amarilli, Pierre Bourhis, Stefan Mengel, Matthias Niewerth
We consider the information extraction framework known as document spanners and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results
-
Scotty ACM Trans. Database Syst. (IF 1.8) Pub Date : 2021-03-27 Jonas Traub, Philipp Marian Grulich, Alejandro Rodríguez Cuéllar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, Volker Markl
Window aggregation is a core operation in data stream processing. Existing aggregation techniques focus on reducing latency, eliminating redundant computations, or minimizing memory usage. However, each technique operates under different assumptions with respect to workload characteristics, such as properties of aggregation functions (e.g., invertible, associative), window types (e.g., sliding, sessions)
-
An Empirical Study of Moment Estimators for Quantile Approximation ACM Trans. Database Syst. (IF 1.8) Pub Date : 2021-03-18 Rory Mitchell, Eibe Frank, Geoffrey Holmes
We empirically evaluate lightweight moment estimators for the single-pass quantile approximation problem, including maximum entropy methods and orthogonal series with Fourier, Cosine, Legendre, Chebyshev and Hermite basis functions. We show how to apply stable summation formulas to offset numerical precision issues for higher-order moments, leading to reliable single-pass moment estimators up to order
-
Evaluation of Machine Learning Algorithms in Predicting the Next SQL Query from the Future ACM Trans. Database Syst. (IF 1.8) Pub Date : 2021-03-18 Venkata Vamsikrishna Meduri, Kanchan Chowdhury, Mohamed Sarwat
Prediction of the next SQL query from the user, given her sequence of queries until the current timestep, during an ongoing interaction session of the user with the database, can help in speculative query processing and increased interactivity. While existing machine learning-- (ML) based approaches use recommender systems to suggest relevant queries to a user, there has been no exhaustive study on