TY - JOUR
T1 - Discovery of Approximate Lexicographical Order Dependencies
AU - Jin, Yifeng
AU - Tan, Zijing
AU - Chen, Jixuan
AU - Ma, Shuai
N1 - Publisher Copyright:
© 1989-2012 IEEE.
PY - 2023/4/1
Y1 - 2023/4/1
N2 - Lexicographical order dependencies (LODs) specify orders between list of attributes, and are proven useful in optimizing SQL queries with order by clauses. To discover hidden dependencies from dirty data in practice, approximate dependency discoveries are actively studied, aiming at automatically discovering dependencies that hold on data with some exceptions. In this paper we study the discovery of approximate LODs. (1) We adapt two error measures, namely g g1 and g3, to LODs. We prove their desirable properties, present efficient algorithms for computing the measures and related lower and upper bounds, and study the relationship between the two measures. (2) We present an efficient approximate LOD discovery algorithm that is well suited to the two error measures, with a set of pruning rules, optimization techniques and ranking functions. (3) We study techniques for estimating g1 by sampling, with high accuracy and far less time. (4) We conduct extensive experiments to verify the effectiveness and scalability of our methods, using both real-life and synthetic data.
AB - Lexicographical order dependencies (LODs) specify orders between list of attributes, and are proven useful in optimizing SQL queries with order by clauses. To discover hidden dependencies from dirty data in practice, approximate dependency discoveries are actively studied, aiming at automatically discovering dependencies that hold on data with some exceptions. In this paper we study the discovery of approximate LODs. (1) We adapt two error measures, namely g g1 and g3, to LODs. We prove their desirable properties, present efficient algorithms for computing the measures and related lower and upper bounds, and study the relationship between the two measures. (2) We present an efficient approximate LOD discovery algorithm that is well suited to the two error measures, with a set of pruning rules, optimization techniques and ranking functions. (3) We study techniques for estimating g1 by sampling, with high accuracy and far less time. (4) We conduct extensive experiments to verify the effectiveness and scalability of our methods, using both real-life and synthetic data.
KW - Data profiling
KW - data dependency
KW - metadata
UR - https://www.scopus.com/pages/publications/85149961008
U2 - 10.1109/TKDE.2021.3130227
DO - 10.1109/TKDE.2021.3130227
M3 - 文章
AN - SCOPUS:85149961008
SN - 1041-4347
VL - 35
SP - 3684
EP - 3698
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 4
ER -