Abstract
Lexicographical order dependencies (LODs) specify orders between list of attributes, and are proven useful in optimizing SQL queries with order by clauses. To discover hidden dependencies from dirty data in practice, approximate dependency discoveries are actively studied, aiming at automatically discovering dependencies that hold on data with some exceptions. In this paper we study the discovery of approximate LODs. (1) We adapt two error measures, namely g g1 and g3, to LODs. We prove their desirable properties, present efficient algorithms for computing the measures and related lower and upper bounds, and study the relationship between the two measures. (2) We present an efficient approximate LOD discovery algorithm that is well suited to the two error measures, with a set of pruning rules, optimization techniques and ranking functions. (3) We study techniques for estimating g1 by sampling, with high accuracy and far less time. (4) We conduct extensive experiments to verify the effectiveness and scalability of our methods, using both real-life and synthetic data.
| Original language | English |
|---|---|
| Pages (from-to) | 3684-3698 |
| Number of pages | 15 |
| Journal | IEEE Transactions on Knowledge and Data Engineering |
| Volume | 35 |
| Issue number | 4 |
| DOIs | |
| State | Published - 1 Apr 2023 |
Keywords
- Data profiling
- data dependency
- metadata
Fingerprint
Dive into the research topics of 'Discovery of Approximate Lexicographical Order Dependencies'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver