TY - GEN
T1 - Improving data quality
T2 - 33rd International Conference on Very Large Data Bases, VLDB 2007
AU - Cong, Gao
AU - Fan, Wenfei
AU - Geerts, Floris
AU - Jia, Xibei
AU - Ma, Shuai
N1 - Publisher Copyright:
Copyright 2007 VLDB Endowment, ACM.
PY - 2007
Y1 - 2007
N2 - Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., find a repair D′ that satisfies the constraints and "minimally" differs from D. Equally important is to ensure that the automatically-generated repair D′ is accurate, or makes sense, i.e., D′ differs from the "correct" data within a predefined bound. This paper studies effective methods for improving both data consistency and accuracy. We employ a class of conditional functional dependencies (CFDs) proposed in [6] to specify the consistency of the data, which are able to capture inconsistencies and errors beyond what their traditional counterparts can catch. To improve the consistency of the data, we propose two algorithms: one for automatically computing a repair D′ that satisfies a given set of CFDs, and the other for incrementally finding a repair in response to updates to a clean database. We show that both problems are intractable. Although our algorithms are necessarily heuristic, we experimentally verify that the methods are effective and efficient. Moreover, we develop a statistical method that guarantees that the repairs found by the algorithms are accurate above a predefined rate without incurring excessive user interaction.
AB - Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., find a repair D′ that satisfies the constraints and "minimally" differs from D. Equally important is to ensure that the automatically-generated repair D′ is accurate, or makes sense, i.e., D′ differs from the "correct" data within a predefined bound. This paper studies effective methods for improving both data consistency and accuracy. We employ a class of conditional functional dependencies (CFDs) proposed in [6] to specify the consistency of the data, which are able to capture inconsistencies and errors beyond what their traditional counterparts can catch. To improve the consistency of the data, we propose two algorithms: one for automatically computing a repair D′ that satisfies a given set of CFDs, and the other for incrementally finding a repair in response to updates to a clean database. We show that both problems are intractable. Although our algorithms are necessarily heuristic, we experimentally verify that the methods are effective and efficient. Moreover, we develop a statistical method that guarantees that the repairs found by the algorithms are accurate above a predefined rate without incurring excessive user interaction.
UR - https://www.scopus.com/pages/publications/84959912087
M3 - 会议稿件
AN - SCOPUS:84959912087
T3 - 33rd International Conference on Very Large Data Bases, VLDB 2007 - Conference Proceedings
SP - 315
EP - 326
BT - 33rd International Conference on Very Large Data Bases, VLDB 2007 - Conference Proceedings
A2 - Gehrke, Johannes
A2 - Koch, Christoph
A2 - Garofalakis, Minos
A2 - Aberer, Karl
A2 - Kanne, Carl-Christian
A2 - Neuhold, Erich J.
A2 - Ganti, Venkatesh
A2 - Klas, Wolfgang
A2 - Chan, Chee-Yong
A2 - Srivastava, Divesh
A2 - Florescu, Dana
A2 - Deshpande, Anand
PB - Association for Computing Machinery, Inc
Y2 - 23 September 2007 through 27 September 2007
ER -