跳到主要导航 跳到搜索 跳到主要内容

Detecting inconsistencies in distributed data

  • Wenfei Fan*
  • , Floris Geerts
  • , Shuai Ma
  • , Heiko Müller
  • *此作品的通讯作者
  • University of Edinburgh
  • Bell Laboratories Inc.

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

One of the central problems for data quality is inconsistency detection. Given a database D and a set Σ of dependencies as data quality rules, we want to identify tuples in D that violate some rules in Σ. When D is a centralized database, there have been effective SQL-based techniques for finding violations. It is, however, far more challenging when data in D is distributed, in which inconsistency detection often necessarily requires shipping data from one site to another. This paper develops techniques for detecting violations of conditional functional dependencies (CFDs) in relations that are fragmented and distributed across different sites. (1) We formulate the detection problem in various distributed settings as optimization problems, measured by either network traffic or response time. (2)We show that it is beyond reach in practice to find optimal detection methods: the detection problem is NP-complete when the data is partitioned either horizontally or vertically, and when we aim to minimize either data shipment or response time. (3) For data that is horizontally partitioned, we provide several algorithms to find violations of a set of CFDs, leveraging the structure of CFDs to reduce data shipment or increase parallelism. (4) We verify experimentally that our algorithms are scalable on large relations and complex CFDs. (5) For data that is vertically partitioned, we provide a characterization for CFDs to be checked locally without requiring data shipment, in terms of dependency preservation. We show that it is intractable to minimally refine a partition and make it dependency preserving.

源语言英语
主期刊名26th IEEE International Conference on Data Engineering, ICDE 2010 - Conference Proceedings
64-75
页数12
DOI
出版状态已出版 - 2010
已对外发布
活动26th IEEE International Conference on Data Engineering, ICDE 2010 - Long Beach, CA, 美国
期限: 1 3月 20106 3月 2010

出版系列

姓名Proceedings - International Conference on Data Engineering
ISSN(印刷版)1084-4627

会议

会议26th IEEE International Conference on Data Engineering, ICDE 2010
国家/地区美国
Long Beach, CA
时期1/03/106/03/10

指纹

探究 'Detecting inconsistencies in distributed data' 的科研主题。它们共同构成独一无二的指纹。

引用此