A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification
01 January 2005
When overlapping or redundant information from multiple sources is integrated, inconsistencies or conflicts in the data may emerge as violations of integrity constrtaints on the integrated data. Previous work has addressed finding minimal repairs to such inconsistent databases, as well as query answers consistent with all minimal repairs. In this work, a database repair is usually defined as a set of tuple insertions and deletions that ensure constraints are satisified. In this paper, we follow recent work to define a database repair as a proposed set of value modifications which resolve constraint violations. We introduce a cost-based framework for evaluating repairs which allows for the application of techniques from record-linkage to the search for good repairs. We prove that for simple cost models, finding minimial-cost repairs is NP-complete in the size of the database, and show that obvious heuristic repair techniques are inadequate in that they may fail to terminate. We introduce a novel repair- construction framework based on equivalence classes of attribute values in which repair algorithms are guaranteed to terminate, and propose variants of the greedy method within this framework. While the naive greedy algorithms take time cubic in the size of the database, we present optimizations inspired by algorithms for duplicate-record detection which greatly improve the scalability of our algorithms at a small cost in repair quality. We evaluate the effectiveness and scalability of our algorithms on synthetic data from the TPC-H benchmark and real data extracted from web sites.