节点文献

面向数据集成的数据清理关键技术研究

Data Cleaning in Data Integration

【作者】 刘杰

【导师】 黄涛;

【作者基本信息】 中国科学技术大学 , 计算机软件与理论, 2010, 博士

【摘要】 数据集成是把不同来源、不同格式、不同语义的数据在物理上或逻辑上有机地集中,从而提供一个统一视图的过程。数据集成需求持续增长,但是因为数据集成环境复杂,数据的完整性、一致性、准确性难以保障,数据质量问题导致企业大量数据集成项目延期完成,并大大增加项目成本。数据质量工具成为企业数据管理不可或缺的组成部分,数据质量保障也一直是计算机科学重要的研究领域。完整性约束支持用户采用声明式语言定义数据要满足的依赖关系,同时支持约束之间的蕴含推理,在经典关系数据库研究中,完整性约束一直被用来保证数据库模式的正确性。如何以完整性约束理论为基础,来推理和挖掘数据清理规则并保证数据的一致性,是数据质量保障一个新的热点问题。本文在数据集成场景中研究这一问题,提出新的方法实现自动化高效地检测和清理不一致数据。首先,本文原创性地研究如何在数据集成流程设计完成后,根据目的端的质量约束推理源端需要满足的质量约束从而在源端进行异常数据检测。在数据集成流程中,数据源端的数据经过流程处理后,可能会将违反目的端的完整性约束,导致不成功的加载或者成为目的端数据库中的脏数据,因为数据量大,而且可能存在远程的数据传输,通过执行调试的方法来定位问题数据的代价太大。本论文中提出反向约束传播(Backwards Constraint Propagation,BCP)的方法,首先将数据集成流程建模为有向无环图,它自动将目的端数据库的完整性约束沿着数据流反方向,向数据源端推理,得到的数据源的完整性约束可以用来检测异常数据从而指导设计者进行异常数据过滤或改进流程设计。文中采用一阶逻辑定义并证明面向基本关系代数操作的约束传播规则,并定义约束传播规则支持采用属性映射和元组映射两个抽象操作标注的复杂数据操作,使BCP可以支持大多数类型的数据操作。案例分析及实验表明该方法可以有效辅助捕获异常数据并提高数据集成流程的设计效率。其次,本文提出基于NULL修复的一致性查询方法,支持对不一致数据源在查询时过滤不一致的属性信息。当多个数据源的数据集成后,因为缺乏足够的辅助信息进行清理,还可能存在大量违反完整性约束的数据。一致性查询技术(Consistent Query Answering,CQA)研究如何在查询时采用虚拟修复的方法获取一致的结果,但已有的方法大多基于元组删除的修复语义,可能导致信息丢失,而且对于大多数约束求解CQA是NP问题。我们将约束类型限制在属性级,即只有违反约束的属性为不一致信息,并提出基于NULL的修复语义,将所有不一致属性使用NULL替换得到虚拟修复。当进行NULL修复后可能会产生新的不一致属性,针对该问题提出约束扩展算法,来查询定位所有可能的不一致属性。基于NULL修复语义,给出了SQL重写算法来实现CQA。文中对不一致属性定位算法与SQL重写方法进行了实验与性能分析,表明该方法的计算复杂度与数据库规模、不一致数据比例、查询的类型都是线性关系。接着,本文研究如何基于流程重构实现数据清理流程性能优化,并研究如何将该方法推广应用于web数据mashup。随着数据量飞速的增长,性能成为数据清理的瓶颈,如果对数据清理流程的逻辑模型进行优化,可以在不增加资源的情况下获取性能的提升。本文研究了通用的数据清理流程的逻辑优化框架,通过对流程进行语义等价的结构变换生成备选流程,并预测各备选流程的执行代价选择最优的流程。支持对操作组件标注其操作语义的特征属性,定义特定领域的流程变换规则,同时提出基于流程代价相对关系来构建代价偏序图,提高流程选择的精确度。为了表明该框架的适用性和有效性,将其应用到web数据Mashup工具中进行案例分析,并通过实验表明可以有效降低mashup的响应时间。最后,本文研究实现了模型驱动的数据集成流程的开发平台OnceDQ,并在其上对提出的数据清理新技术进行了实现和应用。该平台基于Eclipse插件机制实现数据操作组件的可扩展性,支持用户自定义操作组件和数据源接口,采用代码生成工具将用户设计的流程自动生成平台独立的Java代码,可以跨平台部署。

【Abstract】 Data integration is to collect data in various sources, format, and semantics, integration them physically or logically, and provde a unified view to access them. Due to large amount of data and the increasing complexity of business intelligence application requirements, it is hard to ensure the integrity, consistency and accuracy of data. It is error-prone and labor-intensive to develop data integration projects due to data quality issues.Integrity constraints provide user a way to define the data dependencies in a declarative way to ensure the consistency and there are sound theory basis to do implication analysis of integrity constraints.It is a hot area to induce and mine data quality rules based on constraint theories. This thesis targets on this problem in the integration scenario to present new method to automatically and efficiently detect and clean the data.First, we originally present a method to induce the data quality constraints for the data sources from the data quality constraints defined on the target database. The data quality in a data source may exceed the expectations of designers at the design time when validation and transformation rules are specified, and this will cause unsuccessful load of target database due to constraint violations or flush dirty data into the target database. Due to large amount of data, and there may need to transfer data between distributed servers, it is costly to debug the DIF by executing it. In this paper, we design a general framework for the problem, called Backwards Constraint Propagation (BCP), which automatically analyzes a DIF, generates data quality rules from the constraints defined in the DW, and propagate them backwards from target to sources. The derived data quality rules can be used to detect exceptional data in the data sources and help designers improve the DIFs. BCP supports most relational algebra operators and data transformation functions by defining constraint propageation rules. Case studies and experiments are provided to demonstrate the correctness and efficiency of BCP.Second, we present a method to automatically filter the inconsistent attirbutes from data sources based on virtual repair by NULL. Although integrity constraints can successfully capture data semantics, the actual data in the database often violates such constraints. When one DIF can be transformed to a relational algebra query, we can apply consistent query answering (CQA) to get an answer which is true in every minimal repair of the inconsistent database. It has been proved that for most constraints and queries CQA is a NP problem based on repairing by tuple deletions or tuple insertions. Furthermore, repairing by deleting tuples will also cause information losing. In this paper we present a new repair semantics named repairing with nulls, which replaces the inconsistent attribute values with nulls. To capture all the inconsistent attribute values, we study the transitivity of nulls and provide an algorithm to extend the original constraints. Based on repairing with nulls, there will be only one repair and CQA can be computed in PTIME by SQL query rewritings. Finally, we study the performance of our new approach for CQA by detailed experiments.Third, we research on enhancing the performance of data cleaning processes via automatically refactoring the structure of its data flows. First a set of operational semantics features are selected for annotating the operators in data flows and refactoring rules are defined to generate all candidate semantics equivalent data flows. Then a heuristic algorithm is described for accurately and quickly searching the data flow of minimal execution time by constructing a partially ordered set of data flows based on their cost estimation. To validate the framework, we apply it to mashups. Mashup tools usually allow end users quickly and graphically build complex mashups using pipes to connect web data sources into a data flow. Because end users are of varying degrees of technical expertise, the designed data flows may be inefficient and this will definitely increase the response time of mashups. Case study shows the framework is applicable to general mashup data flows without knowing complete operational semantics of their operators and the efficiency improvement is demonstrated by experiments.Finally, we research on model driven development method for data integration process and implement a development platform. The details of implementing our research work in the system are discussed.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络