节点文献

数据仓库中数据志跟踪的理论与方法研究

Theories and Approach of Data Lineage Tracing in Data Warehouse Environment

【作者】 戴超凡

【导师】 陈文伟;

【作者基本信息】 中国人民解放军国防科学技术大学 , 管理科学与工程, 2002, 博士

【摘要】 在数据仓库系统中,一个仓库数据项的精确的历史沿革,即该数据项从获取、转换、集成到现状这一完整过程的相关描述和信息,称为数据志(Data Lineage)。数据志包含两个部分:起始数据集和作用在该数据集上的数据处理过程。获取数据志的过程称为数据志跟踪(Data Lineage Tracing)。数据志跟踪技术是数据仓库研究中一个最新的前沿性课题,不仅可以支持更全面、更深入的数据分析,还可以帮助技术人员验证源数据、清洗规则和转换处理的正确性,从而提高数据仓库的质量。 作者从定义起源集入手,找出了起源集的一般规律,证明了有关起源集的定理,提出了一种“基于属性映射的弱逆与验证”的起源集跟踪方法,给出了一系列有关起源集跟踪的算法,并设计了数据志跟踪的基本过程,从而形成了一套系统的数据志跟踪理论与方法。本文的主要工作与创新有以下几个方面: 作者首先对与数据志相关的概念进行了完善和细化,给出了起源集的形式化定义,并提出了补集无关和补集相关的概念。这些定义和概念是跟踪起源集的基础,也是检验跟踪结果的依据。在此基础上,作者证明了有关起源集的5个定理,这些定理证明了转换与属性映射、起源集与属性映射、起源集与作用集之间的关系,并证明了几类转换的补集无关性。这些定理为作者根据属性映射的可逆性构造和验证弱起源集提供了基本依据和指导思想,丰富了数据志跟踪的基本理论。 作者根据可逆与弱可逆的思想,提出了一种“基于属性映射的弱逆与验证的方法(Wivem,Weak Inversion and VErification of attRibute mapping)”求解属性映射的(属性级)起源集。在此基础上,作者分析了转换的可逆性,给出了弱可逆转换的形式化定义,并通过对弱可逆转换中弱逆映射求解的弱起源集进行单维合并、多维合并来求解转换的(元组级)弱起源集。 作者证明了基本运算的起源集的唯一性定理和求解定理。基本运算的起源集唯一性定理保证了求解的基本运算的起源集的正确性,基本运算的起源集求解定理给出了求解公式,通过这些求解公式可以直接求解这些基本运算的精准的起源集,而不需要进行验证,并且一般不需要访问输入数据集,因此求解性能很好。 作者基于导出关系给出了转换图的起源集的形式化定义,证明了起源集的传递性定理。在此基础上,设计了跟踪转换图的数据志的基本过程。在构造弱起源集阶段,提出了可延续跟踪性的概念,给出了可延续跟踪性判别算法和可延续跟踪的弱逆映射的筛选算法;在验证弱起源集阶段,针对不同类型的转换和属性映射,给出了相应的验证算法。 为了验证本文提出的理论和方法,作者对TPC-H测试标准中具有代表性的关系查询Q2和Q12进行了数据志跟踪实验,验证了起源集理论和方法的有效性,并与Cui博士提出的“基于转换性质的跟踪查询过程的方法”进行了详细的比较。实验结果表明,从跟踪响应时间、存储需求和结果的精度等主要指标来分析和评价,作者提出的Wivem方法的跟踪性能在总体上优于Cui博士方法的跟踪性能。

【Abstract】 The exact history of a given warehouse data item, including the complete description of its acquisition , transformation and integration is termed the data lineage. Data lineage includes two parts: (1) the set of source data items which exactly produces the warehouse data item; (2) the processes which contribute to the set of source data items. Identifying the data lineage of a given warehouse data item is termed data lineage tracing . As one of the most advanced research problems in data warehouse system, data lineage tracing may play an important role in the area of in-depth data analysis, and help us to validate the source data , cleaning rules and transformation rules, and thus improving the quality of data warehouse.Beginning with the formal definition of derivation set, this thesis finds the general laws of derivation set, proves the theorems about derivation set, proposes an approach for weak inversion and verification based on attribute mapping to trace data lineage, gives a series of arithmetic for data lineage tracing, describes the basic processes of data lineage, and then forms systematic theories and approach. Following is the primary work and contributions of this thesis.First, the concepts about data lineage tracing are completed and refined, and the formal definition of derivation set and supplementary set are provided. These definitions form the basis for derivation set tracing. At the same time, they are the criterion for verifying the result of tracing. Then this thesis proves five theorems about derivation set, which defined the relationship between transformation and attribute mapping, derivation set and attribute mapping, derivation set and contribution set, and the correlation of supplementary set of transformation. These theorems is the basis and guideline for constructing and verifying the weak derivation set according to the invertibilrty of attribute mapping, thus improves the basic theories of data lineage tracing.Next, this thesis presents a data lineage tracing approach, Wivem ( Weak Inversion and VErification of attRibute mappiNg ), which can calculate ( attribute-level ) derivation set of attribute mapping. Then, this thesis analyzes the invertibilrty of transformation, and presents the formal definition of weak invertibte transformation, and calculates ( tuple-level ) derivation set of transformation by one-dimension merging and multi-dimension merging of the weak derivation set resolved by weak inverse attribute mapping. Also, this thesis proves the uniqueness and solution theorems of derivation set of basic relation operators.Then , This thesis presents the formal definition of derivation set of transformation diagram, proves the derivation set transitivity theorem, and shows the basic processes for tracing transformation diagram. Upon the construction of weak derivation set, this thesis presents the concept of continuing traceability , and provides decision algorithmfor the continuing traceability of a transformation sequence and tittering algorithm for the continuing traceable weak inverse attribute mapping. Upon verifying weak derivation set, this thesis gives a series of verification algorithms based on the best property of attribute mapping or transformation.Finally, in order to validate our theories and approach, this thesis conducts data lineage tracing experiment with relational query Q2 and Q12 of TPC Benchmark?H, and compares the tracing performance with the approach of tracing query process presented by Doctor Cui. The result shows that the Wivem approach is much better than the approach presented by Cui according to tracing time, storage cost and the precision of tracing result.

  • 【分类号】TP311.131
  • 【被引频次】10
  • 【下载频次】421
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络