节点文献

异构数据映射技术研究

Research on Mapping of Heterogeneous Data Integration

【作者】 缪嘉嘉

【导师】 吴泉源;

【作者基本信息】 国防科学技术大学 , 计算机科学与技术, 2008, 博士

【摘要】 数据集成是信息集成的基础。随着人们对信息综合利用要求的不断深化,大规模异构数据的集成已经成为当前信息集成领域的研究热点。异构数据集成的关键是通过映射技术建立异构数据之间的一致性,包括数据属性或模式的一致性,数据主体或元组实例的一致性。本文工作围绕大规模数据集成中建立模式和数据一致性的映射与匹配技术展开研究,利用机器学习、自然语言处理以及模糊理论对已有的模式映射、实例映射和失效映射检测方法进行发展与改进,并扩展了异构数据集成平台StarEAI,在实际应用中验证了本文给出的方法与技术的有效性。本文主要工作包括:1、在模式层面的一致性方面,本文提出了一种基于数据实例的多策略模式映射方法MSMA,首先针对实例数据具有良好的结构化特征的情况,根据大量样本特征信息,设计了数据格式、约束、均值、贝叶斯等基于实例结构的学习器,并产生预测分类模型,运用机器学习方法,抽取待匹配数据的特征信息,进行模式映射;进而改进了组合算法,将标签作为组合器的输入,有效降低了组合算法的的计算复杂度。实验结果表明MSMA方法的查全率最高达到89%,查准率到达93%,在模式信息缺失的情况下,较已有的著名映射方法LSD准确率提高7%。2、在数据层面的一致性方面,本文提出了一种基于聚类分析的元组实例匹配方法HIMA。首先从方法框架上,HIMA方法利用聚类算法,较一一匹配算法有更高的处理效率;在聚类算法中,采用基于条件概率分布的字符串相似性度量算法进行元素之间距离计算,能够有效的提高匹配准确率;此外,针对一些应用中实例描述冗长的现象,本文提出基于最大熵模型的关键词提取,有效去除无效信息。实验结果表明采用条件概率分布距离度量算法和关键词提取算法的匹配方法HIMA,其准确率达到83%,优于基于距离、基于令牌的算法,其准确率提高6%。3、在运行时模式映射失效方面,本文提出了一种基于模糊聚集算子的失效映射检测方法BMSD,研究了数值、趋势、布局等学习器之间结果融合的各种情况,加入了基于析取加权的模糊聚集算子,改善融合精度;在进行人工数据和真实数据训练结果融合时,引入变权方法,使得融合结果不但能够考虑到各因素的相对重要性的偏好,也顾及各因素状态均衡程度的偏好。实验结果表明BSMD方法的平均准确率达到85%,较已有的Marveric方法提高7%。4、在上述研究的基础上,对我院的国家863成果异构数据集成平台(StarEAI)进行了扩展,增加了自动模式映射功能、元组实例匹配功能以及运行时失效映射检测功能,拓展后的平台在网络监控数据集成项目和军队项目中得到成功应用。

【Abstract】 Data integration is the basis of the information integration technology. With the continuous increasing of the information utilization, the large-scale heterogeneous data integration has become a hot issue in the information research. The mapping technology is the key to establish the consistency among the heterogeneous data, including the consistency of data model, the consistency of data instance and so on. This dissertation focuses on making a deep research on the mapping and matching technologies to maintain the consistency among the heterogeneous data. By introducing the technologies of machine learning, natural language processing, as well as the theory of fuzzy model, we improve the schema mapping approach and the instance matching approach while optimize the broken mapping detecting algorithm. In practice, we extend the platform of heterogeneous data integration (StarEAI), and finally we verify our approaches with the real-world widely used applications. This dissertation makes four contributions as follows:Firstly, to address the consistency issue of schema level, we proposed an Instance-based Multi-Strategy Schema Matching Approach (MSMA). In the schema mapping research, we are supposed to use the information of schema and other descriptions, along with the characteristics of instances,, to identify the relation between different schemas. There are rule based and machine learning based approaches to tackle this problem. Examining the existing mapping approaches, we can draw a conclusion that they build the decision model automatically or artificially. The machine learning based approach is more adaptable. A single leaner determine whether the relationship is established by a certain type of information available, but the multi-strategy approach refers to considering a variety of information. Consequently, the multi-strategy approach can increase the utilization of information, thus it can improve the accuracy of mapping. MSMA designs a number of learners to grasp the information of instances, and improves the multi-strategy approach. The experimental results show that the precision of MSMA is up to 89%, and the recall of MSMA is up to 93%. As to the pattern of lack of schema information, MSMA has more precision of the original approach.Secondly, considering the consistency of instance level, we come up with a Holistic Data Instance Matching Approach (HIMA). The heterogeneous instance refers to the same entity in different data sources, which has different descriptions. The instance matching approach can eliminate the heterogeneous data. Firstly, we measure the similarity of instances with the algorithm of string distances. The condition probabilistic based algorithm can improve the accuracy of the whole approach. From the perspective of framework, the traditional methods can just take two input data sources, and perform the pair-wise matching. HIMA makes use of the clustering algorithm, which it can handle, a large scale of data source holistically. In addition, we use the keyword extracting method, which is based on the maximum entropy model, to get rid of the useless information. The experimental results show that the keyword extracting algorithm can get 70% precision, and the condition probabilistic based algorithm is more precise than the token-based algorithm. HIMA method can achieve 83% accuracy.Thirdly, to process the run time broken mapping detecting issue, we put forward a Fuzzy-based Broken Schema Mapping Detecting Approach (BSMD). In this dynamic distributed environment, the data sources trend to suffer changes that invalidate the mappings. Such continuous monitoring is extremely labor intensive, and poses a key bottleneck to the widespread deployment of the data integration systems. The kernel of BSMD is a set of computationally inexpensive modules called sensors, which capture salient characteristics of data sources, like Maveric system. We develop two novel improvements: Disjunction-Weighted Average Operators are leveraged to calculate the score, which implies whether the mapping is broken; Change Weight Operators is introduced combine artificial data with real data in the training phase. The experiments over the real-world sources demonstrate the effectiveness of our fuzzy-based approach over existing solutions, as well as the utility of our improvements.Finally, based on the above-mentioned studies, we extend the platform of heterogeneous data integration (StarEAI), which is the outcome of an 863 project. We extend this platform with tree modules: the automatic schema mapping module, the instance matching module, as well as the broken mapping detecting module. The StarEAI+ system has been successfully deployed in the projects of armed forces and network monitoring.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络