节点文献

面向多类型数据源的数据仓库构建及ETL关键技术的研究

Research on Key Techniques of Data Warehousing and ETL for Multi-type Data Sources

【作者】 宋杰

【导师】 于戈;

【作者基本信息】 东北大学 , 计算机软件与理论, 2008, 博士

【摘要】 数据仓库的创建与应用是企业信息化发展的必由之路。近十年来,为满足数据的集成、管理和决策支持的目的,在世界各地出现了大量的、不同规模的数据仓库系统。数据仓库数据源的类型也越来越多样化。尤其是Web数据源,文本数据源等实时数据源的出现,给数据仓库的构建以及ETL提出新的挑战。数据仓库技术面临若干紧迫问题:如何构建一个完善的数据仓库体系以适应多种类型的数据源;如何高效实现数据仓库体系中各个层次的ETL过程;如何保证ETL的实时性以及如何改进数据仓库的访问控制模型等。本文针对多类型数据源的特点,首先分析现有数据仓库的需求和数据源的种类。本文以国家海洋数据仓库系统为例,利用局部ETL和全局ETL两段式ETL过程;演化面向多类型数据源的数据仓库体系结构,包括抽取层、归档层、汇总层、仓库层和应用层,并且详细论证了每一层的设计思路和作用。基于此,本文研究了每一层涉及的若干关键问题。抽取层和归档层主要完成数据的抽取和归档工作,该层的ETL软件实现从数据源中抽取数据并装载到归档库中,因此称为局部ETL。本文重点研究了无结构的Web页面,半结构化文本和结构化的关系型数据库这三种数据源的局部ETL技术。首先,针对无结构的Web页面数据源的局部ETL问题,提出一种较传统方式更为高效的Web页面采集存储方法。把页面按照其布局特点分为若干个区域,把这些区域作为变化检测、存储和处理单元。其次,针对半结构化文本数据源的局部ETL问题,重点研究了半结构化非白描述型科学文本数据,提出了一种文本数据关系化方法,实现从文本模型到对象模型进而到关系模型的转换。此外如何保障关系化的效率和安全性也是本研究的重点。再次,针对结构化关系数据库数据源的局部ETL问题,本文分析和总结了影响ETL引擎性能的主要因素,提出了一种基于分布式数据库的ETL新方法,还提出了一种元数据驱动的ETL方法来克服现有ETL工具和手工编码方式的不足。基于E-LT方法,本文利用SQL语言实现了元数据驱动的ETL工具并详细测试了其执行性能。汇总层和仓库层完成从各个数据源的归档区到数据仓库的数据集成工作,该ETL过程称为全局ETL。由于数据仓库的实时性要求,多数据源全局ETL不仅要面临数据集成问题,还要保证ETL的实时或是近实时调度。本文提出了按照集成的自身规则触发ETL过程,并分配资源,以解决全局ETL的调度执行,以及它和其它数据仓库应用之间争夺数据仓库资源的问题。由于实时ETL执行过程中独占数据仓库资源,应用端一时无法连接数据仓库而处于一种离线状态。本文设计了一个支持离线运行的客户端框架,使得短时离线的过程对客户端用户透明。该离线客户端框架属于环境可感知软件框架,具有一定的通用性。数据仓库应用层主要包含查询检索,OLAP,数据挖掘等应用,还包括各应用的访问控制系统。数据仓库应用乃至数据仓库自身都需要一种良好的访问控制机制。本文提出两种访问控制模型。基于角色和上下文的访问控制模型是经典的基于角色的访问控制模型的扩展,适用于数据仓库应用以及任何面向最终用户的软件系统的访问控制。基于意图的访问控制模型适用于数据库系统,数据仓库系统等面向应用软件的系统的访问控制。本研究还在后者的基础上进一步研究了意图间的层次关系挖掘算法。总之,本文提出了一种面向多类型数据源的数据仓库体系结构和层次划分,基于该体系结构对各层次的关键问题进行分析和研究。所提出的所有模型和算法均给出实现方法或运用在实际项目中,理论分析和实验证明了所提出方法和技术的可行性和有效性。整个研究内容围绕着数据仓库和ETL过程的设计和实施,保证了数据仓库系统中数据的流动和访问的实时、灵活、高效,对数据仓库的建设和ETL的实施有一定指导作用。

【Abstract】 The creation and the application of data warehouses is the only way for the enterprise to realize the advanced informationnalization. In the recent decade, lots of different scales data warehouse systems appear to solve the history data integration, management and decision support problem. The data sources of data warehouses are gradually various. Especially, the appearance of new real time data sources such as Web and textual data brings the new challenges to data warehousing and ETL. The data warehouse technologies faced with such serious problems:How to build a perfect data warehouse architecture to adapting the various data sources; how to implement a efficient ETL process of each layer of data warehouse system; how to guarantee a real or near-real time ETL and how to improve a access control model of data warehouse.This dissertation foucs on the characteristic of multi-type data sources first analyzes the existing requirements of data warehouse and the categories of various data sources, used the local ETL and the global ETL as two stages of the whole ETL process. Taking national data warehouse system as an example, the various data sources oriented data warehouse architecture is proposed, including the extraction layer, archive layer, summary layer, warehouse layer and application layer, the design and functions of each layer are also introduced in detail. Based on these, the key techniques of each layer are well studied.The main functions of extraction and archive layer are extracting and archiving data. The ETL software of these layers extracts data from various data source to the archive database, so it is called local ETL. This dissertation studied the local ETL based on the data sources of un-structured Web pages, semi-structured text and structured relative database. First, the issues of local ETL based on the data sources of un-structured Web pages are focused, and a more effective approach of collecting and storing Web pages is proposed. The approach divides the Web page into many blocks based on its layout, and treats these blocks as the units of version comparison, incremental storage and future process.Secondly, focusing on the issues of local ETL based on the semi-structured textural data sources, the dissertation studied on non-self-describing, semi-structured scientific data, purposed an approach of relationalization of textual data, accomplished the conversion of text model to object model then to relation model. Moreover, the efficiency and security of the model are also highlighted.Thirdly, focusing on the issues of local ETL based on the structured data source of relative database, some factors affecting the performance of ETL are summarized, and then a distributed database system based new ETL approach is purposed in this dissertation. Fartherly, a metadata-driven ETL approach is also proposed to provide the better flexibility, extensibility and maneuverability of the ETL tool. Based on the these approaches, a SQL-based, metadata-driven ETL tool is implemented and tested to prove the better efficiency.The summary layer and warehouse layer perform the data integration of the various data sources from the archive layer to the warehouse layer, this is some kind of ETL process named the global ETL. With the real time requirements, the global ETL faced not only the data integration issues but also the issues of real or near-real time ETL schedule. To solving the schedule opportunity of global ETL, and its competing with other applications for the resource of data warehouse environment, a new schedule approach of real time ETL is proposed, which trigger the ETL process and assign the resources according to the integration rules. Because real time ETL make use of all resources exclusively when it is executing, the running applications would lost the connections with data warehouse provisionally. In order to making the terminal users being not conscious of intermittent connectivity, a client framework supporting occasional connectivity is designed. The offline client framework is an environment-appreciable smart software framework with a certain universality.The application layer of data warehouse includes query, search, OLAP and data miming applications, it should also include a well organized access control mechanism. Both the applications and the data warehouse itself need a nice mechanism of access control. The two access control models are proposed in this dissertation. The proposed role and context based access control model is the extension of the classical role based access control model (RBAC), it is fit for the access control of data warehouse applications and for all of the use oriented applications. Another proposed model is purpose based access control model, it is fit for the database, data warehouse system and any other application oriented systems. Furthermore, according to the later model, an algorithm of mining hiberarchy relationships among the purposes is also studied in this dissertation. In conclusion, this dissertation first proposed an architecture of various data sources oriented data warehouse and its layers. Based on the architecture, the key techniques of each layer are well analyzed and studied. All the proposed apporaches and models have been implemented and applied in the practice projects, and their feasibility and effectivity also have been proved by the theoretics and the experiments. The whole researches focus on the design and performance of data warehousing and its ETL processes, and guarantee the opportunely, flexibly and efficiently of data flow and data access in the data warehouse system. These works are the guidance of building data warehouse and implementing ETL system.

  • 【网络出版投稿人】 东北大学
  • 【网络出版年期】2011年 06期
  • 【分类号】TP311.13
  • 【被引频次】20
  • 【下载频次】1159
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络