节点文献

流式数据多维建模与查询关键技术研究

Research on Key Issues of Data Stream Multi-dimensional Modeling and Querying

【作者】 侯东风

【导师】 张维明;

【作者基本信息】 国防科学技术大学 , 管理科学与工程, 2010, 博士

【摘要】 近年来,随着流式数据应用的不断扩展,用户亟待在这些数据中发现不同维度视角、不同数据粒度层次的异常模式、兴趣模式、发展趋势等,为实时决策提供支持,流式数据多维建模与查询技术的研究正适应这一现实需求。流式数据具有不同于传统数据的动态性、无限性、突发性等特征,另外一方面,与传统的分析方法相比较,多维查询具有较高的复杂性,从而数据建模、数据组织与查询计算均面临巨大挑战。本文针对上述挑战深入研究了多层次时间窗口模型、流式数据多维模型、流立方体计算方法、多维连续查询计算方法等关键技术。多层次时间窗口模型约束了流式数据的无限性,同时能够表达数据的多时间粒度和动态特征,通过适应性聚集树结构维持当前时间窗口中不同时间粒度的聚集信息,为流式数据聚集计算提供支持;定义流式数据多维模型,利用多层次时间窗口模型描述时间维度,用于描述模型的动态性,并定义了基本操作代数、分析操作代数与维护操作代数,为实现流式数据的多维组织与查询奠定了理论基础。针对流式数据多维组织问题提出了基于兴趣视图子集的流立方体计算方法,采用一种多路聚集树结构物化存储兴趣视图中的数据单元,支持立方体动态更新与多维查询计算。针对多维连续查询计算问题提出了基于查询状态维护的基本计算框架,并建立索引结构提高多维连续查询的执行效率。最后,以Web日志分析为例,说明了流式数据多维查询在实际中的应用。本文的主要贡献如下:(一)提出多层次时间窗口模型,通过时间粒度体系描述不同时间粒度层次数据的聚集关系,将无限的流式数据集合映射到有限的滑动窗口中,同时能够适应用户查询的多时间粒度需求,为流式数据处理提供基础支撑。提出适应性聚集计算方法维持多层次时间窗口上的聚集信息,建立适应性层次聚集树维持当前时间窗口中的聚集信息,其中的稀疏部分仅维持高层次聚集值,实验结果表明,该方法在非稳定的流式数据聚集计算中具有较为明显的优势。(二)提出流式数据多维模型,利用多层次时间窗口模型描述时间维度,约束了时间维度的无限性,同时能够表达维度的多粒度性,通过窗口中流事实的不断变化描述模型的动态特征。为了支持流式数据多维查询应用,定义了流式数据多维实例上的基本操作代数、分析操作代数、维护操作代数。最后,针对时间维度无限性、流事实动态性以及聚集函数复杂性等属性分析了模型的适用范围和约束条件。流式数据多维模型及操作代数的定义反映了多维计算的动态特征,为实现数据组织与查询奠定了理论基础。(三)针对流式数据的动态多维组织,提出基于兴趣视图子集的流立方体计算方法,兴趣视图反映了用户的查询需求,且仅占据视图集合的小部分,物化存储其中的数据单元能够减少存储空间消耗,同时能够满足大部分用户需求。在该方法中,采用一种多路聚集树结构维持物化数据单元及聚集关系,用于支持快速数据更新、即席查询与分析,在计算过程中采用多层次时间窗口约束和适应性划分策略进一步减少占用的存储空间,实验结果表明,该方法能够满足用户查询需求,并且具有较高时间和空间效率。(四)针对多维连续查询计算问题,提出了基于查询状态维护的查询计算框架,多维连续查询中维持了对连续执行结果产生影响的数据单元,通过更新、移除、输出查询结果等操作支持多维连续查询的动态计算;同时为了提高连续查询计算效率,基于连续查询选择条件建立索引树结构,用于支持多维连续查询状态的快速更新维护。实验结果表明,该方法为实现多维连续查询提供了有效途径,并具有较高的时间和空间效率。综上所述,本文针对关于流式数据多维建模、多维组织与查询计算等关键技术进行了突破性研究,提出了相应的理论与方法,对促进流式数据应用的发展具有一定的理论与实践意义。

【Abstract】 In recent years, with the extension of data stream application in a wide range of fields, the users want to discovery the trends, unusual and interesting patterns from diverse composite dimensions and different granularities for the real time decision making. The research on data stream multidimensional modeling and querying is conducted to meet this requirement. Compared with traditional data, data stream features variability, infinity and bursty; and on the other hand, compared with traditional analysis methods, multidimensional query is highly sophisticated, which presents huge challenges to s data modeling, storage and querying.In response to these challenges, this dissertation aims to address several key problems, including multi-level time window model, multidimensional model of data stream, stream cube computing and multidimensional continuous query. The multi-level time window model bounded the infinity of data stream, and also described the multi-time granularities and variability. The aggregated values of multi-level time window were maintained in the adaptive hierarchy aggregate tree for aggregation computing. The multidimensional model of data stream was defined for organization. The time dimension followed the multi-level time window model which represents the dynamic property of multidimensional model, the basic algebra, analysis algebra and maintenance algebra were defined for lading the theory foundation of multidimensional organizing and querying. The stream cubing method based on interesting view subset was put forward for multidimensional organizing of data stream, in this method, the multi-way aggregation tree is established for maintaining the cells of interesting views. The tree structure can be updated dynamically to meet the multidimensional queries. The framework based on query state maintenance was designed for computing multidimensional continuous queries, and the index structure is built for improving the efficiency of queries execution. Finally, the application of multidimensional query is illuminated by the case of weblog analysis.The main contributions of the dissertation are as follows:(1) Multi-level time window model was put forward for mapping the infinite data stream to the sliding window, and the relations of different level were described by time granularities system, which can fill the multi-time granularities of queries and provide the foundational support for data stream processing. The adaptive method for computing the aggregation in time window was studied, in this method, the adaptive hierarchy aggregate tree is adopted as the basic structure, in which the sparse parts only the high level value is held. The experiment shows that the method is superior than others in the bursty data stream.(2) The multidimensional model of data stream is proposed in the dissertation, the time dimension was described by multi-level time window model, the infinity of time dimension is restricted and the multi-granularities is expressed, and the dynamic of model is depict by the evolving of stream fact. The algebras were described for defining multidimensional queries, including the basic algebra, analysis algebra and maintenance algebra. In the end, we aimed at the infinity of time dimension, dynamic of stream fact and sophistication of aggregate function analyzed the scope and restrictions of model. The definition of multidimensional model of data stream and algebra despite the dynamic of multidimensional computing, and lay the theory foundation for data stream organizing and querying.(3) The stream cube computing method based on Interesting View Subset was proposed for dynamic multidimensional organizing of data stream. The interesting view set indicates the requirements of queries, and cover only small parts of all ones. Materializing the cells of interesting views could reduce the consumed memories and also fill the needs of most users. In this method, an multi-way aggregate tree is adopt for maintaining the cells and it’s relations, it can be used for quickly updating the cube and the result of ad-hoc querying, in the running phase, the storage space of structure can be reduced by multi-level time window and adaptive partition strategy. The experiments show that the method could satisfy users’ requirements and also is efficient in time and space.(4) The framework of query computing based on query state maintenance was proposed for multidimensional continuous querying. The state of continuous query hold the cells that may contribute to any future query results, and support dynamically computing of multidimensional continuous query by the operator of update, remove and generate results. We also constructed the index tree based on the select predication of continuous queries for improving the update efficiency of state. The experiment shows that the method was effective in multidimensional continuous query implementation, and also was efficient in time and space.In conclusion, this dissertation put emphasis on several key issues of data stream multidimensional modeling and querying, and a series of algorithms and theories were studied. It is significant in theory and practice for the development of data stream application.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络