节点文献

高性能数据立方体及其语义研究

Research on High Performance Data Cube and Its Semantics

【作者】 师智斌

【导师】 黄厚宽;

【作者基本信息】 北京交通大学 , 计算机应用技术, 2010, 博士

【摘要】 数据立方体技术是联机分析处理的主要手段。随着数据规模的扩大和维数的增加,数据立方体的操作代价急剧增加,需要进行优化处理。目前数据立方体的研究包括:物化、索引、近似、压缩、约简以及联机聚集等。形式概念分析理论(FCA)是以形式化的概念和概念层次为基础的数学分析工具。研究发现,概念格作为FCA的核心结构与数据立方体格都基于序结构,并且以数据仓库中的基本表作为形式背景,FCA理论中与概念相对应的等价特征组与数据立方体的覆盖等价类对数据单元具有相同的划分结果。本文将FCA和概念格理论引入数据立方体的研究,进行高性能数据立方体及其语义研究。研究表明,FCA及相关理论的引入,为数据立方体研究提供了一个新的有力的分析工具,利用该工具可以从数据内部特性入手,实现结构简单、体积较小且性能较优的数据立方体,并使数据立方体语义的理解更深刻,更易于实现。主要的研究工作如下:(1)提出基于形式概念格结构表达的数据立方体。首先对数据立方体与形式概念格进行相关分析,以概念格结构表达数据立方体,提出聚集概念和聚集概念格结构(ACL)。ACL是一种完全的数据立方体结构,由于其内具有相同聚集值的若干单元用一个聚集概念表示,因此能实现与商立方体相同的约简。另外,ACL结构中概念间的泛化和例化关系反映了约简后数据之间的层次关联,可表达比商立方体更清晰的数据立方体语义关系。其次,在ACL基础上,本文提出约简聚集概念结构(RAC)。基于形式概念分析理论中G偏序关系的性质研究发现,由于基本表的完备性,基本表中各个元组与ACL结构中的对象概念一一对应,因此基本表可以看作是所有对象概念的集合。RAC结构对ACL进一步约简,去除所有对象概念和特殊概念(Ω,null)。与基本表联合,RAC仍然是完全的立方体结构,但能实现比商立方体和ACL结构更大的约简,且仍能保持所有非对象聚集概念之间的语义关系。第三,基于形式概念分析理论中M偏序集的性质,提出基于ACL和RAC高效的查询方法。该方法利用属性概念内涵m″确定在ACL和RAC上的查询搜索路径,避免全范围的搜索,查询效率较高。最后,对形式背景进行讨论,将概念格的属性约简理论应用于数据立方体,通过合并相对必要属性、删除绝对不必要属性实现形式背景的简化,最终实现数据立方体相关操作的简化。(2)研究形式背景的属性蕴含关系,采用关系系统存储,提出基于属性蕴含的约简聚集概念数据立方体结构(RAC-AI)。根据形式概念分析理论,研究形式背景中描述概念格的两类非平凡属性蕴含:前提是伪内涵的蕴含和前提是真前提的蕴含。研究通过属性蕴含而不再依赖概念格结构确定概念内涵。在RAC结构基础上,提出两种基于属性蕴含的约简聚集概念数据立方体结构(RAC-AI):基于前提是伪内涵和基于前提是真前提的RAC-AI结构。RAC-AI结构摒弃RAC复杂的概念格结构,增加属性蕴含表,记录形式背景中所有非平凡的蕴含,并采用主流的关系系统存储所有非对象聚集概念。理论分析和实验结果表明,RAC-AI体积小,结构简单,构建和增量维护代价较低,查询响应速度也较快,是目前综合性能较优的数据立方体。(3)数据立方体语义关系的挖掘和应用直接影响联机分析处理的各种操作。本文研究基于FCA和概念格理论的数据立方体语义操作实现。首先讨论形式背景的净化和约简,消除形式背景中的冗余信息。现有的数据立方体语义研究都未考虑对数据本身进行约简,大量冗余信息的存在干扰了对语义的理解和发现。其次,利用形式概念分析的M偏序集理论,将M偏序关系作为生成概念分层的一种启发式的规则,形成属性级别的概念分层语义,而现有的概念分层一般只进行到维级别。第三,利用M偏序关系和非平凡的属性蕴含,实现数据立方体单元之间的上卷和下钻语义操作。通过分析等价特征组上界和下界的特性,获得等价特征组的结构,实现具有相同聚集值单元之间的上卷和下钻语义操作。利用非平凡的属性蕴含获取任意概念的父概念和子概念的内涵,实现不同聚集值单元的上卷和下钻语义操作。该方法不依赖任何特殊结构,实现从数据立方体任意单元出发的上卷和下钻操作,重复这个过程,能在数据立方体格中漫游,而不必生成完整的数据立方体。现有的数据立方体上卷和下钻语义操作一般只进行到视图级别,能达到单元级别的一般要依赖复杂特殊的结构实现。(4)范围查询是应用于多维数据立方体的有效的分析工具,预计算技术是提高范围查询响应速度的一种方法。本文在现有prefix sum技术和分块技术基础上,提出基于前缀区域的不规则方体的分块方法PRC,这种分块方法利于从起始单元开始的前缀区域聚集值的计算。对d维数据立方体(假定每维的度都为n),PRC在分块及区域求和时使用回归分割技术,在不增加额外空间的基础上,实现范围查询和数据更新的代价都为O(log~dn)。

【Abstract】 Data cube is the main means to implement On Line Analytical Processing (OLAP). With sharply increasing of number of data dimensions and data scale, operation cost on it is very high and optimization should be done for data cube. The current technologies of data cube include partially materialization, index methods, approximate query, data compress, data reduction, online aggregation and so on.The Formal Concept Analysis (FCA) is a mathematical tool for analysis based on formal concept and concept hierarchy and has been widely used in the fields of knowledge discovery and data analysis. A comparative examination of data cube lattice and formal concept lattice shows that each of them is based on order-structure and furthermore if base table in data warehousing is looked as a concept context, equivalent character groups corresponding to concept in FCA and cover equivalence classes of cells in data cube have the same partitions for data. The theories of FCA and concept lattice are introduced into the research of data cube in this paper firstly and research on high performance data cube and implementation of semantics in data cube are carried out in succession. Study shows that introduce of FCA theories into data cube can provide a new powerful analytical tool for data cube. And using this tool and beginning with data characteristics, high performance data cube that has simple structure, low cost and small size can be obtained. Semantics in data cube can be understood deeply and be realized easily. Some achievements are obtained as follows:(1)Data cube structures based on concept lattice are put forward.Firstly, correlations of concept lattice and data cube lattice are analyzed and concept lattice is applied to data cube lattice. Aggregate Concept and Aggregate Concept Lattice (ACL) are brought forward in this paper. ACL can contain all aggregations of cube integrally. As a same aggregate value possessed by several cells is recorded with one aggregate concept, ACL can reduce data cube in same ratio with queotient cube. In addition relations of generalization and specialization among aggregate concepts in ACL can express the hierarchical relations among cells after cube reduction and keep semantic relations of data cube.Secondly, based on ACL, Reductive Aggregate Concept structure (RAC) is proposed in this paper. Based on properties of G partial relations in theories of FCA and completeness of base table in data warehousing, the one to one corresponding of object concepts derived from base table in ACL and tuples in base table is found. So base table can be looked as set of all object concepts. All object concepts and special (Ω, null) are removed based on ACL in RAC. Combined with base table, RAC is still a fully pre-computed cube and can realize reduction of data cube in more high ratio compared with ACL and quotient cube. And in RAC hierarchical relations among non-object concepts can be still preserved.Thirdly, based on properties of M partial relations in theories of FCA, an efficient method of query answering in ACL and RAC is proposed. The search path in ACL and RAC when query can be determined using intent of attribute concept m" avoiding whole scope searching and query can be achived effectively.Lastly, formal context is discussed and attribute reductive theories of concept lattice are applied to research of data cube. The base table can be reduced and raletive operations of data cube can be simplified through removing unnecessary attributes and combining relative necessary attributes.(2) The attribute implicit relations in formal context are studied and based on relational system, Reductive Aggregate Concept structure based on Attribute Implication (RAC-AI) is proposed in this paper.According to theories of FCA, two kinds of nontrivial attribute implication which describes concept lattice are discussed. One is attribute implication in which preconditions are real-preconditions and the other is attribute implication in which preconditions are quasi-intensions. The intent of concept can be obtained through attribute implications but not concept lattice. Based on RAC two structure of RAC-AI that preconditions are real-preconditions and quasi-intensions are put forward. In RAC-AI discarding the complex structure of concept lattice, the table of attribute implication is added that record all nontrivial attribute implication and all non-object aggregate concepts are stored using mainstream relational system. Both theories and experiments show that RAC-AI has simple structure and small cube size. The costs of construction and maintenance incrementally are low, too. Furthermore, speed of response for query is high. So RAC-AI is a good structure of cube with high integrative performance.(3)The semantic relations among cells in data cube are more important for efficient query and OLAP. In this paper implementation of semantic operation in data cube based on theories of FCA and concept lattice is studied.Firstly, purification and reduction of formal context are discussed to remove redundant information. The current semantic research did not consider to reduct data themselves and plenty of redundant information left over in data disturb semantic finding and understanding.Secondly, based on properties of M partial relations in theories of FCA, M partial relations can be used as heuristic rules for concept hierarchy and semantic concept hierarchy can be obtained at attribute level but not dimension level liking in current methods.Thirdly, using properties of M partial relations and nontrivial attribute implication in theories of FCA, semantic meaning of roll-up and drill-down among data cells is realized. Through analyzing properties of upper bound and lower bound of equivalent character groups, computational methods of upper bound and lower bound are proposed to obtaine the structure of equivalent character groups and then obtain semantic meaning of roll-up and drill-dwon among data cells whose aggregates are same in cube. Based on properties of M partial relations and nontrivial attribute implication, intent of parents and children concept of any concept can be obtained and then obtain semantic meaning of roll-up and drill-down among data cells whose aggregates are different in cube. This method avoids depending any special structure and achieves roll-up and drill-down operations starting from any data cell. Roaming can be done in data cube through repeating this process and full data cube is not necessary. The current methods of semantic data cube were achieved at view level commonly or achieved at cell level depending on complex special structure.(4) Range query is a very effective tool to analyze data for Multi-dimensional data cubes. Pre-computing is one mean to improve range query. This paper work on partition scheme and analyze how to organize cells and carry on prefix sum in partitions and propose a Prefix Region data Cube structure (PRC). PRC partitions data cube into several irregular boxes in favor of pre-computing of prefix region. Technology of recursive partition is used to partition data cube and organize cells of partitions to carry prefix sum in PRC. For data cube with d dimensions (suporsed that the cardinality of each dimension is n) without adding any space overhead compared to storing the original array both of the range query and update costs of PRC are O(lgo~dn).

节点文献中: 

本文链接的文献网络图示:

本文的引文网络