节点文献

数据流上Skyline查询处理算法研究

Research on Algoritms to Process Skyline Query over Data Stream

【作者】 孙圣力

【导师】 朱扬勇;

【作者基本信息】 复旦大学 , 计算机软件与理论, 2008, 博士

【摘要】 数据流是连续、实时、有序的数据项序列。数据流广泛存在于因特网与传感器网络、交通与环境监控、工业控制、金融股市与电子商务交易等应用中。流数据挖掘与管理是近年来学术界和工业界所共同关注的问题。作为一种重要的数据挖掘技术,Skvline查询是近年来的研究热点。Skvline是定义在一个多维对象集上的集合,它由所有不被其它对象所支配的对象组成。Skyline对于数据挖掘可视化、用户偏好查询及多约束决策等问题具有重要意义。自从Skyline查询在学术界被提出以来,对该课题的研究迄今为止都非常活跃。数据流的特点是数据实时到达、规模宏大、次序独立以及数据往往只能一次读取。这就要求Skvline查询处理算法必需高效地处理数据流中到来的每一个对象,并且要求算法具有较低的时间复杂度。学术界已经出现了一些有关该课题的研究成果,但这些成果仅涉及数据流上全空间Skyline的查询处理,并且数据形式也仅限于集中式、确定性数据流。此外,现存算法没有很好地解决求影响时间与淘汰被新来对象所支配的对象这两个关键问题,导致算法性能较低。本文在对现有技术之不足进行了彻底改进的基础上,进一步解决了一些新的重要实际应用与需求。用户对数据对象各属性的关注往往是有差异的;实际应用中的数据流往往以分布式的形式出现,例如:金融股票交易、环境监测以及网络通讯等应用;最近两年以来,一种全新的被称为概率数据流的数据形态逐步引起了研究者们的关注,针对概率数据流的挖掘与分析方兴未艾。这些新的应用需求为数据流上的Skyline查询处理带了新的挑战。本文对数据流上的Skyline查询处理算法进行了系统地研究,主要成果概括为以下几个方面:(1).提出了一个高效地处理滑动窗口上Skyline持续查询的算法CridSky。对于数据流上的Skvline查询处理,决定其性能的主要因素是计算新到达对象的影响时间以及淘汰被新到达对象所支配的对象这两个关键过程的效率。对这两个关键过程,现有工作中只是简单地依靠R树上的窗口查询机制来实施,直接导致了算法性低下。GridSky算法采用更适合数据流这种频繁更新环境的基本网格作为索引结构,并在此基础上开发了基于时间戳的剪枝策略、基于网格的的剪枝策略以及分层遍历策略等优化措施,大幅度地提高了算法的性能。大量的对比实验表明,在空间复杂度略低的情况下,GridSky与其竞争对手相比时间性能优势在1个数量级以上。此外,GridSky算法对不同的数据分布、数据规模与维度具有很强的可扩展性。(2).研究了分布式数据流上的Skyline查询问题,提出了一个通信最优的分布式算法BOCS。近年来随着大规模分布式应用的涌现,分布式数据流上的查询处理与数据挖掘越来越引起了研究者们的关注。此前相关工作局限于集中式数据流上的Skyline计算,没有人考虑吏具挑战性的分布式数据流上的Skyline查询问题。本文围绕着降低系统反应延迟与最小化通信负荷的目标,在对GridSky进行重大适应性改造的基础上,提出了一个两阶段渐进求解的分布式算法BOCS。并对BOCS的关键实现环节,如:远程站点与协调站点间的通信协议、Skyline增量的计算等进行了优化,使算法达到了通信效率与反应延迟的合理均衡。理论分析证明在所有基于非共享策略的算法中,BOCS算法通信最优。大量的对比实验也表明,BOCS算法高效、稳定且具有良好的可扩展性。(3).提出了一个高效地计算滑动窗口中任意子空间上Skyline的算法StreamSubsky。此前相关工作仅限于维护滑动窗口全空间上的Skyline,未涉及到子空间上Skyline的计算问题。而用户的偏好与倾向性天然不同,这就催生了滑动窗口中子空间上的Skyline查询问题的研究。本文首次提出并较好地解决了该问题,提出的StreamSubsky算法巧妙地利用了全空间与各子空间上Skyline之间的关系,采用自顶向下的方式通过两个阶段增量地返回目标子空间上的计算结果。此外,算法还采用了多个优化策略显著地提高了计算效率。理论分析和实验结果表明,与同类算法相比StreamSubsky以极少的时间开销就能使查询得到响应,算法具有良好的时间与空间性能。(4).对概率数据流上的Skyline查询问题进行了深入研究,提出了一个高效的解决方案。数据的内在不确定性存信息集成、RFID以及传感器网络等普适计算环境中普遍存在。对概率数据流进行管理与分析近年来引起了研究者们的广泛关注,而此前尚无解决概率数据流上Skyline持续查询的算法出现。本文基于“可能世界”语义对该问题首次进行了建模,并提出了一个高效的查询处理算法SOPDS。与传统确定性数据流上的Skyline查询处理不同,SOPDS算法主要考虑以下两个基本问题:一是如何高效地确定对象的身份(判断其是否为Skyline对象),即减少身份判断过程中支配测试的次数以降低CPU开销;二是在保证算法正确性的前提下尽可能早地淘汰那些不再有机会加入Skyline的对象,以减少内存开销。围绕着以上两个基本问题,本文先后提出了概率定界、逐步求精、提前淘汰与选择补偿等优化措施对算法从时间与空间上进行了系统地优化。理论分析与详细的对比实验表明,SOPDS算法在时间与空间上具有较高的整体性能,算法高效、稳定。本文研究了数据流上Skyline查询的四个重要问题,并分别提出了有效的解决方案。本文提出GridSky算法对现有技术进行了彻底地改进,而提出的BOCS、StreamSubsky和SOPDS算法则进一步填补了一些重要新兴应用的空白。理论分析证明这些算法高效地解决了相应的问题;大量的对于比实验也表明,与现有技术相比本文提出的算法在存储空间、查询处理速度等方面具有明显的优势。

【Abstract】 Data stream is a sequence of digitally encoded signals used to represent information in transmission, which is generated from applications such as network routing, traffic and environmental monitoring, stock trading and electronic business activities etc. The study on streaming data is one of the hot topics among the database circle all over the world recently. As an important data mining technique, skyline query has been receiving considerable attention among database community. Skyline query returns a set of interesting objects which are not dominated by any other objects within the base dataset. Skyline query is of great significance for urban navigation, data mining visualization, multi-criteria decision making and preference query. As skyline query is proposed, techniques to process skyline query mushroom.Item within data stream can only be treated once because streaming data is of the nature of rapid arriving rate, large size and independence. Novel algorithm to process skyline query over data stream efficiently in a limited memory is still on demand. As to the topic of skyline query processing over sliding window, there already exist some methods. But some important issues about this topic are still open questions. Skyline query in full-space and only over centric, certain data stream is addressed in existing work. Besides, there exist critical defects about them. Existing approaches are improved fundamentally in this thesis at first. Furthermore, we set forward to solve some emerging issues about this topic. Attributes of object tend to be regarded differently; data stream in reality is always generated by a distributed manner; more and more attention has been attached to probabilistic data stream analyzing recently, but skyline computation over probabilistic data stream is still at large. These emerging requirements entail great challenges to skyline query processing over data stream. We are committed to subdue these challenges effectively and efficiently in this thesis, and our endeavors and contributions can be concluded as following:(1).We proposed a novel algorithm GridSky to process skyline query over sliding window efficiently. As to the algorithm processing skyline query over sliding window, its performance is decided by efficiency of the critical procedure to compute influence time of new arriving object and to eliminate the dominated objects. This problem is not well addressed in existing work, hence leads to low performance of existing approaches. More adaptive index basic grid, which is more applicable to data stream environment, is adopted by GridSky. Based on this index structure, pruning rule based on timestamp, pruning rule based grid and optimizing rule based on layered visit are developed to speed up GridSky progressively. Massive comparing experiments demonstrate that, GridSky outperforms its competitors by at least one order of magnitude as to the point of time complexity while consuming less memory space. Besides, GridSky is more scalable over data distribution, dimensionality and cardinality than its competitor.(2).We studied the issue of skyline query over distributed data streams, and proposed a communication-optimal algorithm BOCS. Query processing and data mining over distributed data streams have received considerable attention within database community recently. We are the first to address skyline query processing over distributed data streams where streams derive from multiple horizontally-spit data sources. Based on strategy of progressive refinement, skyline is computed incrementally in BOCS through two phases. In the first phase, local skylines on remote sites are maintained efficiently by GridSky and only skyline increments on remote sites are transmitted to coordinator at each time. Some skillful measures are developed to compute delta skyline efficiently in this phase. In the second phase, global skyline is obtained by integrating remote increments with the latest global skyline. Theoretical analysis is provided and shows that BOCS is communication optimal among all algorithms which are out of share-nothing strategy. Extensive experiments demonstrate that our proposal is efficient, scalable and stable.(3). We proposed an efficient algorithm StreamSubsky to handle subspace skyline computation over sliding window. Streaming data mining and skyline computation within subspaces have recently received much attention in data mining community. Previous work only sought to maintain full space skyline, we are the first to handle the problem of subspace skyline computation over sliding window. Some useful rules between full-space and subspace skyline are found by us in StreamSubsky, we propose an efficient top-town method to incrementally output the skyline objects within specified subspaces henceforth. Moreover, we present a set of optimizing heuristics to speed up critical procedures. Theoretical analysis and extensive experiments demonstrate that our algorithm is able to output the first result only taking a very short time compared with previous approaches, and our method are both efficient and scalable.(4).Inherent uncertainty and unreliability of data exist widely in emerging applications like information integration, RFID and sensor network based ubiquitous computation etc. The management of uncertain, probabilistic data stream has recently attracted considerable attention among researchers. Although previous work has well addressed skyline computation over static data or traditional certain data stream, skyline query over probabilistic data stream is still at large. To the best of our knowledge, we are the first to model this issue formally based on "possible worlds" semantics, furthermore an effective and efficient algorithm SOPDS is proposed to handle this issue. Compared with skyline query over traditional data stream, there are two basic different problems within this topic: how to efficiently confirm the status of new arrival and how to knock out unpromising objects as early as possible. Based on more adaptable index structure, a set of heuristic rules like probability bounding, progressive refinement, pre-elimination and selective compensation are developed to improve the comprehensive performance of SOPDS from the point of reducing both CPU overhead and memory cost. Massive back to back experiments demonstrate that our algorithm is of high overall performance, SOPDS is efficient, stable and scalable.In one word, four important issues about skyline query over data stream are well studied in this thesis and efficient algorithms to handle these issues are proposed respectively. Our approaches are great complement and improvement to existing skyline algorithm. Theoretical analysis shows our approaches are effective and efficient, extensive comparing experiments validate the theoretical observation.

  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2009年 03期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络