节点文献

不确定数据的世系管理和相似性查询

Data Provenance Management and Similarity Query Over Uncertain Data

【作者】 高明

【导师】 周傲英;

【作者基本信息】 复旦大学 , 计算机软件与理论, 2011, 博士

【摘要】 不确定性数据在很多应用中广泛出现,例如经济、军事、物流、金融、电信等,其表现形式多种多样,包括关系型数据、半结构化数据、图数据、流数据、移动对象数据以及无结构化的Web数据等。目前,根据应用的特点与数据形式的多样性,已经出现了多种不确定数据模型,这些模型的核心思想都源自可能世界模型。该模型从一个不确定的数据源演化出诸多确定性的可能世界实例,所有实例的概率之和等于1。尽管可以针对各个实例单独进行查询处理,合并中间结果并获取最终结果,但是可能世界实例的数量远大于不确定数据库的规模,从而导致可能世界模型在实践应用中并不可行。因此必须采用排序、剪枝等启发式技术进行优化处理以提高查询处理效率。针对不确定数据管理的挑战,本文主要考察不确定数据查询处理的优化。主要工作分为两部分:不确定数据世系管理和相似性查询。具体的,针对数据的不确定性,研究如何通过不确定数据的世系追踪数据不确定性的来源和大小,以及对不确定性集合数据进行相似度评价,最后提出了不确定数据流上ER-topk查询的精确算法。本文的主要贡献如下:●首先研究了如何利用数据世系追踪数据不确定性的来源和大小。基于PHP-tree数据结构,近似描述不确定数据的How世系,避免了追踪数据演化的中间结果,同时也避免了运用可能世界模型对不确定性数据进行建模;基于PHP-tree,可以追踪日标数据的不确定性来源,以及对目标数据的不确定性大小进行评价。·针对不确定集合,定义了不确定性集合的期望相似度算子,提出了不确定集合期望相似度的精确和近似算法。具体的,运用动态规划方法在多项式时间内给出不确定集合期望相似度的精确算法,而不必扩展可能世界实例;考虑到精确算法需要耗费大量的时间和空间,为克服可扩展性差的缺点,我们运用Monte-Carlo方法在线性时间内近似计算不确定集合的期望相似度。●考虑到不确定集合相似度的多样性,又评价了不确定性集合的概率阈值相似度。给出了不确定集合的概率阈值相似度算子的定义,以及精确和近似算法。运用动态规划方法在多项式时间内给出不确定集合概率阈值相似度的精确计算过程;同时考虑到概率阈值相似度的计算结果是一个概率值,当用户给定相似度的阈值,利用尾概率不等式提出了一个线性时间内的剪枝规则,大大加快了精确解的计算过程;考虑到没有被剪枝的不确定集合的精确算法需要耗费大量的时间和空间,我们运用Monte-Carlo方法近似计算不确定集合的概率阈值相似度。●基于界标模型提出了不确定数据流响应ER-topk查询的精确算法,该方案将所有不断到来的元组分成两组,一组包含ER-topk查询的候选结果,剩下的元组包含在另外一组中,我们分别用数据结构domGraph和probTree来维护这两类元组;基于期望的线性性,我们避免了扩展所有可能世界实例,在次线性时间内给出查询的结果。本文研究了不确定数据的查询处理,主要工作包括不确定数据世系管理和不确定数据的相似性查询,通过大量的实验验证了提出算法的效率和可扩展性等。

【Abstract】 Appearing widely in various fields, inclusive of economy, military, logistic, fi-nance and telecommunication, et al., uncertain data has many different styles, such as relational data, semi-structure data, streaming data, and moving objects. accord-ing to scenarios and data characteristics. tens of data models have been developed. stemming from the core possible world model that contains a huge number of the possible world instances with the sum of probabilities equal to 1. However, the num-ber of the possible world instances is far greater than the volume of the uncertain database, making it infeasible to combine intermediate results generated from all of possible world instances for the final query results. Thus, some heuristic techniques, such as ordering, pruning, must be used to reduce the computation cost for the high efficiency.In this thesis, a comprehensive survey on the techniques for data management in uncertainty, and data provenance. However, traditional techniques cannot be adopted in uncertain data management because of some challenges in uncertain data management. Focusing on the challenges of uncertain data management and weak point of traditional techniques for it, we track the origin and value of uncertainty of data, evaluate the similarity of uncertain set for un-structured data, and study expected ranking top-k (shorted in ER-topk) query over uncertain data stream. The contributions of this thesis are summarized as follows:●Focusing on the uncertainty of data, we track the origin and value of uncer-tainty of data by how provenance at first. We propose an approach, named PHP-tree, to approximately model how-provenance upon probabilistic databases. we also show how to evaluate probability based on a PHP-tree. Compared with Trio style lineage, our approach is independent of intermediate results and can calculate the probability both cases of restricted and complete propagation of data provenance.●Based on uncertain set, we propose the expected set similarity operator based on the possible worlds model, over several mainstream similarity metrics, in-cluding Jaccard, Dice and Cosine. Our first work is to define the expected set similarity over probabilistic sets. Then we design exact and approximate algo-rithms for calculating them. In detail, we employ the dynamic programming to exactly calculate the expected set similarity in polynomial time and space consumption. Due to large cost both in time and space of exact algorithms, we also provide approximate solutions that are both time-and space-efficient with high approximate precision by Monte-Carlo.●Since diversity of set similarity of probabilistic sets, based on the possible worlds model, we propose the probability threshold set similarity operator, over several mainstream similarity metrics, including Jaccard, Dice and Cosine, to evaluate similarity of probabilistic sets. We also design exact algorithms for calculating them. In detail, we employ the dynamic programming to calculate the probability threshold set similarity in polynomial time and space consump-tion exactly. Furthermore, we propose probabilistic threshold similarity query algorithm. We provide a pruning rule in linear time and space for Jaccard, Dice, and Cosine. Due to large cost both in time and space of exact algo-rithms, we also provide approximate solutions that are both time-and space-efficient with high approximate precision by Monte-Carlo.●Based on the landmark model, we give an exact solution for ER-topk query over uncertain data stream. In our solution, all arriving tuples in the data stream are divided into two groups. One group contains candidate top-k tu-ples, i.e, the tuples having chance to belong to the query result, and the other contains the rest. We construct and maintain two structures, namely domGraph and probTree, to describe the two groups for efficiency. Since the linearity of expectation, our solution can provide answer of query in sub-linear time.In this thesis, our work focus on the query processing over uncertain data. Mainly work consist of data provenance management and similarity query over un-certain data. Finally, we verify the efficiency and scalability of our proposed algo-rithms by lot of experiments.

  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2011年 12期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络