节点文献

高维数据挖掘中若干关键问题的研究

The Research on a Few Key Issues in High Dimensional Data Mining

【作者】 杨风召

【导师】 朱扬勇;

【作者基本信息】 复旦大学 , 计算机软件与理论, 2003, 博士

【摘要】 数据挖掘指的是从大量的数据中提取隐含的、事先未知的、并且潜在有用的知识的技术,是目前国际上数据库和信息决策领域最前沿的研究方向之一。在实际应用中经常会碰到高维数据,如交易数据、文档词频数据、用户评分数据、WEB使用数据及多媒体数据等。由于这种数据存在的普遍性,使得对高维数据挖掘的研究有着非常重要的意义。但由于“维灾”的影响,也使得高维数据挖掘变得异常地困难,必须采用一些特殊的手段进行处理。 随着数据维数的升高,高维索引结构的性能迅速下降,在低维空间中,我们经常采用L_p距离作为数据之间的相似性度量,在高维空间中很多情况下这种相似性的概念不复存在,这就给高维数据挖掘带来了很严峻的考验,一方面引起基于索引结构的数据挖掘算法的性能下降,另一方面很多基于全空间距离函数的挖掘方法也会失效。解决的方法可以有以下几种:一个可以通过降维将数据从高维降到低维,然后用低维数据的处理办法进行处理;对算法效率下降问题可以通过设计更为有效的索引结构、采用增量算法及并行算法等来提高算法的性能;对失效的问题通过重新定义使其获得新生。 本文对高维数据挖掘中的相似性搜索、高维数据聚类、高维数据异常检测及电子商务中的协同过滤技术进行了研究,指出了高维给这些领域带来的影响,提出了一些解决问题的方法,具有一定的理论意义和现实的指导意义。 本文的主要工作如下: (1)通过对高维数据特点的分析,提出了一种新的相似性度量函数Hsim(),该函数可以避免在高维空间中分辨能力下降的问题,还可以将数值型的数据和二值型数据相似性的计算整合在一个统一的框架中。并将它与其它的相似性函数进行了比较; (2)结合量化交易数据的特点,提出了一种新的量化交易数据相似性搜索方法,这种算法基于一种称为特征表的结构,对数据有较高的修剪率,能大大提高相似性搜索的速度; (3)提出了一种新的基于用户评分数据的协同过滤算法,并通过实验证明该算法不仅提高了推荐的效率,还对推荐精度有一定的提高; (4)分析了高维数据聚类的算法,提出了基于对象相似性的高维数据聚类框架; (5)对高维对异常检测算法的影响进行了分析,给出了投影异常检测的概念。提出了一种动态环境下局部异常的增量挖掘算法IncLOF,并通过实验和LOF算 摘 要法进行了比较,结果表明在动态高维的环境下,当高维索引结构失效的情况下。能大大提高局部异常的挖掘效率。

【Abstract】 Data mining refers to extracting implicit, previously unknown and usable knowledge from large amounts of data. It is one of the frontiers of research in the fields of database and DSS. The high dimensional data are frequently met when we apply data mining, for example transaction data, term-frequency data, rating data, WEB usage data and multimedia data. The universality of high dimensional data makes researches on high dimensional data mining very important. But mining in high dimensional data is extraordinarily difficult because of the curse of dimensionality. So we must adopt some special means to solve these problems.The performance of similarity indexing structures in high dimensions degrades rapidly. In lower dimensional space, we often use Lp-norm to measure the proximity between two points, but in many case the concept of this proximity is never meaningless in high dimensional space. These issues bring high dimensional data mining two challenges. One is the performance of data mining algorithms degrades, the other is many distance-based and density-based algorithms maybe not effective. These problems can be solved by the following methods: l)Transport the data from high dimensional space to lower dimensional space by dimensionality reduction, then process the data as lower dimensional data. 2)To improve the performance of mining algorithms, we can design more effective indexing structures, adopt incremental algorithms and parallel algorithms and so on. 3)Redefine some concepts in a meaningful way for high dimensional domains.Similarity search, cluster analysis and outlier detection in high dimensional data mining, as well as collaborative filtering in e-commerce are studied in this paper. We point the effect of high dimensionality on these domains and present some method to solve these problems. The researches in this paper have much important theoretical and practical significance.The majority of our work is summarized here:(1)A new function Hsim( ) to measure the proximity of objects in high dimensional spaces is presented by analyzing the characteristic of the high dimensional data. The function can not only avoid the problem which the Lp-norm lead to the non-contrasting behavior of distance in high dimensional space, but also adapt to both binary and numerical data. We also made a comparison between Hsirn()and other similarity functions.(2) According to the characteristic of quantitative transaction data, a new method based on signature table for similarity search on quantitative transaction data is presented. Experiments demonstrate this method have very good pruning efficiency for similarity search on the quantitative transaction data, so can greatly speed thesimilarity search.(3) Put forward a new algorithm for performing ratings-based collaborative filtering. Our preliminary experiments based on a number of real data sets show that the new method can both improve the scalability and quality of collaborative filtering.(4) Analyze the algorithms for high dimensional data cluster, and present a framework of similarity-based cluster analysis for high dimensional data.(5) Analyze the effects of high dimensionality on outlier dection algorithms, give a concept of projected outlier detection. The first incremental outlier detection algorithm IncLOF is presented and compared with LOF algorithm. The results from a study on synthetic data sets demonstrate that the runtime of IncLOF is much less than LOF in dynamic and high dimensional environment where high indexing structures are ineffective.

  • 【网络出版投稿人】 复旦大学
  • 【网络出版年期】2004年 02期
  • 【分类号】TP311.13
  • 【被引频次】52
  • 【下载频次】2208
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络