节点文献

数据挖掘中的聚类方法及其应用

Clustering Methods in Data Mining with Its Applications

【作者】 殷瑞飞

【导师】 朱建平;

【作者基本信息】 厦门大学 , 统计学, 2008, 博士

【副题名】基于统计学视角的研究

【摘要】 数据挖掘是近几年随着数据库和人工智能发展起来的一门新兴技术,它从大量原始数据中发掘出隐含的、有用的信息和知识,帮助决策者寻找数据间潜在的关联,发现被忽略的因素。数据挖掘因其巨大的商业前景,现已成为国际上数据库和信息决策领域最前沿的研究方向之一,并引起了学术界和工业界的广泛关注。面对海量数据,首要的任务就是对其进行归类,聚类分析就是对原始数据进行合理归类的一种方法。作为数据挖掘的一项重要功能,聚类分析能作为一个独立的工具来获得数据的分布情况,观察每个类的特点,集中对特定的某些类做进一步的分析。此外,聚类分析也可以作为其它算法的预处理步骤。因此,聚类分析已经成为数据挖掘领域中一个非常活跃的研究课题。数据挖掘的相关文献中已经存在大量的聚类方法。然而,从目前来看,对数据挖掘中聚类方法的研究大都集中于计算机科学领域,更多注重聚类算法的研究,或者对现有聚类方法进行算法上的改进,而很少真正从统计学角度出发对数据挖掘中的聚类问题进行深入分析。本文尝试从统计学视角出发,以统计理论为基础,以统计方法与算法的结合为基本思路,将一些现有的优秀统计方法,如因子分析、对应分析、函数型数据分析等引入数据挖掘领域,使其能够应用于海量数据的聚类分析。本文共分为六章,各章的内容安排如下:第1章介绍了本文的选题背景、研究内容以及本文的主要创新之处。第2章首先简单阐述了数据挖掘的定义、功能和常用技术,然后对当前数据挖掘中主要的聚类方法及其研究进展进行了综述,并从聚类标准、类的标识、聚类算法框架三个角度对各种聚类方法进行了全面而深入的对比与总结。第3章通过对经典O型因子模型进行改进,克服了其算法效率上的困难,提出了一种新的海量数据聚类方法——Q型因子聚类法,并将其成功应用于上市公司板块分析,为投资决策提供帮助。第4章基于Benzécri对应分析的基本思路,结合Q型因子分析的思想,提出了数据挖掘中的对应分析聚类法。利用对应分析聚类法对移动通讯月度消费大型数据库进行聚类分析,实现了移动通讯消费市场的细分。第5章借助函数型数据分析的基本思想和方法,建立了一个时序数据库聚类分析的一般框架,并将这一方法扩展到多变量的情形,解决了多变量时序数据的聚类问题。将该方法应用到投资组合风险管理中,利用聚类结果进行资产选择,有效地降低了组合投资风险。第6章对全文的主要工作进行了总结,并指出了本文的不足之处以及进一步研究的方向。本文尝试在以下几个方面有所创新:1.通过对经典Q型因子模型进行改进,克服了其算法效率上的困难,提出了一种新的海量数据聚类方法——“Q型因子聚类法”。2.提出了数据挖掘中的“对应分析聚类法”。该方法既解决了Q型因子分析算法效率方面的问题,也解决了传统对应分析法中缺乏客观分类标准、信息损失严重等多种缺陷。3.在对应分析聚类法的提出过程中,构造了对应分析中的标准化因子载荷阵,给出了对应分析中因子得分的求解方法,并首次将因子旋转引入对应分析中,在一定程度上扩展了对应分析的方法和理论体系。4.借助函数型数据分析的基本思想和方法,建立了一个时序数据库聚类分析的一般框架,在这个框架之下,大量传统的静态聚类方法都可以被应用到时序数据聚类当中。

【Abstract】 Data mining is a new technology, developing with database and artificial intelligence. It is a processing procedure of extracting credible, novel, effective and understandable patterns from databases. Owing to its tremendous business prospects, data mining has been one of the most popular research areas in database and information technology, and has received increasing attentions in the past years.Cluster analysis is an important data mining technique used to find data segmentation and pattern information. By clustering the data, people can obtain the data distribution, observe the character of each cluster, and make further study on particular clusters. In addition, cluster analysis usually acts as the preprocessing of other data mining operations. Therefore, cluster analysis has become a very active research topic in data mining.As the development of data mining, a number of clustering methods have been founded. The recent studies on clustering methods in data mining come mostly from computer science area, paying more attention to clustering algorithm research. The study of clustering technique from the perspective of statistics, however, is relative scarce. Based on the statistical theories, our paper make effort to combine statistical method with the computer algorithm technique, and introduce the existing excellent statistical methods, including factor analysis, correspondence analysis, and functional data analysis, into data mining.This paper consists of six chapters, and the main contents of each chapter are outlined as follows:Chapter 1 is the introduction, which briefly introduces the research background and issues, contents and frameworks, as well as the contributions of the paper.Chapter 2 firstly carries out a review on data mining, the main clustering methods and their recent advances, then analyze systematically these methods from three different viewpoints: clustering criteria, cluster representation and algorithm framework.By improving the algorithm of classical Q-mode factor model, chapter 3 put forward a new clustering method for large-scaled database: Q-Mode Factor Clustering Method. This method is used successfully to the listed company board analysis at the last of this chapter.In chapter 4, based on the thoughts of Q-mode factor analysis and correspondence analysis, we propose Correspondence Analysis Clustering Method, a new clustering approach in data mining. After clustering the mobile communication consumption data, we realize the segmentation of mobile communication consumption market.In chapter 5, a general framework of time series clustering is established by virtual of the thoughts and techniques of functional data analysis. By extending this method to the multivariable condition, we resolve the problem of multivariable time series clustering. Finally, we apply the proposed method to portfolio risk diversification, and the validity is verified through the bootstrap simulation technique.Chapter 6 is the summary of the whole paper, including the research conclusions, limitations, and the directions of future research.The main innovations in this paper are as follows:1. By mending the classical Q-mode factor model, we put forward Q-Mode Factor Clustering Method, which dramatically reduce the time complexity of the algorithm.2. We propose a new clustering approach, Correspondence Analysis Clustering Method. The approach is effective in calculation which is an obstacle in Q-mode factor analysis. Additionally, this approach overcomes the subjectivity of traditional correspondence analysis and avoids the lost of information.3. In the procedure of Correspondence Analysis Clustering Method, we construct a standardized factor component matrix, resolve the factor score in correspondence analysis, and for the first time introduce factor rotation into correspondence analysis. All of above work expand to some extent the methodology and theory system of correspondence analysis.4. By virtual of the thoughts and techniques of functional data analysis, we establish a general framework of time series clustering, under which lots of traditional static clustering method can be applied to time series data.

【关键词】 数据挖掘聚类分析统计方法
【Key words】 Data MiningCluster AnalysisStatistical Method
  • 【网络出版投稿人】 厦门大学
  • 【网络出版年期】2009年 08期
  • 【分类号】C811
  • 【被引频次】42
  • 【下载频次】6249
节点文献中: 

本文链接的文献网络图示:

本文的引文网络