节点文献

交易数据的聚类分析

Clustering Transactional Data

【作者】 晏华

【导师】 章毅;

【作者基本信息】 电子科技大学 , 计算机应用技术, 2008, 博士

【摘要】 聚类分析,是将物理或抽象对象集合划分为由相似对象组成的多个类的过程。近年来,随着数据挖掘技术的发展,聚类分析作为数据挖掘的重要内容得到了广泛的研究,并应用于许多领域中。随着信息与互联网技术的发展,人们拥有的数据不仅数量越来越庞大,而且数据类型越来越复杂、结构越来越多样。因此,现有的聚类算法在实际应用中仍然面临两个问题:1)算法在处理大规模数据时,性能急剧下降甚至无法完成数据分析,不具有可伸缩性;2)很多聚类算法局限于理论上的分析,较少考虑具体应用中的实际数据特征与差异,因而实用性差。交易数据是一类特殊的类别数据,具有数据量大和维数高的特点。典型的交易数据包括购物篮数据、WEB日志数据、客户信息、病人诊断记录以及图像信息等,通常产生于零售业、电子商务、医疗以及电信、保险、银行等行业。因此,针对交易数据,研究可伸缩聚类分析方法是一个同时具有挑战性和实际意义的课题。本论文以大规模交易数据为研究对象,重点研究大规模交易数据聚类分析中的一些问题。本文的主要研究内容和创新点包括以下几个方面:(1)提出了可伸缩的大规模交易数据聚类分析框架,即SCALE(Sampling,Clustering structure Assessment,cLustering and domain-specific Evaluation)。SCALE的设计具有下列特点:1)针对交易数据的特征,提出采用覆盖密度以及加权覆盖密度有效地测量一组交易数据的整体相似度;2)基于加权覆盖密度设计和实现可伸缩的WCD交易数据聚类算法;3)采用聚类结构探测方法生成候选的聚类数量,有效地减少聚类算法参数空间的搜索;4)将聚类结果评估集成到该框架下,用领域特定的度量辅助用户选择最优的聚类结果。实验结果表明SCALE框架下的交易数据聚类分析能生成高质量的交易数据聚类结果。(2)研究了交易数据聚类结构探测的问题。针对通用类别数据聚类结构识别方法BKPlot的两个弱点,即噪音候选聚类数量多以及处理具有大量数据项的交易数据集时算法性能下降,提出在交易数据集找出一组候选的最优聚类数量“Ks”的新方法,即DMDI方法。以自定义的交易聚类模式相异度度量为基础设计和开发出一种凝聚的层次聚类算法,即ACTD算法。利用ACTD算法在聚类过程中生成的合并索引值可发现候选的最优聚类数量。实验表明,DMDI方法能有效地识别交易数据聚类结构。(3)研究了交易数据聚类分析结果的稳定性问题。传统基于划分的聚类方法的聚类结果常常陷入局部最优,而SOM神经网络的聚类结果稳定,但只能处理数值型数据。为此,本文提出了一种基于GHSOM神经网络的交易数据聚类分析方法,即GHSOM-CD方法。该方法在GHSOM网络学习算法中引入覆盖密度的概念,改进了神经元权值更新方法以及网络训练停止条件。实验表明GHSOM-CD方法在交易数据集上产生的聚类结果更有意义,是SOM神经网络在类别数据聚类分析上的扩展应用。(4)研究了频繁项集的压缩问题。针对频繁项集挖掘中频繁项集数量过多的问题,研究并提出一种动态聚类的方法,即EESC算法,近似压缩频繁项集。该聚类方法基于自定义的频繁项集类内相似度度量:表达式相似度和支持度相似度。实验结果显示这种近似的频繁项集压缩方法是可行的并且压缩质量好。

【Abstract】 Clustering is a process that partitions a set of physical or abstract objects intoa set of disjoint clusters such that the objects within the clusters are close to eachother and the objects from different clusters are dissimilar from each other.As animportant tool of Data Mining,clustering methods are studied extensively in recentyears and applied in many application areas.Nowadays the volume of data that people owned becomes larger and larger.Even worse,the types and structures of data are more complex and versatile.Sothe existing clustering algorithms face two problems in real applications:1) Theperformance of algorithms comes down dramatically and,in worse cases,these al-gorithms may not be able to perform data analysis for lack of scalability;2) Manyclustering algorithms are limited in theory analysis,while the features of real dataand differences among applications are seldom considered.So the practicability ofthese algorithms are not good.Transactional data is a kind of special categorical data with large volume andhigh dimensions.Typical examples of transactional data are market basket data,web usage data,customer profiles,patient symptoms records,and image features.The research on scalable clustering algorithm for transactional data is both chal-lenging and meaningful.The main research topics and contributions of this thesisare as follows:(1) A scalable clustering framework for large-scale transactional data is pro-posed,i.e.SCALE(Sampling,Clustering structure Assessment,cLustering anddomain-specific Evaluation).The SCALE has the following four features.First,the set-based similarity measures Coverage Density and Weighted Coverage Den-sity are defined according to the features of transactional data.Second,a scalableclustering algorithm based on Weighted Coverage Density for transactional data isdesigned and implemented.Third,the clustering structure detecting can efficientlyreduce the search on parameter space of clustering algorithms.Finally,the domain-specific measures is used to help users selecting the optimal clustering result.Theexperimental results show that the scalable clustering algorithm powered by the SCALE framework can efficiently generate high quality clustering results.(2) The problem of detecting clustering structure for transactional datasets isstudied.Since the generic categorical data clustering structure detecting methodBKPlot has two weaknesses on dealing with transactional data,the DMDI mcthodis specially proposed for finding clustering structure in transactional data.Based onthe concept of Transactional-cluster-modes Dissimilarity,an agglomerative hierar-chical clustering algorithm,i.c.ACTD algorithm is designed and implemented.Thepair-cluster merging index values generated in the ACTD clustering procedure areutilized to find the candidate optimal cluster numbers.The experimental resultsshow that the new method often can identify the clustering structure of transac-tional data effectively.(3) The stability problem of transactional clustering results is studied.Theclustering procedures of traditional partition-based clustering algorithms are oftentrapped into local optimal results,while the SOM neural network has stable cluster-ing results but only for numerical data.So a GHSOM-based clustering method fortransactional data is proposed,i.e.GHSOM-CD method,which introduces the con-cept of Coverage Density into the GHSOM learning algorithm.The new methodimproves the neuron weight values updating way and the network training stopcriterion.The experimental results show that the GHSOM-CD method producescorrect and meaningful transactional clustering results.(4) The frequent itemsets compression problem is studied.To solve the problemthat frequent itemsets mining often generates a large collection of frequent itemsets,a dynamic clustering method,i.e.EESC method is proposed to compress frequentitemsets approximately.The EESC method is based on two frequent itemsets intra-cluster similarity measures:the expression similarity and the support similarity.The experimental results show that the approximate frequent itemsets method isfeasible and the compression quality is good.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络