节点文献

大规模数据集聚类方法研究及应用

Study on Clustering for Large Data Sets and Its Applications

【作者】 钱鹏江

【导师】 王士同;

【作者基本信息】 江南大学 , 轻工信息技术与工程, 2011, 博士

【摘要】 聚类问题一直是模式识别领域的热点课题,很多聚类方法纷纷涌现。这些方法大多在适合自身特点的小规模数据集上表现出优良的性能,但在大规模数据集上往往收效甚微,甚至无法运行。针对大规模数据环境下聚类问题的这种困境,本课题进行了相关研究,并先后提出了四种适用于大规模数据集的聚类方法和一个基础理论,分述如下:第二章给基于图论的松弛聚类算法GRC的目标表达式引入约束条件和一次优化项后首先提出约束型图论松弛聚类算法CGRC,又CGRC可视作一个中心约束型最小包含球问题,于是使用基于核心集的最小包含球快速估计技术进而提出了快速图论松弛聚类算法FGRC,渐进时间复杂度与样本容量呈线性关系是FGRC的最大优点。概率密度估计是模式识别领域的基础研究之一,很多后续工作都基于它而展开。本文第三章提出快速自适应相似度聚类方法FASCM和第四章提出快速均值漂移谱聚类算法FMSSC都是如此,它们均以快速压缩集密度估计器FRSDE为基础而展开。第三章首先证明相似度聚类方法SCM的相似度度量函数相当于一个基于高斯密度核的概率密度估计函数,于是利用FRSDE可以快速地得到具有稀疏权系数形式的相似度函数,从而大大降低了SCM中SCA过程的计算开销。接着使用图论松弛聚类技术代替层次聚类过程,使算法具有了自适应能力,摆脱了人工经验的依赖增强了实用性。这就是FASCM的主要思想。第四章指出原均值漂移谱聚类算法MSSC繁重计算开销的根源是使用了Parzen窗密度估计式。为此该章重新设计了MSSC的架构,以FRSDE取代其PW,以本文第二章提出的CGRC算法代替其简单模式合并方法,从而提出了快速均值漂移谱聚类FMSSC算法。FMSSC较MSSC显著提高了实用性,其总体时间复杂度与样本容量近似呈线性关系。第五章推导了图论松弛聚类算法GRC的目标表达式可表示成“PW加权和+平方熵”的形式,因此GRC也可看作一个KDE问题。于是利用KDE近似定理提出了基于KDE近似的大规模数据集图论松弛聚类SUGRC-KDEA新方法。SUGRC-KDEA的关键抽样容量要适量,为此该章同步提出了基于超球分割的随机抽样算法HSBRS。HSBRS既保证抽样子集容量合适又保证能较好地反映原数据集的数据分布规律。第六章提出了一个基础性理论:快速核密度估计定理。该章利用柯西-许瓦茨不等式证明了基于抽样子集的KDE和基于完整数据集的KDE的误差上限仅与抽样容量和核参数相关,与其它因素无关。即只要抽样容量和核窗宽合适,可以用抽样子集代替原数据集进行核密度估计。该定理的得出为所有基于数据抽样的模式识别方法或技术提供了新的理论支撑。本课题的所有研究均属于此范畴。

【Abstract】 Clustering analysis is always a topical subject in the field of Pattern Recognition and many clustering methods have mergred now. Most of these methods have shown their superior performance on some small datasets fitting for them; however, they are often inefficient, impractical and even invalid on those large ones. Motivated by the challenges of clustering for large datasets, several issues are addressed in this study, which mainly involves the following five parts.In Chapter 2, the constratined Graph-based Relaxed Clustering (CGRC) algorithm is firstly presented by introducing a constrained condition and a linear term to the objective expression of the original Graph-based Relaxed Clustering (GRC) algorithm. Furthermore, owing to CGRC can be regarded as a Center-constrained Minimal Enclosing Ball problem, the Fast Graph-based Relaxed Clustering (FGRC) algorithm is further proposed by using the Core-set based MEB approximation trick accordingly. It is the biggest merit that the asymptotic time complexity of FGRC is linear with the data size. Probability Density Estimation is a theoretical foundation in Pattern Recognition and many follow-up works are continued based on it. So do the Fast Adaptable Similarity-Based Clustering Method (FASCM) proposed in Chapter 3 and the Fast Mean Shift Spectral Clustering (FMSSC) algorithm presented in Chapter 4, they are both based on the Fast Reduced Set Density Estimation (FRSDE).In Chapter 3, in first, it is deduced that the similarity measure function of the Similarity-Based Clustering Method (SCM) equals to a Gaussian kernel density function, so the similarity function with sparse weight coefficients, which reduces sharply the computational cost of the SCA phase embedded in SCM, can be obtained quickly by utilizing FRSDE. Then, by replacing the Agglomerative Hierarchical Clustering (AHC) algorithm with GRC, it makes true that the new prposed method is independent on artificial experience and is self-adaptable. This is FASCM.In Chapter 4, it is pointed out that the Parzen Window density estimator (PW) leads to the expensive time cost of the original Mean Shift Spectral Clustering (MSSC) algorithm. To deal with this problem, the FMSSC algorithm with new framework is presented, where FRSDE substitutes for PW and CGRC proposed in Chapter 2 replaces the simple mode merging method, which makes FMSSC is more practical than MSSC and its time complexity is roughly linear with the whole data size.In Chapter 5, because the objective function of GRC can be represented as two parts:“weight sum of PW”and“Quadratic Entropy”, GRC can be considered as a Kernel Density Estimation (KDE) problem. Thus, the Scaling up GRC by KDE Approximation (SUGRC-KDEA) method is proposed according to the KDE Approximation theorem. The key point of SUGRC-KDEA is the sampled size. To ensure the sampled size is suitable and the sampling subset is able to reflect the potential distribution rule of the whole dataset, the Hyperesphere Segmentation Based Random Sampling (HSBRS) algorithm is presented synchronously. In Chapter 6, the Fast Kernel Density Estimation (FKDE) theorem, which is a fundamental theory in this study, is proposed via the Cauchy– Schwartz inequality. It is proved that the Integrated Squared Error (ISE) between the KDE generated by all data points and the KDE generated by a sampled subset is just dependent on the sampled size and the kernel width. That is, a subset with appropriate sampled size can be employed for KDE instead of the whole dataset. This theorem provides a new theoretical support for those data sampling based methods or technologies in Partten Recognition, including all researches in this thesis.

  • 【网络出版投稿人】 江南大学
  • 【网络出版年期】2011年 07期
节点文献中: