节点文献

基于流形学习的数据约简方法研究与应用

The Research of Data Reduction Methods Based on Manifold Learning and Application

【作者】 闫志敏

【导师】 刘希玉;

【作者基本信息】 山东师范大学 , 计算机软件与理论, 2012, 硕士

【摘要】 随着信息化技术的不断发展,在大量的科学研究中,有时会遇到具有高维特性的数据集,数据的高维特性为获取数据内在规律和结构带来了很大的困难。因此,需要采用适当的数据约简方法对这些数据集进行约简处理。数据约简也被称为维数约简或数据降维,现有的降维方法对于不同的数据集具有不同的处理效果。从数据所呈现的结构出发,基于流形学习的数据约简方法可以分为两大类:线性方法和非线性方法。线性降维方法可以对具有线性结构的数据集或者高斯数据集进行有效的处理,非线性降维方法可以对嵌入在高维空间中的数据进行投影,将其映射到低维空间坐标中,从而可以进一步探索数据的内在几何结构。流形学习将样本集内的数据几何信息通过运用数据分析技术呈现出来,即将高维复杂的数据用简洁的低维结构来表示。流形学习的主要目的是寻求嵌入在高维空间中数据的内在分布规律,目前已成为机器学习等相关领域的研究热点。本文通过对基于流形学习的数据约简方法进行一定程度的研究,分别从邻域参数的选择、新增数据点的处理方面对流形学习方法进行了研究和详细的阐述,将改进后的方法有效的应用在文本聚类中,并通过实验验证了方法的有效性和可行性。主要工作总结如下:1.提出了一种判别邻域参数选择合适性的方法。方法采用核主成分分析方法对数据误差进行重构,然后对重构后的数据误差进行聚类,根据聚类的个数判断邻域选择的合适性。之所以采用核主成分分析方法是因为它属于非线性方法,是在主成分分析的基础上产生的,它采用核函数来代替数据向量内积,同时具有主成分分析方法的特性。利用非线性函数把原始数据映射到高维特征空间中进行处理,需要进行内积计算,通过计算原始数据的核函数来代替内积计算,那么相应的计算量就会大大减小。在对误差进行聚类效果的评价方面,采用AIC信息准则对聚类个数进行判断。当数据误差被聚为一类时,则说明所选的邻域参数没有引起误差结构的变化,此时邻域值是合适的;当数据误差的聚类的个数多于一类时,则说明所选的邻域参数导致误差结构发生了严重的变化,此时邻域值是不合适的。2.探讨了一种新的降维方法。从目前的研究来看,局部切空间排列方法使用比较少,经过分析可知,之所以研究较少是因为该方法在某些情况下存在一些缺陷。比如,在处理样本较大的数据集的时候会出现数据内在结构扭曲或者不完整现象,由此可知局部切空间排列方法对于新增数据样本点的处理并不是很理想。优化的线性判别方法是一种线性降维方法,是将原始线性判别方法中的Fisher准则进行优化,使方法执行起来更加方便。文中将优化的线性判别方法与局部切空间排列方法相结合,利用经过优化的Fisher准则对类内和类间投影矩阵进行求解变形,最后得到数据的最优投影矩阵。通过两种方法的结合,可以有效的对新增数据点进行处理。3.探讨了基于流形学习的降维方法在文本聚类中的应用。一般情况下,对文本信息的获得是通过将文本中出现的词条信息频率构造成相应的矩阵,这些矩阵呈现高维特性。若想进一步探究文本数据的内在规律,就需要运用适当的降维方法,近年来数据约简技术已经逐步被应用在文本聚类中。文中运用基于优化线性判别的局部切空间排列方法对高维文本数据信息进行降维处理,将低维空间中的局部坐标对齐,进而表示出全局坐标,获取数据的局部邻域和局部切空间向量坐标,通过使局部误差最小化来对齐局部和全局切空间向量坐标。为了得到良好的可视化效果,用k均值方法对处理后的数据进行聚类分析,同时使用熵值对聚类质量进行评价。

【Abstract】 With the continuous development of information technology, in a lot ofscientific studies, we may meet some data sets with high dimension characteristicsometimes, the characteristics of the high dimensional data will bring a great deal ofdifficulties for data inner laws and structures.Therefore, we should use appropriate methods of data reduction to processthese data sets. Data reduction is also called dimension reduction or data dimensionreduction; the existing dimension-reduction methods can produce different treatmentresults for different data sets. From the structure of the present data to see, datareduction methods that based on the manifold learning can be classified into twocategories: linear methods and nonlinear methods. The linear dimension reductionmethods can process the data sets and Gaussian data sets effectively, the nonlineardimension reduction methods can project the data that embedded in the highdimension space, and map them to the low dimensional space coordinates, so we canexplore the data inherent geometric structure further . Manifold learning will showthe data inherent geometric structure by the data analysis technology. The conciselow dimensional structure can show the complex data of high dimensional. The mainpurpose of Manifold learning is to seek the data internal distribution that embeddedin the high dimension space. In recent years, Manifold learning has become the hotresearch in the field of the machine learning and other researches.In the article, we study the method of manifold reduction in a certain extent,and discuss the manifold learning algorithm research and detail respectively fromtwo aspects, the neighborhood parameter selection and the processing new datapoints. And apply the improved method to the text clustering effectively; also use theexperimental results to verify the feasibility and effectiveness of the method. Themain work summarized as follows:1. put forward a method of the fit discrimination of neighborhood parameterselection. Use the method of kernel principal component to reconstruct the data error,take the reconstruct data error together, and judge the fit of neighborhood choiceaccording to the number of clustering. Because the kernel principal component analysis method belongs to nonlinear method, it is produced based on the principalcomponent analysis, with the nuclear function instead of inner product data vector,and it also has the characteristics of the principal component analysis method. Whenuse the nonlinear function to map the original data to the high dimension spacefeatures, it need the inside accumulate computation. With the kernel functioncalculation of the original data instead of the inside accumulate computation, thecorresponding calculation is reduced greatly. We use the AIC information criterionto judge the cluster numbers in the clustering effect evaluation. When the data errorsare gathered for a class, the selected of neighborhood parameters don’t cause thechange of the structure of error, so we say that the neighborhood value is the right;When the data errors are gathered more than a class, the selected of neighborhoodparameters change the error structure seriously, we say that the neighborhood valueis not appropriate.2. Discusses a new method of dimension reduction. From the current studies,the method of Learning Technology Systems Architecture has many defects in somecases, so it use rarely. For example, the inner structures will distortion or incompletewhen treatment the large data sets, it can be seen that the method of LearningTechnology Systems Architecture method is not ideal in processing the new datasample points. The optimization of the linear discriminate analysis method is a lineardimension reduction method, it optimizes the Fisher criterion of the original method,and it makes the method more conveniently. In the article, we combine theoptimization of the linear discriminate analysis method and the Learning TechnologySystems Architecture method, and use the optimized Fisher criterion to solute theprojection matrix within and between classes, finally get the optimal projectionmatrix of data. By using the combination of the two methods, we can process thenew data points effectively.3. Discuss the application of the dimension reduction method based on themanifold learning in text clustering. In general, by structuring matrix of thefrequency of information, we can get the text information, and these matrixes alsohave high dimension characteristics. We should use the proper dimension reductionmethod to explore the inner rules of text data further. In recent years, the datareduction technology has been used in text clustering gradually. In the article, we usethe Learning Technology Systems Architecture method which based on the optimallinear discriminative to process the high dimensional text data, align the localcoordinate in the low dimension space, and show the global coordinates, we can get the local neighborhood of the data and Local cut space vector coordinates. Throughminimize the local error to align the global and local cut space vector coordinates. Inorder to get good visual effect, we use k-means method to analysis the dataclustering, and use the entropy value to evaluate the cluster quality.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络