节点文献

数据挖掘中半监督K-均值聚类算法的研究与改进

Research and Improvement for Semi-supervised K-means Clustering Algorithm in Data Mining

【作者】 刘方

【导师】 梁艳春;

【作者基本信息】 吉林大学 , 生物信息学, 2010, 硕士

【摘要】 数据挖掘技术是当前机器学习、模式识别、计算机科学、智能计算技术、应用数学、统计学习方法以及智能机器人研究中的重要课题。它能从已有的数据中分析、提炼和挖掘出隐含的、先前未知的、对决策有潜在应用价值的知识。本文围绕数据挖掘领域的聚类分析问题,展开了算法与应用研究:在传统的K-均值聚类算法的基础上,为了提高算法的效率,提出了两种基于数据分段技术选取初始聚类中心的改进的K-均值聚类算法,将上述算法应用到我国各地区城镇居民家庭收支基本情况统计数据中,取得了较好效果;结合半监督学习方法,提出了半监督K-均值聚类算法,并针对初始聚类中心的选取提出了两种改进的半监督K-均值聚类算法,并将改进前后的算法应用于我国男性与女性身高和体重的统计数据中,取得了较好的效果。

【Abstract】 Data mining technology is an important subject in the current machine learning, pattern recognition, computer science, intelligent computing technology, applied mathematics, statistical learning methods, and intelligent robotics research. Data mining techniques are applied in the database, statistics, optimization techniques, artificial knowledge, pattern recognition, parallel computing, machine learning, neural networks, data visualization, information retrieval, image and signal processing and spatial data analysis.With the rapid development of modern computer technology, information technology and communication technology, how to analyze, refining and digging out the implicit, previously unknown, novel, potential applications for decision-making knowledge from the available data, has been an problem that is urgent need to address.This focus on data mining field, for which the problem of the cluster analysis, expands the algorithm and research application. Basing on the traditional K-means clustering algorithm, in order to improve the efficiency of the algorithm, presents a improved K-means clustering algorithm which based on data segmentation to select the initial cluster centers, the above algorithm is applied to statistics of our country various regions urban residents household income and expenditure basic situation and achieved good results; Combine to semi-supervised learning method, proposed semi-supervised K-means clustering algorithm, for the choice of the initial cluster centers, proposed an improved semi-supervised K-means clustering algorithm and the algorithm is applied to statistical data of our men and women’s height and weight, obtained better results.The main contribution and research findings of this paper are as follows:1. Provide an overview on data mining research.Introduced and summarized the significance of data mining, the main content and applications, discussed the current problems in data mining, and points out the future research and development direction. Data mining technology is the rise of a cross-disciplinary in late 20th century, 80s. The current development state of Data-mining capabilities and product is database, information retrieval, statistics, algorithms and machine learning multi-disciplinary multi-impact results. With the rapid development of modern information technology, communications technology and computer technology, the scope, depth and scale of database applications are expending. Most of the traditional information system is query-driven, database as a historical knowledge base for the average query process is effective, but when the size of data and the database increase sharp, the traditional database management systems query retrieval mechanisms and statistical analysis methods can not meet the real needs, automatically, intelligent and quickly dug out useful information and knowledge from the database is an urgent requirement. In general, data mining work can be divided into two categories: descriptive data mining and predictive data mining. Data mining in financial data analysis, research of gene sequences composition, retail data analysis, telecommunications and other areas all have a wide range of applications. Where there is data, where there is data mining.2. Introduce and analyze the related theory and methods of clustering problems in data mining.Clustering problem is to identify classes which implicit in the data. Category refers to data sets with similar properties. As the different similarities that can have different clustering methods, for example, described the similarity with the distance. Generally, describe the manner of similarity given by the user or expert. A good clustering method can produce good clustering, in order to ensure the less similarity between class and class, and a high similarity in each class internal. Clustering algorithm can be divided into two major categories of hierarchical methods and classification methods. This paper introduces the hierarchical algorithm, described the division algorithm, in the division algorithm, in particular pointed out the K-means clustering algorithm, and gives a brief description of the relevant example of the solution process of the algorithm. Finally, as compared to the relevant algorithms, in which K-means clustering algorithm in the space complexity and time complexity are the smallest.3. Introduce and analyze semi-supervised learning methods.In the traditional supervised learning, the training device marked by a large number of data to learn in order to build models to predict the unmarked data. But to get the data marked is often difficult, expensive and very time-consuming, often requires experienced researchers to mark. With the rapid development of the data collection and storage technologies, unlabeled data collected is very easy, but using only unlabeled data clustering results could have a tremendous error.Obviously, if using only a small amount of "expensive" marked data without using the large numbers of "cheap" unmarked data, the data is a great waste of resources. Semi-supervised learning method is a way of learning which is used to handle a large number of unlabeled data and a small amount of marked data. Semi-supervised learning combines a small amount of "expensive" marked data and the large number of "cheap" unmarked data, avoiding a tremendous waste of data resources, in the theoretical research and practical applications are of great significance. In this paper, semi-supervised classification is given in five kinds of learning methods, in the semi-supervised clustering is given the icon description.4. Study the K-means clustering algorithm, propose two improved algorithms and Semi-supervised algorithm and two improved Semi-supervised algorithms.K-means algorithm is a sure means algorithm for k-center. Its idea is that if a class is confirmed, then the class centers of the data points within the class of the geometric mean. When the initial choice of the initial cluster centers, K-means clustering algorithm the initial centers are randomly selected, randomly selected, results will lead to less efficient clustering algorithm, that algorithm for more iterations, CPU running time than the long. To this end we propose an improvement of the initial point selection algorithm, called the improved K-means clustering algorithm. The algorithm uses data segmentation, data collection of the sample points were divided into k-paragraph, take a center within each segment as the initial center. This approach avoids the choice of the initial center too close. In this paper, experiments show that the algorithm is effective. This combination of semi-supervised learning another idea presents a semi-supervised K-means clustering algorithm, the initial cluster centers by expanding the choice of methods to be used for semi-supervised learning. In the semi-supervised K-means clustering algorithm, the choice of markers is very important, its results of clustering had a significant influence. This algorithm is applied to the two-dimensional data clustering, it examined the effectiveness of the algorithm.The results of this research enriched the clustering problem in data mining theoretical and applied research. This cluster analysis, K-means clustering, as well as K-means clustering semi-supervised learning research, possesses some theoretical and application value.

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2010年 08期
  • 【分类号】TP311.13
  • 【被引频次】10
  • 【下载频次】565
节点文献中: