节点文献

基于映射/规约的网页聚类算法研究

Research of Web Clustering Based on MapReduce

【作者】 于治海

【导师】 国林;

【作者基本信息】 哈尔滨工程大学 , 计算机应用技术, 2011, 硕士

【摘要】 随着网络应用的普及化,网络信息量飞速的增长。因此,人们如何在海量的数据中获取有用的知识变得越来越重要。通过长时间的研究与探索,人们提出了数据挖掘技术,该技术是一门多专业交叉、综合的学科,使用该技术可以有效的将用户所需的知识提取出来,聚类分析是数据挖掘领域中的重要的内容和基本工具之一。数据量呈级数的增长和应用开发的复杂性严重阻碍了多核处理器和多处理器系统发展,进而导致数据不能有效的利用。经典的处理方法是开发一个具有信息传递接口(MPI)的分布式系统,由于该接口在并行应用中只能提供细粒度控制,因此,经典方法的抽象性和复杂性超出了现有的计算能力。与传统的分布式系统相比,映射/规约框架提供了一种比MPI更高级的抽象概念,可以被应用于许多数据密集型的批量处理任务中并且该框架的抽象性和复杂性在现有的计算能力范围内能够被处理。本文在并行计算与映射/规约编程框架研究分析的基础上,对映射/规约框架进行了理论上改进,使改进后框架的计算处理性能提高。在改进框架的基础上,实现了一种基于映射/规约的MRK-Means算法,该算法采用迭代操作的计算,能够实现多次执行映射/规约操作,同时将该算法与网页的海量、动态、更新快等属性特征相结合,提出一种具有属性特征的在线OMRK-Means算法,该算法能够提高在线聚类方法的伸缩性和聚类精确度,并且缩短了聚类操作时间,有效的处理增量式数据。通过实验表明,基于映射/规约框架的MRK-Means算法在保证执行效果的基础上,与传统K-Means算法相比,有效地提高了聚类的速度。通过对OMRK-Means算法的收敛性和执行时间、精确度和伸缩性进行试验分析,表明本文提出的在线OMRK-Means算法在数据并行增量的情况下,能加快大型数据集的交互分析,提高聚类处理的精确度,并且有利于可伸缩的网页挖掘。

【Abstract】 With the popularity of network applications and network information is growing rapidly, in the flood of data to obtain useful knowledge becomes more and more important for people. Through a long period of research and exploration, data mining techniques have been proposed, which is a multi-disciplinary cross, comprehensive discipline, this technology can extract the desired knowledge for users effectively. Clustering analysis is one of the most important part and the basic tools in data mining.The series growth of data and complexity of application development impede the development of multi-core processors and multi-processor system seriously, thus can not effectively use the data. The classic approach is to develop a distributed system with the message passing interface (MPI), which only provides fine-grained control by implementing parallel applications. Therefore, the abstraction and complexity of this method are out of the existing computing ability. Map/Reduce programming framework provides a higher abstraction than MPI, can be used in many data-intensive batch processing tasks, of which the abstraction and complexity are not higher than the present computing ability. Based on the study about distributed computing and Map/Reduce programming framework, this framework is improved, and computing ability of the improved framework is analyzed in theory. MRK-Means by iterating calculation, which performs multiple Map/Reduce operation, meanwhile, this improved Map/Reduce programming framework combines attribute property of web, such as massive, dynamic, fast updates, to explore OMRK-Means with attribute property based on Map/Reduce programming framework, which aim for increasing the scalability of online clustering method, reducing time of clustering, improving clustering accuracy.In ensuring the implementation, the experiment shows that OMRK-Means is faster than the traditional clustering algorithm on clustering, such as convergence and time analysis, precision analysis and scalability analysis. It indicates that the proposed method can speed up the interactive analysis of large data sets, improve clustering accuracy and be good for scalable web mining in the case of parallel incremental data.

【关键词】 聚类并行映射/规约网页
【Key words】 ClusterParallelMap/ReduceWeb
节点文献中: