节点文献

基于云计算的Web结构挖掘算法研究

Research on Web Structure Mining Algorithm in Cloud Computing

【作者】 李远方

【导师】 邓世昆;

【作者基本信息】 云南大学 , 计算机应用技术, 2011, 硕士

【摘要】 数据挖掘是从大量纷杂的数据中分析并提取有用的知识和信息。当今网络上最重要的资源信息库是Web页,因此研究Web数据挖掘有着重要意义。但随着互联网的高速发展,Web信息日增长呈指数量级发展,要从中分析出有用的信息,单一节点的计算和存储己存在着瓶颈,而最近提出的云计算则提供了一种全新的解决方案,即可以通过网络获取强大的计算能力和存储能力,并行高效的挖掘知识和信息。文章在概述了云计算、Web结构挖掘、Hadoop等基本理论知识后,将Web结构挖掘算法和云计算(Hadoop开源云平台)进行了整合,并做了以下工作:1.对Web链接结构做了图论抽象,并详细说明了如何取得Web图结构数据,为挖掘算法提供了统一的数据表示方法。2.对类似数据库或文件系统的数据对象做了MapReduce抽象,以此说明MapReduce模型应用广泛,能满足实际的需要。3.对Hadoop分块(BlockSize)策略进行了研究,并建立了相应的数学模型,在实验结果阶段进行了试探性的研究。4.对传统PageRank并行算法进行了改进和移植,并提出了K-span算法。K步跨度算法(K-span)思想:尽量在PageRank并行迭代时减少Hadoop集群节点之间的通信次数,使得PageRank总的迭代时间减少,从而达到快速收敛的目的。具体来讲是Hadoop运行时,可预先将Dk和(AT)k的值依次算出,保存在Hadoop公用访问处,避免了节点之间频繁的通信访问。最后搭建Hadoop平台来估计传统PageRank并行算法和K-span并行算法的时间和空间开销,实验结果表明K-span算法的执行时效更好,同时也带来了额外的存储开销,但相对于云平台的高存储量来讲这点牺牲是值得的。

【Abstract】 Data mining is a term that defines a method of analyzing and extracting the useful knowledge and information from the abundant and confusing data. Today, the most important resource on the information base on the internet is the web page. Consequently, studying the data mining of the web is of great importance. With the rapid development of the internet, the information of web is growing by magnitudes per day, which makes the computing and storage of a single node quite difficult for the researchers. Nevertheless, the newly-proposed theory---the cloud computing provides us a new solution by which we can mine knowledge and obtain the information efficiently through the Internet, where the cloud computing can get the capability of computing and storage.After overviews the cloud computing, web structure mining, Hadoop, and some other basic knowledge, this essay integrates the structure mining algorithms of web and the cloud computing theory (Open-source cloud platform of the Hadoop). Here is the work I did:1) Using graphs to analyze the interlinkage of the web and illustrate how to obtain the graph structure data of the web, which provide a unified method of the data representation for the mining algorithm.2) Making a MapReduce abstract to the data which are quite similar to the database or the file system through which stating the widely use of the MapReduce which can meet the actual demands of the modern times.3) Researching the blocksize strategy of the Hadoop, establishing a mathematical model correspondently and preceding a preliminary research at the terminate stage of the experiment.4) Improving and grafting the traditional PageRank algorithm, and proposing the K-span algorithm. The K-step span algorithm (K-span) is:trying hard to reduce the communicating times which exist in the clusters of the nodes when the iterations occur in the PageRank. In this way, the iterative time of the PageRank can be decreased and the speed up the pace of the convergence. That is, working out the values of both and then storing them in the public visiting room of the Hadoop, this avoids the frequent communication between the nodes when the Hadoop is operating.Finally, setting up a Hadoop platform to estimate the time and space cost of the traditional PageRank algorithm and the K-span algorithm. The result shows that K-span algorithm has a better execution time though it brings the additional cost, which is worthy in return for the vast memory space.

【关键词】 云计算Web结构挖掘PageRankHadoopK-span
【Key words】 Cloud ComputingWeb Structure MiningPageRankHadoopK-span
  • 【网络出版投稿人】 云南大学
  • 【网络出版年期】2012年 04期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络