节点文献

基于MapReduce的信息检索相关算法并行化研究与实现

The Research and Implementation of Parallelism of Information Retrieval Related Algorithms Based on Mapreduce

【作者】 肖韬

【导师】 袁春风;

【作者基本信息】 南京大学 , 计算机软件与理论, 2012, 硕士

【摘要】 随着Internet的日益普及与迅速发展,互联网上的信息量呈几何级数增长,信息爆炸已成为当今网络时代的特征之一。作为访问互联网的重要入口,搜索引擎在帮助用户从浩如烟海的Internet中快速准确地获得所需信息方面起到了日益重要的作用,人们的生产生活已经越来越依赖搜索引擎。搜索引擎检索的对象是整个互联网上的全部数据,包括网页、图片、音乐、视频、FTP资源等。这些海量的数据对信息检索系统的高效运行提出了新的挑战:一方面,单台计算机的处理能力受到CPU时钟频率、内存容量、磁盘读写速度和网络带宽等因素的制约,无法在理想的时间内独自处理全部的数据;另一方面,这些海量数据并非存储在单台计算机上或者单个数据库中,而是分布在整个Internet上,这就需要成千上万台计算机以“相互合作”的方式对这些海量数据进行处理。因此,为搜索引擎设计能够高效地处理海量Internet数据的并行算法成为了学术界和工业界共同的研究方向与追求目标。在过去的数十年中,并行计算领域的研究取得了长足的进步,一些经典的并行计算平台相继出现,如MPI、OpenMP、OpenCL、CUD A等,特别是Google于2004年提出的MapReduce并行计算模型,以其良好的可扩展性、可靠性和易用性,为并行计算提供了简单、高效的计算模型和运行环境,降低了并行计算从理论向应用转化的难度,为并行计算的实际应用提供了一个简单易用的平台。信息检索领域的传统算法发展至今已日趋成熟,然而,有些算法并非是专为并行环境设计的,面临着无法直接处理大规模的海量数据或者无法在有效的时间内完成对海量数据的计算的窘境。因此,如果能够将这些算法加以改造,使其能够分布在多台计算机上并行地运行,则可以大大提高对海量数据的处理效率,更加快速地响应人们的搜索需求,改善用户的搜索体验。在信息检索领域中,查询推荐(Query Suggestion)与网页排序(Page Rank)是两项重要的研究内容:查询推荐可以帮助用户更加精确有效地查询并节省搜索时间,而网页排序则可以改善搜索质量、帮助用户更容易地找到所需的网页。如果能够对这两个领域中的一些串行算法进行并行化改造,使其能够并行地运行于计算机集群中,则能够有效提升搜索引擎对大规模数据的处理能力,加快搜索引擎在查询推荐和网页排序方面的更新速度,提高用户对检索的满意度。本文研究了查询推荐领域的QUBIC算法和基于频繁项集挖掘的网页排序算法,以对海量Internet数据的并行处理作为研究背景,基于MapReduce并行计算模型对QUBIC算法和基于频繁项集挖掘的网页排序算法进行了并行化改造,使得QUBIC算法和基于频繁项集挖掘的网页排序算法能够运行于MapReduce并行计算框架之中,并利用Hadoop并行计算软件框架实现了一个原型系统。具体而言,本文的主要研究工作包含以下方面:(1)对QUBIC算法进行基于MapReduce模型的并行化改造,提出了数据分布和并行计算的具体方法,包括:搜索引擎日志文件的分布存储,Query-URL二部图的构造,Jaccard相似系数的计算,QAG的生成,QAG中连通分量的计算以及对Query的排序。(2)对传统的SON频繁项集挖掘算法进行基于MapReduce模型的并行化改造,提出频繁项集并行挖掘的PSON算法,并将其应用于对频繁URL的挖掘。在计算出搜索引擎返回结果中关联性较大的一组URL后,按照其重要程度降序呈现给用户。本文在Hadoop并行计算平台上实现了本文对原算法进行并行化改造的思想,并进行了实验。实验表明,本文提出的对相关算法进行并行化改造的方法是行之有效的,并且具有良好的可扩展性能和加速比性能。最后,本文实现了一个原型系统,从整体上演示了QUBIC并行算法和频繁URL并行挖掘算法的运行效果,验证了这两类算法的正确性和有效性。

【Abstract】 With the growing population and fast development of Internet, the scale of data on Internet is increasing exponentially and information explosion has been one of the epochal features. As an important access point of Internet, search engine is playing an more and more important role in helping users find information they need precisely from massive scale of Internet, and people depend increasingly on search engine in life and production. The searching object of search engine is the total Internet, including web pages, graphs, music, vedio, FTP resource and so on. The massive data on Internet has brought a big challenge for the efficient running of information retrieval system:on one side, limited by the CPU clock frequency, memory capacity, disk I/O rate, and network band width, etc.,a single computer can not finish processing all the data alone; on the other side, the massive data is distributed over the whole Internet instead of stored in a single computer or database, which means that thousands or even tens of thousands of computers need to process these data together in a coorperational way. As a result, proposing parallel algorithms for search engine that can process massive data on Internet efficiently has become a common target of academia and industry. The research in parallel computing domain has made a great progress in the past decades, and some classical parallel computing platforms have been proposed, including MPI, OpenMP, OpenCL and CUDA, etc. Among these platforms, the MapReduce parallel computing paradigm proposed by Google in2004provides parallel computing a simple and efficient computing paradigm and environment by its excellent scalability, reliability and ease of use. MapReduce makes it easier to transform parallel computation from theory to application.The traditional algorithms in information retrieval domain have become increasingly mature, however, some algorithms are not designed especially for parallel computing environment, which means they cannot process massive data directly or finish processing massive data in an acceptable time period. They will run on multiple computers in parallel when parallelized, which can enhance the efficiency of processing massive data, respond to the searching needs of users faster and improve uses’searching experience. Query suggestion and page rank are two important research issues in information retrieval domain:query suggestion can help users get search results more precisely in a shorter time period, while page rank can help improve searching experience and help users find their target pages easily. If we can parallelize some of these serial algorithms in information retrieval domain and make them run in a cluster in parallel, then the processing capacity of search engine for massive data can be enhanced effectively, the update cycle of query suggestion and page rank can be shorter, and users will be more satisfied with the searching process.In this paper, we studied the QUBIC algorithm in query suggestion domain and page rank algorithms based on frequent sets mining in the research background of parallel processing of massive Internet data, and parallelized QUBIC and frequent sets mining based page rank algorithm based upon MapReduce parallel computing paradigm, which makes QUBIC and page rank algorithms be able to run in MapReduce parallel computing framework. Besides we implemented a prototype system in Hadoop. Specifically, our work mainly consists of these parts below:(1) We parallelized QUBIC algorithm based on MapReduce parallel computing paradigm and proposed specific methods for data distribution and parallel computation, including the distributed storage of search engine log files, the construction of Query-URL bipartite, the computation of Jaccard similarity coefficient, the generation of QAG, the computation of connected components in QAG and the ranking of Query.(2) We parallelized traditional SON frequent sets mining algorithm based on MapReduce parallel computing paradigm and proposed PSON algorithm which is later applied to mine frequent URLs by computing aset of related URLs from searching results and displaying them according to their order of importance.We implemented our parallelization design of original algorithms in Hadoop parallel computing platform and conducted experiments. Experimental results proved that our parallelization of original algorithms is effective which can achieve good performance in scale-up and speed-up. Finally we implemented a prototype system where we demonstrated the running procedure of parallel QUBIC algorithm and frequent URL sets mining algorithm, which proved the correctness and efficiency of these two algorithms.

  • 【网络出版投稿人】 南京大学
  • 【网络出版年期】2012年 10期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络