节点文献

基于Nutch的农业搜索引擎检索结果排序策略的研究

Researching on the Sorting Strategy of Agricultural Search Engine Based on Nutch

【作者】 王春花

【导师】 朱俊平;

【作者基本信息】 西北农林科技大学 , 计算机软件与理论, 2010, 硕士

【摘要】 搜索引擎是从互联网上快速有效的定位信息的一门技术,其中和用户关系最紧密的是检索结果排序技术,其结果直接反应给用户体验感受,从某种程度上讲,好的排序结果成就好的搜索引擎。而随着我国计算机在农村的普及,农业信息剧增,农业搜索引擎的研究成为热门课题。本研究的目标是对搜索引擎检索结果排序策略进行深入分析研究,改进传统的PageRank算法,最终把它应用在基于Nutch建立的农业搜索引擎中。本文首先分析搜索引擎的工作流程,研究网页抓取、索引建立、检索执行等环节含有的影响排序的因素;其次分析排序流程,找到了影响排序的关键性因素及其基本原理;再次分析了经典的排序算法及其实现过程;接着分析Nutch开源搜索引擎,研究其排序算法,分别从基于超链接分析的权威性和基于内容分析的相关性两方面对算法进行改进;最后在Nutch基础上,通过对网页抓取入口地址控制建立了农业类搜索引擎,并运用提出的改进排序算法对其进行改进。具体实验中,给出了构建基于Nutch的农业搜索引擎的具体流程。采用了通用的P@n评估法和首页重复率评估法,对改进算法评估。通过具体实验,从量化的角度分析了算法效率,得出改进算法的用户满意度和首页重复率比改进前的算法提高了7%左右。本文的主要成果是对PageRank算法超链接分析权威性的改进,包括两个方面:基于深2度链接分析的父网页非平均传递权值的思想实现和对新创资源与孤立资源的补偿策略。主要分析了以上两个创新的基本改进思路,提出了具体计算公式,并进行了简要分析说明。而对内容分析的相关性研究主要引入了农业主题向量概念和计算构造方法,并给出了文档的农业相关度计算公式。最后,进一步综合形成了引入内容分析的基于父子页面相关性的非平均传递权值的算法。

【Abstract】 Search engines is a technology which locates information from the Internet quickly and effectively, and in which the most closely with customer relationship is the technology searching results sequencing, the results direct response to the user。To some extent,a good sort results will become a good search engine. With the popularity of our computers in the countryside, and the increasing of the agricultural information , agricultural engine research becomes a hot researching topic. The aim of this researching is to analyze the sorting strategy of search engine in-depth, to improve the traditional PageRank algorithm, and to apply it to the agricultural search engine Nutch-based.Analyzing the work flow of search engine, and researching the factors of impacting sorting be containing by the web crawling, indexing, retrievaling and other sections,which is the main work. At the same time, Analyzing the sorting processes, and finding out the critical factors and the basic principle of affecting sorting,which is also the important jobs that have been done. By Analyzing the Nutch which is an open source search engine and its implementation process, researches a classic sorting algorithm, and improves the sorting algorithm based on two aspects whice are the authority based on hyperlink analysis and the content correlation. Finally,based on Nutch, established an agricultural search engine by controlling the address of Crawlling the web page to, which is improved by using the improved sorting algorithm.In the experiment, the specific processes of agricultural search engine Nutch-based is brought forward.With the general evaluating method of the P@n and the Home duplicating rate, the improved algorithm is been well evaluated. Through the specific experiment, the efficiency of the algorithm is been discussed from the quantitative point of view, and the following results are been improved: the improved algorithm derived customer satisfaction and improved page repetition rate than the before algorithm increases about 7%.The main achievement of this paper is the improvement to the link analysis for ultra-authoritatives based on PageRank algorithm.Including the following two aspects: the ideology to the hyperlink analysis based on 2 degrees deep which is the weight of the parent page transmist non-average, and the compensation strategies in the new or isolation resources. Mainly analyzes the basic improvement ideas of these two aspacts, and puts forward the specific formula, and a brief analysis shows. For researching into the relevance of the content analysising,introduces the concept of the agricultural theme vectors and the methods of calculation and construction, and gives the document’s agriculture-related degree formula. Finally, the algorithm is been further introduced which is integrated content analysissing based on parent-child transmissing non-average weight.

  • 【分类号】TP391.3
  • 【被引频次】10
  • 【下载频次】306
节点文献中: