

Research and Development of Search Engine Based on Focus Relevance Ranking

【作者】 温泉

【导师】 丁祥武;

【作者基本信息】 东华大学 , 计算机体系结构, 2010, 硕士

【摘要】 搜索引擎是人们从海量网络数据中获取有用信息的重要工具,是网络信息研究和应用的关键内容。目前随着网络信息的爆炸式增长以及信息多元化的发展,快速有效地获取所需的信息变得越来越困难,通用搜索引擎已不能适应用户对信息检索的准确性要求,专业化的、面向主题的垂直搜索引擎正成为研究的热点。相关度排序技术是搜索引擎中的关键技术之一,它对于获取主题相关的数据和提供相关的查询结果集起着至关重要的作用。论文研究了垂直搜索引擎中相关度技术,并分析了其中的不足之处,然后对主题爬行、基于链接结构排序、基于页面权重排序等方面提出了改进模型和算法,以提高相关度排序的质量,从而改善垂直搜索引擎的性能。最终设计并实现了面向领域的垂直搜索引擎系统。论文的主要贡献包括:(1)针对主题爬虫无法穿越“黑暗tunnel”问题,使用在线学习的方法并利用辅助函数,对主题爬虫的主题爬行策略进行改进,使其能抓取到相关度更高的主题数据。(2)研究了PageRank算法及其改进算法,通过对用户点击网页行为进行建模,改进链接之间PageRank值的传递方式,从而提出改进算法。实验证明,该算法能在不增加额外存储空间的情况下,有效地避免主题漂移现象的发生。(3)针对网页权重特征提取模型维度过高的缺陷,提出网页权重的自定义方法,定义出网页权重的因素,并利用可分性判据来衡量页面权重因素的权重,从而给出页面权重的评价函数,有效地降低网页特征空间维度。(4)融合以上三方面改进方案,提出聚焦相关度排序方案,并将其运用到搜索引擎的实现中。(5)利用Lucene全文搜索引擎框架,实现了汽车主题资源的垂直搜索引擎系统。经实际应用表明,聚焦相关度排序使本垂直搜索引擎的相关性、查全率、查准率都有了不同程度的提高。

【Abstract】 Search engine is the most important tool for people to get useful information from the magnanimity web data,also it is the key content of researching and developing web information. But currently,with the web information’s blast increasing and multivariant information’s developing,it comes to be more and more difficult to retrieve desirable information speediness and effectively. Traditional search engine can’t meets users’ high precision requirement of searching information, vertical search engine ,which is professional and oriented topic,becoming the research hot spot.Relevance ranking technology is the core technology of the vertical search engine, it plays an important role in retrieving topic data and providing relevance searching result.The paper works on research the key issues of relevance ranking technology of vertical search engine,and describes improved model and algorithm of the topic crawling technology, ranking bases on links structure,ranking bases on page weight and so on.Improve the quality of relevance ranking to improve the performance of vertical search engine.Finally,design and develop a vertical search engine orient domain.The main contributions of the paper include:(1)Aim at the problem of topic crawler can’t get through the dark tunnel,use online learning method and assist function to improve the topic crawling strategy of topic crawler,make it can retrieve high relevance topic data. (2)Research PageRank algorithm and its’improve algorithms,through modeling of the behavior of user click page,improve the way of delivering PageRank value between links,and describes the improve algorithm,which does’t need added space ,can prevent the topic drift event from happening.(3)Aim at the high dimension shortcoming of feature distill model, describes the customization method of page weight to constitute the factor of page weight,and use dissoluble criterion to weigh the page weight factors,and get the evaluate function which can reduce the dimension of feature vector.,(4)Describes the focus relevance ranking strategy to integrate the three aspect improvement above,and put it into practice in the development of search engine.(5)Using the Lucene full text search engine framwork to develop a oriented automobile topic search engine system. The Pratical Application shows , our focus relevance ranking strategy makes the search engine have improvement in relevance, recall ratio and precision ratio.

  • 【网络出版投稿人】 东华大学
  • 【网络出版年期】2010年 08期