节点文献

智能垂直搜索引擎的研究与设计

The Research and Design on Intelligent Vertical Search Engine

【作者】 黄胜根

【导师】 陈蜀宇;

【作者基本信息】 重庆大学 , 计算机系统结构, 2010, 硕士

【摘要】 随着Internet的快速发展,Web上的信息与资源日益膨胀。面对海量的信息资源,如何更快更好的获取需要的资源成为人们日益关注的问题。通用搜索引擎返回的结果页面中含有大量的“噪声”页面,需要人为的去挑选自己所关注的主题。垂直搜索引擎的出现,为人们提供了更快,更专业,更精准的网络资源的检索服务。垂直搜索引擎是以构筑某一专题领域或学科领域的因特网信息资源库为目标,智能地在互联网上搜集符合设定专题或满足学科需要的信息资源,它只针对某一特定主题,能够提供更集中、更专业的搜索服务。在对垂直搜索引擎的关键技术进行研究的基础上,本文研究并设计了垂直搜索引擎的主题爬行模块、索引模块和检索模块,并最终实现了一个垂直搜索引擎原型系统。主要工作如下:①针对当前垂直搜索引擎面临的一个亟需解决的“主题漂移”问题,本文提出了一种改进型的主题爬行模型。主要包括基于反馈的主题知识库、主题判定模型和链接分析模型。通过不断提炼和反馈主题网页数据库中的主题关键词,丰富和完善主题知识库,使主题知识库具有一定的学习和自适应能力;考虑HTML不同标签的权值,采用改进的向量空间模型算法判定网页的主题相似度,提高主题判定的有效性和准确性;基于Shark算法思想,通过将HTML文档解析为DOM树形结构,同时设置链接上下文阈值,提出一种基于链接上下文的链接主题相似度DOM判定模型,从而更好的来判断URL的主题相似度,指导主题爬行的方向。②在研究全文检索基本原理和倒排索引组织结构的基础上,综合字索引、词索引和主题网页的特征,提出了一种基于主题知识库的混合索引模型,提高了索引建立的效率和准确性;设计了基于混合索引的检索器的工作流程,并结合向量空间模型,对检索结果排序进行了分析和探讨。③最后采用Nutch框架,实现了一个面向“五金”的垂直搜索引擎原型系统。通过对该原型系统进行实验测试,实验结果表明该垂直搜索引擎系统具有较好的查准率,并且具有自适应性,体现了一定的智能,在一定程度上解决了“主题漂移”问题,基本达到了本文的研究目的,同时也为后续的研究提供一定的理论和实验依据。

【Abstract】 Along with the rapid development of Internet, the resources of web are on the increasingly expanding. Facing the mass of information resources, more and more people are now concerning how to access to resources better and faster. General search engine results pages contain a lot of "noise" pages, people need to choose what he needs. Vertical search engines provide people with a faster, more professional, more accurate search services of network resources.Vertical search engine is used to collect information resources of Internet that meet specific topics. It is able to provide more professional search services. The thesis designs a vertical search engine prototype system, including the focused crawling model, the index model and the retrieval model. The main work is listed as follow:①The thesis presents an improved focused crawling model, which can solves the“topic dirft”problem, including a subject knowledge based on feedback, a topics identification model and a link analysis model. Through getting continuous feedback from the theme words, subject knowledge can have a certain adaptive capacity; considering the various weight of html’s tags, the thesis presents a improved vsm algorithm to determine the topic similarity of page; Through parsing the HTML document as a DOM tree structure, the thesis proposes a link context model to determine the topic similarity of URL correctly.②The thesis studys the principles of full-text search and the structure of inverted index in depth. On this basis, the thesis presents a hybrid index model based on subject knowledge to improve the efficiency and accuracy of Index. Then, the thesis designs the workflow of search baseed on the hybrid index and analyzes the sort model of search results combining the vector space model.③Finally, the thesis realizes a hardware-oriented vertical search engine prototype system based on the framework of Nutch. Experiments show that, the vertical search engine system has more precise rate and certain self-adaptive properties, solves the“topic drift”problem, and reachs the research’s purpose basically, also provides a theoretical and experimental basis for the follow-up study.

  • 【网络出版投稿人】 重庆大学
  • 【网络出版年期】2011年 04期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络