节点文献

可扩展分布式垂直搜索引擎设计与实现研究

Research on Design and Implementation of the Extensible Distributed Vertical Search Engines

【作者】 黎斌

【导师】 鲜明;

【作者基本信息】 国防科学技术大学 , 电子与通信工程, 2008, 硕士

【摘要】 众所周知,在浩如烟海的互联网上存在着大量的隐蔽网络资源,这些资源由于许多因素不容易被用户轻易地发掘,然而这些隐蔽信息在数量和质量上都优于普通的网络资源,所以对它们的发掘研究变得越来越重要。通用搜索引擎由于受到爬行深度的限制不可能全面地抓取这些信息,并且许多网站都设置了访问权限,一般爬虫(Crawler)被禁止访问;通用搜索引擎的页面解析也不能适应各具特色的网页形式的要求。相对于通用搜索引擎,具备特殊功能的垂直搜索引擎在挖掘隐蔽信息方面却能取得较好的效果。垂直搜索引擎采用针对资源特点的定制抓取策略和解析方法,能提取出精度非常高的网络信息,对于用户来讲,通过它可以在某一领域查询到经过精心筛选的信息。论文研究了搜索引擎的相关技术。通过分析研究聚焦爬虫的各种爬行策略,提出了基于树型网络结构的国外军事论坛网站资源的网络爬虫方法。通常论坛在网络分布上严格符合树型网络结构,可以针对性地加入爬行链路选择机制,使爬虫只抓取存有信息的贴子网页。在信息分类方面,论坛贴子内容含有大量的无用信息(回贴、恶意发贴),而这些无用信息通过统计发现,含有两个通常的特点:字数少、段落少。本文针对这一特点,提出了基于模糊模式识别的信息分类方法,将贴子信息的字数和段落数提取出来做为影响因子,采用样本分析法确定其影响度和权重,根据S型函数形态计算出分类隶属函数公式,有效地提高了分类的质量。在索引与检索方面,研究了垂直搜索引擎常用的索引软件Lucene的索引方法,提出了针对用户查询的结果缓存方法,通过OSCache进行了实现,大大提高了检索的响应速度。通过对搜索引擎的整体研究,使用Java建立了一个包含Military.com论坛的部分信息的军事资料搜索引擎,并将前面的研究结果进行了实现。最后研究了分布式搜索引擎的各种系统结构及运行机制,提出了基于分布式元搜索引擎系统的分布式垂直搜索引擎的系统框架,并提出了基于CORBA模式的分布式实现方法。

【Abstract】 It is known that there are a lot of hidden resources in the Internet which are not easily explered by the users for many reasons. Because the quantity and quality of these hidden resources exceed the ordinary ones, researches on their exploration become increasingly important. General searching engines can not grasp the information fully due to the restrictions of the crawl depth. The general crawler is prohibited to access many web sits for the limited permission and can not adapt the diversiform web pages. The vertical searching engines are superior in mining hidden information compared to general ones. They adopt specific crawling strategy and analytical method for the characteristics of the resources and can extract highly accurate web information. They can provide the specially selected information in some field for the users.The technologies of the search engines are studied in this dissertation. A crawler based on the tree structure is proposed for the web sits of foreign military forums, through the analysis of the various focused crawlers’ strategies. Usually forums accord strictly the tree structure in the network distribution, so the selection scheme of the crawling link can be added to crawl in the web pages containing information. In information classification, the forum postings contain a lot of useless information (post, malicious post), which statistically contain two features: few words and paragraphs. A method of information classification is proposed based on the fuzzy pattern recognition. Using the quantity of words and paragraphs as an effect factor, determining the effect and weight with the sample analysis method. The quality of the classification is improved effectively by calculating classification formula with S-function. In the index searching, a vertical search engine with Lucene’s method is studied and a buffer method is proposed to solve the users’ inquiries. The response speed is improved greatly by using OSCache. Based on the study of the search engines, a search engine is designed and realized using Java for military information in the Military.com forum. At last the structure and operational scheme of the various distributed search engines are studied and the system framework of the distributed vertical search engine is proposed based on the design of distributed CORBA model.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络