节点文献
基于WEB挖掘的网络蜘蛛的研究与实现
Research and Implementation for Web Spider Based on Web Data Mining
【作者】 詹晶晶;
【导师】 倪子伟;
【作者基本信息】 厦门大学 , 计算机软件与理论, 2007, 硕士
      
      【摘要】 搜索引擎是从WWW上快速而有效地获取信息资源的捷径,而网络蜘蛛技术则是搜索引擎的关键。本文围绕WEB信息挖掘这一前沿性研究领域课题,结合搜索引擎框架的总体要求,实现了网络蜘蛛在互联网中的漫游,并将网页数据存储在本地数据库中,为以后网页搜索引擎的实现打下了良好的基础。本文首先从搜索引擎的分类和组成出发,对搜索引擎的内部运行机制进行了了初步的了解,然后详细分析了网络蜘蛛技术实现的功能和搜索的策略。最后本文实现了一个网络蜘蛛在网络中的漫游,并能将网页数据存储在本地数据库中。研究内容主要包含:首先分析搜索引擎的工作原理,实现搜索引擎工作中的第一步一从互联网上抓取网页。其次详细阐述和分析了所用到的技术,特别是本文实现中所用到的HTTP协议、正则表达式、多线程和ADO.NET等技术。在已有网络蜘蛛技术的基础上,对网络蜘蛛的系统进行分析和设计,采用广度优先的搜索策略,结合多线程机制,实现了对内网和外网页面的抓取和页面内容分析的算法。本文的创新点在于,首先,把正则表达式技术应用到WEB网页内容提取里面,快速有效地提取网页中的URL,实现了对内网和外网页面的抓取和页面内容分析的算法。最后使用Zlib数据压缩算法对网页数据进行压缩并存入本地数据库。其次,在读取网页信息模块的设计中,为了提高网页获取的速度,采用了一个特殊的错误URL处理策略,即通过服务器的响应时间来取决函数是否返回HTTP页面,把超时的URL放入错误队列,等待错误处理进程的处理。会使蜘蛛根据网络状况来快速处理服务器响应时间快的URL,从而提高蜘蛛的整体速度。然后,通过在校园网上进行实验,并且读取存储在数据库中的网页数据,验证了该网络蜘蛛的可行性,证明系统己达到了预期的目标。最后,对本课题下一步的主要工作内容进行系统的总结并做出简单的展望。
【Abstract】 The spider programming technology is the key part of search engine, which is the convenient and effective method to get the information from the WWW. Surrounding the innovative technology of Web Data Mining and based on the whole request of search engine’s frame, the main work of this article is to realize the cruise of the Internet spider,and store the data of the page into the local database, place a firm foundation for the realization of intelligent search engine.The main contents of this article include:Firstly, analyze the principle of search engines and realize the first step in the work of search engine: get the page data from Internet. Secondly, describes the technology used in the article,such as HTTP protocol, Regular Expressions, Multi-thread and ADO.NET. Based on the network spider technique, the article analyzes and designs a system of a new spider. Using the BFS strategy ,Combined with multi-threads technology , this article realizes the algorithms of crawling the web-pages from Internal and External networks and analyzing the content .In this paper, the innovation lies, first, regular expression technology applications to getting WEB content to make extracting the website URL quickly and efficiently and achieving crawls the internal networks and the web-pages content and analysis algorithms. Finally compress data with Zlib algorithm and put the data into the local database. Secondly, in order to increase the speed, we adopt a special strategy to deal with the wrong URL. That is, through the server’s response time to deciding whether or not to get the HTTP pages, then put the overtime URL in the wrong queue waiting for the process of the thread of dealing with wrong URL. Thirdly, after analyzing the result of experiment in the network of campus and the result of the data stored in the database, the feasibility of the spider can be validated,the prospective object of the system have been achieved.Finally,the conclusion of the whole system and the future work of the subject are presented.
- 【网络出版投稿人】 厦门大学 【网络出版年期】2008年 07期
- 【分类号】TP391.3
- 【被引频次】2
- 【下载频次】375