

Research of Intranet Information Supervision System Based on Net Crawler and Full-text Search Engine

【作者】 傅翔华

【导师】 郭文明;

【作者基本信息】 南方医科大学 , 生物医学工程, 2009, 硕士

【摘要】 随着计算机网络技术的发展和信息化建设的不断深入,单位和部门内部的网络应用水平不断提高,网络发展、建设的重点已经由网络建设初期的Internet应用服务转移到单位内部Intranet网络应用的拓展上。各单位普遍以本部门的业务工作为基础,依托Intranet网络建立了多项网络环境下的应用系统,在这些应用系统中,Web成为应用开发的主流平台,随着Web环境下的动态脚本技术、数据库技术开始成为Web应用开发的主流技术,Web环境下的信息发布能力大大增强,包含各种信息的交互式网站如雨后春笋般涌现。伴随着这种建设重心的转移和新技术的应用,各单位的网络应用水平和信息发布能力提高到了一个新的层次,随之而来的是Intranet环境下网络信息的爆炸性增长,如何对这些信息进行有效的监督和管理成为各单位网络管理部门面临的新问题。同Internet上的公共信息不同,Intranet应用中的信息同本单位内部的工作、业务、生活等各个方面息息相关,随着网络这一新生媒体在日常生活中扮演的角色越来越重要,这些网络信息的重要性和影响力也变得越来越大,因此对其进行有效的监管成为网络管理者迫切需要解决的问题,而网络信息的海量特征及其形式的多样性则增加了解决这一问题的技术难度。本文针对这种情况,提出了一种基于信息采集和全文检索技术构建Intranet网络信息监察系统的方法,通过使用计算机技术来实现对目前Intranet网络内Web信息的有效采集和信息的初步筛选,为网络管理者有效地对Intranet内的网络信息进行监管提供了一个可行的解决方案。通过使用目前搜索引擎技术中的爬网机器人技术(Web Crawler、Web Spider),系统的数据采集模块可在较短的时间内完成对Intranet网络内Web信息的数据采集和整理,然后通过数据库的全文检索技术对采集到的大量数据进行初步的检索和筛选。在系统开发过程中,结合Intranet网络内信息的特点,对爬网机器人技术进行了有效的改进,采用了“逐站式搜索”和设定“搜索规则”等技术思路来提高信息采集的准确性和效率。系统提供了基于B/S结构的用户接口,以搜索引擎的方式向用户提供服务,一方面为Intranet内用户提供了实用、方便的网络搜索服务,另一方面通过扩大系统的使用范围来提高系统对敏感信息的识别能力,通过对用户使用时的产生的历史关键字进行记录和分析,结合SQLServer数据库内全文检索引擎的相关技术参数设置,进一步完善系统对敏感信息的覆盖范围和覆盖程度。论文首先对目前Intranet网络信息管理所面临的形势和困难做了简要分析,对Intranet环境下网络信息的特点进行了归纳和总结,在此基础上,提出了一种利用计算机软件技术对网络信息进行有效监管的技术思路,针对系统构建中的一些技术难点提出了相应的解决方法,并对系统软件结构、具体实现方法进行了简要阐述,最后对当前系统已实现的目标和存在的问题以及有待改进的方面进行了总结。

【Abstract】 With the development of network technology and application information-based, the level of application based on network and information has improved increasingly. The major platform of network development and construction has transferred from Internet to Intranet.Generally,most organizations and departments have built all kinds of internet application systems based on Intranet.New technology such as dynamic server-side script and database has widely been used in web application development.As a result,the information based on Intranet grows rapidly with transferring the main point of building and application of new technologies."How to efficiently control the information based on network,especially on Intranet?" has become a challenge,which makes the network administrations have to face.The information on Intranet is different from the one on Internet,which plays an important role in society,and has more significant influence on organizations and departments.Therefore,it is the efficiently supervision that should be paid attention on by internet management.However,there are some technology difficulties to manage this kind of information for the characteristic of information based on Intranet. As to the issues,a software method is introduced,which is based on data collection and full-text search engine to develop an Intranet information supervision system.With the help of this system,network administrator can catch the ability of information collection and data filter fast,thus helping the administrator to supervise the web information on Intranet.In the process of development,some popular software technologies are adopted,such as web crawler,which is widely used in web search engine,full-text search engine based on RDBMS.On the other hand, considering the characteristic of information on Intranet,researchers take some additional technical measures to ensure the system work more efficiently,such as "site by site search mode","restriction search rules".In addition,after the search task on Intranet completed by data collection module,researchers use full-text search engine based on RDBMS to manipulate the data,such as merge and filter,and extract valuable information.Also,it is useful to implement a web module in system, which combines web with full-text search engine and RDBMS,and it provides an easy-to-use user interface based on browser,which offers a convenient way for users to get access to the system.Furthermore,people can use this system to search any keyword they are interested in.Meanwhile,through analyzing keyword log which record all keywords user has utilize,it is helpful to find what users most interested in. As a result,network administrator can further improve supervising ability of the system.At first,the paper makes an analysis of the difficulty and embarrassing situation, which network administrator confront.Then,the writer summarizes the characteristic of the information on Intranet.After that,there is a presentation of a software solution to supervise information on Intranet,as well as a description of the software architecture and implementation of system.At last,the paper makes a conclusion of the system’s goal achieved,the shortage and the improvements.


