节点文献

Web信息采集系统设计与实现

Design and Implementation of Web Information Collection System

【作者】 周林云

【导师】 胡晓鹏;

【作者基本信息】 西南交通大学 , 计算机软件与理论, 2013, 硕士

【摘要】 随着移动终端的快速发展和普及,人们越来越习惯通过在移动终端上安装阅读类应用软件获取感兴趣的信息,与之伴随的是平台供应商(也包括内容提供商)必须构建相应的技术平台来支撑这样的业务模式。而这个平台的内容来源可通过两种方式获取。一种是手工编辑,另一种是通过程序自动采集信息源的内容。本文针对后者设计了一套Web信息采集的解决方案。论文首先介绍了课题的研究背景,研究现状,以及信息抽取的相关技术和信息采集的工作原理,并对网页结构进行分析;接着,分析了系统的功能和面向的用户,运用用例图和用例规约对系统进行用例建模,分析了系统的非功能需求;然后,对系统进行总体设计和数据库设计;再次,对系统进行了详细设计与实现;最后,对系统进行测试,验证了本方案的有效性。本文的主要工作如下:1.本文研究了如何在HTML文档中快速定位目标信息的方法,通过利用HTML标签和属性及DOM的路径表达式设计了信息的抽取规则,采用可视化界面和简单的人机交互来自动生成信息的抽取规则,并在此基础上设计了一种实用的正文去噪解决方案。2.本课题包括采集配置子系统和采集子系统两部分组成。采集配置子系统可将配置的采集任务通过Socket机制传递给采集子系统,从而控制采集任务的开启、停止操作,使得用户不必关心采集运行过程即可得到采集结果。3.采集子系统根据用户已配置的采集任务,运用多线程技术、数据库连接池技术、动态采集策略、多页面合并技术,定时对这些网站进行信息采集、抽取、去噪、去重等,实现对相关网站特定信息的定时采集更新。

【Abstract】 With the rapid development and popularity of mobile terminals, people are increasingly accustomed to obtaining information of interest through the reading application software that installed on the mobile terminal, at the same time, platform vendors (also including content providers) must construct the corresponding technology platform to support such a business. The contents of this platform sources can be obtained in two ways. One is manual editing, and the other is to automatically collect information through the program from information source. In this paper, as to the latter one, there is a Web information collection solution.This paper first introduces the research background, research status, the relevant information extraction technology, as well as including giving information collection works and webpage structure analysis. Secondly, there is a detailed analysis of the system function and the user of the system, the system use case modeling consists of using use case diagrams and use case specification, and analyzing the system’s non-functional requirements. Then, design the system and database. Once more, gives out a detailed system design and implementation. Finally, verify the effectiveness of the program by means of testing the system. The key work is as following:1. This paper analyzes how to locate object information in the HTML document, and designs information extraction rules based on simple visual interface and human-computer interaction through HTML tags and attributes and DOM path expression. Then, gives a solution for main body de-noising based on above.2. This subject includes collection configuration subsystem and collection subsystem. The former pass the configured acquisition task to collection subsystem through the socket mechanism in order to control the task of open and stop operation. The benefits of doing so is to get the collection result and not concern about the operation process for user.3. Acquisition subsystem regularly and automatically collect、extract de-noise、 de-emphasis information based on user configuration on these sites by multi-threading technology, database connection pool technology, dynamic acquisition strategy and multi-page consolidation technology. Update at regular time collecting of site-specific information.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络