节点文献

信息化教育领域的Web信息抽取技术研究

Research of Web Information Extraction in Informatization Education

【作者】 邱亚娜

【导师】 张桂芸;

【作者基本信息】 天津师范大学 , 教育技术学, 2008, 硕士

【摘要】 计算机技术和互联网(Internet)的迅猛发展,使Web发展成为一个全球的、巨大的、分布和共享的信息空间,Web作为一个庞大的资源库,给人们的学习、生活和工作带来了巨大的便利。然而面对Web上的海量信息,人们却陷入了“数据丰富,知识贫乏”的尴尬境地。由于目前的Web数据大多以HTML的形式出现,使得应用程序无法直接获取Web上的信息。Web信息抽取技术正是在这一背景下应运而生。本文分析了一些典型的信息抽取系统技术特点,并探讨了在信息化教育中,从学习者的需求出发,抽取个性化的服务信息。本文实现了一个基于文档结构树的个性化信息抽取系统。本系统主要分为两个部分,抽取规则的定义以及抽取规则的执行。在抽取规则的定义阶段,首先将获取的HTML结构的网页进行规范化处理,转换为格式规范、语义清晰的XML文件,生成对应文档的DOM树,然后由用户指定待抽取信息的位置以及对应的目的表的模式,最后根据这些信息生成抽取规则。在抽取规则执行阶段,系统根据用户定义的抽取规则抽取Web数据并将其加载到指定位置的目的表中。

【Abstract】 With the rapid development of computer technology and the Internet, Web has been a global, huge, distribution and shared information space. As a huge resource base to people’s learning, life and work, Web has brought tremendous convenience. But in the face of vast amounts of information on the Web, people are trapped in an awkward condition of "data rich, poor knowledge". Since most of the Web data is in the form of HTML, the application makes no direct access to information on the Web. Web information extraction technology is brought forth to resolve this problem.This paper analyzes some typical Information Extraction (IE) System and shows how to Extract personality information based on the personal needs of learners in Informatization Education. A personality information extraction system based on document structure tree has been implemented. The system includes two parts, which are the definition and execution of the extraction rules respectively. In the phase of the definition of extraction rules, first introduced is how to transform data represented by HTML to the well-formed XML document and how to get the DOM tree of the XML document. Then user specify the location of the information which will be extracted and map it to the target table to define the Extraction rules. In the phase of the execution of the Extraction rules, the system extracts the data of Web structure with user-defined extraction rules. Finally, it is stored in a structured way.

【关键词】 Web信息抽取HTMLXMLDOM树信息化教育
【Key words】 Web Information ExtractionHTMLXMLDOMInformatization Education
  • 【分类号】G434
  • 【被引频次】1
  • 【下载频次】164
节点文献中: 

本文链接的文献网络图示:

本文的引文网络