

Research on Web Information Extraction Technology Based on Frame Semantic Tagging

【作者】 白鹏洲

【导师】 牛之贤;

【作者基本信息】 太原理工大学 , 计算机软件与理论, 2008, 硕士

【摘要】 随着Internet的快速发展,web已经成为全球化的信息源,它为信息共享和资源共享提供了一个良好的平台。然而,用传统的搜索引擎人们很难迅速准确地找到所需要的信息。信息抽取技术正是在这样的前提背景下产生的,信息抽取是从网页(文本)中自动地抽取出有用的信息的一种技术,它是目前智能信息处理的一个重要研究课题。信息抽取系统在web上抽取的信息不仅可以直接提供给用户,还可以作为构建智能查询系统和数据挖掘系统的基础,有着广阔的应用前景。本文首先介绍了信息抽取系统的产生背景、发展历史,研究了信息抽取技术的研究现状,分析了当前几种重要的信息抽取工具和当前信息抽取工具的一些缺陷——缺乏语义或语义模型过于简单。然后针对这一不足之处,利用框架语义在语义信息标示方面的优势来解决信息抽取结果中语义信息缺失或语义信息过于简单这一问题,提出了一种信息抽取的方法——基于框架语义标注的信息抽取。本文通过构造一个基于框架语义标注的web图书信息抽取系统来说明基于框架语义标注的信息抽取技术的思想——将框架语义网络技术、领域本体知识和信息抽取技术相结合。对自由文本进行信息抽取时,首先进行框架语义标注,再根据标注结果结合领域本体知识生成抽取规则。该方法的特点在于在抽取过程中以框架语义标注作为构建信息抽取规则的基础,用统一的方法来指导信息抽取过程——以语义角色为核心构建信息模式,将信息模式的建立上升到语义角色一级,从而达到所抽取出信息的带有明确的语义信息。本系统对于实现基于语义的信息抽取研究具有重要的现实意义。不仅如此,它的体系结构和主要模块的设计思想,对于其他文档的信息抽取系统的设计和实现也具有较高的借鉴价值。

【Abstract】 With the rapid development of Internet, web has becomed the global information source, which provides an ideal place for sharing and communication information. However, it’s hard for user to get access to the needed and useful information quickly and correctly by traditional search engine. A new technical-information extraction has been put forword. Information extraction can extract auto-matically useful information from web (text) . It has been became an important research topic in the intelligent information processing field. These information extracted from the web site can not only provide the user but also be a foundation resource of the intelligent query system and data mining system. Information extraction has very broad application . prospects.This paper presents the background, history of information extraction, reviews the information extraction state of Internet, analysis several important tools of information extraction. And we analysis some disadvantages of current information extraction techniques. Bacause of the advantages of frame sematics in sematics information indicated, a new method of information extraction base on frame sematics tagging was put forword to resolve this issue of losing sematics or too brief sematics information in the results of information extraction, frame sematiocs have some advantages in sematics indicated .This paper explain the thinking of information extraction technical based on frame semantic tagging to constructing the web’s book information extraction system based on frame semantic tagging—intergrating frame semantic network technology, domain ontology and information extraction technology. when text’s information extracted , firstly,it was tagginged. then summarized the rules of extraction according to the results of tagging and domail ontology’s knowledge . The method’s character lies in frame sematics tagging as basis fo the building information extraction rules in extraction process, and guide the information extraction process by an unified method which building information model as core of sematics role,the model of information rise to the semantic role ,so as to achieve the information extracted with a clear semantic informaiton.The system is of great importance on information extraction based on semantic. Furthermore, the architecture of the system and design of the main components are also valuable for other IE Systems.

  • 【分类号】TP18
  • 【被引频次】2
  • 【下载频次】359