

Building Semantic Knowledge-Bank Based on the Binary Combinatorial Grammar

【作者】 徐忠明

【导师】 万建成;

【作者基本信息】 山东大学 , 计算机软件与理论, 2008, 硕士

【摘要】 句法分析一直是自然语言处理领域的热点。从上世纪80年代以来,句法分析的处理的重心逐渐转移到语义处理上来,词一级语言单位的研究又是语义处理的重心。无论做机器翻译、信息抽取还是词汇语义消歧,语义知识是所有这些应用不可或缺的基础性资源。文中首先介绍了本文和整个系统所基于的二元组合文法体系,然后给出了整个句法分析系统的整体架构。在句法分析过程中,句法、语义分析相互交互,语义知识库是语义分析和语义消歧的知识来源。在随后的章节中介绍了主要的语义学设计理论和当前有代表性的语义知识词典。语义学理论是语义知识库设计的理论基础。语义知识词典的描述体系涉及多方面的内容,既有层级分类关系,又有同义、同类关系。但是,总的来说,都还不能直接满足中文信息处理的应用需求,但可以成为本语义知识库的学习资源。从句法分析实际需求出发,我们设计了语义知识库的描述体系和组织结构。语义知识库由词库、语义搭配属性库、层次库、类属库和语义库维护子系统组成。词库在整个语义库的中心,语义搭配属性库存储词与词之间的二元语义搭配属性关系,类属关系库描述的是词语在某分类系统中的相对关系,组成关系库描述的则是词语之间整体与部分的关系。语义维护子系统负责维护语义知识库,提供检索、添加、删除语义知识的接口。然后讨论了向语义库中添加语义知识的方法。首先介绍了哈工大的依存树库,证明了可以将依存树转换为二元组合树,借鉴基于统计的搭配识别算法,采用搭配属性类别加统计的方法直接从依存树库中抽取搭配属性知识,比单独使用统计的方法提高了准确性和召回率,迅速的扩大了语义搭配属性库的规模。对于层次库和类属库,以知网和WordNet为知识源,主要利用人工发现和判断的方法,这样是为了保证层次不产生混乱,然后借助模式识别层次知识的方法,从文本中自动抽取层次知识。这样就构建了一个初步能够满足基于语义的句法分析需求的语义知识库。语义知识库的构建工程量大,难度很高,目前还只能在有限目标下开展工作。但是我们已经找到了一条可行的技术路径,为实现句法分析系统提供了基础资源。该语义知识库还可以为其它中文信息处理的应用提供基础资源,应用前景十分广阔。

【Abstract】 Syntax analysis is always one of the most important fields of natural language processing, and the research has made great progress on this field. From the beginning of the 1980’s, the focus of syntactic Analysis has gradually shifted to semantic processing, and words phrase in semantic processing is the focus of focus. Whether to machine translation, information extraction or manage lexical ambiguity, semantic representation system is the essential foundation resources in all these applications.This thesis first gave the description of Binary Combinatorial Grammar on which the whole system and the semantic system are based. Then, we introduce the overall system of the syntactic analysis. In the parsing process, syntactic and semantic analyses interact mutually, and the system is the source of the analysis and disambiguation.The ensuing chapter introduces the main semantic designing theories and representative semantic knowledge banks. Their description thesis includes many aspects, involving both classified relation and synonyms、similar relations. Generally, however, they are not directly meet the Chinese information processing application needs, but could be the learning resources of the bank.From the actual needs of the syntactic analysis, we designed the structure of semantic knowledge bank. The bank is composed of word library、semantic collocation library、class library and maintaince subsystem. The word library is the center of the whole bank. The semantic collocation library storages binary semantic collocation relations between two words. The classification library descriptions the relative relationship in certain system, and the component system descriptions the entire and the part relations.Then, The last chapter discussed the method to collect semantic knowledge. First of all, we introduced. the HIT Treebank and Proof that the dependent tree can be converted to binary tree. Subsequently, based on statistics algorithm to match collocation, we adapted the method of collocation types adding statistical methods and the accuracy and recall-rate were improved significantly. We mainly used artificial methods to judge classification and component knowledge from Hownet and Wordnet, so we could be sure of the accuracy of the knowledge. Then we adapted the pattern-recognition method to find knowledge from corpus. After then we have preliminarily built the semantic knowledge bank to meet the need of the syntax analysis.The project is complicated and difficult and so we could only do our research on a limited domain. However, we have found a viable technological path for the realization of parsing system to provide the basic resources. The semantic knowledge base can also be used to other Chinese information processing application and provide the basic source of knowledge. The application prospects are bright.

  • 【网络出版投稿人】 山东大学
  • 【网络出版年期】2009年 01期

