节点文献

面向中文信息处理的复句关系词自动标识研究

Research of Auto-identifying the Relation Markers of Compound Sentence for Chinese Information Processing

【作者】 舒江波

【导师】 胡金柱;

【作者基本信息】 华中师范大学 , 中文信息处理, 2011, 博士

【摘要】 复句作为汉语语法的重要实体单位,在语法学界受到较多关注,且相关理论成果较多。但是,从中文信息处理的角度来看,汉语复句的信息化处理的相关成果还较少,复句信息工程尚未取得实质性的突破和进展。究其原因,一是研究还不全面深入,现有的成果还未囊括复句信息处理的所有环节和难题;二是大多数研究成果都是面向人的,很多方法在信息处理中操作性不强;三是各个研究相对孤立,未将所有环节串联起来,形成一个有机的整体。目前,复句的信息化研究主要是分句与非分句的识别、复句层次关系的识别,而这些研究都以关系词的提取为前提。可见,一方面,关系词的自动提取是其他各项研究工作得以开展的基础;另一方面,关系词作为构成复句的一个部件,本身也需要深入研究。在这种背景下,本文以面向中文信息处理为出发点,以邢福义先生的复句理论为指导,对复句关系词的自动标识方法进行研究和探讨,并以自动机理论、形式化逻辑等为辅助手段,对关系词标识涉及的问题进行建模,对总结的规则进行形式化描述和存储,并研究基于规则的关系词自动标识的方法,已达到自动标识复句关系词的目的。本文的研究主要从以下四个方面展开:1、全面总结影响关系词自动标识的因素。影响关系词标识准确率的因素主要有五类,分别为:关系副词的影响,介词的影响,关系标记不同用法的影响(同形异义词、同形异构词、同形异类词的关联和非关联用法的影响),关系标记搭配的影响和关系标记隐现形式的影响。对每一类影响因素,主要分析其各自的特点,并讨论对应的处理方法和策略。2、对标记连用现象进行深入研究。主要研究二标记连用和三标记连用时各个标记的语法语义功能和类别。对于二标记连用,归纳出矛盾类和限制类两种类型。这种类型的区分,既可以在计算机处理时减少不必要的计算,也可以作为复句分析过程中的一个切入点。三标记连用现象中,识别不同的关系词所需的方法不同,没有一个统一且粒度较细的策略,需要具体问题具体分析。3、研究句式特点与关系词标识之间的关系。主要考察三类句式:第一、格式固定且无歧义,但语义关系不好确定导致关系词辖域确定困难的句式,称之为特殊句式;第二、扩展句式,普通的基于搭配理论的算法不能很好地处理扩展句式的关系词标识问题;第三、多重复句的普通句式,复句实例的标记序列中含有多个标记对。对于特殊句式,采用表里关联的方式将标记序列与处理结果一一对应;对于扩展句式,采用自动机理论进行建模,既保证了可操作性,又保证了对语言现象的概括性;对普通句式,主要是把问题抽象化,并转化为数学模型,利用解空间的求解来对标记序列进行处理。针对上述句式,建立规则库,并探讨了基于规则的关系词自动标识方法。4、对部分充盈模态和非充盈模态下关系词的标识问题进行研究。文章首先对分句的语义关联理论进行了进一步地补充,提出3大类14个语义关联特征,并制定了特征分析的优先图,修正了分句语义关联度的计算方法。充盈模态下的关系标记主要考察“不是……就是……”,“虽然……但是……所以……”等。研究发现,对“不是……就是……”,可利用极值分析法处理;对“虽然……但是……所以……”,暂无较好的处理策略,需要建立常识知识库。非充盈模态下主要是对三分句句式的关系词的识别进行考察,发现从关系标记的典型和非典型属性入手,通过结合搭配知识,并利用分句的语义关联特征,可较准确地标识出各个分句中的关系词。

【Abstract】 As an important entity unit of Chinese grammar, compound sentence gains much concern in the grammarians and has lots of relevant outcomes and theories. However, from the perspective of Chinese information processing, the processing of Chinese compound sentences has less relevant results, information engineering of compound sentence has not yet made any substantial progress. The reasons are, firstly, the study does not go deeply inside, and the existing studies have not yet include all aspects and problems of compound sentences information processing. Secondly, most of the research results are for the people, and the operability of many ways is not strong in information processing. Thirdly, each study is relatively isolate, and has not link the others, so has not form an organized whole. Currently, the study of compound sentence information technology is mainly about the identification of clause and non-clause and identification the layers of compound sentence, but the extraction of the relational markers is the premise of all the studies. So we can see that, on the one hand, the automatic extraction of relational markers is the base of other various studies which can be carried out, on the other hand, relational markers as a part of compound sentence needs further study. In this situation, this paper takes the Chinese information processing as a starting point, and takes the compound sentence theory of Mr. Xing-fuyi as guidance, to research and study the automatic identification and markup method of relation markers, and with the automata theory and formal logic as auxiliary means to model the issues involved about identifies of relation markers, and to describe and storage the rules, to design the prototype model of auto-identification system of the relation markers based on rules.The study of this paper involves the following four parts:Firstly, it sums up the factors which have influence on auto-markup of relation markers comprehensively. The factors are classified mainly into five categories, which are the influence of the adverbs, of the prepositions, of the different usage of relation markers, of the collocations and the occurrence and non-occurrence respectively. For each type of factor, it mainly analyzes its features and make up the corresponding strategies.Secondly, it studies the co-occurrence of relation markers, and mainly focuses on syntax and semantic function and types of the two and three markers. There are two types of the two markers’ co-occurrence, which are contradiction type and constriction type. It not only can reduce the unnecessary computing in the processing, but also can be seen as a checkpoint in the analyzing of compound sentences for distinguishing the two types. In the co-occurrence of three markers, it does not have a unified strategy, but needs to use different method to identify the different markers.Thirdly, it studies the relation between the mark-up of relation markers and the pattern of sentences. It mainly studies three kinds of patterns. The first is called special pattern, the feature of this pattern is that its form is solid and is unambiguous, but the jurisdiction range of the markers is hard to determine result from the hard identification of its semantic relationship. The second is called expanding pattern, the ordinary algorithms can not deal with the identification of relation markers of this pattern. The third is called ordinary pattern, the feature of this pattern is that the compound sentences of this pattern have multiple semantic layers, and have multiple relation markers pairs. For special pattern, it uses the strategy of mapping to map the sequence of relation markers to its corresponding mark-up results. For expanding pattern, it uses the automata theory to modeling, by doing this, it not only assures its operability, but also assures its generality of the phenomenon. For ordinary pattern, its strategy is to abstract the problems, and transform the problems to mathematical models, and then uses the computing of resolution space to deal with the sequence of relation markers.Fourthly, it studies some problems partly in saturated mode and non-saturated mode. Firstly, the paper makes some supplementary of the theory of semantic relevancy. It proposes 14 semantic relevance features which classified into three categories, and makes out a preferred diagram for feature analysis. It also amends the computing method of semantic relevance degree. It mainly studies the relation markers "bushi…jiushi…", "suiran…danshi…suoyi…" in saturated mode. It finds that the method of polarity analysis can be used to deal with the markers of "bushi…iushi…". For markers "suiran…danshi…suoyi…", it does not have any effective method except for building common sense knowledge base. It mainly studies the relation markers of the sentences with three clauses in non-saturated mode. It finds that by considering the typical and atypical attribute of relation markers, and combining the knowledge of collocation, and using the semantic relevance feature of clauses, it can mark-up the relation markers accurately.

  • 【分类号】H146
  • 【被引频次】11
  • 【下载频次】321
节点文献中: