节点文献
汉语动词名物化复合结构的语义解释
The Semantic Interpretation of Chinese Verb Nominalization Compound
【作者】 赵京雷;
【导师】 陆汝占;
【作者基本信息】 上海交通大学 , 计算机软件与理论, 2008, 博士
【摘要】 网络信息和信息检索在人们日常生活中已成为不可或缺的组成部分,语言文字占据信息形式上的绝大多数,实际上关注的是语言形式所承载的内容信息,本质上都涉及自然语言的语义概念。自然语言成分结构间的语义关系计算问题是自然语言理解的关键,其本质在于计算语言结构与语言语义之间的对应关系。如何寻找新的思路及其理论和方法,使得语言的结构和语义尽量能同构对应,尤其是适应动态地计算语言复合结构的概念意义,具有重要的理论研究意义和广阔的应用前景。尽管语言表达有句、短语等多种形式,但从概念分析角度看都可归结到词汇概念间的组合叠置。这与当前国内外语言学聚焦于词汇理论相一致。复合结构是一种由若干个名词性词汇直接组合而成,在整体上相当于一个新的名词性词汇的语言结构形式。和短语、句等语言结构不同,复合结构的构成缺乏功能标记,这对其语义计算形成很大障碍,实际上成为语义计算的一个瓶颈问题。本文主要解决汉语中动词名物化进入复合结构时的语义解释问题。研究的起点是从实例分析出发,剖析以往语法研究中的不足,标引复合结构子成分之间的概念关系,归纳复合结构中概念耦合的内在特点以及多语种表达式在复合结构这一层次上存在自然对齐的潜在可能性。首先,作为数据准备工作,研究了动词名物化复合结构的识别;然后,分别构建了两种基本动词名物化复合结构(NV型和VN型)的语义解释模型;最后,还探讨了属性知识在复合结构语义解释中的应用。具体来讲,本文的创新性工作有以下几点:一、提出了一种基于主题词表和万维网的复合结构识别方法。为了有效地解决汉语中名词和动词组合时的结构歧义问题,构造了两个新的分类特征集合:词汇复合能力和指称模板特征。特征的获取使用了两个独立的资源:主题词表和万维网,其好处在于不依赖于复合结构出现的具体上下文,可以用于对文档中的低频复合结构进行识别,而这是以往的识别模型所无法解决的问题。机器学习实验表明,两个新特征集极大的改善了动词名物化复合结构识别的性能。二、归纳了汉语NV型复合结构中涉及的语义关系,构建了一个基于词汇语法模板的复合结构语义解释模型。模型定义了新的词汇模板形式:功能词例化模板,并将其作为分类特征,对复合结构词汇间的语义关系进行标注。模型的主要优点是其对资源的依赖性很低,以往的方法主要利用词汇本体和句法语料,而该模型则使用纯文本语料来获取复合结构的分类特征,从而使得模型的适用性和可移植性大大增强。实验表明,基于功能词例化模板的模型取得了很好的性能。三、提出了汉语VN型复合结构的语义关系标注集,并设计了一个机器翻译驱动的复合结构语义解释模型。基于复合结构的多语种自然同构假设,模型首先将汉语复合结构自动翻译为对齐的英文复合结构,然后将英文复合结构作为附加信息,用于对汉语的复合结构进行解释。模型的主要优点是可以利用跨语种的资源,对多语种的对齐复合结构同时进行语义解释,从而可以在某种程度解决某些语种中的资源缺乏问题。实验证实,双语语义解释模型的性能要好于单语模型。四、构建了一个属性知识库的获取框架。词汇概念可以被描述为属性和属性值的集合,属性知识对于复合结构语义解释非常重要。属性获取分为两个阶段,一个阶段是属性词的获取,一个阶段是属性宿主的求取。在属性词的获取中,设计了一个机读词典和万维网的协同自举算法。算法充分利用了汉语的义符构词特点,并结合了机读词典和万维网作为属性知识的来源,对属性词进行获取。而针对属性宿主的求取,则将其视为一个选择约束求解问题,通过评估属性与可能的概念类之间的选择关联度来确定属性的宿主。该方法的特点在于其可以动态、高效地获取以属性词为中心的词汇知识。五、利用所获取的属性知识,提出了一种基于属性词的词汇相似度计算模型。与以往基于词汇层级知识体系的相似度计算方法不同,该模型充分利用了词汇概念所可能具有的属性词信息来对词汇概念进行表征。属性词可以对概念的各个不同方面进行刻画,如果两个词汇概念共享的关键属性信息越多,则两个词汇概念越为相似,从而,用属性词向量表示词汇概念可以更加精细的刻画词汇概念之间的区分程度。在标准数据集评测以及复合结构语义解释的应用上,该模型取得了比其他词汇相似度模型更好的性能。
【Abstract】 The information on the Web and information retrieval has become an essential part indaily life. Language is the main form of information. It is an urge need to make comput-ers understand the content and semantics of the language information. The computation ofsemantic relations between natural language structures is the key for natural language un-derstanding. The essence of the semantic computation is to compute the correspondencebetween structures and semantic representations. Although there are many forms of lan-guage structures, they can all be reduced to word combinations. This is in accordance withthe trend of lexical approach in language theory.Compound is a consecutive sequence of nominal words which functions as a new nom-inal word as a whole. The semantic problem in word combination has been a major concernfor scholars working in this area, because the research on it has important significance inboth theory and application. However, there are no semantic clues like functional wordsin compound formation as in other language structures which presents a big challenge forcomputing the semantic of compound expressions. This dissertation focuses on the semanticinterpretation of a subset of Chinese compounds in which a verb nominalization is involved.First, as a work of data preparation, this thesis explores the problem of compound iden-tification. Second, it constructs the interpretation models for the two basic types of verbnominalization compounds (NV compound and VN compound), respectively. At last, thisthesis explores the application of attribute knowledge to compound interpretation.Concretely to say, the creative work of this dissertation includes the following aspects:1. The author proposes a method for compound identification based on thesaurus andthe Web. To solve the structural ambiguity in Chinese verb and noun combination, the iden-tification model introduces two novel feature sets, one is compounding ability and the otheris referential patterns. The acquisition of such features doesn’t rely on the context of thecompound candidate. Instead, it uses two independent sources: thesaurus and the Web. Theadvantage of such an approach is that it has the ability to recognize compounds with low frequency in text. The machine learning experiments show that the novel features greatlyimprove the performance of compound identification.2. The author introduces the semantic relations involved in Chinese NV compound,and then, implements a compound interpretation model based on lexical syntactic patterns(LSPs). A new form of LSP is defined which is called functional lexicalized patterns (FLPs).The FLP vector of a NV compound is used as the features for the labeling of its semanticrelations. Different from previous approaches which mainly rely on ontologies or treebanks,the model exploits plain text for acquiring the classification features, which makes it morerobust and easy to generalize.3. The author presents the set of semantic relations of Chinese VN compounds, andthen proposes a translation-driven bilingual compound interpretation model. The model firsttranslates Chinese compounds into their English equivalents. Then, it explores the Englishcompounds as additional information to interpret Chinese VN compounds. The main merit ofthe model is that it can use the cross linguistic resources to interpret multilingual compoundsat the same time. The experiments verify that bilingual model has a better performance thanthe monolingual model.4. For the purpose of application in compound interpretation, the author designs aframework for the construction of attribute knowledge base. It includes two phases: the firstis attribute word acquisition and the second is attribute host computation. In the first phase,the author proposes a multi-resource bootstrapping algorithm which boots off from a set ofChinese morphemes and exploits both an MRD and the Web as the resource. In the secondphase, the author models it as a problem of selectional constraint resolution. The character-istic of the framework is that it can dynamically acquire attribute-centered knowledge. Theexperiments show the algorithms are very effective.5. Applying the acquired attribute knowledge, the author presents an attribute-basedword similarity model. Different from previous models which mainly explore the IS-A tax-onomies, the proposed model represents a word concept by the attribute word it can take.If two concepts share more important attributes, they will be more similar. Such an at-tribute representation of concepts can make more fine-grained difference between word con-cepts. The attribute-based word similarity model gets good results evaluated both in standarddatasets and in the application to compound interpretation.
【Key words】 Compound; Verb Nominalization; Semantic Interpretation; Attribute Knowledge; Word Similarity;