

【作者】 袁彩霞

【导师】 钟义信; 任福继;

【作者基本信息】 北京邮电大学 , 信号与信息处理, 2009, 博士

【摘要】 近年来,中文自然语言处理技术在分词、词性标注等方面获得了很大进步。然而,自然语言处理应用系统(如信息抽取、问答系统等)则需要对文本信息进行深层解释。功能组块分析任务为句子成分自动标注主语、宾语、时间状语、地点状语等功能标记,作为语法分析及语义理解的一种实现方式,具有定义清晰、便于评价等优点,近年来受到越来越多学者的关注。本文提出基于序列判别模型的中文功能组块自动分析技术,将传统的支持向量机分类模型扩展到序列学习任务中,灵活地使用了输入输出序列之间的多重相依特征。研究表明,本文提出的方法获得目前中文功能组块分析的最好性能,系统整体F1值达到93.76,并且可以较好地扩展到不同的特征集合,适合于众多自然语言处理问题。论文的第一部分说明了功能组块分析的几个问题:课题的提出与研究及意义,相关研究现状,以及本论文的研究重点。然后简要介绍了本课题采用的语料库资源——宾夕法尼亚大学中文树库,讨论了中文功能组块的划分标准。接着,以自然语言理解的全信息方法论为基础,分析了功能组块标记在语法、语义、语用理解中的位置及作用。最后说明了几个常用的考察组块标记性能的指标。论文的第二部分详细介绍了本研究的理论基础,具体说明了序列判别模型在中文功能组块识别中的应用。在对算法模型进行分析的基础上,构建了中文功能组块自动标注器,并通过大量实验,从多方面分析了系统的性能,检验了不同特征对于系统性能的影响及其语言学解释。接着,比较了两个不同的构建功能组块标注器的方法:采用简单词法信息(词、词性等)的标注器和采用完全句法树信息(短语类型、句法树路径等)的标注器,并通过实验说明了基于词法信息的功能组块标注器具有性能稳定、领域适应能力强等优点,适用于缺少句法资源或句法分析本身存在困难的语言,是进行汉语功能组块分析的有效方法。论文的第三部分介绍了功能组块标记在自然语言处理系统中的应用。我们选取文本观点挖掘中的观点要素识别为应用领域,以句子成分的功能类别为依据,构建了观点主题自动识别系统,并通过实验验证了该方法的可行性。最后,给出了本研究的结论以及未来的研究方向。以上工作不仅实现了汉语功能组块自动分析系统,并且从计算语言学的角度,对汉语功能组块进行了明确界定,对其它相关的应用研究具有参考价值。目前的实验结果也显示了功能组块分析技术具有非常好的应用前景。

【Abstract】 As researchers improve results on various other problems in "pure" natural language processing (e.g. part-of-speech tagging, parsing), those who work in the more "applied" NLP fields (e.g. question-answering, information extraction) are seeking more powerful sorts of linguistic annotation as input for their own systems. Function tags are a context-sensitive annotation applied to words and phrases of natural language text, marking their syntactic or semantic role within a larger utterance.In this thesis we develop a sequential predication model for Chinese function tag labeling. We will show that this method provides state-of-the-art accuracy, yielding an F1 score of 93.76, is extensible through the feature set and can be implemented efficiently. Furthermore, we display the specific properties of Chinese function tags by comparing it with English as well as show its practical applicability through integration into an opinion holder recognition system.In the first part of the thesis, we present the problem of function tag labeling: why it is an interesting problem, who else has worked on similar thing, and what exactly we intend to do. Then we will briefly review the datasets we are working on - the Penn Chinese Treebank, and explain the specific metrics by which we will evaluate our work.In the second part of the thesis, we will present a sequential predication model. This will lead to the heart of the thesis - automatic function tag labeling. Here we formulate function tag labeling as a sequence learning problem within structural spaces, yielding state-of-the-art accuracy and high robustness. Then we will present an analysis of what features prove to be the most helpful for Chinese function tag assignment and why we think it will be useful in this task, and introduce two totally different function labeling systems, one assigning function tags to unparsed text using simple lexical features (word, part-of-speech tag, etc), and one assigning function tags to the output of parsed text using features collected from the full parsed trees (phrase type, tree path, etc). We then discuss the advantages and disadvantages of each system in various situations . We also compare our function tagger to other state-of-the-art systems.Finally, in the third part of the thesis, we present how this work improves the applications of text opinion mining. We will introduce our primary work on opinion holder recognition by using function tags as clues, to show its applicability to a real world problem. Lastly, we will present a comparison to other systems performing related tasks, and speculate on some interesting future work.The proposed work has defined Chinese function tags from the view of computation and yielded an automatic Chinese funtion tag labeler. The research results are directive and with reference value to other related work. In addition, the experiment suggests the promising application of function tags.


