节点文献

汉语功能块的自动识别研究

A Study on Chinese Functional Chunk Parsing

【作者】 刘海霞

【导师】 黄德根;

【作者基本信息】 大连理工大学 , 计算机应用技术, 2011, 硕士

【摘要】 汉语的功能块是定义在句子层面上的功能性成分,一般占据句子中的主语、谓语、宾语、状语、定语、中心语等功能位置,体现了汉语句子的基本骨架。功能块识别的目的正确标注出句子的功能块标记信息,覆盖自顶向下进行事件句式拆分而形成的各个基本信息单元,以显示句子在小句层面上的基本结构及骨架,为进一步的事件骨架树分析提供最小的功能块描述序列。本文将汉语功能块的自动识别问题转化为序列标注问题,使用的序列标注器是条件随机域(CRFs, Conditional Random Fields)。CRFs是一个基于无向图的条件概率模型,可以任意添加有效的特征向量,具有表达长距离依赖性和交叠性特征的能力,能够较好地解决标注偏置等问题。因此本文选择CRFs建立功能块的序列标注模型。为了构建较好的功能块自动识别系统,本文首先通过特征模板优化策略进行汉语功能块的识别,得到功能块识别的精确率、召回率和F1-measure值分别为85.84%、85.07%和85.45%,其中主语块、述语块、宾语块和状语块四个典型功能块的F1-measure值分别达到了85.16%、88.22%、81.75%和91.98%。在此基础上,本文首次将语义信息引入汉语功能块的识别系统,将通过词义聚合关系组织词语的《同义词词林》作为语义资源,把其中的语义信息作为特征加入到功能块的识别过程,缓解了数据稀疏以及歧义问题对识别结果造成的影响,使得上述三个性能指标分别提高到86.21%、85.31%和85.76%,与单独使用条件随机域模型的方法相比有了较大程度的提高。

【Abstract】 The automatically parsing of Chinese functional chunk is transformed into the problem of sequence labeling in this paper. We build a sequence labeling model for Chinese functional chunk based on Conditional Random Fields which is a conditional probability model based on undirected graph. We can append any effective feature vector into Conditional Random Fields model at random. It has the ability of expressing the characteristics of long-distance dependencies and overlap, so it could solves the problem of label bias. Also, all of the feature could execute the global normalization and find the global optimal solution. Conditional Random Fields model has not that forceful assumption for the probability distribution of input or output like Hidden Markov Model, so it is very suitable to sequence labeling and we choose it for labeling of Chinese functional chunk.We focus on building a system for labeling Chinese functional chunks, through detecting the boundary of Chinese functional chunks and labeling the functional information in a sentence with correctly word segmenting and POS tagging. This paper proposes an approach that combines the feature template optimizing strategy with Conditional Random Field Model for automatic labeling Chinese functional chunks. On the testing data set, the precision, recall and F-1 measure of Chinese functional chunks reaches 85.84%,85.07% and 85.45% respectively, of which the F-1 measure of subject, predicate, object and adverb functional chunk reaches 85.16%,88.22%,81.75% and 91.98% respectively, and ranked the first in the close test of CIPS-ParsEval-2009 task3 Function Chunk.On the basis of combining the feature template optimizing strategy with Conditional Random Field Model, existing language resources Chinese thesaurus "Tongyici Cilin" is introduced into the processing module, of which the semantic information will be added to the feature template, the effect of data sparseness and ambiguous problem is remitted, thus the three performance indexes are increased to 86.21%、85.31% and 85.76% respectively, and better than the previous method based on Conditional Random Fields model solely.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络