节点文献

中文自然语言理解中基于条件随机场理论的词法分析研究

Research on Morpheme Analysis Based on Conditional Random Fields in Chinese Natural Language Understanding

【作者】 熊英

【导师】 朱杰;

【作者基本信息】 上海交通大学 , 电路与系统, 2009, 博士

【摘要】 随着计算机技术的不断发展和互联网的广泛普及,人们迫切需要一种自然、便捷的方式与计算机交流,使计算机能够“听懂”人类的语言。语音识别就是实现这种人机交互界面的关键技术,统计语言模型作为当前连续语音识别技术的基石之一,离不开自然语言处理技术的支持。对于中文来说,中文词法分析是中文信息处理技术的基础和关键,它直接关系到后续的句法分析和语义理解,并最终影响到实际的应用系统。因此,它一直是目前中文信息处理研究领域的一个热点和难点。本文系统地介绍了条件随机场(Conditional Random Fields,CRFs)模型及其在中文词法分析领域的应用,分析了目前主流的条件随机场模型训练准则和参数优化方法。然后以中文词法分析为应用背景,从区分性原理的角度研究了条件随机场训练准则,提出了基于条件随机场模型的交集型歧义消解方法,讨论了特定领域中的新词提取和词典优化算法,为中文词法分析的研究提供了新的方法和思路。最后简要阐述了中文词法研究在汉语语音识别中的应用。首先,本文研究了条件随机场模型区分性训练准则。目前,条件随机场模型的参数训练准则主要是基于最大似然/最大后验概率,其训练的目标是使训练语料中正确标注序列的概率最大。而以此目标建立的模型并不能保证在实际的测试环境中可以找到最佳的标注序列,从而获得较高的标注正确率。故目前的训练准则与序列标注性能评价指标之间存在着不匹配的情况。针对这一问题,本文提出了一种新的区分性训练准则—最小标注错误(Minimum Tag Error,MTE)。该准则在每条候选路径中加入该条路径相对于参考路径的正确率权重,以训练语料平均正确率最大化为目标函数。为了有效地计算平均正确度,本文还提出了一种新的前-后向算法,推导了正确率期望求解方法。实验表明,该准则不仅使切分指标的F-score值略有上升,而且使词表外(OutOf Vocabulary,OOV)词的召回率显著提高,即该准则在未知词识别的能力上具有明显的优势。同时,该准则在命名实体识别方面的性能也获得了较大的提升。其次,针对条件随机场等概率图模型不具备支持向量机(Support Vector Machine,SVM)那样良好的泛化能力,本文借鉴大间隔原理,提出了类似于大间隔思想的区分性条件随机场训练方法—增强型条件随机场(Boosted Conditional RandomFields,BCRF)。该方法不仅继承了传统CRFs凸函数的特性,保证了全局最优解,而且也融合了大间隔模型所具有的泛化能力,其内涵可理解为在正确标注序列和候选序列之间加入了一个“软间隔”,而该“软间隔”与两个序列间的汉明距离(候选序列中错误标注元素的个数)成一定的比例关系。实验结果表明,该方法与传统的最大后验概率方法相比具有明显的优势,不仅能够提高切分精度,而且能够提高OOV词和命名实体识别的能力。但与MTE方法相比,虽然其切分精度和识别性能略有下降,但其参数计算方法相对较简便,无需进行第二次前-后向算法。再次,本文讨论了中文交集型歧义消解方法。利用SVM在分类问题上的卓越表现及其适用于处理高维数据的特点,研究了SVM在交集型歧义消解问题上特征的选取原则和表示方法。通过分析交集型歧义两种切分方式之间存在的差异,采用互信息、附属种类、二字词频和单字词频四种统计量进行特征表示和融合,比较了特征的不同表示方法对分类性能的影响。实验表明,特征的选择和表示方法对SVM分类性能的提高至关重要,具有互补特性的特征组成的高维特征向量可以在很大程度上提高SVM分类器的歧义消解能力。针对SVM方法在处理链长大于1的歧义字串时必须将其转化为多个链长为1的字串进行处理所带来的不便,本文提出了一种基于条件随机场模型的歧义消解方法,将传统的二值分类问题转化为序列标注问题。该方法不仅能同时处理任意链长的歧义字串,而且对于真歧义字串,可以充分利用上下文信息给出不同语言环境下正确的切分形式。实验结果表明,该方法取得了目前最佳的性能表现。然后,讨论了特定领域中新词提取和词典优化算法。在缺少特定领域的训练语料情况下,有监督的机器学习方法不能很好地发挥其优势。基于词典的最大匹配切分方法虽然最简单有效,但由于缺乏特定领域的专业词典和新词汇的不断涌现,使得基于词典的切分算法在特定领域中的切分精度受到严重影响。本文以通用词典为初始词典,利用启发式排歧规则,在粗切分的基础上,提出了一种改进的新词提取和词典优化算法。该算法以语言模型困惑度最小化为新词提取标准,自动从候选集中提取新词,加入到初始词典得到适用于特定领域的扩充词典。为了计算候选词加入词典前后模型困惑度的变化,本文提出了一种简单有效的近似计算方法。实验结果表明,该算法不仅能提取很多特定领域的专业词汇,而且能有效地降低模型的困惑度,提高切分正确率。最后,简单介绍了语言模型在语音识别系统中的应用,分析了中文词法研究对统计语言建模的作用及其对语音识别系统性能的影响。

【Abstract】 With the constantly developed technology of computer and the widespread popularity of Internet, people urgently need a natural and convenient way to communicate with computers. In order to make computers "understand" human beings’ language, speech recognition is the key technology for realizing the interface of human-computer interaction. The statistical language model, which is one of the cornerstones in current continuous speech recognition technology, requires the support of natural language processing technology. For Chinese language, Chinese morpheme analysis is the basic and key technology for Chinese information processing because it directly relates to the sentence analysis and semantic understanding of the next step, and finally affects the actual application system. Therefore, Chinese morpheme analysis is always a hotspot and a difficulty in present research area of Chinese information processing.In this dissertation, we first study the model of conditional random fields (CRFs) and its application to the field of Chinese morpheme analysis. The dominant training criteria and parameter optimization methods are analyzed. Under the background of Chinese morpheme analysis, new training criteria based on discriminative principle are studied, the method of overlapping ambiguity resolution based on conditional random fields is proposed and the algorithm of new words extraction and lexicon optimization is discussed in the specific domain, all of which provide a new approach and idea for Chinese morpheme analysis. Finally, we briefly describe the application of Chinese morpheme analysis in the field of speech recognition.Discriminative training criteria of CRFs are firstly investigated. Currently the training methods of CRFs are mainly based on maximum likelihood (ML) or maximum a posterior (MAP) which aim to maximize the probability of the correct labeling sequence in the training data. The best sequence selected by these models is not guaranteed to be high accuracy in the real test environment. Therefore, there is a mismatch between the training criteria and the performance evaluation metric in the task of sequence labeling. A new discriminative training criterion called minimum tag error (MTE) is proposed in the dissertation which is integrated with sentence tagging accuracy. The objective function in MTE aims to maximize the expected tagging accuracy rate on the training corpus. To calculate the average accuracy efficiently, a new forward-backward algorithm is presented and the accuracy expectation is induced. The experiments show that the MTE criterion can not only improve the F - score but also increase Roov significantly. That is to say, the MTE criterion has a clear advantage in recognizing out-of-vocabulary (OOV) words. At the same time, the MTE training method exhibits improved performance in name entity recognition.Secondly, since probabilistic graphic models such as CRFs do not take the advantage of good generalization as support vector machine (SVM) , a new discriminative training method named boosted conditional random fields (BCRF) is proposed which is motivated by the theory of large margin. The new method not only inherits the convex attribute of CRFs which can be guaranteed to achieve globally optimal solution, but also combines the generalization ability provided by large margin models. The understanding of BCRF can be regarded as a soft margin enforced between the reference sentence and the hypothesised one which is proportional to the Hamming distance (the number of errors in the hypothesised sentence) . Experiments show that the presented method achieves significant improvement compared with the traditional MAP method. The new approach can not only improve the segmentation accuracy but also increase the performance of OOV identification and name entity recognition. But when compared with the MTE criterion, although the segmentation accuracy and the recognition performance obtained by the BCRF method decrease slightly, the parameter optimization method is comparatively easy without another forward-backward algorithm.Thirdly, Chinese overlapping ambiguity resolution is discussed in this dissertation. Since SVM has the remarkable advantage on the task of classification and can be applied to deal with high-dimensional vectors, feature selection and representation based on SVM are studied for the task of resolving Chinese overlapping ambiguous strings. Based on the two different segmentation forms possibly existed in the ambiguous strings, four statistical parameters (mutual information, accessor variety, two-character word frequency and single-character word frequency) are adopted to represent different dimensional feature vectors. Classification performance is compared when different feature vectors are represented. The experiments show that feature selection and representation are vital important to improve the classification performance. High-dimensional features represented by complementary statistics can highly improve the ambiguity resolution ability of SVM classifiers. But it is very inconvenient for SVM classifiers to deal with longer ambiguous strings whose lengths are larger than three because the strings should be first converted into multiple three-character ambiguous strings. In order to solve this problem, a new method based on CRFs is proposed. Instead of the traditional methods which treated the overlapping ambiguity as a binary classification problem, the new method regards it as a sequence labeling problem. The proposed method can not only deal with overlapping ambiguous strings of any lengths no matter whether the ambiguous strings are pseudo ambiguity or true ambiguity but also consider the context information and the dependencies among the predicted labels at the same time. The experimental results show that this method achieves state-of-the-art performance.New words extraction and lexicon optimization algorithm are then studied in the specific domain. Since the training data towards specific domain are extremely scarce, supervised machine learning methods can not take their advantages. Although dictionary-based maximum matching method is simple and efficient, the segmentation accuracy is seriously influenced by the lack of specific lexicon and the constantly appeared new words. In this dissertation, by making use of heuristic rules, an initial segmentation based on a general lexicon which is served as an original lexicon is obtained. According to the initial segmen- tation, we present an improved method for new word extraction and lexicon optimization. The proposed approach selects new words based on a perplexity minimization criterion, extracts new words from the candidate word lists and adds them into the original lexicon. The augmented lexicon, which contains the new words, can be considered as the lexicon towards specific domain. To efficiently calculate the language model perplexity before and after the candidate word is added to the lexicon, a simple substituted method is proposed to approximatively estimate the perplexity change. Experiments show that this method can not only extract many specific new words, but also reduce the model perplexity and improve the segmentation accuracy.Finally, the application of language model to the field of speech recognition system is briefly introduced, and the effect on statistical language modeling and the influence on speech recognition system are analyzed for the research of Chinese morpheme.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络