节点文献

藏语分词与词性标注研究

【作者】 康才畯

【导师】 江荻; 潘悟云;

【作者基本信息】 上海师范大学 , 中国少数民族语言文学, 2014, 博士

【摘要】 藏语信息处理技术经过二十多年的发展,无论是在藏文信息处理研究及其相关标准制定方面,还是在藏语信息处理应用开发方面,都取得了不少成绩。藏语信息处理技术也逐步迈入到语言信息处理层面。虽然藏语信息处理研究在技术上紧跟着英汉语等之后,但作为信息处理研究基础的语料资源相对贫乏。公开的藏语语料库都是未标注的生语料库,其应用价值非常有限。由于对藏语的本体研究不够深入,许多对藏语信息处理有价值的属性未能挖掘和描述出来,因而限制了藏语信息处理技术的发展和应用范围。针对以上问题,本文采用了多种统计模型和方法来进行藏语分词和词性标注研究,并取得了以下几个方面的主要成果:一、提出了基于词位的藏语分词方法,在国内外较早地将藏语黏写形式的特征融合到藏语分词研究当中。我们采用了基于词位的统计方法来处理藏语分词问题,将藏语分词转化为序列标注问题,实现了一个藏语分词系统。该系统采用条件随机场模型,针对藏语黏写形式的语法特征,将汉语分词中常用的四词位标签集改进为更适合藏语特点的六词位标签集,并使用100万余经人工反复校对的语料对模型进行训练。经实验测试,在大规模真实语料的测试中,系统的开放测试F值达到了91%,分词性能基本上令人满意。在进一步的研究中,我们经分析发现分词精度主要受到了藏语黏写形式识别结果的限制。考虑到黏写形式的复杂多样,我们在总结前人的研究成果的基础上,加入了基于规则的后处理环节,最终的测试结果F值达到了95%以上,已能满足藏语语料库建设的实际需求。二、在藏语分词研究的基础上,根据藏族人名特征探讨了藏语人名识别方法。通过研究藏语人名的特点,我们总结了藏语人名识别的多种策略并最终选择了基于统计的方法来实现藏语人名的识别。我们基于条件随机场模型,通过使用名字边界、前后缀、上下文等特征,给出了藏文人名识别的一种方法。最终实验系统在开放测试中取得的F值达到了91.26%。虽然未能进一步发掘名字与普通词语同形这一极易导致歧义现象的特征,导致系统识别性能未能达到十分理想的效果,但可以通过对特征标签集进行调整,同时优化特征模板集,进一步提高识别效果。三、综合使用了多种统计模型实现了藏语词性标注研究,在国内外首次采用最大熵结合条件随机场模型实现了藏语的词性标注方法。通过对藏语词性的研究,在满足基本的词法分析的需求下,我们将藏语词类标记集精简到统计模型切实可用的规模,然后选择最大熵模型构建了一个藏语词性标注系统,并采用小规模的语料进行训练。实验结果显示,在小规模语料训练下,基于最大熵的词性标注系统达到了87.76%的准确率,已基本接近词法分析可用的要求。在最大熵模型的基础上,我们提出了基于条件随机场的修正模型。该模型在最大熵模型的输出结果上进行训练,从而可以将最大熵模型中次优结果和再次优结果中的正确标注挑选出来,提高词性标注的准确率。实验证明,采用同样规模的训练语料和测试语料,最大熵结合条件随机场的词性标注模型达到了89.12%的准确率,已接近同类汉语词性标注系统的水平。四、实现了一种基于条件随机场的藏语分词标注一体化模型,将分词和词性标注整合到一个统一的系统中,为藏语词法分析提供了新的解决途径。我们充分利用了分词与词性标注间更深层次的依赖关系,在一体化模型中利用词性信息来处于分词过程中遇到的歧义问题。在较小的训练语料规模下,藏语分词标注一体化模型在开放测试中分词结果的F值达到了89.0%,这表明一体化模型将词位信息和所属词的词性信息很好的结合起来,能更有效的提高分词精度,其分词效果已基本可以满足语料库对自动分词的需求。一体化模型的词性标注准确率也达到了85.35%,虽然还稍稍落后于独立的词性标注模型,但通过扩大模型的训练语料规模,词性标注性能应该可以取得一定程度的提升。

【Abstract】 Tibetan information processing technology has been developed over twentyyears. Whether in the aspect of Tibetan information processing research, or in theaspect of application development, great achievements have been made. Tibetaninformation processing technology have gradually entered into the languageinformation processing level. Although Tibetan information processing is followingEnglish and Chinese technically, the research-based Tibetan corpus for informationprocessing are relatively scarce. Almost all the open corpus are untagged corpus withlimited value. The ontology research of Tibetan is not deep enough so that manyvaluable properties for Tibetan information processing cannot be mined anddescripted, and application development and scope of Tibetan informationprocessing technology are limited. To solve the above problems, we adopt severalstatistical models and methods to study the Tibetan word segmentation andpart-of-speech tagging. Finally we made the achievements in the following aspects:First, we put forward the Tibetan word segmentation method based on wordposition, which early took full advantage of Tibetan abbreviated forms in the Tibetanword segmentation research both at home and abroad.We adopted a statistical method based on word position to deal with Tibetanword segmentation, which turns Tibetan word segmentation into sequence labelingtask, and established a Tibetan word segmentation system. The system is based onconditional random field and improved4-tag set in Chinese word segmentation to6-tag set according to the grammar features of Tibetan abbreviated forms, which ismore suitable for Tibetan word segmentation. We trained the conditional randomfield model with the corpus of more than1million syllable characters which wereproofread manually. The large-scale corpus experiment shown that the F value of thesystem reached91%, which is satisfactory, in open test. In the further research, wefound out that the precision was limited by the recognize results of Tibetanabbreviated forms. In consideration of the complexity of Tibetan abbreviated forms,we summarized predecessors’ research results and introduced a post-processing module based on rules. In the final experiment, the F value of open test reachedmore than95%, which means the system has been able to meet the actual demandof the construction in Tibetan corpus.Second, we study the features of Tibetan name and discuss a recognitionmethod based on the research of Tibetan word segmentation.Through the research on Tibetan names, we summarized several strategies onTibetan name recognition and finally choose an approach based on statistics torealize the Tibetan name recognition. The approach is still based on conditionalrandom field, while use the features of boundaries, prefix and suffix, and context ofTibetan names. The experiment shown that the F value of the approach reached91.26%in open test. Regrettably we did not solve the problem on identifying theTibetan name and general words that have the same forms, which damp theperformance of recognition. However, through adjusting the tag set and optimizingthe feature templates, we should be able to improve the performance of Tibetanname recognition.Third, we used a combination of several statistic models to study the Tibetanpart-of-speech tagging. For the first time, we used the maximum entropy modelcombined with conditional random field model to achieve a Tibetan part-of-speechtagging method.Through the research on Tibetan part of speech, we first simply the Tibetan partof speech tagging set to a usable size for the statistic model, then use the maxentropy model to construct a Tibetan part-of-speech tagging system, and train it withsmall-scale corpus. The experiment shown that the precision of the Tibetanpart-of-speech tagging system based on max entropy model reached87.76%, whichis almost meet the demand of lexical analysis.Based on the research of max entropy model, we put forward an errorcorrection model with conditional random field. The error correction model wastrained with the outputs of max entropy model so that it could pick out the righttagging result from the three outputs of the highest probability and improve theprecision of the Tibetan part-of-speech. The experiments shown that, with the same train and test corpus, the mixed model combined max entropy with conditionalrandom field reached89.12%accuracy and was close to the same level of Chinesepart-of-speech tagging system.Forth, we achieved an integration model of Tibetan word segmentation andpart-of-speech tagging, which is based on conditional random field. To integrateTibetan word segmentation and part-of-speech tagging into a unified system, we putforward a new approach for Tibetan lexical analysis.We took full advantage of the deep dependencies in the word segmentation andpart-of-speech tagging, and used the lexical information to deal with the ambiguityproblem in word segment processing. In a small-scale of training corpus, the F valueof the integration model reached89.0%, which proved the integration modelcombined the word position information with the part-of-speech context well andcould be more effective in the improvement of word segmentation precision. Theperformance of our integration model was able to meet the corpus’ demand ofautomatic word segmentation. Though the precision of part-of-speech of theintegration model reached85.35%, which was still behind the independentpart-of-speech tagging model, we should be able to improve its performance ofpart-of-speech by expanding the scale of training corpus.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络