节点文献

基于最大熵的汉语词性标注

Chinese POS Tagging Based on Maximum Entropy

【作者】 孔海霞

【导师】 黄德根;

【作者基本信息】 大连理工大学 , 计算机应用, 2007, 硕士

【摘要】 词性标注是给文本中的每个词标注上正确的词性。它是自然语言处理的基础,其正确率将影响后期句法分析或组块分析的正确率。在词性标注时出现的错误会在后续自然语言处理链中被放大,正确标注词性对自然语言处理有非常重要的意义。本文的目的就是在文本分词的基础上,实现汉语词性标注,为后期词法分析和其它自然语言处理任务提供基础。本文首先阐述了汉语词性标注的研究现状及研究意义,然后在深入理解最大熵理论的基础上实现了基于最大熵的汉语词性标注系统,最后利用统计规则和词性限定方法对未登录词进行了进一步标注。利用不同模板将不同的上下文信息导入最大熵模型,构建了四个最大熵标注模型,选出具有最优标注效果的模板作为最终模板。为了简化模型,采用了三种不同的特征选取方法精简最大熵模型的候选特征,为了进一步提高词性标注正确率,采用了规则和词性限定法,结合最大熵对未登录词做了进一步标注。论文给出了最大熵标注模型的算法,并给出了标注结果,及对未登录词进一步标注后的结果。词性标注比较复杂,由于最大熵可以充分利用词的不同层次的上下文信息,能较好地解决复杂问题,因此用最大熵进行词性标注,取得了较好的效果。实验结果表明,用最大熵进行中文词词性标注是有效的:开试测试正确率为94.96%,未登录词的标注正确率为63.32%。本文的研究成果可应用于实际翻译系统中,为自然语言后期处理提供了基础。另外还可进一步应用到信息检索、文本分类等自然语言处理领域中。

【Abstract】 Part of speech (POS) tagging is the problem of assigning POS or lexical categories to all the words in a text. It is the basic work in Natural Language Processing (NLP), and its tagging precision greatly affects the later step of syntax analysis or chunk analysis. The errors occurred in POS tagging will always propagate through the processing chain, so tagging POS correctly has great significance in NLP. The main goal of this thesis is to implement Chinese POS tagging task based on word segmentation, and provide the basis for later syntactic parsing and other NLP tasks.In this thesis, we first introduce the current research status of POS tagging and its significance, then implement Chinese POS tagging system based on Maximum Entropy (ME) on the basis of deep understanding of ME theory, and at last, statistical rules and POS confinement are used for tagging unlogged words.Different context information is introduced to ME model by using different templates, four ME POS tagging models are built, and the template with the highest tagging precision is selected as the final template. In order to simplify the model, three feature selection methods are used to simplify ME model’s candidate features. In order to further improve the POS tagging precision, the method of combining rules, POS confinement and ME is adopted. This thesis presents the algorithm of ME tagging model and its result, moreover, the result of further unlogged words tagging is given.POS tagging is comparatively complex. Since ME can make full use of different context of a word on different levels to solve complex problems, so we used ME for POS tagging, and have achieved good results.The experimental results show that using ME for Chinese POS tagging is effective: the open test rate is 94.96%, and the test rate for unclogged words tagging is 63.32%.The POS tagging approaches introduced in this thesis can be used in actual MT system, which can provide basis for further NLP tasks. Moreover, the research of this thesis can be applied to other NLP tasks, such as information retrieval, text classification and so on.

【关键词】 词性标注最大熵模板未登录词
【Key words】 Part Of Speech (POS)METemplateUnlogged Words
  • 【分类号】TP391.1
  • 【被引频次】9
  • 【下载频次】464
节点文献中: