节点文献

中文文本分词及词性标注自动校对方法研究

Research on the Methods of Automatic Correction of Chinese Word Segmentation and Part-of-Speech Tagging

【作者】 钱揖丽

【导师】 郑家恒;

【作者基本信息】 山西大学 , 计算机应用技术, 2003, 硕士

【摘要】 语料库建设是中文信息处理研究的基础性工程。汉语语料的基本加工过程,包括自动分词和词性标注两个阶段。自动分词和词性标注在很多现实应用(中文文本的自动检索、过滤、分类及摘要,中文文本的自动校对,汉外机器翻译,汉字识别与汉语语音识别的后处理,汉语语音合成,以句子为单位的汉字键盘输入,汉字简繁体转换等)中都扮演着关键角色,为众多基于语料库的研究提供重要的资源和有力的支持。 语料库的有效利用在很大程度上依赖于语料库切分和标注的层次和质量。当前对汉语语料的加工结果,虽已取得了一定的成绩,但国家的评测结果表明,其离实际需要的差距还是很大的,还有待于进一步的提高。 本文以进一步提高汉语语料库分词和词性标注的正确率,提高汉语语料的整体加工质量为目标,分别针对语料加工中的分词和词性标注两个阶段进行了研究和探讨: 1.讨论和分析了自动分词的现状,并针对分词问题,提出了一种基于规则的中文文本分词自动校对方法。该方法通过对机器分词语料和人工校对语料的学习,自动获取中文文本的分词校对规则,并应用规则对机器分词结果进行自动校对。 2.讨论和分析了词性标注的现状,并针对词性标注问题,提出了一种基于粗糙集的兼类词词性标注校对规则的自动获取方法。该方法以大规模汉语语料为基础,利用粗糙集理论及方法为工具,挖掘兼类词词性标注校对规则,并应用规则对机器标注结果进行自动校对。 3.设计和实现了一个中文文本分词及词性标注自动校对实验系统,并分别做了封闭测试、开放测试及结果分析。根据实验,分词校对封闭测试和开放测试的正确率分别为93.75%和81.05%;词性标注校对封闭测试和开放测试的正确率分别为90.40%和84.85%。

【Abstract】 The building of corpus is the basic work in the area of Chinese information processing. The processing of Chinese corpus includes Chinese word segmentation and part-of-speech tagging. They are widely used in many researches (for example, the automatic searching of Chinese text, machine translation, and Chinese characters identification and so on), and they provide important study resources for these researches.The effective use of corpus strongly depends on its processing level and quality. Now, we have written a lot of software for Chinese corpus processing, and have gained great achievements. But the outcome of them cannot answer our needs very well, and needs further improvements.The paper aims at improving the accuracy of Chinese word segmentation and part-of-speech tagging, studies and analyzes the two phases respectively:1. It discusses and analyzes the actuality of Chinese word segmentation, and describes an approach to correcting the Chinese word segmentation automatically based on rules. It compares the corpus processed by computer with the right, acquires the rules for Chinese word segmentation correction, and then corrects the corpus automatically based on these rules.2. It discusses and analyzes the actuality of Chinese part-of-speech tagging, and describes an approach to correcting the Chinese part-of-speech tagging automatically. It mines rules from right-tagged corpus using the method of rough sets, and then corrects the results of part-of-speech tagging automatically.3. We have designed and implemented an experiment system for the correction of Chinese word segmentation and part-of-speech tagging. The results of close-test and open-test of the system for Chinese word segmentation correction are 93.75% and 81.05% respectively, and the results of close-test and open-test of the system for Chinese part-of-speech tagging correction are 90.40% and 84.85% respectively.

  • 【网络出版投稿人】 山西大学
  • 【网络出版年期】2004年 01期
  • 【分类号】TP391.12
  • 【被引频次】3
  • 【下载频次】524
节点文献中: 

本文链接的文献网络图示:

本文的引文网络