

【作者】 蔡灿民

【导师】 吴晟;

【作者基本信息】 昆明理工大学 , 计算机软件与理论, 2008, 硕士

【摘要】 中文自动分词是中文信息处理的关键技术,同时也是中文信息处理的第一道工序,它是自然语言理解、自动翻译、电子词典、文本分类等中文信息处理的基础性工作。随着中文信息技术的不断发展,中文自动分词已经成为中文信息自动处理的“瓶颈”。因此,中文自动分词技术目前是我国计算机科学研究的重要课题之一。目前分词方法主要有三类:一类是基于字符串匹配的机械分词法,也称词典法;另一种是基于统计语言模型的分词方法;还有一种是建立在知识库及语义规则基础上的分词方法,也被统称作人工智能法。这些分词方法都有其各自的优缺点:机械分词法是最常用的一种方法,虽然现在的机械分词法中运用了各种技术,但还是不能有效地解决未登录词识别和歧义处理问题;基于统计语言模型的分词方法不能有效地提高分全率以适应一般中文信息处理的应用;人工智能法中无法解决规则库和语义在应用方面的问题,目前基本处于研究阶段。本文针对各种自动分词方法中出现的这些问题,利用基于统计语言模型的分词方法能识别第一类未登录词及处理部分歧义的优点来弥补基于字符串匹配的机械分词法未登录词识别及部分歧义处理的缺陷,提出了具有自学习机制的智能词典的概念,初步地构架了智能词典的基本模型,对基于智能词典的汉语自动分词系统的可行性在理论上进行了论证,并详细地论述了基于智能词典的分词系统的基本原理和实现过程。最后,对本课题进行了总结,分析了本系统的不足,并对课题将来的发展作了展望。

【Abstract】 Chinese automatic word segmentation is a key technology of Chinese information processing,which is basic work of NLP,automatic interpretation,digital dictionary,text classify and so on.With the growth of Chinese information technology continually,the Chinese automatic word segmentation had became the neck-bottle of Chinese information automatic processing,so the technology of Chinese automatic word segmentation is one of important task at the present time.There are three methods of the word segmentation at present.The first id is mechanical word segmentation that based on matching of character of string;the second method is based on mode of statistics and language;the other is artificial intelligent,which is based on repository and semantic rule.These methods of segmentations have their advantages and disadvantages.Mechanical word segmentation can’ t resolve the new words and processing of different meanings.The segmentation based on mode of statistics and language can’ t improve accuracy of word segmentation and adapt the usual application.The segmentation method based on artificial intelligent can’ t resolve the problem of rule warehouse,and the word segmentation method is studying now.This paper aims at these problems in process of segmentation,and makes use of advantages of the word segmentation method based on mode of statistics and language for remedying the disadvantage of mechanical word segmentation.This put forward the conception of intelligent dictionary, which could extract new word and resolve any problems of different meanings.

  • 【分类号】TP391.1
  • 【被引频次】7
  • 【下载频次】341

