节点文献

基于论坛语料的未登录词自动识别新方法

【作者】 都菁

【导师】 熊海灵;

【作者基本信息】 西南大学 , 计算机软件与理论, 2010, 硕士

【摘要】 未登录词识别一直是中文分词研究领域的瓶颈。为解决中文分词中未登录词识别效率低的问题,提出基于论坛语料对中文未登录词进行识别的新方法:首先利用网络蜘蛛下载论坛网页;然后对该语料库进行周期性的更新以随时保持语料的新鲜度,以构建一个具备高时效性的语料库;接下来对语料库进行分词,具体是先将Mutual Information函数和Duplicated Combination Frequency函数线性叠加构造出新统计量MD(由Mutual Information函数和Duplicated Combination Frequency函数的首字母结合而成),再用MD函数对语料库进行分词产生候选词表;最后通过对比候选词表与原始词表发现未登录词,并将识别出的未登陆词扩充到原始核心词库中,以便在下一次分词过程中可以一次性识别出该未登录词。中文分词与一般英文分词不同,中文的语言构成和使用习惯使得中文分词比英文分词困难很多。在该领域先后产生三种传统的中文分词算法:基于字符串查找的机械匹配算法;基于理解的算法和基于统计的算法。三种算法对于未登录词的识别都存在不同程度的问题:机械匹配算法从根本上就无法实现未登录词的识别:理解算法由于算法复杂、实现难度大,实际开发和应用并不广泛;统计算法在一定程度上可以解决部分未登录词,一度成为比较流行的算法,但是现有的统计算法仍然出现较多误判和无法判定的情况。总的说来,基于统计的算法是一个实际应用中相对可行的一种方法,因此本文提出一种改进的统计算法对未登录词进行识别。具体策略如下:第一,本文首次将网络论坛——天涯论坛,引入未登录词识别研究中,利用网络蜘蛛下载论坛网页。第二,通过预处理网页构建语料库,并对该语料库进行周期性的更新以获取具备较强时效性的语料。第三,将Mutual Information函数和Duplicated Combination Frequency函数线性结合构造出新统计量MD,运用该MD函数对语料库进行分词产生候选词表。第四,通过对函数的反复训练,选定较优的阈值,对比候选词表与原始词表发现未登录词。最后根据这种思想设计测试方案,搭建测试环境。通过对新词召回率和分词准确率两个指标,证明本文设计的未登录词自动识别新方法是可行的。

【Abstract】 Identification of unknown Chinese words is the bottleneck in the field.This paper presented that download adequate web documents from BBS with web spider in order to construct a corpus which was updated periodicity. Then generate candidate words list by extracting words from the corpus with this new function. Finally, compare this candidate words list and the previous lexicon, so as to recognize the unknown words. Experiments showed that the proposed method was more efficient.Different with English word, Chinese word has its own characteristics. As the composition and use habit of Chinese language, parser Chinese word is a harder problem than the English.At present, the Chinese word segmentation algorithm is mainly in three ways:based on string matching algorithms, based on understanding algorithm and based on statistical algorithms. These three methods, both in the unknown word to varying degrees, there are some problems:based on string matching algorithms can not recognize unknown words fundamentally. Based on understanding algorithm is more difficult and complexity of the time complexity and the space complexity. So it is not widely used. Based on statistical algorithm is more feasible and popular method at present, but there are also some errors in identification.Over all, based on statistical algorithm is a relatively feasible and practical application of a method. This paper studed unknown Chinese word based on statistical algorithms for unknown words identification. First, the Chinese word segmentation, especially in unknown word recognition is descripted. Secondly, the traditional word segmentation algorithms and segmentation system has been analyzed and compared. There are three kinds of traditional Chinese word segmentation algorithms:based on string matching algorithm; based on understanding of the algorithm; based on statistical algorithms. Mechanical matching algorithms can not extract unknown word from a fundamentally reason; understanding algorithm due to algorithm complexity and great difficulty, practical development and application is not widespread; Statistics algorithm in a certain extent, may solve some of unknown words, the algorithm became more popular, but it is still available in more statistical algorithms can not determine the miscarriage of justice and circumstances. This paper presented methods that download adequate web documents from BBS with web spider in order to construct a corpus which was updated periodicity which was contrarily against to the shortage of traditional ways. This step can ensure the timeliness of the corpus. Then generate candidate words list by extracting words from the corpus with this new function MD (the Mutual Information function and Duplicated Combination Frequency are combinated to construct a new statistic MD). This candidate words list and the previous lexicon were compared, so as to recognize the unknown words. Subsequently, according to this thinking program designed to test, set up a test environment. New word recall rate and accuracy of two indicators shows that this design of unknown words automatically recognize the new method is feasible.

  • 【网络出版投稿人】 西南大学
  • 【网络出版年期】2010年 08期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络