

Research on Application of Waf in Text Processing

【作者】 张黎

【导师】 徐蔚然;

【作者基本信息】 北京邮电大学 , 信号与信息处理, 2013, 硕士

【摘要】 中文分词和新词发现是中文文本处理和自然语言处理中最基本和最重要的研究,它们效果的好坏直接影响到所在领域中进一步研究的效果。现有方法存在着依赖词典、依赖标注语料、低频词发现效率低等问题。本文结合2元语言模型(Bi-Gram Language model)改进了WAF(Word Activation Forces,词激活力)模型,并基于它提出了一种的无监督机器学习思想,不依赖词典和标注语料,由字构词,同时完成分词功能和新词发现功能。对于分词和新词发现,本文结合改进的WAF模型试验了最大匹配法、入链出链对比法、排序法,最终提出了动态规划迭代法。方法利用字间关系提取候选串,解决了低频词发现效率低的问题;利用动态规划完成词义消歧,解决了依赖标注语料的问题;利用分词结果筛选词表,解决了垃圾串过滤问题。本文采集10万条微博数据进行实验,结果表明,本文提出的基于WAF模型的方法可以有效解决上述问题,WAF模型在文本处理中有着较好的应用效果。

【Abstract】 The most basic and most important research in Chinese text processing and natural language processing is word segmentation and new word identification, the result of which affects the following research in text processing and natural language processing.There are some shortcomings of existing methods, such as relying on dictionary, relying on labeled corpus and low efficiency of low-frequency words’identification. This paper amend WAF model on the basis of Bi-Gram language model, and proposed a WAF-based and statistics-based unsupervised machine learning thought which does not rely on dictionary and labeled corpus to deal with word segmentation and new word identification at the same time.For word segmentation and new word identification, this paper tests the maximum matching method, inbound link and out link comparing method and sorting method, and proposes a method which contains dynamic programming and iteration at last. This method improves the efficiency of low-frequency words’ identification by using the relationship among words, completes word disambiguation by using dynamic programming, and also filters garbage strings by using the result of word segmentation.This paper collects1000,000messages from micro blog for experiment. The result shows that the WAF based methods can effectively solve those problems, and WAF model has a good application effect for text processing.
