节点文献

带噪声的文本聚类及其在反垃圾邮件中的应用

Text Clustering with Noise and Application in Anti-spam

【作者】 周鑫

【导师】 郝志峰;

【作者基本信息】 广东工业大学 , 计算机应用技术, 2012, 硕士

【摘要】 随着互联网技术的飞速发展,文本数据呈指数级增长。为了获得数据之间的内在关系及隐含信息,文本挖掘技术应运而生。聚类分析作为数据挖掘的一个重要功能,在文本挖掘中有着非常重要的作用,本文将讨论带有干扰信息的文本聚类方法。传统的文本挖掘方法首先将文本表示成向量空间模型;然后用TFIDF权重将文档转化为向量形式,最后在向量空间模型中计算文本相似度。在传统的向量空间模型中,由于没有考虑词之间存在的概念相似情况,因此影响了数据聚类的准确性。因而针对中文提出了一种基于知网模型和语义内积的相似度计算方法。然而,这一方法却并不适用于垃圾邮件的聚类问题。原因是垃圾邮件发送者经在邮件编辑完成后,用类似于查找替换的办法,把文本中规范的敏感关键词替换为另一个用插入符号、改动次序甚至用拼音替代等方法混淆过的、但能被读者理解的词语,以逃脱邮件处理程序的过滤。如果利用传统的方法则会采取一系列预处理措施,将会过滤掉干扰信息,这样会使垃圾邮件的相似度计算准确度较低,最终导致聚类质量效果较差。针对垃圾邮件含有较多干扰信息而导致相似性度量较差这一问题,本文采用非度量的方法,将Needleman-Wunsch算法应用到文本相似度计算中。最后,利用该相似度计算方法,提出一种基于Needleman-Wunsch的聚类算法,最终完成文本聚类。与基于向量空间模型相比,采用Needleman-Wunsch算法计算文本相似度时,避免了分词过程,减少语义损失,保留了所有的文本信息,保证了聚类质量;而本文通过预处理将文档内容分成中文字符、英文字符串和符号串,减轻数据稀疏问题,减少了字符的比较次数,从而加快了处理速度。通过仿真实验与传统的聚类算法进行对比,该聚类质量和效率都有很大改进。这说明本文提出的聚类算法适合于垃圾邮件聚类,从而提供了一种有效的垃圾邮件过滤技术。具体思路是利用本文方法将垃圾邮件与合法邮件进行聚类,根据文档相似度值聚成不同的类别,从而判断出垃圾邮件与合法邮件。

【Abstract】 With the rapid development of Internet technology, the text data is growing exponentially. In order to obtain the intrinsic relationship between the data and implied information, text mining technology emerges as the times require.Cluster analysis has a very important role in text mining and has an important feature of data mining, the paper will discuss the text clustering method with interference information.Traditional text mining methods first represent text into a vector space model; secondly, documents are converted to vector form by using the TFIDF weights.Finally calculate the text similarity in the vector space model. Traditional vector space model don’t consider the conceptual similarity between the words, thus affecting the accuracy of the data clustering. To solve the problem, a method of similarity for Chinese based on the HowNet model and semantics of the inner product is proposed.However, this method is not appropriate to the problem of spam. Because in order to escape the filter of the mail, when finishing editing spam, spam senders will use some methods such as finding and replacing the sensitive keywords by another or inserting symbols or changing orders of words or altering words to phonetic.But readers can understand the text. Traditional methods will take a series of pretreatment measures, which will filter out the interference information and cause less accuracy of similarity. Ultimately the methods lead to poor quality of clustering effect.In this paper, a method based on Needleman-Wunsch algorithm is proposed to measure the similarity among the spam mail, in which the texts usually contain a lot of noises. Based on the proposed similarity measurement, an efficient clustering algorithm based on Needleman-Wunsch algorithm is devised. Finally text clustering is completed.Compared with the vector space model, when using the Needleman-Wunsch algorithm to compute the text similarity, the method avoids the process of segmentation, reduces the semantic loss, and retains all the text information, so that the quality of the clustering is ensured;By preprocessing the content of the document into Chinese characters, English strings and symbol strings, the data sparseness problem is alleviated, the number of comparisons of the characters is reduced,thereby speeding up the processing speed.Compared by simulation with traditional clustering algorithm, the clustering quality and efficiency are greatly improved.That shows that the proposed clustering algorithm is suitable for spam clustering, and then provides a valid e-mail spam filtering technology. The specific idea is that spam and legitimate e-mail are clustered by using the method proposed in the paper. According to the document similarity values, they are clustered into different categories. Finally the spam and legitimate mail are determined.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络