节点文献

基于EVS相似度的邮件社区划分方法研究

【作者】 王芳

【导师】 范明;

【作者基本信息】 郑州大学 , 计算机软件与理论, 2010, 硕士

【摘要】 近年来,复杂网络中社区结构的发现及社会关系知识的挖掘,已经成为数据挖掘领域的研究热点之一。电子邮件系统中的邮件通信网络是一种较简单的社会网络,其社区划分问题本质上可以归结为稀疏图的聚类问题。聚类方法的核心是邻近性度量,因此发掘新的更加有效的邻近性度量方法进而提高邮件社区的划分质量,对以后的垃圾邮件的识别与过滤以及大型复杂网络的研究,具有非常重要的意义。本文以网络社区为背景,对邮件通信网络中的社区进行了重点研究,主要工作如下:(1)提出了一种新的邻近性度量方法EVS,用于指导邮件社区聚类。通过学习和研究各种邻近性度量方法以及国内外复杂网络社区挖掘的相关方法,论文将邮件社区划分转化为图的聚类。首先介绍了邮件特征的向量表示形式、构建了邮件特征矩阵。在此基础上,使用变形后的极值分布函数模型拟合邮件间通信特征信息,然后在转换后的信息矩阵上构建EVS。(2)结合微聚类-宏聚类的技术提出了基于EVS相似度的邮件社区聚类算法,验证了EVS的有效性。本文将余弦、皮尔森等经典的相似性度量方法引入邮件社区划分中,用于进行对比分析,并且从具体邮件社区的特点来评估邮件社区的划分质量。(3)实验结果表明,在实际的测试数据集上,基于EVS度量的邮件社区聚类算法比基于余弦、皮尔森相似性的邮件聚类方法更加有效,更能够发现高质量的社区。本文的研究具有很强的实际应用价值,对垃圾邮件的识别与过滤技术的进一步发展,大型复杂社会网络的社区发掘以及一些商业应用,都有十分重要的意义。

【Abstract】 Nowadays, the community detection in complex networks and the knowledge mining of social relations has become one of the hot spots in the area of Data Mining. Email communication network, which is a simpler social network, belongs to the clustering of a sparse graph in nature. The key to the problem of clustering is searching for effective proximity measurement between objects. Therefore, it is helpful of detecting and constructing new similarity measure to improve the quality of community partition. What we have done will be important to recognize and filter spam and do the research on complex networks.In contact with web community, the thesis explores the community of mail communication network in depth. The main contributions are as follows:(1) Propose a new proximity measurement method, EVS (Extreme Value distribution Similarity), for the email community clustering. After analyzing various kinds of similarity measures and the research of the web community both at home and broad, we transform the problem of the mail community partition as a graph clustering. This paper firstly introduces the email feature vector to construct the email feature matrix and then models the information of email features using the transformed Extreme value distribution. Based on this, we construct EVS.(2) To validate EVS, the thesis proposes a new mail community clustering algorithm based on EVS in combination with micro-macro clustering technique. In addition, we induce Cosine-based Similarity and Pearson Correlation Coefficient to email community partition problem. Then, based on the experimental comparison between them, we evaluate the quality of email community according to the specific characteristics of e-mail community.(3) The experiments show that, comparing to Cosine-based Similarity and Pearson Correlation Coefficient, the algorithm based on EVS is more competitive for detecting email community.This research has high practical. It is important and useful to the development of spam detecting technique, the community detection of large complex social networks and some other commercial activities.

  • 【网络出版投稿人】 郑州大学
  • 【网络出版年期】2011年 06期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络