节点文献

基于数据挖掘技术的电子邮件地址聚类系统设计与实现

Design and Implementation on Email Address Clustering System Based on Data Mining Technology

【作者】 张丹

【导师】 黄永忠;

【作者基本信息】 解放军信息工程大学 , 计算机软件与理论, 2007, 硕士

【摘要】 目前流行的电子邮件信息处理方法大部分只是针对单个电子邮件内容进行分析筛选,但仅仅凭借电子邮件本身内容无法实现高精确度的分类。如何利用目前各种成熟的数据挖掘技术,从海量电子邮件信息中挖掘出有用的知识和信息,成为了亟待解决的热点问题。数据挖掘中的聚类分析技术是数据挖掘领域一个重要研究方向,其作用是将样本数据区分为若干个类或簇,在同一个类或簇中样本之间具有较高的相似度,而不同类或簇中样本差别较大。本文描述了一个基于数据挖掘技术的电子邮件地址聚类系统。系统根据电子邮件地址之间的收发关系,构建出电子邮件地址的相似度测量属性,利用基于密度聚类方法中的DBSCAN算法,对电子邮件地址关系紧密程度进行划分,找出较为活跃的电子邮件地址,从而缩小了电子邮件地址查阅范围,提高电子邮件信息分析处理的针对性和有效性。在电子邮件信息抽取过程中,系统实现了海量电子邮件信息解码和属性分类存储。在不影响数据原有特征的前提下,通过去重、填补、剪枝和遍历查找的方法,对电子邮件信息进行预处理,最大限度的缩减了数据规模,解决了处理海量信息时的速度问题。另外,系统使用了特定地址邮件收发数量统计和特定地址联系状况统计的两种统计方法,为分析数据规律,了解数据概貌提供了一种直观的方法,同时也为验证电子邮件地址聚类结果有效性提供了参考。最后,本文还对开发的系统进行了验证分析。验证结果表明,系统在保证较快运行速度的前提下,达到了对电子邮件地址关系紧密程度的划分和电子邮件地址信息统计结果可视化表示的设计目标。验证了系统的有效性。

【Abstract】 Now,the popular disposal methods of Email information mostly focus on analyzing and filtering of single Email content.But it’s impossible to achieve classification Email by ruler and line just based on content. So how to use all kinds of successful technologies of data mining to find out valuable information from huge Email data becomes a problem that urgently to be resolved.The cluster analysis is the one of the important research of Data Mining.The function of cluster analysis is to group a set of physical or abstract objects into classes of similar objects.A cluster is a collection of data objects that similar to one another within the same cluster and are dissimilar to the objects in other clusters.This paper brings forward an Email address cluster system based on data mining technology.According to the receiving and sending’s contact of Email addresses,system creates Email address’s attribute of similarity measure,then use DBSCAN algorithm ,which is the one of density-based clustering methods,to classify Email by degree of Email address’s contact, and find out the active Email addresses.The process minish the scope of Email address that should be examined.The pertinence and validity of Email analysis were improved.The process of extracting Email information implements information decoding and attribute storage by classes.By removing repetitive records,filling up blank records,eliminating superfluous records and traveling data sets, Email data is pretreated.The process of pretreating furthest curtails the data quantity.So it resolves the problem of time when disposing huge information.And the process does not destroy the data intrinsic charaters.Furthermore,by using the statistics of the Email’s receiving and sending’s quantity and the statistics of the Email contact status of given Email address,system can display visually the communication status of given Email address.It provides an intuitionistic means to analyse the rule of data and find out the survey of data.At the same time,it also provides the reference to validate the results aquired by clustering.Finally,the paper validates and analyses the system.The results of tests show that this system can run at an ideal speed,attain the goal of design to classify Email by degree of Email address’s contact and display visually the results of the statistics of Email’s information.The results also validate system’s validity.

【关键词】 数据挖掘电子邮件聚类密度
【Key words】 Data MiningEmailClusteringDensity
  • 【分类号】TP311.13;TP311.52
  • 【下载频次】167
节点文献中: 

本文链接的文献网络图示:

本文的引文网络