

Study on the Authorship Mining for Chinese E-mail Documents Based on SVM

【作者】 马建斌

【导师】 滕桂法;

【作者基本信息】 河北农业大学 , 农业机械化工程, 2004, 硕士

【摘要】 随着计算机技术、信息化程度的日益提高,尤其是互联网的日益普及,电子邮件已经成为人们必不可少的经济、实用的信息交换手段。但是,不幸的是,网上邮件滥用的现象时有发生,比如:垃圾邮件、欺骗邮件、威胁邮件、反动邮件等。在这些邮件中,发送者总是试图隐藏他的真正身份以逃避侦察,发送者通过匿名邮件服务器可以更改或伪造自己的地址,更改自己的真实姓名等,因此,通过邮件本身找出邮件作者的真实身份是一件很困难的事情。这样,研究一种识别原始邮件作者真实身份的方法,为计算机取证提供依据,追究非法邮件作者的刑事责任,无疑为控制非法电子邮件的现象提供一种行之有效的方法。本文在分析数据挖掘各种技术的基础上,提出了一种自动辨别或分类匿名邮件作者身份的方法,应用支持向量机做分类算法,提取邮件的各种特征:包括语言特征、头信息和结构特征,自动把邮件分类到预定的作者类别中。本文在分类算法及特征提取策略方面取得了很大进展,对有限数据集的实验取得了满意的结果,为作者身份识别提供了可能。但是分类精度还达不到用于计算机取证的程度,有待将来进一步研究。

【Abstract】 With the rapid growth in computer technology and information level, especially the increasing popularization of Internet, e-mail has become an expedient and economical form of communication. But unfortunately, the phenomenon of e-mail misusage is common on the Internet, such as junk mail, cheating mail, threatening mail and antisocial mail etc. In these mails, the sender always attempts to hide his true identity hi order to avoid detection. The sender’s address can be forged and routed through anonymous mail server, or the sender’s name may have been modified. So it is difficult to find out the real identity of e-mail and undoubtedly to identify the original author of illegitimate e-mail and provide evidence for computer forensic is an effective method to control the illegitimate e-mail phenomenon. In this paper, we propose one method that identify or classify anonymous e-mail authorship automatically on the basis of analyzing various kinds of data mining technology. We adopt the support vector machine algorithm to extract various e-mail document features including linguistic features, header information and structural characteristics and classify or attribute authorship of e-mail messages to predefined author list. Great progress on classification algorithm and feature extraction strategy has been made. Experiments on a limited number of e-mail documents gave satisfying results. This makes it possible to identify authorship of e-mail. But the classification precision is far from the computer forensic standards and further researches should be implemented in the future.

  • 【分类号】TP393.09
  • 【被引频次】7
  • 【下载频次】213

