节点文献

图像型垃圾邮件过滤关键技术研究

Research on Key Technologies of Image Spam Filtering

【作者】 李鹏

【导师】 崔刚;

【作者基本信息】 哈尔滨工业大学 , 计算机系统结构, 2013, 博士

【摘要】 电子邮件在方便人们便捷通信的同时,也逐渐成为了别有用心的人用作发送广告、传播淫秽色情内容、进行恶意诈骗和宣传反动思想及言论的便捷途径。目前,针对文本型垃圾邮件的过滤已取得较好效果。但自2006年起,为了躲避传统的过滤系统,垃圾邮件发送者开始将邮件文本内容移至图像中进行发送,并且经常以加入形变文字和各种噪声干扰等方式进一步对抗过滤系统,这些手段大大降低了过滤器的性能。相对于传统垃圾邮件而言,图像型垃圾邮件具有更强的隐蔽性,消耗了更多的网络带宽、计算和存储资源,同时给社会带来了更大的安全隐患,对其进行有效地过滤已到了非常迫切的时刻。为了防止图像型垃圾邮件的进一步泛滥,本文针对垃圾邮件图像的不同特征以及实际应用需求,对其中的若干关键问题进行了研究。通过对垃圾邮件的生成与发送方式分析可知,垃圾邮件图像具有批量发送的特征,相同来源的垃圾邮件图像主要利用相同的模板生成,彼此之间通常具有相似的结构或者区域。针对上述特征,本文分析了近似复制图像检测中存在的主要问题,提出了一种结合局部特征点的邻域几何上下文和匹配点之间的全局几何一致性验证来提高近似复制图像匹配准确性的方法。首先,提取对应于每个SIFT局部特征点的弱稳定特征点,用于生成几何上下文信息,以避免特征点量化为视觉单词后导致的可区分性降低问题;然后,判断两幅图像匹配的点对中是否存在满足全局几何关系一致性的子集,以进一步验证潜在匹配图像的正确性。实验结果表明,本方法能够有效地提高部分近似复制图像识别的准确率,这对于有样本时的垃圾邮件图像过滤具有积极意义。垃圾邮件图像的另一个重要特征是其中经常包含大量的文本,因此可以借鉴基于内容检测的传统垃圾邮件过滤方法,同样判断邮件图像中是否包含特定的敏感关键字。本文提出了一种利用字符基元视觉短语进行图像关键字识别的方法。首先,通过提取图像中的最大稳定极值区域用于构造字符基元;然后,根据MSER区域拟合椭圆的邻接特性构造字符基元视觉短语,同一图像关键字中的基元通常位于相同的视觉短语中;最后,结合元素相似性和几何邻接关系进行视觉短语相似性判断。这种方法不需要对图像进行二值化、布局分析和文本区域定位等预处理操作,具有较高的灵活性和鲁棒性。此外,本文还借鉴几何模糊描述符,提出了一种对于复杂干扰场景下的中文图像关键字的识别方法。借助可变核对图像进行高斯模糊,可以有效降低噪声干扰带来的影响。首先,利用几何模糊进行特征点匹配,并通过对匹配特征点的布局特征分析以滤除潜在的误匹配;然后,由于中文关键字中经常存在形状相近的文字,这些文字通常具有相同的偏旁,本文通过分析样本图像中未匹配点的区域范围大小以进一步提高匹配的准确性。实验结果表明,本文方法对于复杂场景中的关键字发现具有较好的效果,并且能够有效地区分形状相似的文字,对于垃圾邮件图像中常用的干扰类型具有较好的抗干扰性。垃圾邮件图像多种多样,不同类型的邮件图像间通常具有较大的特征差异。此外,还需要考虑到实际应用中对于垃圾邮件的漏判具有一定程度的容忍性,而对于正常邮件的误判通常会给用户带来较大的损失。因此,本文提出利用局部和全局特征进行图像特征描述,并借助级联分类器对不同类型的垃圾邮件图像进行分层过滤的方法。同时,为了避免误判造成的影响,利用信息熵对分类结果进行评估,对于分类结果不确定的图像进行多次判断或者直接作为正常图像,以达到尽可能降低垃圾邮件图像的漏报率,同时减少对于正常邮件图像误报的目标。为了对抗过滤器,垃圾邮件图像中经常被加入大量的干扰噪声,因此也可以将其作为垃圾邮件图像判断的重要依据。针对上述特征,本文提出一种对邮件图像背景区域中的噪声进行分析的方法。首先,利用小波变换得到邮件图像非文本区域的噪声特征图像;然后,通过对特征图像中的连通域分析进行噪声的度量和分类。该方法可以作为邮件图像的特征提取模块,其输出用于表示邮件图像中包含的“噪声量”以及“噪声的类型”。虽然图像中的噪声含量不能直接用于判断当前图像是否为垃圾邮件图像,却可以为后续判断提供重要依据。

【Abstract】 Email has become an indispensable communication tool in our daily life.However, it has also become a convenient way for some people with ulteriormotives to send advertising, pornographic materials, malicious frauds, reactionaryideology and rhetoric in recent years. Nowadays, text-based filters have grown insophistication and effectiveness for filtering spam emails. Since2006, in response,spammers have adopted a number of countermeasures to circumvent thesetext-based filters. Currently, one of the most popular spam construction techniquesinvolves embedding text messages into images. It is also with deformable characters,different kinds of noise to defeat the filters furthermore, which poses a newchallenge for spam researchers. Image spam emails are more hidden, and consumemore network bandwidths, computing and storage resources, at the same time bringgreater security risk to the community. It has been the urgent moment for itseffective filtering. In order to prevent the further proliferation of image spam, wemake some researches on the key issues according to the different characteristics ofspam images, as well as the actual application requirements.Through analysis of the generation and sending ways of image spam, we knowthat spam images are always sent in batch. And the spam images from the samesource are often generated by the same template, and therefore commonly have thesimilar strucutre and regions. According to this feature, this paper analyzes the mainproblems in near-duplicate image detection(NDII), and proposes a novel schemecombing the neighborhood information of single local feature and the globalgeometric consistency of multi-local features for improving the accuracy ofnear-duplicate image detection. Firstly, we construct the geometric contextualinformation of image local features to enhance the distinctiveness of visual word.Then, we propose to verify the global geometric consistency of subset-of-featuresfor improving the accuracy of retrieval results furthermore. Experimental resultsshow that the proposed method can improve the accuracy of NDII prominently,which has a positive meaning for image spam filtering with sample images.One of the most important features of spam images is that it often containslarge amounts of text. Therefore, by the same way for filtering text-based spam, wecan also judge that whether the email image contains certain sensitive keywords.This paper proposes a new approach for image keyword spotting using visual phraseof character primitives. Firstly, maximally stable extremal regions are extractedfrom a given image, and then normalized to be our character primitives. Theprimitives of the same keyword are often within the same phrase. Then, we propose to measure the similarity with element similarity and geometric structureconsistency. This method does not require the processes of image binarization,layout analysis and text area localization. And it is more flexibly and robust.Otherwise, this paper proposes a method based on geometric blur descriptorsfor image keywords spotting in cluttered scenes. It can reduce the impact of noiseinterference with Gaussian variable kernels for image blurring. Firstly, we get theinitial correspondences of local feature points with geometric blur, and filter out themismatches by layout analyis. Because there often exist Chinese characters sharingthe same radicals, we propose to use the ratio of the area of the no-match featurepoints in the sample image to that of the whole image to further improve thematching accuracy. The experimental results show that our method can recognizeand spot the keyword images with high accuracy. And it has better anti-interferencefunctions for the noise used in spam images.Spam images are various. Different kinds of spam images are often withdifferent types of features. Furthermore, false positive will bring greater losses foremail users, and it is also tolerant to false negative to some extent in practice.Therefore, this paper proposes to use both local and global features for spam imagesdescription, and proposes to use cascade of classifiers for hierarchical filtering ofdifferent types of spam images. To avoid the false positives, we propose to useclassification entropy to indicate the multi-times of judgement or normal images.The experimental results show that we can not only reduce the false positive ratio offilters as much as positible, but also enhance the accuracy ratio.Spam images are commonly with many background noise components fordefeating spam filters. Therefore, the presence of background noise can beconsidered as an indication that an email is spam. According to this feature, thispaper proposes to obtain the noise feature image using wavelet transform, and thenthe method for noise measurement and classification by connected componentanalysis in the noise feature images is given. This technique is intended to be usedas a specific module of spam filter, whose output could indicate the “amount” and“type” of noise in email images. Since noise could also be present in legitimateimages, the results of noise analysis can not give the certainty that an email is spam.But it can be taken as an indication of the tricks which were introduced to defeatagainst OCR tools.

  • 【分类号】TP393.098;TP391.41
  • 【被引频次】1
  • 【下载频次】295
  • 攻读期成果
节点文献中: 

本文链接的文献网络图示:

本文的引文网络