节点文献

垃圾图像特征提取与选择研究

Research on Feature Extraction and Selection for Spam Image Recognition

【作者】 程红蓉

【导师】 秦志光;

【作者基本信息】 电子科技大学 , 信息与通信工程, 2011, 博士

【摘要】 垃圾图像识别是当前互联网络垃圾信息过滤研究领域的热点之一,目标是解决传统的垃圾信息过滤方法在过滤垃圾图像信息时,性能急剧下降甚至失效的问题。解决垃圾图像识别问题的关键是特征建模时采用的特征提取和特征选择方法。鉴于电子邮件是目前传播垃圾图像信息最主要的途径之一,本论文以电子邮件所含垃圾图像为研究对象,针对抗干扰的图像区域和图像边缘特征提取方法、基于信息度量准则的有监督特征选择方法、应对标注瓶颈问题的半监督特征选择方法进行了研究。本文主要的创新性成果包括以下四个方面:1.提出一种抗干扰的文本区域自动提取方法,削弱了现有相关方法对图像质量有较高要求的限制。该方法设计的八邻域细小区域去除算法和候选文本区域筛选机制,能有效降低复杂背景和不规整的图像文字对文本区域分割形成的干扰。在此基础上,该方法设计了一种基于霍夫变换求标记区域最小外接矩形的算法,克服了现有相关方法不能有效提取倾斜文本区域的不足。实验结果显示该方法能有效提高文本区域的提取精确度,从而获得更有效的文本区域特征。2.提出一种邮件图像边缘特征提取方法。该方法引入高阶局部自相关(Higher-order Local AutoCorrelation, HLAC)函数提取邮件图像的边缘特征,据此得到的HLAC特征能反映图像内容固有的边缘相关性,具有对位移和尺度变化不敏感的优点,表现出较强的抗干扰能力,克服了现有相关算法对图像边缘分布或者图像中的文字数量存在限制条件的不足。真实数据集上的实验结果证实HLAC特征是一种有效的判别特征。3.提出一种基于信息度量准则的特征选择算法。针对现有相关算法脱离分类环境评估冗余特征的问题,该算法提出分类冗余特征的定义,并设计了一个分类信息增益度量化指标,在评估候选特征之前删除分类冗余特征,降低对评估特征的干扰。针对大多数信息度量准则不能正确处理特征协作关系的问题,该算法运用条件互信息,设计了一个信息度量准则对特征进行评估。实验结果表明该算法能够有效降低特征空间的复杂度,提高分类模型的性能。4.提出一种基于图的半监督特征选择算法。该算法以聚类假设为理论基础,对基于谱图理论的无监督特征选择算法Laplacian Score进行扩展,通过构建样本数据的类内相似度和类间离散度矩阵,考察特征保持全局结构和局部结构的能力,并且利用分类信息增益度指标去除冗余特征,弥补了现有相关算法不能处理冗余特征的不足。实验结果显示该算法在样本标注程度很低的数据集上能有效去除冗余特征,选出预测力强的特征子集。上述研究成果为实现垃圾图像的自动判别,从而解决垃圾图像信息的过滤问题提供了新的研究思路和有希望的解决方案。

【Abstract】 Spam image recognition is one of the hot issues in the current research area of Internet spam filtering, aimed at addressing the problem that the traditional text-based spam filtering methods may fail to discrimimate image-based spam. The approaches of feature extraction and feature selection that are used to build feature model play a critical role in solving the problem of spam image recognition. Since Email is one of the mostly used ways to deliver spam image, this dissertation focuses on the spam images in Email and studies the noise robust feature extraction for image region and edge, the information criterion based supervised feature selection, and the semi-supervised feature selection for dealing with the label bottleneck problem. The main creative results of this dissertation are the followings:1. A noise robust method of automatic text region extraction is proposed to mitigate the constraints on image quality. In this method, the algorithm of removing small region based on eight-neighborhood pixels and the fasle text region filtering scheme are designed to effectively reduce the noise caused by complex background and irregular image text for text segmentation. Then a Hough transform based algorithm of calculating minimum enclosing rectangle is proposed to solve the problem of extracting non-horizhontal text region. The experimental results show that the proposed method can effectively improve the extraction precision. Based on the new method, more effective features of text regions can be achieved.2. An algorithm is proposed to extract edge features for the images in Email. The algorithm exploits the higher-order local autocorrelation (HLAC) function to extract the features of image edage. The extracted HLAC features are inherently related to its local edge autocorrelation features and insensitive to shift-variance and scale-variance. Thus the new algorithm is noise robust without limitations respect to edge distribution and the amount of image text. The experiment results show that HLAC features are effective for spam image discrimination.3. A new information-criterion based feature selection algorithm is proposed. In this algorithm, classification redundant feature and its measure classification information gain are defined to solve the problem that the prevalent algorithms evaluate feature redundancy independently of the classification task at hand. Based on the measure of classification information gain, the classification redundant features in the candidate feature subset can be removed beforehand to reduce noise. Since the prevalent information criterions cannot handle feature synerge correctly, in this algorithm, a new information criterion based on conditional information is proposed to approperaitely estimate the information of feature interaction. The experimental results show that the new algorithm can effectively reduce the dimension of the feature space and improve the classification performace.4. A graph-based semi-supervised feature selection algorithm is proposed. This algorithm exploits the clustering assumption to extend the graph-based unsupervised feature selection algorithm, i.e. Laplacian Score. By constructing between-class scatter matrix and within-class similarity matrix, the new algorithm can evaluate the features according to their power of preserving global structure and local structure of samples. In addition, the new algorithm can remove redundant features based on exploiting classification information gain. Thus it solves the problem that the popular algorithms based on score function cannot deal with redundant features. The experimental results show that the new algorithm can effectively reduce redundancy of the feature space and improve the classification performace when the labeled samples in the data set are lack.The findings and conclusions proposed above will provide a new perspective for automatic spam image recognition and will encourage promising investigations along the lines suggested.

节点文献中: 

本文链接的文献网络图示:

本文的引文网络