

Study and Application of Correlation Analysis Methods in Anomaly Detection

【作者】 汪保男

【导师】 李爱国;

【作者基本信息】 西安科技大学 , 计算机软件与理论, 2010, 硕士

【摘要】 本文研究相关分析方法在异常检测中的应用,并将其应用于特征选择及地震特征数据的异常检测中。主要研究内容如下:提出了一种基于离散粒子群算法(Binary Particle Swarm Optimization,BPSO)及以重叠信息熵(Overlap Information Entropy,OIE)为适应值函数的特征子集选择方法。该方法是不依赖于分类器的特征选择方法。主要思想是:首先随机产生若干粒子,以特征属性集与类别属性之间的OIE作为BPSO算法的适应值函数,其大小表示所选特征子集与类别属性之间相关性程度的高低;利用BPSO算法对特征子集进行优化,最终确定与类别属性的OIE最大的特征子集为最优特征子集。实验结果显示:该方法不仅能有效地寻找到最优特征子集,且能进行特征降维和去除冗余信息,其分类结果不差于全部属性的分类结果。提出了一个非线性新相关信息熵的概念,推导并证明了该信息熵的若干性质,这些性质满足香农熵的基本性质。新相关信息熵是一种度量多变量、非线性系统的相关性程度大小的标准。作为多变量之间相关关系的不确定性度量,变量间的相关性程度越大,对应的新相关信息熵值越小。新相关信息熵的提出为相关分析理论的研究提供了一种新方法和新思路。新相关信息熵的应用实例结果说明它是一种有效且有用的度量非线性系统不确定性的方法。基于上述研究,开发了用数据挖掘技术进行地震趋势预报与评判的分析软件原型系统,此系统的开发目的旨在为后续的进一步研究打下基础。本文的研究结果主要开发了其中的相关分析模块,同时提供给用户可视化的操作界面,其主要功能是进行特征选择和异常检测,以此评判本文特征选择方法的有效性。以汶川余震特征数据为实验数据,测试结果表明该系统功能正确。

【Abstract】 This thesis mainly focuses on that the correlation analysis method is applied in anomaly detection, and this method is used in feature selection and earthquake feature data’s anomaly detection. At the same time, the prototype software system of using data mining theory and technology to forecast and judge earthquake tendency was developed. The main contents are as follows:This thesis proposes a new method of Feature Subset Selection, which is based on discrete Binary version of Particle Swarm Optimization (BPSO) and Overlap Information Entropy (OIE). This method does not depend on classifier. The main idea is: at first, a group of particles are generated randomly. The OIE between attribute set and class attribute is used as BPSO algorithm’s fitness function, its size denotes the correlation degree between selected attribute set and class attribute. Then, feature subset is optimized by BPSO. Finally, feature subset, which has the largest OIE with class attribute, is selected as the Optimal Feature Subset. Experimental results confirm that this method can not only find the Optimal Feature Subset effectively but also do feature reduction and remove the redundant information, and its classification results are not worse than all features’classification results.The concept of A New Nonlinear Correlation Information Entropy (NNCIE) is proposed based on the study of Correlation Information Entropy (CIE) and Hpal Entropy. Under the condition of the largest partition of finite sets, some properties of this information entropy are derived and proved theoretically and these properties meet the basic properties of the information entropy, which is proposed by Shannon C E. The NNCIE is a measurement criterion of multi-variable and nonlinear system’s correlation degree. As an uncertainty measurement of multi-variable correlation, the more correlation information between variables contain, the smaller value of corresponding NNCIE is. The NNCIE contributes to information fusion and provides a new method and idea for the research of correlation analysis theory. The results of NNCIE show that NNCIE is an effective and useful measurement method for nonlinear system’s uncertainty.Based on above research results, the software prototype system of using data mining theory and technology for prediction and judgment earthquake tendency was developed. But this system is not an application software system, and its development just only supplies a good foundation for subsequent research. Correlation analysis module is one of main constituent part, and this module makes the NNCIE be the fitness function of feature selection method that this thesis proposed. At the same time, a visual operation interface is provided for user and its main function is feature selection and anomaly detection so as to judge this feature selection method’s availability. Experimental data is WenChuan aftershock’s feature data, and the test results show that the software runs well.
