节点文献
强影响点的数据挖掘和图示
Data Mining and Graphics Mode on Influential Point
【作者】 张森;
【作者基本信息】 重庆大学 , 应用数学, 2002, 硕士
【摘要】 随着数据挖掘技术在现代商业中的广泛应用,对异常点和强影响点的挖掘成了经济、统计等领域广泛研究的课题。由于数据挖掘和统计诊断是近半个世纪才发展起来的新兴学科,虽然取得了很多研究成果,但仍有许多问题处于探索之中。本文在分析研究国内外有关强影响点的挖掘方法及其研究现状的基础上,从探索性数据分析的角度出发,提出了挖掘强影响点的两个新方法:基于关联分析的离差法和贡献得分降维法。其主要工作和结论如下:·基于关联分析的离差法:利用关联分析方法,计算第k个观测值与中心的偏差系数和偏离系数,并根据它们的内积求离差度,用来判断强影响点。文中,针对几个典型实例,并编写了相应的计算程序,理论分析与计算结果表明:(1)使用该方法判断强影响点与经典方法相比较,结论是一致的。(2)该方法需要的样本容量可以很小,大于3个数据就可进行离差度计算与分析。(3)该方法计算工作量小,算法的时间复杂度为O()。·贡献得分降维法:对变量作主成分分析,计算贡献得分,从而对高维数据降维,剔除数据后并利用K-均值聚类求影响距离,判断强影响点。通过实例的计算分析,结果表明:(1)降维前后,使用影响距离和Cook距离所求得的强影响点是一致的,说明降维是可行的。(2)使用影响距离判断强影响点与经典方法-Cook距离相比较,结论是一致的,说明本文提出的影响距离法也是可行的。(3)通过降维,就可对高维数据的强影响点进行图示。·设计并开发了一个强影响点的挖掘系统。
【Abstract】 With the wide application of data mining to modern business, the researches of data mining for outlier and influential point have been paid close attention to by economic and statistical circles. Though both data mining and statistical diagnostics have only fifty-year history, a lot of achievements have been made. However, there are many problems remaining unsolved.Based on the analysis of internal and international research works related to influential point and exploratory data analysis, two new approaches are presented in this paper to deal with the data mining of influential point, namely, relationship-based warp-departure analysis and contribution-score dimension reduction analysis.The main works and conclusions in this paper are listed below:· Relationship-based warp-departure analysis: First, we compute the warp coefficient and departure coefficient according to the method of relationship analysis. Then, warp-departure degree, the product of the two coefficients, is used to decide which is the influential point. Meanwhile, the method is applied to several typical examples, the analytical and numerical results show that: (1) Comparing with classical diagnosis method, the conclusions about influential point are the same. (2)The approach is adaptive to the case with small sample number, say, any integer larger than 3.(3)The method is of lower cost in computation, the computational complexity is 0(). · Contribution-score dimension reduction analysis: The contribution-score which is obtained from the principal component analysis, is used to reduce the dimensions of data. Then the influential distance is employed to decide influential point by sample data removing. The computational results from some typical examples show that: (1) Analyzing the fore-and-aft influential distance and Cook-distance, the points with first largest distance are unchanged, this results that the dimension reduction method is acceptable. (2) Comparing the influential distance method with the classical analysis method-Cook distance method, the conclusions are in accord on influential point, it results that the influential distance method is acceptable. (3) Graphics mode of influential points is available via dimension reduction.· A data mining application system is developed to diagnose influential point.
【Key words】 influential point; data mining; diagnosis; graphics mode; warp-departure degree; dimension reduction; influential distance; Cook-distance;
- 【网络出版投稿人】 重庆大学 【网络出版年期】2003年 02期
- 【分类号】O213
- 【被引频次】6
- 【下载频次】170