

【作者】 李丹

【导师】 洪晓光;

【作者基本信息】 山东大学 , 计算机软件与理论, 2005, 硕士

【摘要】 本文主要介绍了异常挖掘和聚类分析在税务行业的应用。 随着数据库技术在税收上的的普及和应用,税务机关积累了大量的原始数据,然而却不能有效的利用这些资源。而如何从这些数据中得到有用的知识,正是数据挖掘要解决的问题。数据挖掘技术是从上个世纪80年代开始发展起来的一门新技术,就是从大量的、不完全的、有噪声的、模糊的、随机的实际应用数据中其主要的目的,提取隐含在其中的、人们事先不知道的、但又是潜在有用的信息和知识。 异常挖掘是数据挖掘中的重要研究方面之一,其作用就是发现数据中的“小模式”,即数据集中显著不同于其它数据的对象。这在税务上是非常有效的数据挖掘方式。特殊的生产经营模式、规模特别大的纳税企业(也就是税务行业所谓的重点税源)、甚至各种涉税犯罪都会形成异常的数据,而这些数据正是税务机关关注的重点。如何快速有效地找到这些特殊的数据,对税务行业有着重要的意义。本文在税务行业的异常数据挖掘方面进行了探讨。 本文首先讲述了数据挖掘的基本概念和方法,介绍了数据挖掘研究的一般对象和典型应用;具体研究了聚类和异常挖掘技术,说明了评价聚类和异常挖掘算法的一般准则,介绍了一些典型的聚类和异常挖掘算法。具体回顾了异常挖掘的研究发展及当前研究动态,介绍了基于距离、基于密度、基于偏离以及高维数据等孤立点发现中的主要算法,具体分析了各个算法的主要内容,在此基础上总结比较了各个算法的优劣及其适用范围。 本文的重点是使用一种基于密度的方法对税务机关的税收数据进行聚类分析,发现其中有意义的模型以及异常的数据。根据税务行业的特点,异常挖掘具有非常广阔的应用前景。本文在研究现有聚类分析和异常挖掘算法的基础上,从税务行业的实际需求出发,根据税务行业数据的特点,对基于孤立点因

【Abstract】 In this article we will apply clustering and outlier detection method on data of tax.As the database technology has been used on revenue widely, revenue has accumulated a large number of row data, which are saved in Database to little avail. How to abstract knowledge from these data is the key task of Data mining technology. Data Mining is a new technique developed from 1980s. It aims to extract the implicit, unknown, and potentially useful knowledge from voluminous, non-complete, fuzzy, stochastic data.Outlier analysis is a important part of data mining research. Its purpose is to find the "small patterns" from dataset. An outlier is an object that is considerably dissimilar or inconsistent with the remainder of the data. This is very useful in revenue. The outlier in revenue database could be generated by a special mode of production, a large-scale taxpayer, or even criminality. All of these are in special supervision of revenue. It is important for revenue to find them quickly and accurately. The outlier detection technology adapted to revenue is discussed in this article.Firstly, we describe the basic concepts and method. Then introduce the commonly objects and representative applications. We study clustering and outlier detection technology and describe the commonly rules, and introduce some clustering and outlier detection algorithms. The research process and the current situation of outlier detection are reviewed. The algorithms of outlier detection based distance; density, deviation and high dimension are introduced. The content of these algorithms is analyzed. The disadvantages and advantages of thesealgorithms are compared.The emphasis of this article is using ODACDS(outlier detection algorithm on Continued Data Sets), one of density-based clustering method to analyze the data of tax. The algorithm can discover arbitrary shape clusters and can distinguish noise. Owing to the feature of tax, outlier detection can be used widely in the field. For the demand of revenue, we studied all kinds of algorithms about outlier detection and Clustering. On the base of studying the clustering algorithm based outlier factor, we bring forward an outlier detection algorithm on Continued Data Sets. The new algorithms can be use to find the excursion of the data. We firstly introduce the concepts of outlier factor, then explain the idea and process of the algorithm, and do some discuss for the detail and exception.

  • 【网络出版投稿人】 山东大学
  • 【网络出版年期】2005年 08期
  • 【分类号】TP311.13
  • 【被引频次】8
  • 【下载频次】423

