节点文献

数据挖掘技术在联网审计中的应用研究

Research on Data Mining Technologies in Online-Audit

【作者】 谢岳山

【导师】 樊晓平; 廖志芳;

【作者基本信息】 中南大学 , 计算机应用技术, 2013, 博士

【摘要】 随着信息系统的发展,各大型国有企业、中央部委、海关等部门积累了大量业务信息,审计署每年要花大量的精力审计这些业务数据,以期发现这些部门是否有违规行为并上报中央,对其进行监督整改。由于这些业务数据日益巨大,因此有必要利用相关技术对这些重要信息进行分析处理。数据挖掘是广泛应用于大数据分析处理的技术之一,采用数据挖掘技术可以通过对大量业务数据的分析处理,挖掘出有疑点的数据,从而重点对这些疑点数据进行审计分析,降低数据量,减少审计分析处理工作量,排除人为因素的干扰,得到的审计结果有着较好的客观性。本文以数据挖掘技术为基础,以海关、社保和国税单位的数据为载体,通过分析数据特征,从数据预处理、可疑审计数据集的形成以及审计方法匹配三个方面,对审计数据进行深入分析研究,为最终审计提供辅助决策。本文首先分析了当前国有企业、中央部委等部门的数据分布特点,根据审计组网要求,提出了数据采集局域网、数据传输局域网、数据存储局域网的网络拓扑结构,在数据采集局域网通过设置前置数据采集机进行数据采集,为保证被审计单位和审计署之间的系统安全性,设置了双开关的网络开关,保证两个系统的物理隔离;在数据传输局域网中,采用当前成熟的数据传输方法,采取SDH/ATM/ADSL等技术进行数据传输,并通过构建审计专网VPN进行安全性构建;在数据存储局域网中,通过不同单位数据特点,设置了集中式、分散式以及共享式的存储局域网,并且根据各个单位的数据分布特点,提出了三种典型的组网模式,即集中式组网、分布式组网以及点对点式组网。面向有噪声的审计数据,通过分析比较数据降维方法,本文提出融合L2.1主成分分析的半监督降维去噪算法,由于PCA对数据中的噪声敏感,将L2.1范数对PCA进行改进,同时由于L2.1范数的PCA算法是通过降低矩阵的秩实现维数约简,而秩的计算复杂。本文针对这一问题,提出利用迹范数代替矩阵的秩来简化L2,1-PCA的计算,提高算法效率,进行数据降维。为获得算法的最优解,本文在此基础上提出了基于半监督的融合L2,1-PCA的除噪优化模型,模型利用迹范数以及矩阵变化,利用特征方程方法以及李雅普诺夫方程方法,求取模型的最优解,并证明了模型的稳定性。实验结果表明,该模型具有良好的降维除噪效果。由于审计数据大部分是时序数据,为分析可疑审计数据,本文提出了去峰值的显著连续序列算法,该算法通过分析以往时序序列异常数据发现算法,在显著连续序列算法的基础上,进一步减少显著序列组的计算,提高运算效率,算法以海关数据进行实验,发现了数据集中的显著数据序列,在此基础上,对这些数据进行进一步的审计可以提高审计效率。为提高审计效率,借鉴以往的审计方法,本文提出了构建审计方法库的基本方法。在进行审计方法的匹配中,本文提出了基于hownet的语句匹配算法,该算法在分析了以往匹配方法不考虑词语频率的问题,构建了频率函数以及权重函数,将频率函数加入匹配算法中,充分考虑了不同词语的权重。实验表明,这种方法具有更有效的匹配度。在审计规则应用中,将这种匹配算法引入到审计方法的查找匹配中,有效率较高。论文最后总结了全文的创新点,提出了今后将继续进行的研究方向。图46幅,表31个,参考文献137篇。

【Abstract】 With the development of information system, a lot of information in corporations, Central Departments can be available in audit area. All the information needs to be analyze because some useful information are concealed by the general processing methods.Data mining is one of the technologies that are widely used in the data analysis, the analysis to a large number of business data processes with data mining techniques, can find the suspected data which will be analyzed by audit department. This will reduce the amount of data, and also reduce the audit rules which will have better objectivity excluded the interference of human factors.This dissertation presents the research issues to process audit information by data mining techniques with the steps of data preprocessing, forming the training samples, building the audit methods rule-base, and provide the decision support to audit person.The three novel audit online network modes which are introduced as central mode, distribute mode and point to point mode according to the data storage methods in most ministries and commissions. Then each mode is introduced in details, including the basic elements, the key problems and the implementation by using the example of central mode audit online network. As a result, the three modes provide an audit network platform for most ministries and commissions quite efficiently.Traditional dimension reduction methods reduce noises by explicit rank reduction and dimension reduction simultaneously. Principal component analysis (PCA) is widely used for dimensionality reduction, denoising, feature selection, subspace detection and other purpose. However, traditional PCA is based on. Frobenius norm, and therefore suffers from both outliers and large feature noises. The dissertation presents a denoising method by a robust formulation using L2.1norm together with rank reduction without dimension reduction. The L2,1-norm based PCA (L2.1-PCA) replaces Frobenius norm with L2,1-norm, and is suitable to overcome this difficulty. And the dissertation proposes a computational efficient algorithm to solve the L2.1-PCA problem. Both numerical and visual results show that L2.1-PCA are consistently better than standard PCA.This dissertation studies the problem of prominent streak discovery in time-series data. Given a sequence of values, a prominent streak is a long consecutive subsequence consisting of only large (small) values. The algorithm of obtaining the candidate prominent streaks always contains a few peak streaks. If the peak value of the streak is not maximum or minimum value, these peak streaks are certainly not the prominent streaks of the sequence. In order to improve the efficiency of the algorithm, this dissertation proposes a new concept which called Peak-removed Local Prominent Streak, based on which this thesis also proposes a new algorithm named Linear PLPS-based algorithm. The algorithm in the calculation of the process of candidate prominent streaks, remove these peak streaks which the peak value is not maximum or minimum value. Through the experimental results can be concluded that the Linear PLPS-based method algorithm is more efficient than other common algorithm, and reducing the time complexity.This dissertation adopts HowNet as a basic semantic dictionary in the semantic similarity research work, it mainly studys the semantic similarity algorithm for sentences and improve it. The dissertation introduces the structure of HowNet first, and then studies the details and formulas of semantic similarities between primitives, concepts and words. It also introduces corpus and discuss how to calculate the frequency of each character in the corpus, and improves the semantic similarity algorithm based on HowNet by adding a function of word frequency. In aspect of aplication, since audit methods summarized by the auditing administration are always reduplicated and not clear, so we build an audit rule base to reuse them. The system will use this improved algorithm to calculate the similarity value between the user input and rules in audit rule base, this is called rules matching. Finally, the experimental result shows the improved similarity algorithm can return audit methods that satisfy the given condition appropriately and improve the accuracy of the matching process.

  • 【网络出版投稿人】 中南大学
  • 【网络出版年期】2014年 12期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络