节点文献

基于SSAS的数据挖掘算法研究与实现

Research and Implementation of Data Mining Algorithms Based on SSAS

【作者】 郭醒

【导师】 刘大有;

【作者基本信息】 吉林大学 , 软件工程, 2008, 硕士

【摘要】 本文在数据挖掘研究和关联规则挖掘研究背景下,重点研究了基于SSAS(SQL Server 2005 Analysis Services)下发现关联规则最大频繁项目集的方法,以及决策树优化算法。本文首先分析讨论了数据挖掘技术的产生背景、数据挖掘的基本过程、数据挖掘的主要任务;然后介绍了关联规则挖掘的基本概念,研究了关联规则最大频繁项目集发现算法。本文在Microsoft SSAS环境下改进并实现了一种采用集合枚举树来描述项目集、基于数据立方体的快速发现最大频繁项目集的算法FM_CUBE,显著提高了发现效率。FM_CUBE为发现最大频繁项目集的数据挖掘应用提供了一种有效而快速的算法;最后,通过对决策树算法的研究,在最小错误剪枝的基础上设计出了新的剪枝优化算法。实验结果表明,提出的算法较Microsoft算法在时间上有较好的性能。

【Abstract】 With the continuous development of database technology and the extensive database management system applications, database storage of data increases rapidly. Much important information exists in a large amount of data, and these would be important information to support the people’s good decision-making. At present database system can be accomplished only in the database to access to the data, and what people get from these data is only a part of the data and the more important information is the characteristics of the data and the description of its development trend forecasts. The information generated in the decision-making process has very important reference value. So the requirements of data-processing technology is also rising, that is needed to be able to conduct a deeper level of data processing, in order to obtain the overall features of the development trends and forecasts of the data.Data mining is to discovery interested knowledge from large data sets (which may be incomplete, noise, the uncertainty, various forms of storage), which is implicit, previously unknown, and in the decision-making have potential value. The extracted knowledge can be described as for the concept, rules, laws and forms mode. Therefore, data mining as a new field of study, involving such as machine learning, pattern recognition, statistics, databases, artificial intelligence, mathematics and visualization technology, and other areas of learning, is an emerging research fields with broad application e.In this paper, the basic process of data mining and the main tasks of data mining were discussed. The paper also has a study on the entire data mining process: data integration, data cleansing, data selection, data transformation, data mining, pattern assessment, a test that knowledge and practice. We have a deep research on association rule mining and decision tree building.Then in the second chapter, this paper studies the algorithm of discovering the maximum frequent item set. In this paper, with Microsoft SSAS environment, we improve and implement the algorithm FMCUBE with a set-enumeration tree used to describe the item set, based on data cube. The FM_CUBE algorithm significantly improves the efficiency of discovery. Identifying the frequent subsets is the key technique and the computationally intensive step in association mining task. In fact, any frequent subset is a subset of a maximal frequent item set. FMCUBE which finds the most frequent item sets of data mining application provides an effective and quick method. In Chapter II, association rules were first introduced, and classical algorithm Apriori is explained. Then proposed the largest frequent item sets FMCUBE algorithm. Unlike relational database entities - relational model, in data warehouse data model is multi-dimensional data model, it will form data as data cube. Multi-dimensional data cube is the statistical entities. Based on the data cube an subset is a combination of different members of data cube, and the support of the subset is the measure value. Generally algorithms discovering frequent item sets based on the data cube calculate the support with using data cube. Some scholars have given the algorithm based on the data cube and the frequent Apriori Algorithm.The authors use C# to program the Max-Miner algorithms and FMCUBE algorithm, and use SQL Server 2005 Analysis Services to generate the data cube and access the data cube through ADOMD.net and MDX.In the third chapter, based on the research of the smallest error pruning Decision Tree Algorithm is designed. The experimental results show that the proposed algorithm has better performance than Microsoft algorithm in terms of timing. In Chapter III, the classic first Decision Tree Algorithm ID3, C4.5 were analyzed and studied. To set a record for each record has the same structure, and each structure by the number of pairs of attribute values constituted. Those properties are on behalf of their respective categories. To solve the problem is to construct a decision tree, and thus gained by the non-category of attribute values correctly predict the answers attribute value category. Then two kinds of algorithms of the main advantages and disadvantages are analyzed. Then the whole generation of decision tree process has been more detailed Description: Decision Tree Construction mainly divided into two parts, the first generation trees, at the beginning, all the data in the root node, and then recursive data points tablets; Second, Tree pruning is likely to remove some of the noise or abnormal data. Decision Tree stop division of conditions: a node, the data belong to the same category; not attribute data can be used for segmentation.Then, the main pruning methods for detailed is studied and discussed. Pessimistic on the wrong pruning PEP, the smallest error pruning MEP, the cost of a complex pruning CCP, based on an incorrect pruning EBP, such four major pruning algorithm are studied, and so do their pros and cons.Last, Microsoft decision tree algorithm is described and examples in the database through the SSAS CollegePlans, MovieClick a decision tree to achieve data mining.In the fourth part, on the basis of the pruning algorithm study, in accordance with the principle of minimum pruning mistakes, the ID3 optimization algorithm is proposed. This algorithm greatest advantage is that it can be in accordance with the characteristics and attributes of data from the optimal choice of nodes generating program and the lowest error rate, so that the system will not only improve the efficiency of operation, and can reduce the occurrence of the error rate. Then, based on the above pruning methods, we use C # achieving a decision tree optimization algorithm. Here, the tree controls (TreeView) of Microsoft Visual Studio 2005 is used to achieve the addition of trees. Finally, the Decision Tree algorithm is better efficiency than Microsoft Decision Tree.Finally, a summary of this paper is given, and data mining future is discussed.The study results of the thesis, especially of maximal frequent item sets and decision tree, are of both theoretical and practical benefit to further researches.

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2008年 11期
  • 【分类号】TP311.13
  • 【被引频次】3
  • 【下载频次】278
节点文献中: