节点文献

面向半结构化数据的数据模型和数据挖掘方法研究

Research on the Data Model and the Approaches to Data Mining in the Semi-structured Data

【作者】 孙涛

【导师】 李雄飞;

【作者基本信息】 吉林大学 , 计算机应用技术, 2010, 博士

【摘要】 随着计算机技术、Internet、数据库技术的快速发展,各领域积累的半结构化数据和信息急剧增加。迫切需要面向知识发现需求设计半结构化数据模型,利用模型存储和描述半结构化数据的内容和结构信息。设计有效的半结构化数据挖掘算法,从大量半结构化数据文档中提取深层次的用于描述信息、结构特征以及事物发展趋势的预测内容,综合内容和结构等多方面信息对半结构化数据进行深层次的潜在知识发现。本文面向半结构化数据模型和数据挖掘方法展开了深入研究,主要内容包括:(1)从半结构化数据研究的整体内容出发,对该领域知识进行了详细的综述。总结了各种已提出的半结构化数据模型和数据模式;从特征提取、频繁结构的发现、文档聚类与分类等多角度详细综述了当前半结构化数据挖掘技术的研究进展;跟踪介绍了当前流行的数据挖掘系统的功能特点。(2)针对半结构化数据模型下不精确和不确定性知识,设计了基于标签树的粗糙集模型LTRS。利用LTRS模型从结构和内容两个角度分析半结构化数据,基于树的表现形式从结构和内容两个角度生成决策规则,描述树节点间的组成关系和内容上的知识约简。基于现有半结构化数据模型中缺少对数据变化趋势和变化程度的形式化定义,缺乏对数据动态性质有力描述的缺点,提出了一个带有树平均深度和平均宽度等动态变化信息的树模型ADAWT,为后续高效空间动态变化结构的发现奠定了基础。(3)提出一种新的基于数据的平衡方法—SSGP,用于处理半结构化数据固有的偏斜数据集分类问题。该算法能处理数据集中存在多种少数类别样例的情况,此外还扩展并运用了样例取模运算,使算法在计算效率上取得了较大提高。(4)在处理XML等半结构化数据集的聚类和分类问题时,都会面临类边界相互重叠,边界噪声带来聚类质量或分类精度下降的问题。借鉴方向性和物理学中万有引力定律的思想,以数据对象之间的相互作用为基础,从标量影响和方向影响两个角度讨论基于密度的聚类问题,提出一个考察对象间矢量感应的密度聚类算法VICA。使用方向相似度法和累加向量法两种计算矢量感应函数的方法判断邻域平衡,处理边界稀疏、对象密度分布不均且含有边界噪声点等情况下的数据聚类问题。(5)针对于传统的静态挖掘算法不能胜任对动态变化的XML文档进行知识发现的问题,利用所提出的ADAWT模型,设计了发现平均深度和平均宽度的空间结构变化达到用户关注程度的SCSFinder算法。此外,基于已抽取发现的各种动态结构为特征构建特征空间,将XML文档表示成特征向量的形式,利用改进的聚类算法实现了大规模XML文档的聚类分析。(6)基于已有的半结构化数据挖掘理论基础,综合目前市场及科研领域较为流行和成熟的数据挖掘产品(如SAS Enterprise Miner、Weka等)的优点,设计了一个多策略数据挖掘原型系统—DBIN Miner。系统实现了对半结构化XML数据的存储,集成了前述工作所介绍的挖掘算法和常用的基本数据挖掘算法。并针对数据挖掘技术和数据挖掘系统面临的处理大规模数据的难题,通过缓冲区和插件技术对系统的可扩展性等问题进行了重点设计与实现。本文在半结构化数据模型设计、面向半结构化数据应用的分类与聚类问题、基于半结构化数据动态特征提取的文档聚类等方向展开相关研究工作,为半结构化数据的知识发现打下理论基础。并且将所研究的理论应用于数据挖掘原型系统的设计与实现中,为相关理论的商业化应用奠定了基础。

【Abstract】 As the society coming into the information period, and the comprehensive application of the computer network and computer technology, the database in every industry accumulates substantive data increasingly. How to use these data and pick up useful information or knowledge from them to guide the production and distribution of the enterprises comes into being and develops a new computer technology—Data Mining Technology which is widely used and has tremendous practicality. Along with the popularization of Internet, the network data increase endlessly with a great deal of semi-structured data appears. The semi-structured data is preferred of the data storage and data exchange as its scalability, self-describing and dynamically. It provides flexibility for system implementation and makes convenience for resource share between corporations.The characteristic of semi-structured data lacking of rigid and integrated structure makes it include content and structure information, its structure may be connotative, even being modified constantly. Therefor, it needs to design data models which can better describe semi-structure data characteristic based on data analysis requirement. The well designed models can establish the stability bases for data storage, indexing construction optimization query and knowledge discovery. Besides, as the flexibility of semi-strutured data, there are many problems while doing application analysis, such as data skewness, obscurity of clustering boundary, clustering boundary noises, it needs to design reasonable semi-structure data mining algorithms solving these problems. The structure and content of semi-structured data may be modified continuously and exhibit highly dynamic characteristic. The changes of structure and content can definetly reflect the change rules in time. How to find out the dynamics structure from the history changing process, and how to make use of the dynamic structures and information to do semi-structured data analysis work along with the clustering and classification method. These will be great signification to better use the flexibility and dynamic of semi-structured data.Along with the expanding of data scale and the increasing of analysis requirement, it needs to develop many kinds of data analyzing tools and data mining systems. By mining the history data, it can build decision rules to instruct the management or development and make more economy benefit for corporation. Data mining is face to application at the beginning, and no other than the widely using and popularization, it can promote the researches on data mining theory contrarily.The main results obtained by this thesis are summarized as follows:1) We analyze the current research work of the semi-structured data model and data mining work. By the analysis of relevant literatures, we summarize the characteristic of semi-structured data and data scheme which has been put forward, and point out the worse description while doing with the application. From the application of semi-structured data, we present the problem of data skewness, obscurity of clustering boundary, etc. Then, we sum up the research work on feature extraction, frequent structure discovery, document clustering and classification; introduce the characteristic of the popularity data mining system. All the reference reading work makes the bases for this thesis. 2) Based on the data mining requirement, we design two semi-structured data model LTRS and ADAWT. In order to characterize and deal with the vagueness and uncertainty of structured data as well as the compositions and contents implied within semi-structured data models, we present a Labeled Tree Rough Set Model (LTRS) by extending the traditional rough set model. Making use of the structure and content of the semi-structured data, from the tree structure we redefine the information system and rough set’s basic concepts, such as equivalence relation, indiscernibility relation, upper approximation and lower approximation, etc. Furthermore, we give a description about the discernibility matrix and decision rules. By analyzing the XML data sets using the LTRS model, we can construct decision rules by structure and content at the same time and describe composing relationship between tree nodes and knowledge reduct of content. Based on the existing semi-structured data model lacking of the formalize defination about the data change direction and the degree of change, being short of the definitely description of data dynamic property, we presented a tree model ADAWT with dynamic change information of tree depth and width. The model can integrate the dynamic change information about the tree shape document like XML in N history edition files, and can establish the basis for the effective dynamic structure discovery.3) We put forward a data balance algorithm SSGP based on the classification problem about the semi-structured skew data. There are substantive skew data in the semi-structure data Web application field, the traditional classifier isn’t efficiency while dealing with this skew data. The classifier may partly or completely ignore the positive examples, so much as forecast every examples into negative examples. Therefor, the forecast and analysis on the less proportion examples is an important branch of data mining. It needs design classify algorithm to solve the widely used semi-structured skew data classification problem. In order to balance the training sets that have several classes, an algorithm called SSGP is introduced, which is based on the idea that little difference lies between the same class cases. SSGP form new minority class cases by interpolating between several minority class cases that lie together. It’s proved that SSGP would not add noise to the data set. To enhance the efficiency, SSGP adopt the modulus in stead of calculating a lot of dissimilarity between cases. Take decision tree classifier to test the effect of balancing, the results show that SSGP can improve the predictive accuracy of several minority classes by running once.4) We presented the clustering algorithm concerning vector influence between objects called VICA to deal with the obscurity of clustering boundary and clustering noises problems. While solving semi-structured skew data classification problems, we find clustering and classification problems facing to the obscurity of clustering boundary and clustering noises causing precision decrease problem. We present a density based clustering algorithm concerning vector influence between objects. From the point view of the law of gravity, the influence between particles includes two aspects, namely distance and direction. We define a concept of Vector Influence Function by introducing the scalar influence function and direction influence function. Moreover, we propose two methods, i.e. similarity method and summation method, to compute the direction influence. The VICA algorithm normalizes the object project of the core point in its neighborhood, inspects the balance of the core point and then expands objects which are reachable by balanceable core points with balanceable density into a cluster. The theoretical analysis and experimental results indicate that this algorithm can discover clusters with arbitrary shape and can also effectively eliminate noise such as boundary sparse points. It addresses many problems due to the obscurity of clustering boundary division for high dimensional data, an uneven density distribution, plenty of clustering boundary objects. The algorithm improves the accuracy of clustering and offers better results of clustering on various data sets.5) We research on the dynamic feature extraction and document clustering of XML data. For the problem of traditonal static mining algorithm being incapable of knowledge discovery on dynamic change XML document, we sum up the basic conception and definition of existing FSC, FS finding work, and design the corresponding structure finding algorithm based on the temporal data model, decrease substantive time consuming causing by change detection between different editions. Then, we present the ADAWT model at the point view of scaling space change between XML editions. Moreover, we construct feature space using kinds of extracted dynamic structure, make XML document into the eigenvector, implement the clustering of large scale XML documents by the algorithm VICA.6) We design a multi-strategy Data Mining System DBIN Miner. The development of the database technology and the comprehensive application of the database management system result in the data expanding and the increasing of the analysis requirement. Many kinds of datamining system and business intelligence software are developed continuously. We review the development history of the data mining system, analyze the characteristic of the typical data mining system, and design a multi-strategy data mining system. In dealing with the large scale data, we introduce and design the algorithm groupware idea, buffer processing technology, configuration file based on the XML. The system integrates the algorithms designed above and makes it well extensibility. The research results of this thesis promoting the research work of the semi-structure model, the classification and clustering facing semi-structured data analysis, dynamic feature extraction and document clustering of semi-structured data. Our contribution of theory research and prototype design takes on definite theory signification and application value.

  • 【网络出版投稿人】 吉林大学
  • 【网络出版年期】2010年 08期
节点文献中: 

本文链接的文献网络图示:

本文的引文网络